Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-3062

Intermittent client connection hang due to packet-timing-sensitive defect in multi-statement response handling

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.21
    • Fix Version/s: 2.3.0
    • Component/s: readwritesplit
    • Labels:
      None

      Description

      Maxscale readwritesplit exhibits inconsistent/nondeterministic behavior when handling multi-statement responses consisting only of "OK" packets (not result sets), such as would occur with multiple UPDATE statements. When the defect is triggered, the client experiences a hung connection and the last query sent by the client is never forwarded to any server. If the client is in a transaction then all the work is lost, since commit is impossible.

      When the failure condition occurs, the client connection hangs because Maxscale has passed the complete response through to the client but Maxscale has failed to recognize that the response is in fact complete. The client sends its next query, which is buffered indefinitely by Maxscale, because of this incorrect internal state.

      Failure example:

      2020-07-02 14:03:36   info   : (82) [readwritesplit] > Autocommit: [enabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 180, type: QUERY_TYPE_WRITE, stmt:
        UPDATE test_mxs_multi_bug SET ts = NOW(6) WHERE id = 1;
        UPDATE test_mxs_multi_bug SET ts = NOW(6) WHERE id = 2;
        UPDATE test_mxs_multi_bug SET ts = NOW(6) WHERE id = 3;
       
      2020-07-02 14:03:36   info   : (82) [readwritesplit] Route query to master      [maria]:3306 <
      2020-07-02 14:03:36   info   : (82) [readwritesplit] Reply not yet complete. Waiting for 1 replies, got one from maria
      2020-07-02 14:03:36   info   : (82) [readwritesplit] Reply not yet complete. Waiting for 1 replies, got one from maria
      2020-07-02 14:03:36   info   : (82) [readwritesplit] Storing query (len: 13 cmd: 3), expecting 1 replies to current command
      

      "Storing query" refers to Maxscale stashing the client's next query, because it believes the query arrived early, while the response was still incomplete... but this is not actually what happened. The client sent this query after receiving a complete response.

      Importantly, the exact same query does not always fail. It is random.

      2020-07-02 14:00:27   info   : (80) [readwritesplit] > Autocommit: [enabled], trx is [open], cmd: (0x03) COM_QUERY, plen: 180, type: QUERY_TYPE_WRITE, stmt:
        UPDATE test_mxs_multi_bug SET ts = NOW(6) WHERE id = 1;
        UPDATE test_mxs_multi_bug SET ts = NOW(6) WHERE id = 2;
        UPDATE test_mxs_multi_bug SET ts = NOW(6) WHERE id = 3;
       
      2020-07-02 14:00:27   info   : (80) [readwritesplit] Route query to master      [maria]:3306 <
      2020-07-02 14:00:27   info   : (80) [readwritesplit] Reply not yet complete. Waiting for 1 replies, got one from maria
      2020-07-02 14:00:27   info   : (80) [readwritesplit] Reply complete, last reply from maria
      

      The difference in behavior in both scenarios is in the timing of the arrival of the three response packets from the server.

      With a debug build, I've confirmed that when that last two packets arrive together (in a single read from the socket), Maxscale correctly parses through the buffer and concludes that the reply is complete... but when the last packet arrives alone, the hang condition occurs. The timing of any earlier packets does not appear to be important, nor is the number of queries sent, though the issue is only easy to reproduce with 3 or more queries sent together.

      The issue was originally encountered in a production environment using the Connector/NET client, but the test case was created using the Perl DBD::mysql client. Because this is a timing-related defect that only impacts multi-statement queries, the likelihood of triggering the bug seems to vary with system conditions; however, the test case script should only need to be run a few times in succession to generate the observed hang when the final commit or rollback query is sent by the test script. Or it may fail on the first attempt. The behavior is not deterministic.

      Investigating this issue leads in a number of false directions, since it appears – to the application – that the server is hanging on commit or rollback, but in fact Maxscale is suspending query processing from the client. Maxscale's "Reply not yet complete" message "waiting for 1 replies" is easily misinterpreted as "waiting for 1 result set" when in fact it means "waiting for a reply from 1 server." Intuition suggests that the difference in the number of "Reply not yet complete" messages that appear is related to the number of server responses (result sets, OK packets) actually being expected, but this is also incorrect – the number of messages is impacted by the arrival timing on the wire of the individual responses. More responses arriving together (in a single read) results in fewer logged messages.

      Further debugging points to readwritesplit.cc setting n_eof to '2' when the behavior is correct...

      int n_eof = modutil_count_signal_packets(buffer, 0, &more, &state);
      

      ...but not setting it to '2' when the behavior is incorrect... which lead me to discover some potentially flawed logic in modutil.cc with an accompanying comment:

      // Treat complete multi-statement result sets that consist of only OK packets as a single result set
      // TODO: Review this, it doesn't look very convincing.
      if (only_ok && !more && num_packets > 1)
      {
          total = 2;
      }
      

      This indeed does not look very convincing.

      Changing this expression to (only_ok && !more && num_packets >= 1) resolves this issue for me, but I arrived at this intuitively rather than in a provably correct way, so I'm unclear whether this proposed fix actually resolves a bug... or perhaps introduces a new one that cancels out another bug in readwritesplit.cc.

      A simple test script is attached. Tested against MariaDB 10.3.23 using Maxscale 2.2.21 but the server version and client library do not appear to be relevant. The same code appears to be present in both newer and older Maxscale 2.x, so this is an issue that likely exists in all versions, but potentially rare in the wild since it only appears to impact multi-statement queries that are not SELECT.

        Attachments

          Activity

            People

            Assignee:
            markus makela markus makela
            Reporter:
            sqlbot Michael Ezzell
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: