Uploaded image for project: 'MariaDB Connector/J'
  1. MariaDB Connector/J
  2. CONJ-863

Ensure socket state when SocketTimeout occurs

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 2.7.2
    • 2.7.4
    • aurora
    • None
    • Java 8, Aurora MySQL 5.7

    Description

      When the Driver encounters a "SocketTimeoutException - Read timed out" it doesn't force close the Socket, leaving the TCP connection open and therefore doesn't rollback previously started transactions on the database.

      If the Read timeout happens because of temporary high load or some temporary locks this can cause a situation where zombie transactions keep holding locks / are being executed for a very long time effectively causing a majority of the table to become locked and transactions piling up.

      Example Scenarion (default 10s socketTimeout):
      00:00 -> BEGIN;
      00:00 -> DELETE FROM MyTable WHERE reference = '1234';
      00:10 -> Socket read timeout, driver abandons the connection - but doesn't close the Socket
      00:15 -> Worst case the query is being retried by the client because it failed before ... (causing even more hanging transactions)
      00:25 -> Delete complete - but never committed or rolled back since the TCP connection is abandoned by the driver (but the TCP connection is still Valid/Open).

      From our experience there is no recovery from this situation other than a manual database failover / restart of all services to force close all zombie tcp connections by shutting down the JVMs & transactions caused by the driver.

      Fix
      After a couple incidents on our production system we've ended up manually patching the driver to call the following code in `AbstractQueryProtocol.handleIoException.

      public class AbstractQueryProtocol extends AbstractConnectProtocol implements Protocol {
        public SQLException handleIoException(Exception initialException) {
          ....
          else if (initialException instanceof SocketTimeoutException) {
            maxSizeError = false;
            try {
              destroySocket();
            } catch (Exception e) {
              logger.info("Failed to close socket after SocketTimeoutException - " + e.getMessage());
            }
          }
          ....
        }
      }
      

      Since MySQL 5.7/Aurora doesn't support the `max_statement_time` of MariaDB there is currently no good way of dealing with this issue from our knowledge (setting a wait_timeout of 10s would have the same effect, but this causes other problems with connection pooling).

      Reproduction
      1. Create MyTable
      2. Connect with MariaDB/J driver - low socketTimeout
      3. Manually execute `LOCK TABLES MyTable WRITE`
      4. Run `BEGIN; INSERT INTO MyTable ..;` from Java
      5. After Read timeout has occured check `SHOW PROCESSLIST`, our transaction is still there, but it will never get committed or rolled back by the driver.
      Note: Same behaviour when using a very large table and attempt to execute any DML that exceeds the socketTimeout.

      Attachments

        Activity

          diego dupin Diego Dupin added a comment -

          What you indicate seem relying on socket timeout, not query timeout.
          if you expect having command that have longer time than socket timeout (default to 10s in aurora) better to set connection.setNetworkTimeout accordingly first.

          query timeout for anything different than mariadb >= 10.1.2 are handle differently : when reaching timeout, another connection will be created, issuing a KILL QUERY <thread id> to ensure ending running query, and having a connection in a good state.
          (i've heard that Aurora doesn't return the good value <thread id>, so this might be the reason why this is not working, but that i cannot verify since not having any Aurora instance. If that, then this is an aurora bug).

          diego dupin Diego Dupin added a comment - What you indicate seem relying on socket timeout, not query timeout. if you expect having command that have longer time than socket timeout (default to 10s in aurora) better to set connection.setNetworkTimeout accordingly first. query timeout for anything different than mariadb >= 10.1.2 are handle differently : when reaching timeout, another connection will be created, issuing a KILL QUERY <thread id> to ensure ending running query, and having a connection in a good state. (i've heard that Aurora doesn't return the good value <thread id>, so this might be the reason why this is not working, but that i cannot verify since not having any Aurora instance. If that, then this is an aurora bug).
          Aquint Emil von Wattenwyl added a comment - - edited

          Yes exactly - i'm talking about the socket timeout.

          if you expect having command that have longer time than socket timeout (default to 10s in aurora) better to set connection.setNetworkTimeout accordingly first.

          In our case we don't have specific commands that we expect to take longer than the 10s - and 10s seems like a good default to guarantee quick failover when the DB just disappears.

          What might let the application run into the 10s threshold are rare temporary locks, due to concurrent DML that interferes with each other - or simply due to a DDL command.

          In any case this is only the "symptom" - from our observation the real problem is that the driver doesn't seem to close Sockets that reached the socket timeout - the problem with zombie transaction on aurora was where we've noticed it, but it probably can also cause resource exhaustion.

          (The mechanism with KILL QUERY that you've described would normally make the DB close the abandoned TCP connection from its side I assume - more as a side effect than a proper cleanup - so the reason that this behaviour of not closing the sockets actually causes the observed problems on aurora is likely related to the wrong thread id issue).

          I think by simply making sure that the sockets are closed when a SocketTimeoutException occurs removes this problem and potential other problems related to "abandoned sockets".

          Aquint Emil von Wattenwyl added a comment - - edited Yes exactly - i'm talking about the socket timeout. if you expect having command that have longer time than socket timeout (default to 10s in aurora) better to set connection.setNetworkTimeout accordingly first. In our case we don't have specific commands that we expect to take longer than the 10s - and 10s seems like a good default to guarantee quick failover when the DB just disappears. What might let the application run into the 10s threshold are rare temporary locks, due to concurrent DML that interferes with each other - or simply due to a DDL command. In any case this is only the "symptom" - from our observation the real problem is that the driver doesn't seem to close Sockets that reached the socket timeout - the problem with zombie transaction on aurora was where we've noticed it, but it probably can also cause resource exhaustion. (The mechanism with KILL QUERY that you've described would normally make the DB close the abandoned TCP connection from its side I assume - more as a side effect than a proper cleanup - so the reason that this behaviour of not closing the sockets actually causes the observed problems on aurora is likely related to the wrong thread id issue). I think by simply making sure that the sockets are closed when a SocketTimeoutException occurs removes this problem and potential other problems related to "abandoned sockets".
          diego dupin Diego Dupin added a comment - - edited

          what i mean about aurora is that 'KILL QUERY <thread id> is executed, but aurora giving a wrong thread id, doesn't do anything.
          This was a bug in a very old version of maxscale, and a bug that has been reported on aurora.
          If this is the case - i cannot verify that - then, command throw an error, and connection will be in a wrong shape like you indicate.

          If this is confirmed aurora has to correct this issue, not driver.
          (driver take thread id from initial handshake packet)

          diego dupin Diego Dupin added a comment - - edited what i mean about aurora is that 'KILL QUERY <thread id> is executed, but aurora giving a wrong thread id, doesn't do anything. This was a bug in a very old version of maxscale, and a bug that has been reported on aurora. If this is the case - i cannot verify that - then, command throw an error, and connection will be in a wrong shape like you indicate. If this is confirmed aurora has to correct this issue, not driver. (driver take thread id from initial handshake packet )

          So there is no difference in the behaviour between Aurora + MariaDB.

          I've added a test case to better illustrate what caused us some unexpected lock & load problems - you can check it out at `/test-case-base/src/main/kotlin/ch/aquint/TestCase` for a detailed description and run the tests like:
          `./gradlew :test-case-official-driver:run`
          `./gradlew :test-case-fixed-driver:run`

          With the official driver Thread 2 takes 30s before it can finally insert his rows, whereas with the fixed driver Thread 2 can insert his rows after a couple seconds already. (I assume some MariaDB internal purge timeout that checks the sessions TCP connection state).

          The TCP connection is definitely left open by the driver & the query left running after getting a SocketTimeoutException (10s default for aurora configurations) regardless of the DB engine.

          However the issue seems to be less severe than i thought initially because the transaction is killed after the long-running statement completes in the Background, i'm not exactly sure why that happens.

          Honestly after thinking about this topic a bit more i'm not 100% sure anymore if closing the TCP connection on SocketTimeout is the correct thing to do for all use cases - from our experience it certainly is because it prevents a destructive pile-up of connections, locks and processes on the database when we get a lot of retries due to some bad query / unexpected amount of data to delete. (In our case basically an event that is stuck in the queue because it cannot be processed due to constantly running into SocketTimeout - ofc. this can be mitigated with better application design & more careful queries to be absolutely sure you always stay under 10s - although it's sometimes not as easy when dealing with delete cascade that has an unknown amount of related rows etc.)

          Aquint Emil von Wattenwyl added a comment - So there is no difference in the behaviour between Aurora + MariaDB. I've added a test case to better illustrate what caused us some unexpected lock & load problems - you can check it out at `/test-case-base/src/main/kotlin/ch/aquint/TestCase` for a detailed description and run the tests like: `./gradlew :test-case-official-driver:run` `./gradlew :test-case-fixed-driver:run` With the official driver Thread 2 takes 30s before it can finally insert his rows, whereas with the fixed driver Thread 2 can insert his rows after a couple seconds already. (I assume some MariaDB internal purge timeout that checks the sessions TCP connection state). The TCP connection is definitely left open by the driver & the query left running after getting a SocketTimeoutException (10s default for aurora configurations) regardless of the DB engine. However the issue seems to be less severe than i thought initially because the transaction is killed after the long-running statement completes in the Background, i'm not exactly sure why that happens. Honestly after thinking about this topic a bit more i'm not 100% sure anymore if closing the TCP connection on SocketTimeout is the correct thing to do for all use cases - from our experience it certainly is because it prevents a destructive pile-up of connections, locks and processes on the database when we get a lot of retries due to some bad query / unexpected amount of data to delete. (In our case basically an event that is stuck in the queue because it cannot be processed due to constantly running into SocketTimeout - ofc. this can be mitigated with better application design & more careful queries to be absolutely sure you always stay under 10s - although it's sometimes not as easy when dealing with delete cascade that has an unknown amount of related rows etc.)
          diego dupin Diego Dupin added a comment -

          Good points.

          To simplify: SocketTimeoutException is a special IOException, since when raised, the Socket is still valid.
          This leave the driver in a wrong state concerning exchanges, and with socket still active. Socket must be explicitly closed in this specific case, in order for database to cancel current transaction running on this connection for example.

          done with https://github.com/mariadb-corporation/mariadb-connector-j/commit/0f04e45d2555b39a08e693163300c5ca58cad645

          diego dupin Diego Dupin added a comment - Good points. To simplify: SocketTimeoutException is a special IOException, since when raised, the Socket is still valid. This leave the driver in a wrong state concerning exchanges, and with socket still active. Socket must be explicitly closed in this specific case, in order for database to cancel current transaction running on this connection for example. done with https://github.com/mariadb-corporation/mariadb-connector-j/commit/0f04e45d2555b39a08e693163300c5ca58cad645

          Exactly - thanks a lot for the support!

          Aquint Emil von Wattenwyl added a comment - Exactly - thanks a lot for the support!

          People

            diego dupin Diego Dupin
            Aquint Emil von Wattenwyl
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.