Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-4183

Multiplexing fails with "Timed out when waiting for a connection"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 6.4.0
    • Fix Version/s: 6.4.2, 22.08.0
    • Component/s: readwritesplit
    • Labels:
      None
    • Environment:
      AWS m2.micro CentOS 7 VM
    • Sprint:
      MXS-SPRINT-161, MXS-SPRINT-162

      Description

      I was testing our new connection queuing functionality-
      https://mariadb.com/kb/en/mariadb-maxscale-6-mariadb-maxscale-configuration-guide/#idle_session_pool_time

      I used MaxScale v6.4.0 connected to a MariaDB Enterprise Sever 10.6.8-4 backend. I have attached my maxscale.cnf used for this. My 10.6.8-4 ES configuration is vanilla except for the below-

      [server]
      log_error=/var/log/mariadb/error.log
      max_connections=15

      This syncs with my maxscale.cnf config for the backend of-

      max_routing_connections=10

      I leave extra max_connections on the backend due to persistent overhead from MariaDB Monitor and readwritesplit. I also like leaving a slot open for me to poll mariadb-admin stat proc on the backend as needed. In any case, I have validated MaxScale is configured and correctly adheres to only establishing 10 "extra" connections to the backend when given a query workload of >= 10 concurrent queries from a client.

      I generate the workload via-

      mariadb-slap -h ${your_maxscale_ip} -u slap -p"${mariadb_slap_user_password}" -v \
          -a \
          --create-schema='mariadbslap' \
          -x 5 \
          -y 5 \
          --auto-generate-sql-add-autoincrement \
          --auto-generate-sql-load-type=mixed \
          --auto-generate-sql-execute-number=1000 \
          --auto-generate-sql-unique-write-number=1000 \
          -c 1000 \
          -i 5;

      markus makela has successfully reproduced this issue via the above, so I am not including the 260MB log file MaxScale generates with log_info=true for this. Note to reproduce this you will need to create appropriate maxscale and slap users on the MariaDB backend.

      What happens when running the above is between 60 and 90 seconds into the workload, mariadb-slap errors on a write query with the below-

      ERROR : Timed out when waiting for a connection.

      Looking into MaxScale's error log, we find-

      2022-06-30 19:08:25   info   : [readwritesplit] (readwritesplit); Master 'mariadb' failed: #HY000: Timed out when waiting for a connection.
      2022-06-30 19:08:25   error  : [readwritesplit] (readwritesplit); Lost connection to the master server, closing session. Lost connection to master server while waiting for a result. Connection has been idle for 69 seconds. Error caused by: #HY000: Timed out when waiting for a connection.. Last close reason: <none>. Last error: 

      And then a whole lot of sessions and connections get closed from there.

      Before approaching MaxScale Engineering, I reviewed all timeout configuration variables present in MaxScale and attempted setting them to maximum or infinite-equivalent values. This did not change MaxScale's behavior. Note the submitted configuration file lacks these extreme measures and focuses more directly on what the original use-case required.

      From discussion with markus makela on Slack, expected likely causes of this behavior are:

      1. A scheduling issue where a subset of connections are being preferred over others causing some client sessions to spend a long time waiting to execute
      2. An internal, 60 second time limit

      While my testing only cares about how long it takes MaxScale to complete all queries being fed to it, the expectation is real-world use-cases will care about per-connection latency, so correcting the scheduling issue is a top priority.

      However, the internal 60 second timeout is also a problem we should address. I performed a similar test on competing database proxy software and verified that software provided full user configuration of all such timeouts, so MaxScale should provide the same functionality for users.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              esa.korhonen Esa Korhonen
              Reporter:
              rob.schwyzer@mariadb.com Rob Schwyzer
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.