Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36311

Sporadic 120 seconds connect timeouts, test suite failures

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • N/A
    • None
    • Tests
    • None

    Description

      As a follow-up to MDEV-30232, it seems the testsuite is more generally affected by a bug in the network stack in the Linux kernel. The bug affects tests that deliberately crash the server.

      The bug can be seen in rpl.rpl_gtid_crash before push of this work-around:

      commit b6b6bb8d36b8dd72150647570833ec60a037a237
      Author: Kristian Nielsen <knielsen@knielsen-hq.org>
      Date:   Sat Mar 8 07:27:04 2025 +0100
       
          Fix sporadic failures of rpl.rpl_gtid_crash
      

      rpl.rpl_gtid_crash 'stmt'                w52 [ 9 fail ]
              Test ended at 2025-03-17 17:51:03
       
      CURRENT_TEST: rpl.rpl_gtid_crash
      mysqltest: In included file "./include/sync_with_master_gtid.inc": 
      included from /kvm/src/my/dev/mariadb2/mysql-test/suite/rpl/t/rpl_gtid_crash.test at line 509:
      At line 45: Failed to sync with master
       
      The result from queries just before the failure was:
      < snip >
      SET GLOBAL debug_dbug="+d,inject_error_writing_xid";
      BEGIN;
      INSERT INTO t1 VALUES (22);
      COMMIT;
      ERROR HY000: Error writing file 'master-bin' (errno: 28 "No space left on device")
      SET GLOBAL debug_dbug="+d,crash_dispatch_command_before";
      COMMIT;
      Got one of the listed errors
      INSERT INTO t1 VALUES (23);
      INSERT INTO t1 VALUES (24);
      SELECT * from t1 WHERE a > 10 ORDER BY a;
      a
      13
      14
      23
      24
      include/save_master_gtid.inc
      connection server_2;
      include/sync_with_master_gtid.inc
      Timeout in master_gtid_wait('0-1-11,1-1-2,2-1-1', 120), current slave GTID position is: 0-1-9,1-1-2,2-1-1.
       
      More results from queries before failure can be found in /kvm/src/my/dev/mariadb2/mysql-test/var/52/log/rpl_gtid_crash.log
      

      However, the problem is general and can affect different tests in different
      ways.

      The problem is described in detail here:

      https://lore.kernel.org/netdev/87sf0ldk41.fsf@urd.knielsen-hq.org/T/#u

      The TL;DR is that a connection that is made just as the server is being killed can hang indefinitely, the client never beeing notified that the server shut down. This causes the client to time out after --connect-timeout which defaults to 120 seconds. When this happens in eg. --source include/wait_until_disconnected.inc the result is that the test will stall for the 120 seconds doing nothing with the server stopped. This can cause various other failures due to timeout, for example in replication tests when the slave IO thread gives up reconnecting to the master.

      How to fix the problem and avoid these very random and mysterious failures in the test suite? Another attempt should be made to report the problem upstream to kernel developers. But even if the kernel is fixed we will need some workaround in the testsuite until all Buildbot kernels are upgraded with the fix.

      Maybe we need to reduce the default value of mysqltest --connect-timeout to say 20 seconds or so so that the hang is short enough when it happens that it doesn't cause a test to fail.

      With more technical details, here is an example of how this happens, showing the evidence for the kernel bug from tcpdump:

      The test was run as follows, after reverting the above-mentioned workaround:

      (cd mysql-test/ && for i in $(seq 20) ; do script -e -c './mtr rpl.rpl_gtid_crash{,,}{,,}{,,} --parallel=81 --repeat=50 --client-rr' || exit 1 ; done)
      

      A test fails after some time and we see in the server error log the 120 seconds hang between one server crashing/stopping and the new server starting:

      rpl.rpl_gtid_crash 'stmt'                w52 [ 9 fail ]
      

      2025-03-17 17:47:01 5 [ERROR] mariadbd: Error writing file 'master-bin' (errno: 28 "No space left on device")
      dispatch_command: crash_dispatch_command_before: now
      SIGKILL myself
      $ /kvm/src/my/dev/mariadb2/sql/mariadbd --defaults-group-suffix=.1 --defaults-file=/kvm/src/my/dev/mariadb2/mysql-test/var/52/my.cnf --log-output=file --binlog-format=statement --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --log-bin=master-bin --loose-skip-stack-trace --loose-debug-sync-timeout=300
      2025-03-17 17:49:01 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32214)
      

      The --rr trace of the mysqltest run shows how poll() times out after 120 seconds:

      pvio_socket_wait_io_or_timeout (pvio=0x55f55ea3d3c0, is_read=1 '\001', 
          timeout=120000)
          at /kvm/src/my/dev/mariadb2/libmariadb/plugins/pvio/pvio_socket.c:509
       
      540	      rc= poll(&p_fd, 1, timeout);
      Elapsed Time (s): 12.267555
      (rr) 
      541	    } while (rc == -1 && errno == EINTR);
      Elapsed Time (s): 132.364463
      

      Here is the stack trace at the point of poll(), showing this happens in reconnect code; in this test this is in the include/wait_until_connected_again.inc include file:

      #0  pvio_socket_wait_io_or_timeout (pvio=0x558e0f5d6900, is_read=1 '\001', 
          timeout=90000)
          at /kvm/src/my/dev/mariadb2/libmariadb/plugins/pvio/pvio_socket.c:533
      #1  0x0000558dfc3965c0 in pvio_socket_read (pvio=0x558e0f5d6900, 
          buffer=0x558e0f5e59b0 "", length=16384)
          at /kvm/src/my/dev/mariadb2/libmariadb/plugins/pvio/pvio_socket.c:318
      #2  0x0000558dfc3792c8 in ma_pvio_read (pvio=0x558e0f5d6900, 
          buffer=0x558e0f5e59b0 "", length=16384)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/ma_pvio.c:255
      #3  0x0000558dfc379484 in ma_pvio_cache_read (pvio=0x558e0f5d6900, 
          buffer=0x558e0f5dcfc0 "", length=4)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/ma_pvio.c:297
      #4  0x0000558dfc39b140 in ma_real_read (net=0x7fffda47c000, 
          complen=0x7fffda47b9d8)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/ma_net.c:385
      #5  0x0000558dfc39b375 in ma_net_read (net=0x7fffda47c000)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/ma_net.c:439
      #6  0x0000558dfc36ca30 in ma_net_safe_read (mysql=0x7fffda47c000)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/mariadb_lib.c:210
      #7  0x0000558dfc36fd84 in mthd_my_real_connect (mysql=0x7fffda47c000, 
          host=0x558e0f4773b0 "127.0.0.1", user=0x558e0f46cc30 "root", 
          passwd=0x558e0f462170 "", db=0x558e0f4502f0 "test", port=19870, 
          unix_socket=0x0, client_flag=2160042636)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/mariadb_lib.c:1561
      #8  0x0000558dfc36f62e in mysql_real_connect (mysql=0x7fffda47c000, 
          host=0x558e0f4773b0 "127.0.0.1", user=0x558e0f46cc30 "root", 
          passwd=0x558e0f462170 "", db=0x558e0f4502f0 "test", port=19870, 
          unix_socket=0x0, client_flag=2160042636)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/mariadb_lib.c:1363
      #9  0x0000558dfc371053 in mariadb_reconnect (mysql=0x558e0eaab850)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/mariadb_lib.c:1841
      #10 0x0000558dfc36d13d in mthd_my_send_cmd (mysql=0x558e0eaab850, 
          command=COM_QUERY, arg=0x558e0f5dcec8 "show status", length=11, 
          skipp_check=1 '\001', opt_arg=0x0)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/mariadb_lib.c:398
      #11 0x0000558dfc36d53c in ma_simple_command (mysql=0x558e0eaab850, 
          command=COM_QUERY, arg=0x558e0f5dcec8 "show status", length=11, 
          skipp_check=1 '\001', opt_arg=0x0)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/mariadb_lib.c:472
      #12 0x0000558dfc3721a6 in mysql_send_query (mysql=0x558e0eaab850, 
          query=0x558e0f5dcec8 "show status", length=11)
          at /kvm/src/my/dev/mariadb2/libmariadb/libmariadb/mariadb_lib.c:2213
      #13 0x0000558dfc34b4b7 in wrap_mysql_send_query (mysql=0x558e0eaab850, 
          q=0x558e0f5dcec8 "show status", length=11)
          at /kvm/src/my/dev/mariadb2/client/../tests/nonblock-wrappers.h:211
      #14 0x0000558dfc35d768 in run_query_normal (cn=0x558e0ea24c88, 
          command=0x558e0f5dcd88, flags=3, query=0x558e0f5dcec8 "show status", 
          query_len=11, ds=0x558dfc76c240 <ds_res>, ds_warnings=0x7fffda47c750)
          at /kvm/src/my/dev/mariadb2/client/mysqltest.cc:8153
      #15 0x0000558dfc3606a3 in run_query (cn=0x558e0ea24c88, 
          command=0x558e0f5dcd88, flags=3)
          at /kvm/src/my/dev/mariadb2/client/mysqltest.cc:9538
      #16 0x0000558dfc362632 in main (argc=15, argv=0x7fffda47cb78)
          at /kvm/src/my/dev/mariadb2/client/mysqltest.cc:10381
      

      And here are the packets matching the client and server port from tcpdump during the test run:

      $ tcpdump -r dump2.pcap port 37946 and port 20530
      reading from file dump2.pcap, link-type EN10MB (Ethernet), snapshot length 192
      17:47:01.532107 IP localhost.localdomain.37946 > localhost.localdomain.20530: Flags [S], seq 1614537942, win 65495, options [mss 65495,sackOK,TS val 2023193483 ecr 0,nop,wscale 7], length 0
      17:47:01.532128 IP localhost.localdomain.20530 > localhost.localdomain.37946: Flags [S.], seq 966114325, ack 1614537943, win 65483, options [mss 65495,sackOK,TS val 2023193483 ecr 2023193483,nop,wscale 7], length 0
      17:47:01.532138 IP localhost.localdomain.37946 > localhost.localdomain.20530: Flags [.], ack 1, win 512, options [nop,nop,TS val 2023193483 ecr 2023193483], length 0
      17:49:01.629162 IP localhost.localdomain.37946 > localhost.localdomain.20530: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 2023313580 ecr 2023193483], length 0
      17:49:01.629168 IP localhost.localdomain.20530 > localhost.localdomain.37946: Flags [R], seq 966114326, win 0, length 0
      

      This shows clearly how the client connection succeeds just at the time where the server is being killed, and that no FIN or RST packet gets sent to the client as it should. Only after the 120 seconds client timeout does the client send the FIN (and the server replies RST as the socket was long gone).

      Attachments

        Issue Links

          Activity

            There are no comments yet on this issue.

            People

              Unassigned Unassigned
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.