Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31427

MariaDB replication server's SQL Thread stuck at 'Waiting for prior transaction to commit'

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.6.10
    • 10.6.11
    • Replication
    • hw.system.vendor=HP
      hw.system.model=DL380 Gen10
      hw.cpu.vendor=Intel
      hw.cpu.model=Xeon Silver 4116 CPU
      hw.cpu.clock.speed.in.ghz=2.10
      hw.cpu.quantity=48
      hw.ram.in.gb=256
      os=RHEL 7.9

    Description

      Replication hangs often at a one Master - One Replica setup.
      From processlist the SQL worker threads seems to wait indefinitely with 'Waiting for prior transaction to commit' status.
      Once this happens replication is stuck and STOP SLAVE does not work, have to kill manually then works well for few days and issue occurs again.
      The queries before Relay_Log_Pos also does not seem out of ordinary (an insert and an update with PK).

      MariaDB : v10.6.10

      Processlist

      Id User Host db Command Time State Info Progress
      11 system user NULL Slave_IO 986857 Waiting for master to send event NULL 0.000
      13 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      14 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      15 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      16 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      18 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      17 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      20 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      21 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      19 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      22 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      23 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      24 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      25 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      26 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      27 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      28 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      29 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      30 system user NULL Slave_worker 57879 closing tables NULL 0.000
      31 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      32 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      34 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      33 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      35 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      36 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      37 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      38 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      39 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      40 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      41 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      42 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      43 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      44 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      45 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      46 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      47 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      48 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      49 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      50 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      51 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      52 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      53 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      54 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      55 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      56 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      57 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      58 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      59 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      60 system user NULL Slave_worker 57878 Waiting for prior transaction to commit NULL 0.000
      12 system user NULL Slave_SQL 57999 Waiting for room in worker thread event queue NULL 0.000

      Slave status

      Slave_IO_State: Waiting for master to send event
      Master_Host: a.b.c.d
      Master_User: repl
      Master_Port: 3307
      Connect_Retry: 60
      Master_Log_File: binary-log.024465
      Read_Master_Log_Pos: 312810975
      Relay_Log_File: mysql-1-relay-bin.004631
      Relay_Log_Pos: 201089830
      Relay_Master_Log_File: binary-log.024265
      Slave_IO_Running: Yes
      Slave_SQL_Running: Yes
      Replicate_Do_DB:
      Replicate_Ignore_DB: mysql
      Replicate_Do_Table:
      Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
      Replicate_Wild_Ignore_Table:
      Last_Errno: 0
      Last_Error:
      Skip_Counter: 0
      Exec_Master_Log_Pos: 610809588
      Relay_Log_Space: 76470879853
      Until_Condition: None
      Until_Log_File:
      Until_Log_Pos: 0
      Master_SSL_Allowed: No
      Master_SSL_CA_File:
      Master_SSL_CA_Path:
      Master_SSL_Cert:
      Master_SSL_Cipher:
      Master_SSL_Key:
      Seconds_Behind_Master: 57879
      Master_SSL_Verify_Server_Cert: No
      Last_IO_Errno: 0
      Last_IO_Error:
      Last_SQL_Errno: 0
      Last_SQL_Error:
      Replicate_Ignore_Server_Ids:
      Master_Server_Id: 2
      Master_SSL_Crl:
      Master_SSL_Crlpath:
      Using_Gtid: Current_Pos
      Gtid_IO_Pos: 0-2-703347858
      Replicate_Do_Domain_Ids:
      Replicate_Ignore_Domain_Ids:
      Parallel_Mode: optimistic
      SQL_Delay: 0
      SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Waiting for room in worker thread event queue
      Slave_DDL_Groups: 21059
      Slave_Non_Transactional_Groups: 0
      Slave_Transactional_Groups: 71502590

      Relay log events

      show relaylog events in 'mysql-1-relay-bin.004631' from 201088966 limit 20;
      -------------------------------------------------------------------------------------------------------------+

      Log_name Pos Event_type Server_id End_log_pos Info

      -------------------------------------------------------------------------------------------------------------+

      mysql-1-relay-bin.004631 201088966 Xid 2 610806615 COMMIT /* xid=33545593056 */
      mysql-1-relay-bin.004631 201088997 Gtid 2 610806659 BEGIN GTID 0-2-696762571 cid=33545631902
      mysql-1-relay-bin.004631 201089041 Table_map 2 610807665 table_id: 16927 (aDatabase.table1)
      mysql-1-relay-bin.004631 201089153 Write_rows_v1 2 610807864 table_id: 16927 flags: STMT_END_F
      mysql-1-relay-bin.004631 201089352 Table_map 2 610809239 table_id: 101 (aDatabase.table2)
      mysql-1-relay-bin.004631 201089481 Update_rows_v1 2 610809557 table_id: 101 flags: STMT_END_F
      mysql-1-relay-bin.004631 201089799 Xid 2 610809588 COMMIT /* xid=33545619314 */
      mysql-1-relay-bin.004631 201089830 Gtid 2 610809632 BEGIN GTID 0-2-696762572 cid=33545631902
      mysql-1-relay-bin.004631 201089874 Table_map 2 610810941 table_id: 16924 (aDatabase.table3)
      mysql-1-relay-bin.004631 201089988 Write_rows_v1 2 610811072 table_id: 16924 flags: STMT_END_F
      mysql-1-relay-bin.004631 201090119 Table_map 2 610823353 table_id: 16928 (aDatabase.table4)
      mysql-1-relay-bin.004631 201090240 Write_rows_v1 2 610829138 table_id: 16928 flags: STMT_END_F
      mysql-1-relay-bin.004631 201096025 Table_map 2 610830574 table_id: 72 (aDatabase.table5)
      mysql-1-relay-bin.004631 201096152 Update_rows_v1 2 610831036 table_id: 72 flags: STMT_END_F
      mysql-1-relay-bin.004631 201096614 Table_map 2 610832472 table_id: 72 (aDatabase.table5)
      mysql-1-relay-bin.004631 201096741 Update_rows_v1 2 610832934 table_id: 72 flags: STMT_END_F
      mysql-1-relay-bin.004631 201097203 Table_map 2 610834371 table_id: 72 (aDatabase.table5)
      mysql-1-relay-bin.004631 201097330 Update_rows_v1 2 610834833 table_id: 72 flags: STMT_END_F
      mysql-1-relay-bin.004631 201097792 Table_map 2 610836271 table_id: 72 (aDatabase.table5)
      mysql-1-relay-bin.004631 201097919 Update_rows_v1 2 610836733 table_id: 72 flags: STMT_END_F

      -------------------------------------------------------------------------------------------------------------+
      20 rows in set (0.078 sec)

      Replica my.cnf

      binlog-format=ROW
      binlog-ignore-db=mysql
      binlog-ignore-db=performance_schema
      binlog-ignore-db=test
      binlog_cache_size=4M
      binlog_commit_wait_count=48
      binlog_commit_wait_usec=20000
      character_set_server=utf8
      datadir=/data/mysql-1
      expire_logs_days=3
      general-log=0
      general_log_file=/var/opt/na/log/mysql-1/general-query.log
      gtid_strict_mode=1
      innodb=FORCE
      innodb_buffer_pool_chunk_size=128M
      innodb_buffer_pool_size=192G
      innodb_data_file_path=ibdata:100M:autoextend
      innodb_data_home_dir=/data/mysql-1
      innodb_fast_shutdown=1
      innodb_file_per_table=1
      innodb_flush_log_at_trx_commit=2
      innodb_flush_method=O_DIRECT
      innodb_force_recovery=0
      innodb_io_capacity=3000
      innodb_log_buffer_size=8M
      innodb_log_file_size=2047MB
      innodb_log_group_home_dir=/data/mysql-1
      innodb_open_files=10000
      innodb_print_all_deadlocks=1
      innodb_purge_threads=1
      interactive_timeout=420
      join_buffer_size=256K
      join_buffer_space_limit=2M
      join_cache_level=2
      log-bin=binary-log.
      log-error=/var/opt/na/log/mysql-1/error.log
      max_allowed_packet=33554432
      max_connections=2300
      max_heap_table_size=64M
      open-files-limit=65535
      optimizer_switch=default
      pid-file=/var/opt/na/run/mysql-1.pid
      port=3307
      query_cache_type=0
      replicate-ignore-db=mysql
      server-id=1
      skip-name-resolve
      slave_net_timeout=60
      slave_parallel_max_queued=1048576
      slave_parallel_mode=optimistic
      slave_parallel_threads=48
      slow-query-log=0
      slow_query_log_file=/var/opt/na/log/mysql-1/slow-query.log
      socket=/data/mysql-1/mysql.sock
      sql_mode=ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER
      sync_binlog=0
      sync_master_info=0
      sysdate_is_now=TRUE
      table_definition_cache=500
      table_open_cache=10000
      thread_handling=pool-of-threads
      tmp_table_size=64M
      tmpdir=/data/tmp/mysql-1
      transaction-isolation=READ-COMMITTED

      Attachments

        1. gbd_out.txt
          594 kB
        2. innodb.txt
          22 kB

        Issue Links

          Activity

            Thank you for your efforts. We will upgrade to latest release v10.6.14

            mariadbuser Mohamed Ismail added a comment - Thank you for your efforts. We will upgrade to latest release v10.6.14

            @knielsen , Would like to know if there's any work around this issue until we plan an upgrade.

            mariadbuser Mohamed Ismail added a comment - @knielsen , Would like to know if there's any work around this issue until we plan an upgrade.

            mariadbuser, you can try setting --thread-handling=one-thread-per-connection .
            If I read the code correctly, that will avoid the code path that hangs.

            knielsen Kristian Nielsen added a comment - mariadbuser , you can try setting --thread-handling=one-thread-per-connection . If I read the code correctly, that will avoid the code path that hangs.
            Elkin Andrei Elkin added a comment -

            knielsen, I also looked at this case presented in our mailing list, to reply with a guess of
            a hang/loop inside innodb:

            From: andrei.elkin@pp.inet.fi
            Subject: Re: [Maria-discuss] Backup on the replication server getting affected
            To: ragul rangarajan <ragulrangarajan@gmail.com>
            Cc: MariaDB discuss <maria-discuss@lists.launchpad.net>
            Date: Fri, 02 Jun 2023 10:46:37 +0300 (1 week, 1 hour, 17 minutes ago)
            Organization: Home sweet home
             
            Howdy Ragul,
             
            > Hi Andrei,
            >
            > Do we have any procedures to reproduce the issue MDEV-30780?
             
            Thanks for posting the gdb bt:s. They rule out 30780 yet not
            suggesting to me enough about the hang reason. This is something new to
            me and does deserve filing an MDEV ticket.
            Still I'd defer that until one has confirmed the same issue is seen
            on the latest 10.6. So you could run your load against the most recent
            slave version that'd be at least the safest (for our time).
             
            It might be (a slave worker) Thread 80 spinning inside
             
               #6  0x000055de407a0a3c in log_write_up_to (lsn=<optimized out>, lsn@entry=216757233923297, flush_to_disk=flush_to_disk@entry=false, rotate_key=rotate_key@entry=false, 
             
            a goto repeat "loop".
            That hopefully you can confirm any next time the hang appears back.
            Could you please check whether #6 calls iteratively indeed
            `group_commit_lock::release()`? (With  e.g
              (gdb) br thd_decrement_pending_ops thread 80
            of course the number may change:-)).
            
            

            While you must've penetrated deeply, perhaps you can confirm my guess?
            Thank you for looking into it!

            Andrei

            Elkin Andrei Elkin added a comment - knielsen , I also looked at this case presented in our mailing list, to reply with a guess of a hang/loop inside innodb: From: andrei.elkin@pp.inet.fi Subject: Re: [Maria-discuss] Backup on the replication server getting affected To: ragul rangarajan <ragulrangarajan@gmail.com> Cc: MariaDB discuss <maria-discuss@lists.launchpad.net> Date: Fri, 02 Jun 2023 10:46:37 +0300 (1 week, 1 hour, 17 minutes ago) Organization: Home sweet home   Howdy Ragul,   > Hi Andrei, > > Do we have any procedures to reproduce the issue MDEV-30780?   Thanks for posting the gdb bt:s. They rule out 30780 yet not suggesting to me enough about the hang reason. This is something new to me and does deserve filing an MDEV ticket. Still I'd defer that until one has confirmed the same issue is seen on the latest 10.6. So you could run your load against the most recent slave version that'd be at least the safest (for our time).   It might be (a slave worker) Thread 80 spinning inside   #6 0x000055de407a0a3c in log_write_up_to (lsn=<optimized out>, lsn@entry=216757233923297, flush_to_disk=flush_to_disk@entry=false, rotate_key=rotate_key@entry=false,   a goto repeat "loop". That hopefully you can confirm any next time the hang appears back. Could you please check whether #6 calls iteratively indeed `group_commit_lock::release()`? (With e.g (gdb) br thd_decrement_pending_ops thread 80 of course the number may change:-)). While you must've penetrated deeply, perhaps you can confirm my guess? Thank you for looking into it! Andrei

            Elkin, I agree the issue on the mailing list looks precisely like this issue. I wrote a reply on the list.

            - Kristian.

            knielsen Kristian Nielsen added a comment - Elkin , I agree the issue on the mailing list looks precisely like this issue. I wrote a reply on the list. - Kristian.

            People

              knielsen Kristian Nielsen
              mariadbuser Mohamed Ismail
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.