Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30780

parallel slave hangs after hit an error

Details

    Description

      After a parallel worker thread hits an error that must be error-stopping the slave
      show slave status does display the error along with YES of the slave running status, e.g

      show slave status\G
       
      *************************** 1. row ***************************
       
                      Slave_IO_State: Waiting for master to send event
       
                         Master_Host: 172.31.15.61
       
                         Master_User: db02replication
       
                         Master_Port: 3306
       
                       Connect_Retry: 60
       
                     Master_Log_File: mysql-bin.028940
       
                 Read_Master_Log_Pos: 1050157656
       
                      Relay_Log_File: relay-bin.000134
       
                       Relay_Log_Pos: 964684321
       
               Relay_Master_Log_File: mysql-bin.028938
       
                    Slave_IO_Running: Yes
       
                   Slave_SQL_Running: Yes
       
                     Replicate_Do_DB: 
       
                 Replicate_Ignore_DB: 
       
                  Replicate_Do_Table: 
       
              Replicate_Ignore_Table: 
       
             Replicate_Wild_Do_Table: 
       
         Replicate_Wild_Ignore_Table: 
       
                          Last_Errno: 1062
       
                          Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486
       
                        Skip_Counter: 0
       
                 Exec_Master_Log_Pos: 964684022
      

      Slave threads however instead of expected exiting may hang like

      +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
      | Id   | User         | Host               | db   | Command      | Time  | State                                         | Info             | Progress |
      +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
      |    5 | system user  |                    | NULL | Slave_IO     | 51160 | Waiting for master to send event              | NULL             |    0.000 |
      |   19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep        |     5 |                                               | NULL             |    0.000 |
      |   61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep        |    10 |                                               | NULL             |    0.000 |
      | 2394 | system user  |                    | NULL | Slave_worker | 50852 | closing tables                                | NULL             |    0.000 |
      | 2395 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2396 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2397 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2398 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2399 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2400 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2401 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2402 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2403 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2404 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2405 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2393 | system user  |                    | NULL | Slave_SQL    | 50860 | Waiting for room in worker thread event queue | NULL             |    0.000 |
      

      Slave_SQL may also hang in a different state.

      Upon analysis it turned out that closing tables worker got entrapped in endless looping
      in mark_start_commit_inner across already garbage-collected items including rgi->gco itself.
      The reason of the belated access is identified as possible out-of-order group committing
      in the error branch.

      The issue applies to both the conservative and optimistic modes.
      A patch, to be committed soon, fixes the case to reinforce group_commit_orderer-based order for errored-out workers.

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin created issue -
            julien.fritsch Julien Fritsch made changes -
            Field Original Value New Value
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.11 [ 27614 ]
            Elkin Andrei Elkin made changes -
            Description After a parallel worker thread hits an error that must be error-stopping the slave
            {{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

            {noformat}
            show slave status\G

            *************************** 1. row ***************************

                            Slave_IO_State: Waiting for master to send event

                               Master_Host: 172.31.15.61

                               Master_User: db02replication

                               Master_Port: 3306

                             Connect_Retry: 60

                           Master_Log_File: mysql-bin.028940

                       Read_Master_Log_Pos: 1050157656

                            Relay_Log_File: relay-bin.000134

                             Relay_Log_Pos: 964684321

                     Relay_Master_Log_File: mysql-bin.028938

                          Slave_IO_Running: Yes

                         Slave_SQL_Running: Yes

                           Replicate_Do_DB:

                       Replicate_Ignore_DB:

                        Replicate_Do_Table:

                    Replicate_Ignore_Table:

                   Replicate_Wild_Do_Table:

               Replicate_Wild_Ignore_Table:

                                Last_Errno: 1062

                                Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                              Skip_Counter: 0

                       Exec_Master_Log_Pos: 964684022
            {noformat}

            Slave threads however instead of expected exiting may hang like


            {noformat}
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | Id | User | Host | db | Command | Time | State | Info | Progress |
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
            | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
            | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
            | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
            | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
            {noformat}
            Slave_SQL may also hang in a different state.

            Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
            in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
            The reason of the belated access is identified as possible out-of-order group committing
            in the error branch.

            A patch, to be committed soon, fixes the case to simplify logics of garbage-collecting of {{group_commit_orderer}} objects. An instance can be removed into a free list only by a worker
            that finishes the last group-of-events associated with it.

            After a parallel worker thread hits an error that must be error-stopping the slave
            {{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

            {noformat}
            show slave status\G

            *************************** 1. row ***************************

                            Slave_IO_State: Waiting for master to send event

                               Master_Host: 172.31.15.61

                               Master_User: db02replication

                               Master_Port: 3306

                             Connect_Retry: 60

                           Master_Log_File: mysql-bin.028940

                       Read_Master_Log_Pos: 1050157656

                            Relay_Log_File: relay-bin.000134

                             Relay_Log_Pos: 964684321

                     Relay_Master_Log_File: mysql-bin.028938

                          Slave_IO_Running: Yes

                         Slave_SQL_Running: Yes

                           Replicate_Do_DB:

                       Replicate_Ignore_DB:

                        Replicate_Do_Table:

                    Replicate_Ignore_Table:

                   Replicate_Wild_Do_Table:

               Replicate_Wild_Ignore_Table:

                                Last_Errno: 1062

                                Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                              Skip_Counter: 0

                       Exec_Master_Log_Pos: 964684022
            {noformat}

            Slave threads however instead of expected exiting may hang like


            {noformat}
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | Id | User | Host | db | Command | Time | State | Info | Progress |
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
            | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
            | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
            | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
            | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
            {noformat}
            Slave_SQL may also hang in a different state.

            Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
            in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
            The reason of the belated access is identified as possible out-of-order group committing
            in the error branch.

            A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers.

            Elkin Andrei Elkin made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            Elkin Andrei Elkin made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]
            Elkin Andrei Elkin added a comment -

            The patch is pushed to bb-10.4-andrei.

            Elkin Andrei Elkin added a comment - The patch is pushed to bb-10.4-andrei.
            Elkin Andrei Elkin made changes -
            Assignee Andrei Elkin [ elkin ] Brandon Nesterenko [ JIRAUSER48702 ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            julien.fritsch Julien Fritsch made changes -
            rpizzi Rick Pizzi (Inactive) added a comment - - edited

            Elkin I just hit this bug with a client which has CONSERVATIVE parallel replication.
            Will the patch cover that case as well?

            MariaDB [(none)]> show processlist;
            +---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+
            | Id      | User                       | Host                 | db           | Command      | Time    | State                                         | Info             | Progress |
            +---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+
            |      15 | system user                |                      | NULL         | Slave_IO     | 7336114 | Waiting for master to send event              | NULL             |    0.000 |
            |      17 | system user                |                      | NULL         | Slave_worker |  123681 | closing tables                                | NULL             |    0.000 |
            |      18 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |
            |      19 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |
            |      20 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |
            |      16 | system user                |                      | NULL         | Slave_SQL    |  123685 | Reading event from the relay log              | NULL             |    0.000 |
            | 1319544 | skysql_maxscale_db00006746 | 10.103.64.226:46248  | NULL         | Sleep        |       1 |                                               | NULL             |    0.000 |
            | 1640887 | readonly_user              | 35.188.30.0:62489    | NULL         | Sleep        |     112 |                                               | NULL             |    0.000 |
            | 1640888 | readonly_user              | 35.188.30.0:62491    | chariot_case | Sleep        |     112 |                                               | NULL             |    0.000 |
            | 1644053 | readonly_user              | 35.188.30.0:65440    | chariot_case | Sleep        |      83 |                                               | NULL             |    0.000 |
            | 1650211 | root                       | localhost            | NULL         | Query        |   55897 | Killing slave                                 | stop slave       |    0.000 |
            

            rpizzi Rick Pizzi (Inactive) added a comment - - edited Elkin I just hit this bug with a client which has CONSERVATIVE parallel replication. Will the patch cover that case as well? MariaDB [(none)]> show processlist; +---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+ | 15 | system user | | NULL | Slave_IO | 7336114 | Waiting for master to send event | NULL | 0.000 | | 17 | system user | | NULL | Slave_worker | 123681 | closing tables | NULL | 0.000 | | 18 | system user | | NULL | Slave_worker | 123681 | Waiting for prior transaction to start commit | NULL | 0.000 | | 19 | system user | | NULL | Slave_worker | 123681 | Waiting for prior transaction to start commit | NULL | 0.000 | | 20 | system user | | NULL | Slave_worker | 123681 | Waiting for prior transaction to start commit | NULL | 0.000 | | 16 | system user | | NULL | Slave_SQL | 123685 | Reading event from the relay log | NULL | 0.000 | | 1319544 | skysql_maxscale_db00006746 | 10.103.64.226:46248 | NULL | Sleep | 1 | | NULL | 0.000 | | 1640887 | readonly_user | 35.188.30.0:62489 | NULL | Sleep | 112 | | NULL | 0.000 | | 1640888 | readonly_user | 35.188.30.0:62491 | chariot_case | Sleep | 112 | | NULL | 0.000 | | 1644053 | readonly_user | 35.188.30.0:65440 | chariot_case | Sleep | 83 | | NULL | 0.000 | | 1650211 | root | localhost | NULL | Query | 55897 | Killing slave | stop slave | 0.000 |
            Elkin Andrei Elkin made changes -
            Summary optimistic parallel slave hangs after hit an error parallel slave hangs after hit an error
            Elkin Andrei Elkin made changes -
            Description After a parallel worker thread hits an error that must be error-stopping the slave
            {{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

            {noformat}
            show slave status\G

            *************************** 1. row ***************************

                            Slave_IO_State: Waiting for master to send event

                               Master_Host: 172.31.15.61

                               Master_User: db02replication

                               Master_Port: 3306

                             Connect_Retry: 60

                           Master_Log_File: mysql-bin.028940

                       Read_Master_Log_Pos: 1050157656

                            Relay_Log_File: relay-bin.000134

                             Relay_Log_Pos: 964684321

                     Relay_Master_Log_File: mysql-bin.028938

                          Slave_IO_Running: Yes

                         Slave_SQL_Running: Yes

                           Replicate_Do_DB:

                       Replicate_Ignore_DB:

                        Replicate_Do_Table:

                    Replicate_Ignore_Table:

                   Replicate_Wild_Do_Table:

               Replicate_Wild_Ignore_Table:

                                Last_Errno: 1062

                                Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                              Skip_Counter: 0

                       Exec_Master_Log_Pos: 964684022
            {noformat}

            Slave threads however instead of expected exiting may hang like


            {noformat}
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | Id | User | Host | db | Command | Time | State | Info | Progress |
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
            | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
            | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
            | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
            | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
            {noformat}
            Slave_SQL may also hang in a different state.

            Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
            in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
            The reason of the belated access is identified as possible out-of-order group committing
            in the error branch.

            A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers.

            After a parallel worker thread hits an error that must be error-stopping the slave
            {{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

            {noformat}
            show slave status\G

            *************************** 1. row ***************************

                            Slave_IO_State: Waiting for master to send event

                               Master_Host: 172.31.15.61

                               Master_User: db02replication

                               Master_Port: 3306

                             Connect_Retry: 60

                           Master_Log_File: mysql-bin.028940

                       Read_Master_Log_Pos: 1050157656

                            Relay_Log_File: relay-bin.000134

                             Relay_Log_Pos: 964684321

                     Relay_Master_Log_File: mysql-bin.028938

                          Slave_IO_Running: Yes

                         Slave_SQL_Running: Yes

                           Replicate_Do_DB:

                       Replicate_Ignore_DB:

                        Replicate_Do_Table:

                    Replicate_Ignore_Table:

                   Replicate_Wild_Do_Table:

               Replicate_Wild_Ignore_Table:

                                Last_Errno: 1062

                                Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                              Skip_Counter: 0

                       Exec_Master_Log_Pos: 964684022
            {noformat}

            Slave threads however instead of expected exiting may hang like


            {noformat}
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | Id | User | Host | db | Command | Time | State | Info | Progress |
            +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
            | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
            | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
            | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
            | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
            | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
            | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
            {noformat}
            Slave_SQL may also hang in a different state.

            Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
            in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
            The reason of the belated access is identified as possible out-of-order group committing
            in the error branch.

            The issue applies to both the conservative and optimistic modes.
            A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers.

            Elkin Andrei Elkin added a comment - - edited

            rpizzi thanks for the heads up. I confirm the issue is more general that was originally stated in the summary. I've updated it now. The patch is at review and will be available fairly soon.

            Elkin Andrei Elkin added a comment - - edited rpizzi thanks for the heads up. I confirm the issue is more general that was originally stated in the summary. I've updated it now. The patch is at review and will be available fairly soon.

            Approved (discussed patch on Slack).

            bnestere Brandon Nesterenko added a comment - Approved (discussed patch on Slack).
            bnestere Brandon Nesterenko made changes -
            Assignee Brandon Nesterenko [ JIRAUSER48702 ] Andrei Elkin [ elkin ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            Elkin Andrei Elkin made changes -
            Fix Version/s 10.4.29 [ 28510 ]
            Fix Version/s 10.5.20 [ 28512 ]
            Fix Version/s 10.6.13 [ 28514 ]
            Fix Version/s 10.8.8 [ 28518 ]
            Fix Version/s 10.9.6 [ 28520 ]
            Fix Version/s 10.10.4 [ 28522 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.11 [ 27614 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            forke Marcin Wanat added a comment -

            Why 10.11 is not included in Fix Version? It it not affected or it took a bit more work to patch this version too?

            forke Marcin Wanat added a comment - Why 10.11 is not included in Fix Version? It it not affected or it took a bit more work to patch this version too?
            bnestere Brandon Nesterenko made changes -
            Fix Version/s 10.11.3 [ 28524 ]

            forke It was accidentally missed while closing the ticket, it will be fixed in 10.11.3.

            bnestere Brandon Nesterenko added a comment - forke It was accidentally missed while closing the ticket, it will be fixed in 10.11.3.
            angelique.sklavounos Angelique Sklavounos (Inactive) made changes -
            forke Marcin Wanat added a comment - - edited

            The issue is probably not fully fixed.

            We are experiencing this issue on 10.11.2 under high load in conservative mode. Master has binlog commit wait count = 10 and slaves have 12 replication workers, replication is ROW based. Master have 100k+ updates/inserts per second and few DROP/CREATE tables (we are often using CREATE LIKE+RENAME TABLE+DROP as faster alternative to truncate large tables under high load to reduce latency for client).

            I have manually replaced rpl_rli.cc, rpl_parallel.h and rpl_parallel.cc from 10.11.2 with latest files from github 10.11 branch (it has merged this patch) and recompiled from source.
            Replication can still hang the same way, but not that often as previously.

            EDIT:
            There is one change. It does not have "Closing tables" state, but all workerd are NULL or waiting:

            MariaDB [(none)]> SHOW PROCESSLIST;
            +---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+
            | Id      | User        | Host                      | db      | Command      | Time | State                                         | Info             | Progress |
            +---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+
            |       4 | system user |                           | NULL    | Slave_IO     | 5845 | Waiting for master to send event              | NULL             |    0.000 |
            |       6 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |       7 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
            |       9 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
            |       8 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
            |      10 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |      11 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |      12 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
            |      13 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |      14 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |      15 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |      16 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |      17 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
            |       5 | system user |                           | NULL    | Slave_SQL    |  318 | Waiting for room in worker thread event queue | NULL             |    0.000 |
            | 2524245 | root        | localhost                 | NULL    | Query        |    0 | starting                                      | SHOW PROCESSLIST |    0.000 |
            +---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+
            

            forke Marcin Wanat added a comment - - edited The issue is probably not fully fixed. We are experiencing this issue on 10.11.2 under high load in conservative mode. Master has binlog commit wait count = 10 and slaves have 12 replication workers, replication is ROW based. Master have 100k+ updates/inserts per second and few DROP/CREATE tables (we are often using CREATE LIKE+RENAME TABLE+DROP as faster alternative to truncate large tables under high load to reduce latency for client). I have manually replaced rpl_rli.cc, rpl_parallel.h and rpl_parallel.cc from 10.11.2 with latest files from github 10.11 branch (it has merged this patch) and recompiled from source. Replication can still hang the same way, but not that often as previously. EDIT: There is one change. It does not have "Closing tables" state, but all workerd are NULL or waiting: MariaDB [(none)]> SHOW PROCESSLIST; +---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+ | 4 | system user | | NULL | Slave_IO | 5845 | Waiting for master to send event | NULL | 0.000 | | 6 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 7 | system user | | NULL | Slave_worker | 311 | Waiting for prior transaction to commit | NULL | 0.000 | | 9 | system user | | NULL | Slave_worker | 311 | Waiting for prior transaction to commit | NULL | 0.000 | | 8 | system user | | NULL | Slave_worker | 311 | Waiting for prior transaction to commit | NULL | 0.000 | | 10 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 11 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 12 | system user | | NULL | Slave_worker | 311 | Waiting for prior transaction to commit | NULL | 0.000 | | 13 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 14 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 15 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 16 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 17 | system user | | NULL | Slave_worker | 311 | NULL | NULL | 0.000 | | 5 | system user | | NULL | Slave_SQL | 318 | Waiting for room in worker thread event queue | NULL | 0.000 | | 2524245 | root | localhost | NULL | Query | 0 | starting | SHOW PROCESSLIST | 0.000 | +---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+
            danblack Daniel Black added a comment -

            forke, sorry only just saw your message.

            Is the problem still occurring? Are you running on 10.11.3 now?

            Was the 10.11.2+changes compiled with debug symbols (file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed.

            If so, can you obtain a backtrace on the running mariadbd instance? ref.

            danblack Daniel Black added a comment - forke , sorry only just saw your message. Is the problem still occurring? Are you running on 10.11.3 now? Was the 10.11.2+changes compiled with debug symbols ( file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed . If so, can you obtain a backtrace on the running mariadbd instance? ref .
            forke Marcin Wanat added a comment -

            danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely.

            This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely.

            forke Marcin Wanat added a comment - danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely. This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely.
            danblack Daniel Black added a comment -

            Thanks for confirming forke, your notes in MDEV-31234 while not responded to have been read.

            danblack Daniel Black added a comment - Thanks for confirming forke , your notes in MDEV-31234 while not responded to have been read.
            RagulR Ragul added a comment -

            Hi Andrei Elkin, Do we have any procedures to reproduce this issue and test it?
            Facing similar issue which can't be reproduced but occurs at random.

            RagulR Ragul added a comment - Hi Andrei Elkin , Do we have any procedures to reproduce this issue and test it? Facing similar issue which can't be reproduced but occurs at random.
            Elkin Andrei Elkin added a comment -

            RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei

            Elkin Andrei Elkin added a comment - RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei
            bnestere Brandon Nesterenko made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 172527 171882

            People

              Elkin Andrei Elkin
              Elkin Andrei Elkin
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.