[MDEV-30780] parallel slave hangs after hit an error - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4(EOL), 10.5, 10.6, 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
Fix Version/s: 10.11.3, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4
Component/s: Replication
Labels:
None

Description

After a parallel worker thread hits an error that must be error-stopping the slave
show slave status does display the error along with YES of the slave running status, e.g

show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: 172.31.15.61

                   Master_User: db02replication

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mysql-bin.028940

           Read_Master_Log_Pos: 1050157656

                Relay_Log_File: relay-bin.000134

                 Relay_Log_Pos: 964684321

         Relay_Master_Log_File: mysql-bin.028938

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 1062

                    Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 964684022

Slave threads however instead of expected exiting may hang like

+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+

| Id   | User         | Host               | db   | Command      | Time  | State                                         | Info             | Progress |

+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+

|    5 | system user  |                    | NULL | Slave_IO     | 51160 | Waiting for master to send event              | NULL             |    0.000 |

|   19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep        |     5 |                                               | NULL             |    0.000 |

|   61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep        |    10 |                                               | NULL             |    0.000 |

| 2394 | system user  |                    | NULL | Slave_worker | 50852 | closing tables                                | NULL             |    0.000 |

| 2395 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2396 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2397 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2398 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2399 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2400 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2401 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2402 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2403 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2404 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2405 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2393 | system user  |                    | NULL | Slave_SQL    | 50860 | Waiting for room in worker thread event queue | NULL             |    0.000 |

Slave_SQL may also hang in a different state.

Upon analysis it turned out that closing tables worker got entrapped in endless looping
in mark_start_commit_inner across already garbage-collected items including rgi->gco itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.

The issue applies to both the conservative and optimistic modes.
A patch, to be committed soon, fixes the case to reinforce group_commit_orderer-based order for errored-out workers.

Attachments

Issue Links

relates to

MDEV-31052 Parallel slave hangs with binlog_alter_two_phase=ON and ALTER replication error

Open

MDEV-31448 Killing a replica thread awaiting its GCO can hang/crash a parallel replica

Closed

Activity

Ascending order - Click to sort in descending order

Andrei Elkin created issue - 2023-03-03 18:48

Julien Fritsch made changes - 2023-03-06 07:04

Field	Original Value	New Value
Fix Version/s		10.4 [ 22408 ]
Fix Version/s		10.5 [ 23123 ]
Fix Version/s		10.6 [ 24028 ]
Fix Version/s		10.8 [ 26121 ]
Fix Version/s		10.9 [ 26905 ]
Fix Version/s		10.10 [ 27530 ]
Fix Version/s		10.11 [ 27614 ]

Andrei Elkin made changes - 2023-03-06 16:23

Description

After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

{noformat}
show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: 172.31.15.61

                   Master_User: db02replication

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mysql-bin.028940

           Read_Master_Log_Pos: 1050157656

                Relay_Log_File: relay-bin.000134

                 Relay_Log_Pos: 964684321

         Relay_Master_Log_File: mysql-bin.028938

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 1062

                    Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 964684022
{noformat}

Slave threads however instead of expected exiting may hang like

{noformat}
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
| 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
| 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
| 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
| 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
{noformat}
Slave_SQL may also hang in a different state.

Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.

A patch, to be committed soon, fixes the case to simplify logics of garbage-collecting of {{group_commit_orderer}} objects. An instance can be removed into a free list only by a worker
that finishes the last group-of-events associated with it.

After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

{noformat}
show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: 172.31.15.61

                   Master_User: db02replication

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mysql-bin.028940

           Read_Master_Log_Pos: 1050157656

                Relay_Log_File: relay-bin.000134

                 Relay_Log_Pos: 964684321

         Relay_Master_Log_File: mysql-bin.028938

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 1062

                    Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 964684022
{noformat}

Slave threads however instead of expected exiting may hang like

{noformat}
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
| 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
| 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
| 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
| 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
{noformat}
Slave_SQL may also hang in a different state.

Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.

A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers.

Andrei Elkin made changes - 2023-03-07 16:33

Status

Open [ 1 ]

Confirmed [ 10101 ]

Andrei Elkin made changes - 2023-03-07 16:33

Status

Confirmed [ 10101 ]

In Progress [ 3 ]

Andrei Elkin added a comment - 2023-03-07 16:34

The patch is pushed to bb-10.4-andrei.

Andrei Elkin added a comment - 2023-03-07 16:34 The patch is pushed to bb-10.4-andrei.

Andrei Elkin made changes - 2023-03-07 16:34

Assignee	Andrei Elkin [ elkin ]	Brandon Nesterenko [ JIRAUSER48702 ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Julien Fritsch made changes - 2023-03-08 08:52

Link

This issue duplicates MENT-1699 [ MENT-1699 ]

Rick Pizzi (Inactive) added a comment - 2023-03-11 11:17 - edited

Elkin I just hit this bug with a client which has CONSERVATIVE parallel replication.
Will the patch cover that case as well?

MariaDB [(none)]> show processlist;

+---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+

| Id      | User                       | Host                 | db           | Command      | Time    | State                                         | Info             | Progress |

+---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+

|      15 | system user                |                      | NULL         | Slave_IO     | 7336114 | Waiting for master to send event              | NULL             |    0.000 |

|      17 | system user                |                      | NULL         | Slave_worker |  123681 | closing tables                                | NULL             |    0.000 |

|      18 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |

|      19 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |

|      20 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |

|      16 | system user                |                      | NULL         | Slave_SQL    |  123685 | Reading event from the relay log              | NULL             |    0.000 |

| 1319544 | skysql_maxscale_db00006746 | 10.103.64.226:46248  | NULL         | Sleep        |       1 |                                               | NULL             |    0.000 |

| 1640887 | readonly_user              | 35.188.30.0:62489    | NULL         | Sleep        |     112 |                                               | NULL             |    0.000 |

| 1640888 | readonly_user              | 35.188.30.0:62491    | chariot_case | Sleep        |     112 |                                               | NULL             |    0.000 |

| 1644053 | readonly_user              | 35.188.30.0:65440    | chariot_case | Sleep        |      83 |                                               | NULL             |    0.000 |

| 1650211 | root                       | localhost            | NULL         | Query        |   55897 | Killing slave                                 | stop slave       |    0.000 |

Andrei Elkin made changes - 2023-03-11 19:15

Summary

optimistic parallel slave hangs after hit an error

parallel slave hangs after hit an error

Andrei Elkin made changes - 2023-03-11 19:17

Description

After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

{noformat}
show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: 172.31.15.61

                   Master_User: db02replication

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mysql-bin.028940

           Read_Master_Log_Pos: 1050157656

                Relay_Log_File: relay-bin.000134

                 Relay_Log_Pos: 964684321

         Relay_Master_Log_File: mysql-bin.028938

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 1062

                    Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 964684022
{noformat}

Slave threads however instead of expected exiting may hang like

{noformat}
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
| 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
| 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
| 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
| 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
{noformat}
Slave_SQL may also hang in a different state.

Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.

A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers.

After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g

{noformat}
show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: 172.31.15.61

                   Master_User: db02replication

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mysql-bin.028940

           Read_Master_Log_Pos: 1050157656

                Relay_Log_File: relay-bin.000134

                 Relay_Log_Pos: 964684321

         Relay_Master_Log_File: mysql-bin.028938

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 1062

                    Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 964684022
{noformat}

Slave threads however instead of expected exiting may hang like

{noformat}
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| Id | User | Host | db | Command | Time | State | Info | Progress |
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
| 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
| 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
| 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
| 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
| 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
{noformat}
Slave_SQL may also hang in a different state.

Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping
in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.

The issue applies to both the conservative and optimistic modes.
A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers.

Andrei Elkin added a comment - 2023-03-11 19:19 - edited

rpizzi thanks for the heads up. I confirm the issue is more general that was originally stated in the summary. I've updated it now. The patch is at review and will be available fairly soon.

Andrei Elkin added a comment - 2023-03-11 19:19 - edited rpizzi thanks for the heads up. I confirm the issue is more general that was originally stated in the summary. I've updated it now. The patch is at review and will be available fairly soon.

Brandon Nesterenko added a comment - 2023-03-13 15:55

Approved (discussed patch on Slack).

Brandon Nesterenko added a comment - 2023-03-13 15:55 Approved (discussed patch on Slack).

Brandon Nesterenko made changes - 2023-03-13 15:55

Assignee	Brandon Nesterenko [ JIRAUSER48702 ]	Andrei Elkin [ elkin ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Andrei Elkin made changes - 2023-03-16 17:32

Fix Version/s		10.4.29 [ 28510 ]
Fix Version/s		10.5.20 [ 28512 ]
Fix Version/s		10.6.13 [ 28514 ]
Fix Version/s		10.8.8 [ 28518 ]
Fix Version/s		10.9.6 [ 28520 ]
Fix Version/s		10.10.4 [ 28522 ]
Fix Version/s	10.4 [ 22408 ]
Fix Version/s	10.5 [ 23123 ]
Fix Version/s	10.6 [ 24028 ]
Fix Version/s	10.8 [ 26121 ]
Fix Version/s	10.9 [ 26905 ]
Fix Version/s	10.10 [ 27530 ]
Fix Version/s	10.11 [ 27614 ]
Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Marcin Wanat added a comment - 2023-04-11 13:21

Why 10.11 is not included in Fix Version? It it not affected or it took a bit more work to patch this version too?

Marcin Wanat added a comment - 2023-04-11 13:21 Why 10.11 is not included in Fix Version? It it not affected or it took a bit more work to patch this version too?

Brandon Nesterenko made changes - 2023-04-11 14:08

Fix Version/s

10.11.3 [ 28524 ]

Brandon Nesterenko added a comment - 2023-04-11 14:09

forke It was accidentally missed while closing the ticket, it will be fixed in 10.11.3.

Brandon Nesterenko added a comment - 2023-04-11 14:09 forke It was accidentally missed while closing the ticket, it will be fixed in 10.11.3.

Angelique Sklavounos (Inactive) made changes - 2023-04-13 21:39

Link

This issue relates to MDEV-31052 [ MDEV-31052 ]

Marcin Wanat added a comment - 2023-04-26 10:56 - edited

The issue is probably not fully fixed.

We are experiencing this issue on 10.11.2 under high load in conservative mode. Master has binlog commit wait count = 10 and slaves have 12 replication workers, replication is ROW based. Master have 100k+ updates/inserts per second and few DROP/CREATE tables (we are often using CREATE LIKE+RENAME TABLE+DROP as faster alternative to truncate large tables under high load to reduce latency for client).

I have manually replaced rpl_rli.cc, rpl_parallel.h and rpl_parallel.cc from 10.11.2 with latest files from github 10.11 branch (it has merged this patch) and recompiled from source.
Replication can still hang the same way, but not that often as previously.

EDIT:
There is one change. It does not have "Closing tables" state, but all workerd are NULL or waiting:

MariaDB [(none)]> SHOW PROCESSLIST;

+---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+

| Id      | User        | Host                      | db      | Command      | Time | State                                         | Info             | Progress |

+---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+

|       4 | system user |                           | NULL    | Slave_IO     | 5845 | Waiting for master to send event              | NULL             |    0.000 |

|       6 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|       7 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |

|       9 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |

|       8 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |

|      10 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|      11 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|      12 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |

|      13 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|      14 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|      15 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|      16 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|      17 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |

|       5 | system user |                           | NULL    | Slave_SQL    |  318 | Waiting for room in worker thread event queue | NULL             |    0.000 |

| 2524245 | root        | localhost                 | NULL    | Query        |    0 | starting                                      | SHOW PROCESSLIST |    0.000 |

+---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+

Daniel Black added a comment - 2023-05-30 00:40

forke, sorry only just saw your message.

Is the problem still occurring? Are you running on 10.11.3 now?

Was the 10.11.2+changes compiled with debug symbols (file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed.

If so, can you obtain a backtrace on the running mariadbd instance? ref.

Daniel Black added a comment - 2023-05-30 00:40 forke , sorry only just saw your message. Is the problem still occurring? Are you running on 10.11.3 now? Was the 10.11.2+changes compiled with debug symbols ( file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed . If so, can you obtain a backtrace on the running mariadbd instance? ref .

Marcin Wanat added a comment - 2023-05-30 08:04

danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely.

This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely.

Marcin Wanat added a comment - 2023-05-30 08:04 danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely. This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely.

Daniel Black added a comment - 2023-05-31 00:15

Thanks for confirming forke, your notes in ~~MDEV-31234~~ while not responded to have been read.

Daniel Black added a comment - 2023-05-31 00:15 Thanks for confirming forke , your notes in MDEV-31234 while not responded to have been read.

Ragul added a comment - 2023-06-02 05:50

Hi Andrei Elkin, Do we have any procedures to reproduce this issue and test it?
Facing similar issue which can't be reproduced but occurs at random.

Ragul added a comment - 2023-06-02 05:50 Hi Andrei Elkin , Do we have any procedures to reproduce this issue and test it? Facing similar issue which can't be reproduced but occurs at random.

Andrei Elkin added a comment - 2023-06-02 07:50

RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei

Andrei Elkin added a comment - 2023-06-02 07:50 RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei

Brandon Nesterenko made changes - 2023-06-09 21:47

Link

This issue relates to ~~MDEV-31448~~ [ ~~MDEV-31448~~ ]

Jira Automation (IT) made changes - 2024-07-04 01:12

Zendesk Related Tickets

172527 171882

People

Assignee:: Andrei Elkin

Reporter:: Andrei Elkin

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2023-03-03 18:48

Updated:: 2024-07-07 19:48

Resolved:: 2023-03-16 17:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration