[MDEV-30780] parallel slave hangs after hit an error - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4(EOL), 10.5, 10.6, 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
Fix Version/s: 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4, 10.11.3
Component/s: Replication
Labels:
None

Description

After a parallel worker thread hits an error that must be error-stopping the slave
show slave status does display the error along with YES of the slave running status, e.g

show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: 172.31.15.61

                   Master_User: db02replication

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mysql-bin.028940

           Read_Master_Log_Pos: 1050157656

                Relay_Log_File: relay-bin.000134

                 Relay_Log_Pos: 964684321

         Relay_Master_Log_File: mysql-bin.028938

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 1062

                    Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 964684022

Slave threads however instead of expected exiting may hang like

+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+

| Id   | User         | Host               | db   | Command      | Time  | State                                         | Info             | Progress |

+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+

|    5 | system user  |                    | NULL | Slave_IO     | 51160 | Waiting for master to send event              | NULL             |    0.000 |

|   19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep        |     5 |                                               | NULL             |    0.000 |

|   61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep        |    10 |                                               | NULL             |    0.000 |

| 2394 | system user  |                    | NULL | Slave_worker | 50852 | closing tables                                | NULL             |    0.000 |

| 2395 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2396 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2397 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2398 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2399 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2400 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2401 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2402 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2403 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2404 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2405 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |

| 2393 | system user  |                    | NULL | Slave_SQL    | 50860 | Waiting for room in worker thread event queue | NULL             |    0.000 |

Slave_SQL may also hang in a different state.

Upon analysis it turned out that closing tables worker got entrapped in endless looping
in mark_start_commit_inner across already garbage-collected items including rgi->gco itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.

The issue applies to both the conservative and optimistic modes.
A patch, to be committed soon, fixes the case to reinforce group_commit_orderer-based order for errored-out workers.

Attachments

Issue Links

relates to

MDEV-31052 Parallel slave hangs with binlog_alter_two_phase=ON and ALTER replication error

Open

MDEV-31448 Killing a replica thread awaiting its GCO can hang/crash a parallel replica

Closed

Activity

People

Assignee:: Andrei Elkin

Reporter:: Andrei Elkin

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2023-03-03 18:48

Updated:: 2024-07-07 19:48

Resolved:: 2023-03-16 17:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server