Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.4(EOL), 10.5, 10.6, 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
-
None
Description
After a parallel worker thread hits an error that must be error-stopping the slave
show slave status does display the error along with YES of the slave running status, e.g
show slave status\G
|
|
*************************** 1. row ***************************
|
|
Slave_IO_State: Waiting for master to send event
|
|
Master_Host: 172.31.15.61
|
|
Master_User: db02replication
|
|
Master_Port: 3306
|
|
Connect_Retry: 60
|
|
Master_Log_File: mysql-bin.028940
|
|
Read_Master_Log_Pos: 1050157656
|
|
Relay_Log_File: relay-bin.000134
|
|
Relay_Log_Pos: 964684321
|
|
Relay_Master_Log_File: mysql-bin.028938
|
|
Slave_IO_Running: Yes
|
|
Slave_SQL_Running: Yes
|
|
Replicate_Do_DB:
|
|
Replicate_Ignore_DB:
|
|
Replicate_Do_Table:
|
|
Replicate_Ignore_Table:
|
|
Replicate_Wild_Do_Table:
|
|
Replicate_Wild_Ignore_Table:
|
|
Last_Errno: 1062
|
|
Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486
|
|
Skip_Counter: 0
|
|
Exec_Master_Log_Pos: 964684022
|
Slave threads however instead of expected exiting may hang like
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
|
| 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
|
| 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
|
| 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
|
| 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
|
| 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
|
Slave_SQL may also hang in a different state.
Upon analysis it turned out that closing tables worker got entrapped in endless looping
in mark_start_commit_inner across already garbage-collected items including rgi->gco itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.
The issue applies to both the conservative and optimistic modes.
A patch, to be committed soon, fixes the case to reinforce group_commit_orderer-based order for errored-out workers.
Attachments
Issue Links
- relates to
-
MDEV-31052 Parallel slave hangs with binlog_alter_two_phase=ON and ALTER replication error
- Open
-
MDEV-31448 Killing a replica thread awaiting its GCO can hang/crash a parallel replica
- Closed