Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.4(EOL), 10.5, 10.6, 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11
-
None
Description
After a parallel worker thread hits an error that must be error-stopping the slave
show slave status does display the error along with YES of the slave running status, e.g
show slave status\G
|
|
*************************** 1. row ***************************
|
|
Slave_IO_State: Waiting for master to send event
|
|
Master_Host: 172.31.15.61
|
|
Master_User: db02replication
|
|
Master_Port: 3306
|
|
Connect_Retry: 60
|
|
Master_Log_File: mysql-bin.028940
|
|
Read_Master_Log_Pos: 1050157656
|
|
Relay_Log_File: relay-bin.000134
|
|
Relay_Log_Pos: 964684321
|
|
Relay_Master_Log_File: mysql-bin.028938
|
|
Slave_IO_Running: Yes
|
|
Slave_SQL_Running: Yes
|
|
Replicate_Do_DB:
|
|
Replicate_Ignore_DB:
|
|
Replicate_Do_Table:
|
|
Replicate_Ignore_Table:
|
|
Replicate_Wild_Do_Table:
|
|
Replicate_Wild_Ignore_Table:
|
|
Last_Errno: 1062
|
|
Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486
|
|
Skip_Counter: 0
|
|
Exec_Master_Log_Pos: 964684022
|
Slave threads however instead of expected exiting may hang like
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
|
| 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 |
|
| 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 |
|
| 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 |
|
| 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 |
|
| 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 |
|
| 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 |
|
Slave_SQL may also hang in a different state.
Upon analysis it turned out that closing tables worker got entrapped in endless looping
in mark_start_commit_inner across already garbage-collected items including rgi->gco itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.
The issue applies to both the conservative and optimistic modes.
A patch, to be committed soon, fixes the case to reinforce group_commit_orderer-based order for errored-out workers.
Attachments
Issue Links
- relates to
-
MDEV-31052 Parallel slave hangs with binlog_alter_two_phase=ON and ALTER replication error
-
- Open
-
-
MDEV-31448 Killing a replica thread awaiting its GCO can hang/crash a parallel replica
-
- Closed
-
Activity
Field | Original Value | New Value |
---|---|---|
Fix Version/s | 10.4 [ 22408 ] | |
Fix Version/s | 10.5 [ 23123 ] | |
Fix Version/s | 10.6 [ 24028 ] | |
Fix Version/s | 10.8 [ 26121 ] | |
Fix Version/s | 10.9 [ 26905 ] | |
Fix Version/s | 10.10 [ 27530 ] | |
Fix Version/s | 10.11 [ 27614 ] |
Description |
After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g {noformat} show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.31.15.61 Master_User: db02replication Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.028940 Read_Master_Log_Pos: 1050157656 Relay_Log_File: relay-bin.000134 Relay_Log_Pos: 964684321 Relay_Master_Log_File: mysql-bin.028938 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486 Skip_Counter: 0 Exec_Master_Log_Pos: 964684022 {noformat} Slave threads however instead of expected exiting may hang like {noformat} +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 | | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 | | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 | | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 | | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 | {noformat} Slave_SQL may also hang in a different state. Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself. The reason of the belated access is identified as possible out-of-order group committing in the error branch. A patch, to be committed soon, fixes the case to simplify logics of garbage-collecting of {{group_commit_orderer}} objects. An instance can be removed into a free list only by a worker that finishes the last group-of-events associated with it. |
After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g {noformat} show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.31.15.61 Master_User: db02replication Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.028940 Read_Master_Log_Pos: 1050157656 Relay_Log_File: relay-bin.000134 Relay_Log_Pos: 964684321 Relay_Master_Log_File: mysql-bin.028938 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486 Skip_Counter: 0 Exec_Master_Log_Pos: 964684022 {noformat} Slave threads however instead of expected exiting may hang like {noformat} +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 | | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 | | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 | | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 | | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 | {noformat} Slave_SQL may also hang in a different state. Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself. The reason of the belated access is identified as possible out-of-order group committing in the error branch. A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers. |
Status | Open [ 1 ] | Confirmed [ 10101 ] |
Status | Confirmed [ 10101 ] | In Progress [ 3 ] |
Assignee | Andrei Elkin [ elkin ] | Brandon Nesterenko [ JIRAUSER48702 ] |
Status | In Progress [ 3 ] | In Review [ 10002 ] |
Link | This issue duplicates MENT-1699 [ MENT-1699 ] |
Summary | optimistic parallel slave hangs after hit an error | parallel slave hangs after hit an error |
Description |
After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g {noformat} show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.31.15.61 Master_User: db02replication Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.028940 Read_Master_Log_Pos: 1050157656 Relay_Log_File: relay-bin.000134 Relay_Log_Pos: 964684321 Relay_Master_Log_File: mysql-bin.028938 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486 Skip_Counter: 0 Exec_Master_Log_Pos: 964684022 {noformat} Slave threads however instead of expected exiting may hang like {noformat} +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 | | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 | | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 | | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 | | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 | {noformat} Slave_SQL may also hang in a different state. Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself. The reason of the belated access is identified as possible out-of-order group committing in the error branch. A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers. |
After a parallel worker thread hits an error that must be error-stopping the slave
{{show slave status}} does display the error along with {{YES}} of the slave running status, e.g {noformat} show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.31.15.61 Master_User: db02replication Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.028940 Read_Master_Log_Pos: 1050157656 Relay_Log_File: relay-bin.000134 Relay_Log_Pos: 964684321 Relay_Master_Log_File: mysql-bin.028938 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486 Skip_Counter: 0 Exec_Master_Log_Pos: 964684022 {noformat} Slave threads however instead of expected exiting may hang like {noformat} +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+ | 5 | system user | | NULL | Slave_IO | 51160 | Waiting for master to send event | NULL | 0.000 | | 19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep | 5 | | NULL | 0.000 | | 61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep | 10 | | NULL | 0.000 | | 2394 | system user | | NULL | Slave_worker | 50852 | closing tables | NULL | 0.000 | | 2395 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2396 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2397 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2398 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2399 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2400 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2401 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2402 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2403 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2404 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2405 | system user | | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL | 0.000 | | 2393 | system user | | NULL | Slave_SQL | 50860 | Waiting for room in worker thread event queue | NULL | 0.000 | {noformat} Slave_SQL may also hang in a different state. Upon analysis it turned out that {{closing tables}} worker got entrapped in endless looping in {{mark_start_commit_inner}} across already garbage-collected items including {{rgi->gco}} itself. The reason of the belated access is identified as possible out-of-order group committing in the error branch. The issue applies to both the conservative and optimistic modes. A patch, to be committed soon, fixes the case to reinforce {{group_commit_orderer}}-based order for errored-out workers. |
Assignee | Brandon Nesterenko [ JIRAUSER48702 ] | Andrei Elkin [ elkin ] |
Status | In Review [ 10002 ] | Stalled [ 10000 ] |
Fix Version/s | 10.4.29 [ 28510 ] | |
Fix Version/s | 10.5.20 [ 28512 ] | |
Fix Version/s | 10.6.13 [ 28514 ] | |
Fix Version/s | 10.8.8 [ 28518 ] | |
Fix Version/s | 10.9.6 [ 28520 ] | |
Fix Version/s | 10.10.4 [ 28522 ] | |
Fix Version/s | 10.4 [ 22408 ] | |
Fix Version/s | 10.5 [ 23123 ] | |
Fix Version/s | 10.6 [ 24028 ] | |
Fix Version/s | 10.8 [ 26121 ] | |
Fix Version/s | 10.9 [ 26905 ] | |
Fix Version/s | 10.10 [ 27530 ] | |
Fix Version/s | 10.11 [ 27614 ] | |
Resolution | Fixed [ 1 ] | |
Status | Stalled [ 10000 ] | Closed [ 6 ] |
Fix Version/s | 10.11.3 [ 28524 ] |
Link | This issue relates to MDEV-31052 [ MDEV-31052 ] |
Link |
This issue relates to |
Zendesk Related Tickets | 172527 171882 |
The patch is pushed to bb-10.4-andrei.