Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Not a Bug
-
10.2.12, 10.2.32, 10.3(EOL)
-
None
Description
There is an async replication setup between two Galera clusters of 3 nodes each. In some cases SQL thread on slave cluster stops with error 1047. Error log content is the following:
...
|
2020-03-26 18:06:27 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 140233980, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 126225898; GTID position '100-100-5514,200-200-524506'
|
2020-03-26 18:06:55 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
|
TRANSACTION 11656002, ACTIVE 0 sec starting index read
|
mysql tables in use 1, locked 1
|
MySQL thread id 1, OS thread handle 139923535603456, query id 6370090 Update_rows_log_event::find_row(5817624)
|
UPDATE impu SET user_state=0 WHERE id=40382
|
2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
|
TRANSACTION 11655993, ACTIVE 0 sec
|
, undo log entries 3
|
MySQL thread id 66, OS thread handle 139923264952064, query id 6370086 Unlocking tables
|
2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
|
RECORD LOCKS space id 74 page no 94 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 11655993 lock_mode X locks rec but not gap
|
Record lock, heap no 531 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
|
0: len 4; hex 80009dbe; asc ;;
|
1: len 6; hex 000000b1db39; asc 9;;
|
2: len 7; hex 13000001d813c6; asc ;;
|
3: len 1; hex 80; asc ;;
|
|
2020-03-26 18:09:26 139923535603456 [Note] InnoDB: SQL1: UPDATE impu SET user_state=0 WHERE id=40382��|^d
|
2020-03-26 18:09:26 139923535603456 [Note] InnoDB: SQL2: NULL
|
2020-03-26 18:09:26 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
|
2020-03-26 18:09:26 139923535603456 [Note] WSREP: Winning thread:
|
THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 5817624
|
SQL: UPDATE impu SET user_state=0 WHERE id=40382��|^d
|
2020-03-26 18:09:26 139923535603456 [Note] WSREP: Victim thread:
|
THD: 66, mode: local, state: committing, conflict: no conflict, seqno: -1
|
SQL: NULL
|
2020-03-26 18:09:26 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-549803, Internal MariaDB error code: 1047
|
2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 146989842; GTID position '100-100-5514,200-200-549802'
|
2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
|
2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3
|
2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached
|
2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 146989842, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 132322234; GTID position '100-100-5514,200-200-549802'
|
2020-03-26 18:11:38 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-26 18:13:53 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
|
TRANSACTION 13221641, ACTIVE 0 sec starting index read
|
mysql tables in use 1, locked 1
|
MySQL thread id 1, OS thread handle 139923535603456, query id 7230511 Update_rows_log_event::find_row(6600241)
|
UPDATE impu SET user_state=1 WHERE id=20997
|
2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
|
TRANSACTION 13221628, ACTIVE 0 sec
|
, undo log entries 3
|
MySQL thread id 67, OS thread handle 139923264952064, query id 7230504 Unlocking tables
|
2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
|
RECORD LOCKS space id 74 page no 64 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 13221628 lock_mode X locks rec but not gap
|
Record lock, heap no 586 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
|
0: len 4; hex 80005205; asc R ;;
|
1: len 6; hex 000000c9befc; asc ;;
|
2: len 7; hex 6c000002b40ef4; asc l ;;
|
3: len 1; hex 81; asc ;;
|
|
2020-03-26 18:23:45 139923535603456 [Note] InnoDB: SQL1: UPDATE impu SET user_state=1 WHERE id=20997!�|^d
|
2020-03-26 18:23:45 139923535603456 [Note] InnoDB: SQL2: NULL
|
2020-03-26 18:23:45 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
|
2020-03-26 18:23:45 139923535603456 [Note] WSREP: Winning thread:
|
THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 6600241
|
SQL: UPDATE impu SET user_state=1 WHERE id=20997!�|^d
|
2020-03-26 18:23:45 139923535603456 [Note] WSREP: Victim thread:
|
THD: 67, mode: local, state: committing, conflict: no conflict, seqno: -1
|
SQL: NULL
|
2020-03-26 18:23:45 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-626547, Internal MariaDB error code: 1047
|
2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 167400561; GTID position '100-100-5514,200-200-626546'
|
2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
|
2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3
|
2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached
|
2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 167400561, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 150809766; GTID position '100-100-5514,200-200-626546'
|
...
|
So, normally SQL thread restarts to (maybe) hit a conflict on some other row (see different id values in messages above. But in some cases it stops and replication does not continue. Like this:
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
|
TRANSACTION 360061759, ACTIVE 0 sec starting index read
|
mysql tables in use 1, locked 1
|
MySQL thread id 1, OS thread handle 139923535603456, query id 195459563 Update_rows_log_event::find_row(179994284)
|
UPDATE impu SET user_state=0 WHERE id=46832
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
|
TRANSACTION 360061748, ACTIVE 0 sec
|
, undo log entries 3
|
MySQL thread id 1830, OS thread handle 139923258246912, query id 195459555 Unlocking tables
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
|
RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061748 lock_mode X locks rec but not gap
|
Record lock, heap no 501 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
|
0: len 4; hex 8000b6f0; asc ;;
|
1: len 6; hex 000015761b34; asc v 4;;
|
2: len 7; hex 7f0000028920e5; asc ;;
|
3: len 1; hex 80; asc ;;
|
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: SQL1: UPDATE impu SET user_state=0 WHERE id=46832K�^d
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: SQL2: NULL
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: Winning thread:
|
THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994284
|
SQL: UPDATE impu SET user_state=0 WHERE id=46832K�^d
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: Victim thread:
|
THD: 1830, mode: local, state: committing, conflict: no conflict, seqno: -1
|
SQL: NULL
|
2020-03-29 0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220900, Internal MariaDB error code: 1047
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365130; GTID position '100-100-8368143,200-200-15220899'
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365130, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622044; GTID position '100-100-8368143,200-200-15220899'
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
|
TRANSACTION 360061781, ACTIVE 0 sec starting index read
|
mysql tables in use 1, locked 1
|
MySQL thread id 1, OS thread handle 139923535603456, query id 195459577 Update_rows_log_event::find_row(179994293)
|
UPDATE impu SET user_state=1 WHERE id=46841
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
|
TRANSACTION 360061773, ACTIVE 0 sec
|
, undo log entries 3
|
MySQL thread id 1838, OS thread handle 139923258246912, query id 195459574 Unlocking tables
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
|
RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061773 lock_mode X locks rec but not gap
|
Record lock, heap no 510 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
|
0: len 4; hex 8000b6f9; asc ;;
|
1: len 6; hex 000015761b4d; asc v M;;
|
2: len 7; hex 0e000001ce0547; asc G;;
|
3: len 1; hex 81; asc ;;
|
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: SQL1: UPDATE impu SET user_state=1 WHERE id=46841K�^d
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: SQL2: NULL
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: Winning thread:
|
THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994293
|
SQL: UPDATE impu SET user_state=1 WHERE id=46841K�^d
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: Victim thread:
|
THD: 1838, mode: local, state: committing, conflict: no conflict, seqno: -1
|
SQL: NULL
|
2020-03-29 0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220902, Internal MariaDB error code: 1047
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365612; GTID position '100-100-8368143,200-200-15220901'
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365612, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622526; GTID position '100-100-8368143,200-200-15220901'
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
|
TRANSACTION 360061796, ACTIVE 0 sec starting index read
|
mysql tables in use 1, locked 1
|
MySQL thread id 1, OS thread handle 139923535603456, query id 195459587 Update_rows_log_event::find_row(179994299)
|
UPDATE impu SET user_state=0 WHERE id=46840
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
|
TRANSACTION 360061795, ACTIVE 0 sec starting index read
|
mysql tables in use 1, locked 1
|
, undo log entries 1
|
MySQL thread id 1839, OS thread handle 139923258246912, query id 195459586
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
|
RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061795 lock_mode X locks rec but not gap
|
Record lock, heap no 509 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
|
0: len 4; hex 8000b6f8; asc ;;
|
1: len 6; hex 000015761b4f; asc v O;;
|
2: len 7; hex 0f000001c80803; asc ;;
|
3: len 1; hex 81; asc ;;
|
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: SQL1: UPDATE impu SET user_state=0 WHERE id=46840K�^d
|
2020-03-29 0:48:27 139923535603456 [Note] InnoDB: SQL2: NULL
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: Winning thread:
|
THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994299
|
SQL: UPDATE impu SET user_state=0 WHERE id=46840K�^d
|
2020-03-29 0:48:27 139923535603456 [Note] WSREP: Victim thread:
|
THD: 1839, mode: local, state: executing, conflict: no conflict, seqno: -1
|
SQL: NULL
|
2020-03-29 0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353366094; GTID position '100-100-8368143,200-200-15220903'
|
2020-03-29 0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-29 0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-29 0:55:08 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-29 1:06:04 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-29 1:08:24 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
2020-03-29 1:34:54 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
|
So, async replication had to be monitored and restarted "manually". After that:
2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': WSREP: ready state reached
|
2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353366094, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313623008; GTID position '100-100-8368143,200-200-15220903'
|
Question is: why with wsrep_restart_slave = 1 there are cases when slave restart does not happen automatically? It looks like a bug.
Attachments
Activity
Field | Original Value | New Value |
---|---|---|
Fix Version/s | 10.2 [ 14601 ] | |
Assignee | Jan Lindström [ jplindst ] |
Labels | need_feedback |
Labels | need_feedback |
Status | Open [ 1 ] | In Progress [ 3 ] |
Status | In Progress [ 3 ] | Stalled [ 10000 ] |
Status | Stalled [ 10000 ] | In Progress [ 3 ] |
Priority | Major [ 3 ] | Critical [ 2 ] |
Assignee | Jan Lindström [ jplindst ] | Seppo Jaakola [ seppo ] |
Assignee | Seppo Jaakola [ seppo ] | Jan Lindström [ jplindst ] |
Status | In Progress [ 3 ] | Stalled [ 10000 ] |
Assignee | Jan Lindström [ jplindst ] | Seppo Jaakola [ seppo ] |
Attachment | mysqld.1.err [ 55243 ] | |
Attachment | mysqld.2.err [ 55244 ] | |
Attachment | mysqld.3.err [ 55245 ] | |
Attachment | stdout.log [ 55246 ] |
Affects Version/s | 10.3 [ 22126 ] |
Status | Stalled [ 10000 ] | In Progress [ 3 ] |
Labels | need_feedback |
Status | In Progress [ 3 ] | Stalled [ 10000 ] |
Assignee | Seppo Jaakola [ seppo ] | Valerii Kravchuk [ valerii ] |
Assignee | Valerii Kravchuk [ valerii ] | Jan Lindström [ jplindst ] |
Labels | need_feedback |
Fix Version/s | N/A [ 14700 ] | |
Fix Version/s | 10.2 [ 14601 ] | |
Resolution | Not a Bug [ 6 ] | |
Status | Stalled [ 10000 ] | Closed [ 6 ] |
Workflow | MariaDB v3 [ 106696 ] | MariaDB v4 [ 157548 ] |
Zendesk Related Tickets | 158900 |
In case when slave did not restart automatically did node go Non-Primary state? MariaDB 10.2.36 contains some important fixes but not sure if they would help here. To further analyze, we would need full error logs and some instructions how to reproduce.