[MDEV-22136] wsrep_restart_slave = 1 does not always work - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Not a Bug
Affects Version/s: 10.2.12, 10.2.32, 10.3(EOL)
Fix Version/s: N/A
Component/s: Galera, Replication
Labels:
None

Description

There is an async replication setup between two Galera clusters of 3 nodes each. In some cases SQL thread on slave cluster stops with error 1047. Error log content is the following:

...

2020-03-26 18:06:27 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 140233980, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 126225898; GTID position '100-100-5514,200-200-524506'

2020-03-26 18:06:55 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:

TRANSACTION 11656002, ACTIVE 0 sec starting index read

mysql tables in use 1, locked 1

MySQL thread id 1, OS thread handle 139923535603456, query id 6370090 Update_rows_log_event::find_row(5817624)

UPDATE impu SET user_state=0 WHERE id=40382

2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:

TRANSACTION 11655993, ACTIVE 0 sec

, undo log entries 3

MySQL thread id 66, OS thread handle 139923264952064, query id 6370086 Unlocking tables

2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 74 page no 94 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 11655993 lock_mode X locks rec but not gap

Record lock, heap no 531 PHYSICAL RECORD: n_fields 4; compact format; info bits 0

 0: len 4; hex 80009dbe; asc     ;;

 1: len 6; hex 000000b1db39; asc      9;;

 2: len 7; hex 13000001d813c6; asc        ;;

 3: len 1; hex 80; asc  ;;

2020-03-26 18:09:26 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=40382��|^d

2020-03-26 18:09:26 139923535603456 [Note] InnoDB:  SQL2: NULL

2020-03-26 18:09:26 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:

2020-03-26 18:09:26 139923535603456 [Note] WSREP: Winning thread:

   THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 5817624

   SQL: UPDATE impu SET user_state=0 WHERE id=40382��|^d

2020-03-26 18:09:26 139923535603456 [Note] WSREP: Victim thread:

   THD: 66, mode: local, state: committing, conflict: no conflict, seqno: -1

   SQL: NULL

2020-03-26 18:09:26 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-549803, Internal MariaDB error code: 1047

2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 146989842; GTID position '100-100-5514,200-200-549802'

2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue

2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3

2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached

2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 146989842, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 132322234; GTID position '100-100-5514,200-200-549802'

2020-03-26 18:11:38 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-26 18:13:53 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:

TRANSACTION 13221641, ACTIVE 0 sec starting index read

mysql tables in use 1, locked 1

MySQL thread id 1, OS thread handle 139923535603456, query id 7230511 Update_rows_log_event::find_row(6600241)

UPDATE impu SET user_state=1 WHERE id=20997

2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:

TRANSACTION 13221628, ACTIVE 0 sec

, undo log entries 3

MySQL thread id 67, OS thread handle 139923264952064, query id 7230504 Unlocking tables

2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 74 page no 64 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 13221628 lock_mode X locks rec but not gap

Record lock, heap no 586 PHYSICAL RECORD: n_fields 4; compact format; info bits 0

 0: len 4; hex 80005205; asc   R ;;

 1: len 6; hex 000000c9befc; asc       ;;

 2: len 7; hex 6c000002b40ef4; asc l      ;;

 3: len 1; hex 81; asc  ;;

2020-03-26 18:23:45 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=1 WHERE id=20997!�|^d

2020-03-26 18:23:45 139923535603456 [Note] InnoDB:  SQL2: NULL

2020-03-26 18:23:45 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:

2020-03-26 18:23:45 139923535603456 [Note] WSREP: Winning thread:

   THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 6600241

   SQL: UPDATE impu SET user_state=1 WHERE id=20997!�|^d

2020-03-26 18:23:45 139923535603456 [Note] WSREP: Victim thread:

   THD: 67, mode: local, state: committing, conflict: no conflict, seqno: -1

   SQL: NULL

2020-03-26 18:23:45 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-626547, Internal MariaDB error code: 1047

2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 167400561; GTID position '100-100-5514,200-200-626546'

2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue

2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3

2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached

2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 167400561, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 150809766; GTID position '100-100-5514,200-200-626546'

...

So, normally SQL thread restarts to (maybe) hit a conflict on some other row (see different id values in messages above. But in some cases it stops and replication does not continue. Like this:

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:

TRANSACTION 360061759, ACTIVE 0 sec starting index read

mysql tables in use 1, locked 1

MySQL thread id 1, OS thread handle 139923535603456, query id 195459563 Update_rows_log_event::find_row(179994284)

UPDATE impu SET user_state=0 WHERE id=46832

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:

TRANSACTION 360061748, ACTIVE 0 sec

, undo log entries 3

MySQL thread id 1830, OS thread handle 139923258246912, query id 195459555 Unlocking tables

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061748 lock_mode X locks rec but not gap

Record lock, heap no 501 PHYSICAL RECORD: n_fields 4; compact format; info bits 0

 0: len 4; hex 8000b6f0; asc     ;;

 1: len 6; hex 000015761b34; asc    v 4;;

 2: len 7; hex 7f0000028920e5; asc        ;;

 3: len 1; hex 80; asc  ;;

2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=46832K�^d

2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL

2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:

2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread:

   THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994284

   SQL: UPDATE impu SET user_state=0 WHERE id=46832K�^d

2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread:

   THD: 1830, mode: local, state: committing, conflict: no conflict, seqno: -1

   SQL: NULL

2020-03-29  0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220900, Internal MariaDB error code: 1047

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365130; GTID position '100-100-8368143,200-200-15220899'

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365130, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622044; GTID position '100-100-8368143,200-200-15220899'

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:

TRANSACTION 360061781, ACTIVE 0 sec starting index read

mysql tables in use 1, locked 1

MySQL thread id 1, OS thread handle 139923535603456, query id 195459577 Update_rows_log_event::find_row(179994293)

UPDATE impu SET user_state=1 WHERE id=46841

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:

TRANSACTION 360061773, ACTIVE 0 sec

, undo log entries 3

MySQL thread id 1838, OS thread handle 139923258246912, query id 195459574 Unlocking tables

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061773 lock_mode X locks rec but not gap

Record lock, heap no 510 PHYSICAL RECORD: n_fields 4; compact format; info bits 0

 0: len 4; hex 8000b6f9; asc     ;;

 1: len 6; hex 000015761b4d; asc    v M;;

 2: len 7; hex 0e000001ce0547; asc       G;;

 3: len 1; hex 81; asc  ;;

2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=1 WHERE id=46841K�^d

2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL

2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:

2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread:

   THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994293

   SQL: UPDATE impu SET user_state=1 WHERE id=46841K�^d

2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread:

   THD: 1838, mode: local, state: committing, conflict: no conflict, seqno: -1

   SQL: NULL

2020-03-29  0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220902, Internal MariaDB error code: 1047

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365612; GTID position '100-100-8368143,200-200-15220901'

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365612, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622526; GTID position '100-100-8368143,200-200-15220901'

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:

TRANSACTION 360061796, ACTIVE 0 sec starting index read

mysql tables in use 1, locked 1

MySQL thread id 1, OS thread handle 139923535603456, query id 195459587 Update_rows_log_event::find_row(179994299)

UPDATE impu SET user_state=0 WHERE id=46840

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:

TRANSACTION 360061795, ACTIVE 0 sec starting index read

mysql tables in use 1, locked 1

, undo log entries 1

MySQL thread id 1839, OS thread handle 139923258246912, query id 195459586

2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061795 lock_mode X locks rec but not gap

Record lock, heap no 509 PHYSICAL RECORD: n_fields 4; compact format; info bits 0

 0: len 4; hex 8000b6f8; asc     ;;

 1: len 6; hex 000015761b4f; asc    v O;;

 2: len 7; hex 0f000001c80803; asc        ;;

 3: len 1; hex 81; asc  ;;

2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=46840K�^d

2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL

2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:

2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread:

   THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994299

   SQL: UPDATE impu SET user_state=0 WHERE id=46840K�^d

2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread:

   THD: 1839, mode: local, state: executing, conflict: no conflict, seqno: -1

   SQL: NULL

2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353366094; GTID position '100-100-8368143,200-200-15220903'

2020-03-29  0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-29  0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-29  0:55:08 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-29  1:06:04 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-29  1:08:24 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

2020-03-29  1:34:54 139923543996160 [Note] WSREP: Trying to continue unpaused monitor

So, async replication had to be monitored and restarted "manually". After that:

2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': WSREP: ready state reached

2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353366094, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313623008; GTID position '100-100-8368143,200-200-15220903'

Question is: why with wsrep_restart_slave = 1 there are cases when slave restart does not happen automatically? It looks like a bug.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mysqld.1.err
18 kB
2020-12-15 11:44
mysqld.2.err
175 kB
2020-12-15 11:44
mysqld.3.err
20 kB
2020-12-15 11:44
stdout.log
144 kB
2020-12-15 11:44

wsrep_restart_slave = 1 does not always work

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration