Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22136

wsrep_restart_slave = 1 does not always work

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Stalled (View Workflow)
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 10.3, 10.2.12, 10.2.32
    • Fix Version/s: 10.2
    • Component/s: Galera, Replication
    • Labels:
      None

      Description

      There is an async replication setup between two Galera clusters of 3 nodes each. In some cases SQL thread on slave cluster stops with error 1047. Error log content is the following:

      ...
      2020-03-26 18:06:27 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 140233980, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 126225898; GTID position '100-100-5514,200-200-524506'
      2020-03-26 18:06:55 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 11656002, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 6370090 Update_rows_log_event::find_row(5817624)
      UPDATE impu SET user_state=0 WHERE id=40382
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 11655993, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 66, OS thread handle 139923264952064, query id 6370086 Unlocking tables
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 94 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 11655993 lock_mode X locks rec but not gap
      Record lock, heap no 531 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 80009dbe; asc     ;;
       1: len 6; hex 000000b1db39; asc      9;;
       2: len 7; hex 13000001d813c6; asc        ;;
       3: len 1; hex 80; asc  ;;
       
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=40382��|^d
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-26 18:09:26 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-26 18:09:26 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 5817624
         SQL: UPDATE impu SET user_state=0 WHERE id=40382��|^d
      2020-03-26 18:09:26 139923535603456 [Note] WSREP: Victim thread: 
         THD: 66, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-26 18:09:26 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-549803, Internal MariaDB error code: 1047
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 146989842; GTID position '100-100-5514,200-200-549802'
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 146989842, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 132322234; GTID position '100-100-5514,200-200-549802'
      2020-03-26 18:11:38 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-26 18:13:53 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 13221641, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 7230511 Update_rows_log_event::find_row(6600241)
      UPDATE impu SET user_state=1 WHERE id=20997
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 13221628, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 67, OS thread handle 139923264952064, query id 7230504 Unlocking tables
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 64 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 13221628 lock_mode X locks rec but not gap
      Record lock, heap no 586 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 80005205; asc   R ;;
       1: len 6; hex 000000c9befc; asc       ;;
       2: len 7; hex 6c000002b40ef4; asc l      ;;
       3: len 1; hex 81; asc  ;;
       
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=1 WHERE id=20997!�|^d
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-26 18:23:45 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-26 18:23:45 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 6600241
         SQL: UPDATE impu SET user_state=1 WHERE id=20997!�|^d
      2020-03-26 18:23:45 139923535603456 [Note] WSREP: Victim thread: 
         THD: 67, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-26 18:23:45 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-626547, Internal MariaDB error code: 1047
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 167400561; GTID position '100-100-5514,200-200-626546'
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 167400561, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 150809766; GTID position '100-100-5514,200-200-626546'
      ...
      

      So, normally SQL thread restarts to (maybe) hit a conflict on some other row (see different id values in messages above. But in some cases it stops and replication does not continue. Like this:

      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 360061759, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 195459563 Update_rows_log_event::find_row(179994284)
      UPDATE impu SET user_state=0 WHERE id=46832
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 360061748, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 1830, OS thread handle 139923258246912, query id 195459555 Unlocking tables
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061748 lock_mode X locks rec but not gap
      Record lock, heap no 501 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 8000b6f0; asc     ;;
       1: len 6; hex 000015761b34; asc    v 4;;
       2: len 7; hex 7f0000028920e5; asc        ;;
       3: len 1; hex 80; asc  ;;
       
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=46832K�^d
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994284
         SQL: UPDATE impu SET user_state=0 WHERE id=46832K�^d
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread: 
         THD: 1830, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-29  0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220900, Internal MariaDB error code: 1047
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365130; GTID position '100-100-8368143,200-200-15220899'
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365130, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622044; GTID position '100-100-8368143,200-200-15220899'
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 360061781, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 195459577 Update_rows_log_event::find_row(179994293)
      UPDATE impu SET user_state=1 WHERE id=46841
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 360061773, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 1838, OS thread handle 139923258246912, query id 195459574 Unlocking tables
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061773 lock_mode X locks rec but not gap
      Record lock, heap no 510 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 8000b6f9; asc     ;;
       1: len 6; hex 000015761b4d; asc    v M;;
       2: len 7; hex 0e000001ce0547; asc       G;;
       3: len 1; hex 81; asc  ;;
       
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=1 WHERE id=46841K�^d
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994293
         SQL: UPDATE impu SET user_state=1 WHERE id=46841K�^d
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread: 
         THD: 1838, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-29  0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220902, Internal MariaDB error code: 1047
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365612; GTID position '100-100-8368143,200-200-15220901'
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365612, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622526; GTID position '100-100-8368143,200-200-15220901'
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 360061796, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 195459587 Update_rows_log_event::find_row(179994299)
      UPDATE impu SET user_state=0 WHERE id=46840
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 360061795, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      , undo log entries 1
      MySQL thread id 1839, OS thread handle 139923258246912, query id 195459586
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061795 lock_mode X locks rec but not gap
      Record lock, heap no 509 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 8000b6f8; asc     ;;
       1: len 6; hex 000015761b4f; asc    v O;;
       2: len 7; hex 0f000001c80803; asc        ;;
       3: len 1; hex 81; asc  ;;
       
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=46840K�^d
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994299
         SQL: UPDATE impu SET user_state=0 WHERE id=46840K�^d
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread: 
         THD: 1839, mode: local, state: executing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353366094; GTID position '100-100-8368143,200-200-15220903'
      2020-03-29  0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  0:55:08 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  1:06:04 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  1:08:24 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  1:34:54 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      

      So, async replication had to be monitored and restarted "manually". After that:

      2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353366094, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313623008; GTID position '100-100-8368143,200-200-15220903'
      

      Question is: why with wsrep_restart_slave = 1 there are cases when slave restart does not happen automatically? It looks like a bug.

        Attachments

        1. mysqld.1.err
          18 kB
        2. mysqld.2.err
          175 kB
        3. mysqld.3.err
          20 kB
        4. stdout.log
          144 kB

          Activity

            People

            Assignee:
            jplindst Jan Lindström
            Reporter:
            valerii Valerii Kravchuk
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated: