Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22136

wsrep_restart_slave = 1 does not always work

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Not a Bug
    • 10.2.12, 10.2.32, 10.3(EOL)
    • N/A
    • Galera, Replication
    • None

    Description

      There is an async replication setup between two Galera clusters of 3 nodes each. In some cases SQL thread on slave cluster stops with error 1047. Error log content is the following:

      ...
      2020-03-26 18:06:27 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 140233980, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 126225898; GTID position '100-100-5514,200-200-524506'
      2020-03-26 18:06:55 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 11656002, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 6370090 Update_rows_log_event::find_row(5817624)
      UPDATE impu SET user_state=0 WHERE id=40382
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 11655993, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 66, OS thread handle 139923264952064, query id 6370086 Unlocking tables
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 94 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 11655993 lock_mode X locks rec but not gap
      Record lock, heap no 531 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 80009dbe; asc     ;;
       1: len 6; hex 000000b1db39; asc      9;;
       2: len 7; hex 13000001d813c6; asc        ;;
       3: len 1; hex 80; asc  ;;
       
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=40382��|^d
      2020-03-26 18:09:26 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-26 18:09:26 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-26 18:09:26 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 5817624
         SQL: UPDATE impu SET user_state=0 WHERE id=40382��|^d
      2020-03-26 18:09:26 139923535603456 [Note] WSREP: Victim thread: 
         THD: 66, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-26 18:09:26 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-549803, Internal MariaDB error code: 1047
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 146989842; GTID position '100-100-5514,200-200-549802'
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-26 18:09:26 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 146989842, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 132322234; GTID position '100-100-5514,200-200-549802'
      2020-03-26 18:11:38 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-26 18:13:53 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 13221641, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 7230511 Update_rows_log_event::find_row(6600241)
      UPDATE impu SET user_state=1 WHERE id=20997
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 13221628, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 67, OS thread handle 139923264952064, query id 7230504 Unlocking tables
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 64 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 13221628 lock_mode X locks rec but not gap
      Record lock, heap no 586 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 80005205; asc   R ;;
       1: len 6; hex 000000c9befc; asc       ;;
       2: len 7; hex 6c000002b40ef4; asc l      ;;
       3: len 1; hex 81; asc  ;;
       
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=1 WHERE id=20997!�|^d
      2020-03-26 18:23:45 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-26 18:23:45 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-26 18:23:45 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 6600241
         SQL: UPDATE impu SET user_state=1 WHERE id=20997!�|^d
      2020-03-26 18:23:45 139923535603456 [Note] WSREP: Victim thread: 
         THD: 67, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-26 18:23:45 139923264952064 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-626547, Internal MariaDB error code: 1047
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000009' at position 167400561; GTID position '100-100-5514,200-200-626546'
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-26 18:23:45 139923264952064 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000009' at position 167400561, relay log '/var/lib/mybinlog/relaylog-to_master_0.000012' position: 150809766; GTID position '100-100-5514,200-200-626546'
      ...
      

      So, normally SQL thread restarts to (maybe) hit a conflict on some other row (see different id values in messages above. But in some cases it stops and replication does not continue. Like this:

      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 360061759, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 195459563 Update_rows_log_event::find_row(179994284)
      UPDATE impu SET user_state=0 WHERE id=46832
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 360061748, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 1830, OS thread handle 139923258246912, query id 195459555 Unlocking tables
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061748 lock_mode X locks rec but not gap
      Record lock, heap no 501 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 8000b6f0; asc     ;;
       1: len 6; hex 000015761b34; asc    v 4;;
       2: len 7; hex 7f0000028920e5; asc        ;;
       3: len 1; hex 80; asc  ;;
       
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=46832K�^d
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994284
         SQL: UPDATE impu SET user_state=0 WHERE id=46832K�^d
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread: 
         THD: 1830, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-29  0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220900, Internal MariaDB error code: 1047
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365130; GTID position '100-100-8368143,200-200-15220899'
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365130, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622044; GTID position '100-100-8368143,200-200-15220899'
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 360061781, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 195459577 Update_rows_log_event::find_row(179994293)
      UPDATE impu SET user_state=1 WHERE id=46841
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 360061773, ACTIVE 0 sec
      , undo log entries 3
      MySQL thread id 1838, OS thread handle 139923258246912, query id 195459574 Unlocking tables
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061773 lock_mode X locks rec but not gap
      Record lock, heap no 510 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 8000b6f9; asc     ;;
       1: len 6; hex 000015761b4d; asc    v M;;
       2: len 7; hex 0e000001ce0547; asc       G;;
       3: len 1; hex 81; asc  ;;
       
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=1 WHERE id=46841K�^d
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994293
         SQL: UPDATE impu SET user_state=1 WHERE id=46841K�^d
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread: 
         THD: 1838, mode: local, state: committing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-29  0:48:27 139923258246912 [ERROR] Master 'to_master_0': Slave SQL: Node has dropped from cluster, Gtid 200-200-15220902, Internal MariaDB error code: 1047
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353365612; GTID position '100-100-8368143,200-200-15220901'
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: Slave error due to node temporarily non-primarySQL slave will continue
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: slave restart: 3
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353365612, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313622526; GTID position '100-100-8368143,200-200-15220901'
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Priority TRANSACTION:
      TRANSACTION 360061796, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      MySQL thread id 1, OS thread handle 139923535603456, query id 195459587 Update_rows_log_event::find_row(179994299)
      UPDATE impu SET user_state=0 WHERE id=46840
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** Victim TRANSACTION:
      TRANSACTION 360061795, ACTIVE 0 sec starting index read
      mysql tables in use 1, locked 1
      , undo log entries 1
      MySQL thread id 1839, OS thread handle 139923258246912, query id 195459586
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
      RECORD LOCKS space id 74 page no 104 n bits 720 index PRIMARY of table `bf_lock_test`.`impu` trx id 360061795 lock_mode X locks rec but not gap
      Record lock, heap no 509 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
       0: len 4; hex 8000b6f8; asc     ;;
       1: len 6; hex 000015761b4f; asc    v O;;
       2: len 7; hex 0f000001c80803; asc        ;;
       3: len 1; hex 81; asc  ;;
       
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL1: UPDATE impu SET user_state=0 WHERE id=46840K�^d
      2020-03-29  0:48:27 139923535603456 [Note] InnoDB:  SQL2: NULL
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: cluster conflict due to high priority abort for threads:
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Winning thread: 
         THD: 1, mode: applier, state: executing, conflict: no conflict, seqno: 179994299
         SQL: UPDATE impu SET user_state=0 WHERE id=46840K�^d
      2020-03-29  0:48:27 139923535603456 [Note] WSREP: Victim thread: 
         THD: 1839, mode: local, state: executing, conflict: no conflict, seqno: -1
         SQL: NULL
      2020-03-29  0:48:27 139923258246912 [Note] Master 'to_master_0': Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 353366094; GTID position '100-100-8368143,200-200-15220903'
      2020-03-29  0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  0:51:31 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  0:55:08 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  1:06:04 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  1:08:24 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      2020-03-29  1:34:54 139923543996160 [Note] WSREP: Trying to continue unpaused monitor
      

      So, async replication had to be monitored and restarted "manually". After that:

      2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': WSREP: ready state reached
      2020-03-29 13:38:01 139923259475712 [Note] Master 'to_master_0': Slave SQL thread initialized, starting replication in log 'binlog.000002' at position 353366094, relay log '/var/lib/mybinlog/relaylog-to_master_0.000004' position: 313623008; GTID position '100-100-8368143,200-200-15220903'
      

      Question is: why with wsrep_restart_slave = 1 there are cases when slave restart does not happen automatically? It looks like a bug.

      Attachments

        1. mysqld.1.err
          18 kB
          Jan Lindström
        2. mysqld.2.err
          175 kB
          Jan Lindström
        3. mysqld.3.err
          20 kB
          Jan Lindström
        4. stdout.log
          144 kB
          Jan Lindström

        Activity

          People

            jplindst Jan Lindström (Inactive)
            valerii Valerii Kravchuk
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.