Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28461

semisync-slave server recovery fails to rollback prepared transaction

    XMLWordPrintable

Details

    Description

      • Test config
      • Primary(master) : 192.168.254.24
      • Primary(master) : 192.168.254.25
      • MaxScale : 192.168.254.26
      • test table schema and query.
      • max client sessions : 5000 concurrent user
      • workload generator and test project file : Jmeter [^sysbench_insert_only_10.6.jmx]

      MariaDB [test]> show create table au_test\G
      *************************** 1. row ***************************
             Table: au_test
      Create Table: CREATE TABLE `au_test` (
        `id` bigint(20) NOT NULL AUTO_INCREMENT,
        `b` varchar(10) NOT NULL COMMENT '@@hostname',
        `c` datetime DEFAULT NULL,
        `d` datetime DEFAULT NULL,
        `e` bigint(20) DEFAULT NULL,
        `f` varchar(500) DEFAULT NULL,
        `g` varchar(100) DEFAULT NULL,
        PRIMARY KEY (`id`)
      ) ENGINE=InnoDB AUTO_INCREMENT=457948 DEFAULT CHARSET=latin1;
       
      insert into au_test (b,c,d,e,f,g) values (@@hostname,now(), now(), last_insert_id(), (select variable_value from information_schema.session_status where variable_name = 'Rpl_semi_sync_slave_status'), cast(@@gtid_binlog_state as varchar(100)));
      

      Once master joins as a master once again after `kill -9 $(pidof mariadbd)` against master node, slave status changes to `running` instead of 'slave, running'

      Here is the output of `maxctrl list status`:

       
      {"timestamp":"2022-05-03 14:26:04.907","command":"maxctrl list servers","status":true,"output":"
      ┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
      │ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ primary │ 192.168.254.2433064195        │ Master, Running │ 0-1-7533546
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ replica │ 192.168.254.2533064072        │ Slave, Running  │ 0-2-7496224
      └─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
       
      _this is the time when master server was forcefully killed:
      ----------------------------------------------------------------
      [root@primary my.cnf.d]# date
      Tue May  3 14:26:04 KST 2022
      [root@primary my.cnf.d]# kill -9 $(pidof mariadbd)
      ----------------------------------------------------------------_
       
      {"timestamp":"2022-05-03 14:26:07.547","command":"maxctrl list servers","status":true,"output":"
      ┌─────────┬────────────────┬──────┬─────────────┬────────────────┬─────────────┐
      │ Server  │ Address        │ Port │ Connections │ State          │ GTID        │
      ├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
      │ primary │ 192.168.254.2433060           │ Down           │ 0-1-7533546
      ├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
      │ replica │ 192.168.254.2533064303        │ Slave, Running │ 0-2-7496224
      └─────────┴────────────────┴──────┴─────────────┴────────────────┴─────────────┘
       
       
      {"timestamp":"2022-05-03 14:26:10.969","command":"maxctrl list servers","status":true,"output":"
      ┌─────────┬────────────────┬──────┬─────────────┬────────────────┬─────────────┐
      │ Server  │ Address        │ Port │ Connections │ State          │ GTID        │
      ├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
      │ primary │ 192.168.254.2433060           │ Down           │ 0-1-7533546
      ├─────────┼────────────────┼──────┼─────────────┼────────────────┼─────────────┤
      │ replica │ 192.168.254.2533064344        │ Slave, Running │ 0-2-7496224
      └─────────┴────────────────┴──────┴─────────────┴────────────────┴─────────────┘
       
       
      {"timestamp":"2022-05-03 14:26:13.606","command":"maxctrl list servers","status":true,"output":" => master was successfully rejoin as master again.
      ┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
      │ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ primary │ 192.168.254.2433062275        │ Master, Running │ 0-1-7534828
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ replica │ 192.168.254.2533064344        │ Slave, Running  │ 0-2-7496224
      └─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
       
       
      *{"timestamp":"2022-05-03 14:26:15.677","command":"maxctrl list servers","status":true,"output":"  => At this point  Slave got changed into Running state due to Duplicate key error.
      ┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
      │ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ primary │ 192.168.254.2433063069        │ Master, Running │ 0-1-7534828
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ replica │ 192.168.254.2533060           │ Running         │ 0-1-7496290
      └─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
      *
       
      {"timestamp":"2022-05-03 14:26:26.720","command":"maxctrl list servers","status":true,"output":"
      ┌─────────┬────────────────┬──────┬─────────────┬─────────────────┬─────────────┐
      │ Server  │ Address        │ Port │ Connections │ State           │ GTID        │
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ primary │ 192.168.254.2433062927        │ Master, Running │ 0-1-7534828
      ├─────────┼────────────────┼──────┼─────────────┼─────────────────┼─────────────┤
      │ replica │ 192.168.254.2533060           │ Running         │ 0-1-7496290
      └─────────┴────────────────┴──────┴─────────────┴─────────────────┴─────────────┘
      

      I've run this test for quite some time, but still hitting this error quite a lot.

      -slave server error log

      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 816
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1577
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 643
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 981
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 812
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1398
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 803
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 857
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 833
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 894
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1788
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 846
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 879
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1118
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 490
      2022-05-03 14:25:50 0 [Note] Rolled back transaction with xid 1330
      2022-05-03 14:25:50 0 [Note] Successfully truncated binlog file:./mariadb-bin.000007 from previous file size 1358 to pos:360 to remove transactions starting from GTID 0-2-7496224
      2022-05-03 14:25:50 0 [Note] Server socket created on IP: '0.0.0.0'.
      2022-05-03 14:25:50 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.
      2022-05-03 14:25:50 0 [Note] Reading of all Master_info entries succeeded
      2022-05-03 14:25:50 0 [Note] Added new Master_info '' to hash table
      2022-05-03 14:25:50 0 [Note] /usr/sbin/mariadbd: ready for connections.
      Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
      2022-05-03 14:25:53 7 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
      2022-05-03 14:25:53 7 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='192.168.254.24', master_port='3306', master_log_file='', master_log_pos='4'.
      2022-05-03 14:25:53 7 [Note] Previous Using_Gtid=No. New Using_Gtid=Current_Pos
      2022-05-03 14:25:53 10 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.254.24:3306' in log '' at position 4
      2022-05-03 14:25:53 11 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './mariadb-relay-bin.000001' position: 4; GTID position '0-2-7496223'
      2022-05-03 14:25:53 10 [Note] Slave I/O thread: connected to master 'repl@192.168.254.24:3306',replication starts at GTID position '0-2-7496223'
      2022-05-03 14:26:04 10 [ERROR] Error reading packet from server: Lost connection to server during query (server_errno=2013)
      2022-05-03 14:26:04 10 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb-bin.000021' at position 41641390; GTID position '0-1-7534857'
      2022-05-03 14:26:04 10 [ERROR] Slave I/O: error reconnecting to master 'repl@192.168.254.24:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '192.168.254.24' (111 "Connection refused"), Internal MariaDB error code: 2003
      2022-05-03 14:26:13 11 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000009, pos: 904), semi-sync up to file , position 0.
      2022-05-03 14:26:13 11 [Note] Semi-sync replication switched OFF.
      2022-05-03 14:26:13 11 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table test.au_test; Duplicate entry '417427' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mariadb-bin.000021, end_log_pos 22320629, Gtid 0-1-7496291, Internal MariaDB error code: 1062
      2022-05-03 14:26:13 11 [Warning] Slave: Duplicate entry '417427' for key 'PRIMARY' Error_code: 1062
      2022-05-03 14:26:13 11 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.000021' position 22320161; GTID position '0-1-7496290'
      2022-05-03 14:26:13 11 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb-bin.000021' at position 22320161; GTID position '0-1-7496290', master: 192.168.254.24:3306
      2022-05-03 14:27:04 10 [Note] Slave: connected to master 'repl@192.168.254.24:3306',replication resumed in log 'mariadb-bin.000021' at position 41641390
      2022-05-03 14:31:14 2854 [Warning] Aborted connection 2854 to db: 'test' user: 'allen' host: '192.168.254.26' (Got timeout reading communication packets)
      2022-05-03 14:31:14 2876 [Warning] Aborted connection 2876 to db: 'test' user: 'allen' host: '192.168.254.26' (Got timeout reading communication packets)
      

      nothing much in master's error log.

      2022-05-03 14:26:10 0 [Note] Server socket created on IP: '0.0.0.0'.
      2022-05-03 14:26:10 0 [Warning] 'proxies_priv' entry '@% root@node1' ignored in --skip-name-resolve mode.
      2022-05-03 14:26:10 0 [Note] Reading of all Master_info entries succeeded
      2022-05-03 14:26:10 0 [Note] Added new Master_info '' to hash table
      2022-05-03 14:26:10 0 [Note] /usr/sbin/mariadbd: ready for connections.
      Version: '10.6.7-3-MariaDB-enterprise-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Enterprise Server
      2022-05-03 14:26:14 802 [Note] InnoDB: Number of pools: 2
      2022-05-03 14:26:21 1851 [Note] InnoDB: Number of pools: 3
      2022-05-03 14:26:32 354 [Warning] Timeout waiting for reply of binlog (file: mariadb-bin.000022, pos: 859), semi-sync up to file , position 0.
      2022-05-03 14:26:32 354 [Note] Semi-sync replication switched OFF.
      2022-05-03 14:27:04 1991 [Note] Start binlog_dump to slave_server(2), pos(mariadb-bin.000021, 41641390), using_gtid(1), gtid('0-1-7534857')
      2022-05-03 14:27:04 1991 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(./mariadb-bin.000022, 4)
      2022-05-03 14:27:04 1 [Note] Semi-sync replication switched ON with slave (server_id: 2) at (mariadb-bin.000022, 972296)
      

      rep_master_semi.7z - master(192.168.254.24) server's semi_sync status and show slave status
      rep_slave_semi.7z - slave(192.168.254.25) server's semi_sync status and show slave status

      Attachments

        1. maxscale.cnf
          3 kB
        2. primary-server.cnf
          2 kB
        3. rep_master_semi.7z
          31 kB
        4. rep_slave_semi.7z
          31 kB
        5. replica-server.cnf
          2 kB
        6. sysbench_insert_only_10.6.jmx
          31 kB

        Activity

          People

            Elkin Andrei Elkin
            allen.lee@mariadb.com Allen Lee (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.