[MDEV-19572] Async slave on Galera node stops due to 1942 Error during COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 0: Created: 2019-05-23  Updated: 2020-08-25  Resolved: 2019-11-25

Status: Closed
Project: MariaDB Server
Component/s: Galera, Replication
Affects Version/s: 10.1, 10.2.16, 10.2.23
Fix Version/s: 10.1.44, 10.2.30, 10.3.21, 10.4.11

Type: Bug Priority: Critical
Reporter: Valerii Kravchuk Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-16936 Master and Galera Cluster as slave Closed
relates to MDEV-18641 "Resource deadlock avoided during COM... Closed
relates to MDEV-21096 Galera cluster crashed after starting... Closed

 Description   

When changes to MyISAM table are replicated from async MariaDB 10.1.x master to a node in MariaDB 10.2.x Galera cluster, in GTID-based replication setup, replication eventually breaks with the following slave status:

MariaDB [test]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: ao756
                  Master_User: root
                  Master_Port: 3310
                Connect_Retry: 60
              Master_Log_File: ao756-bin.000004
          Read_Master_Log_Pos: 59590494
               Relay_Log_File: mysql-node1-relay-bin.000002
                Relay_Log_Pos: 14872226
        Relay_Master_Log_File: ao756-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1942
                   Last_Error: Error during COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 0:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 29774344
              Relay_Log_Space: 44688691
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1942
               Last_SQL_Error: Error during COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 0:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 4100-1-550731
      Replicate_Do_Domain_Ids:
  Replicate_Ignore_Domain_Ids:
                Parallel_Mode: conservative
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
1 row in set (0,02 sec)
 
MariaDB [test]> show variables like 'gtid%';\
+------------------------+---------------+
| Variable_name          | Value         |
+------------------------+---------------+
| gtid_binlog_pos        | 4100-1-380354 |
| gtid_binlog_state      | 4100-1-380354 |
| gtid_current_pos       | 4100-1-380353 |
| gtid_domain_id         | 4100          |
| gtid_ignore_duplicates | OFF           |
| gtid_seq_no            | 0             |
| gtid_slave_pos         | 4100-1-380353 |
| gtid_strict_mode       | OFF           |
+------------------------+---------------+
8 rows in set (0,01 sec)
 
MariaDB [test]> select * from mysql.gtid_slave_pos;
+-----------+--------+-----------+--------+
| domain_id | sub_id | server_id | seq_no |
+-----------+--------+-----------+--------+
|      4100 | 599107 |         1 | 380352 |
|      4100 | 599108 |         1 | 380353 |
+-----------+--------+-----------+--------+
2 rows in set (0,06 sec)
 
MariaDB [test]> show create table mysql.gtid_slavee_pos\G
ERROR 1146 (42S02): Table 'mysql.gtid_slavee_pos' doesn't exist
MariaDB [test]> show create table mysql.gtid_slave_pos\G
*************************** 1. row ***************************
       Table: gtid_slave_pos
Create Table: CREATE TABLE `gtid_slave_pos` (
  `domain_id` int(10) unsigned NOT NULL,
  `sub_id` bigint(20) unsigned NOT NULL,
  `server_id` int(10) unsigned NOT NULL,
  `seq_no` bigint(20) unsigned NOT NULL,
  PRIMARY KEY (`domain_id`,`sub_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Replication slave GTID position'
1 row in set (0,05 sec)
 
MariaDB [test]> select version();
+---------------------+
| version()           |
+---------------------+
| 10.2.23-MariaDB-log |
+---------------------+
1 row in set (0,06 sec)

In the error log we see:

...
2019-05-23 16:55:10 140069099460352 [Note] Slave SQL thread initialized, starting replication in log 'ao756-bin.000004' at position 28452044, relay log './mysql-node1-relay-bin.000002' position: 13549926; GTID position '4100-1-372797'
2019-05-23 16:55:10 140069099460352 [Warning] Slave SQL: Could not execute Write_rows_v1 event on table test.ttmyis; Duplicate entry '161766' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log ao756-bin.000004, end_log_pos 28452166, Gtid 4100-1-372798, Internal MariaDB error code: 1062
2019-05-23 16:55:10 140069099460352 [Warning] WSREP: SQL statement was ineffective thd: 61940 buf: 126
schema: test
QUERY: COMMIT
=> Skipping replication
2019-05-23 16:55:10 140069099460352 [ERROR] Slave SQL: Error 'Got error 35 "Resource deadlock avoided" during COMMIT' on query. Default database: 'test'. Query: 'COMMIT', Gtid 4100-1-372800, Internal MariaDB error code: 1180
2019-05-23 16:55:10 140069099460352 [ERROR] Slave SQL: Node has dropped from cluster, Gtid 4100-1-372800, Internal MariaDB error code: 1047
...
2019-05-23 16:55:54 140069099460352 [Note] Slave SQL thread initialized, starting replication in log 'ao756-bin.000004' at position 29774344, relay log './mysql-node1-relay-bin.000002' position: 14872226; GTID position '4100-1-380353'
2019-05-23 16:55:54 140069099460352 [Warning] WSREP: BF applier failed to open_and_lock_tables: 0, fatal: 0 wsrep = (exec_mode: 0 conflict_state: 0 seqno: -1)
2019-05-23 16:55:54 140069099460352 [ERROR] Slave SQL: Error during COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 0: , Gtid 4100-1-380354, Internal MariaDB error code: 1942
2019-05-23 16:55:54 140069099460352 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'ao756-bin.000004' position 29774344; GTID position '4100-1-380353'
2019-05-23 16:55:54 140069099460352 [Note] Slave SQL thread exiting, replication stopped in log 'ao756-bin.000004' at position 29774344; GTID position '4100-1-380353'
...

To end up like this I've created a fresh MariaDB master started as follows:

bin/mysqld_safe --no-defaults --port=3310 --log-bin --gtid_domain_id=4100 --server_id=1 --binlog_format=row --sync_binlog=0 --innodb_flush_log_at_trx_commit=2 &

and set up also a local 2-node Galera cluster with nodes started as:

bin/mysqld_safe --defaults-file=/home/openxs/galera/mynode1.cnf --wsrep-new-cluster &
bin/mysqld_safe --defaults-file=/home/openxs/galera/mynode2.cnf &

while related configuration files are:

openxs@ao756:~/dbs/maria10.2$ cat /home/openxs/galera/mynode1.cnf
[mysqld]
datadir=/home/openxs/galera/node1
port=3306
socket=/tmp/mysql-node1.sock
pid-file=/tmp/mysql-node1.pid
log-error=/tmp/mysql-node1.err
binlog_format=ROW
innodb_autoinc_lock_mode=2
 
wsrep_on=ON
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_name = singlebox
wsrep_node_name = node1
wsrep_cluster_address=gcomm://127.0.0.1:4567,127.0.0.1:5020,127.0.0.1:6020?pc.wait_prim=no
 
log_bin
log_slave_updates=1
binlog_row_image=FULL
gtid_domain_id=4100
wsrep_gtid_domain_id=4100
wsrep_gtid_mode=ON
wsrep_restart_slave=1
server_id=2
slave-skip-errors=1062
innodb_flush_log_at_trx_commit=2
sync_binlog=0
openxs@ao756:~/dbs/maria10.2$ cat /home/openxs/galera/mynode2.cnf
[mysqld]
datadir=/home/openxs/galera/node2
port=3307
socket=/tmp/mysql-node2.sock
pid-file=/tmp/mysql-node2.pid
log-error=/tmp/mysql-node2.err
binlog_format=ROW
innodb_autoinc_lock_mode=2
 
wsrep_on=ON
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_cluster_name = singlebox
wsrep_node_name = node2
wsrep_cluster_address=gcomm://127.0.0.1:4567,127.0.0.1:5020,127.0.0.1:6020?pc.wait_prim=no
wsrep_provider_options = "base_port=5020;"
 
log_bin
log_slave_updates=1
binlog_row_image=FULL
gtid_domain_id=4100
wsrep_gtid_domain_id=4100
wsrep_gtid_mode=ON
wsrep_restart_slave=1
server_id=3
slave-skip-errors=1062
innodb_flush_log_at_trx_commit=2
sync_binlog=0
openxs@ao756:~/dbs/maria10.2$

The I've created two tables on master, in test database:

 CREATE TABLE `ttinno` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `c1` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;
 
CREATE TABLE `ttmyis` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `c1` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM;
 
insert into ttinno(c1) values (1), (2), (3);
insert into ttmyis(c1) values (1), (2), (3);

Then I pointed out slave on node1 of cluster to master:

set global gtid_slave_pos='4100-1-0';
change master to master_host='ao756', master_port=3310, master_user='root', master_use_gtid=slave_pos;

and started replication. After checking that initial stat is replicated to cluster I've executed the following mysqlslap commands pointing to async master:

bin/mysqlslap --concurrency=2 --create-schema=test --no-drop --number-of-queries=100000 --iterations=1 --query='insert into ttmyis(c1) values(rand()*100); insert into ttinno(c1) values(rand()*100);' -uroot --host=127.0.0.1 --port=3310 test
bin/mysqlslap --concurrency=2 --create-schema=test --no-drop --number-of-queries=100000 --iterations=1 --query='start transaction; update ttmyis set c1=100 where id = rand()*1000; insert into ttinno(c1) values(rand()*100); commit;' -uroot --host=127.0.0.1 --port=3310 test
bin/mysqlslap --concurrency=4 --create-schema=test --no-drop --number-of-queries=100000 --iterations=1 --query='start transaction; update ttinno set c1=100 where id = rand()*1000; insert into ttmyis(c1) values(rand()*100); commit;' -uroot --host=127.0.0.1 --port=3310 test

Second run was eventually cancelled as it was too slow. When the last mysqlslap completed I've checked node1 and noted slave is broken there in thee way explained above.



 Comments   
Comment by Jan Lindström (Inactive) [ 2019-09-18 ]

seppo Any idea what this error really could mean and we could not repeat it so hard to say how to repeat.

Comment by Seppo Jaakola [ 2019-10-10 ]

This issue appears to be hard to reproduce with the given test scenario. Exercised the test with 10.4 and 10.2.20 versions, test load continues for a couple of hours, but no problems surfaced.

Comment by Seppo Jaakola [ 2019-10-10 ]

I tried with few more modified mysqlslap loads, but still no problems in my test system.

Looking at the code, about how mysql.gtid_slave_pos is accessed, it seems that duplicate key errors should not be possible with single threaded slave setup. And there appears to be some bug reports of similar failures, with multi threaded slave configuration.

@Valerii , can you confirm that your test configuration was exactly like what is reported in this issue? And how long test run was needed to make the failure to happen?

Comment by Valerii Kravchuk [ 2019-10-10 ]

The configuration was exactly as shared in the bug report. I just copy/pasted from my test netbook. I've presented .cnf files for both nodes involved. I do not see any explicit setting for slave_parallel_threads, so it must be 0 by default, no parallel slave.

I tried all that on a slow 2 cores netbook. What also had to matter is the fact that one of tables in transaction is MyISAM.

Generated at Thu Feb 08 08:52:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.