Details
-
Bug
-
Status: In Progress (View Workflow)
-
Critical
-
Resolution: Unresolved
-
10.6.19, 11.4.5
Description
Tested on: 10.6.19-15 Enterprise (consistent), 10.6.19 Community (consistent), 11.4.5 Community (consistent)
- You will need 3 instances to make this occur: Master (db1) / Slave (db2) / Slave (db3)
- The backup is taken on the SLAVE that is running, but after restore on the other slave, replication is connected to master.
- If inserts on the master are slowed down to one per second, this failure does not occur and checksum table shows that tables are perfect copies.
- If the table is ENGINE=INNODB the failure will not occur.
On master, create a table:
Drop schema if exists Rpa; Create schema Rpa; Use Rpa;
|
DROP TABLE IF EXISTS `aria_table`;
|
CREATE TABLE `aria_table` (
|
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
|
`message` longtext NOT NULL, `createdOn` datetime NOT NULL DEFAULT current_timestamp(),
|
`processId` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY `idx_svax` (`processId`)
|
) ENGINE=ARIA AUTO_INCREMENT=7295 DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_general_ci PAGE_CHECKSUM=1;
|
On master, run a bash script that quickly inserts into the aria_table:
#!/bin/bash
|
touch t.txt
|
mariadb -Ae "truncate table Rpa.aria_table;"
|
ii=0
|
while [ -f t.txt ]; do
|
ii=$(( $ii + 1 ))
|
mariadb -Ae "use Rpa; INSERT INTO aria_table ( message, createdOn, processId) VALUES ( 'Lorem ipsum dolor sit amet, et pharetra nulla tincidunt.', now(), round(rand()*10000000));"
|
echo $ii
|
done
|
echo "finished"
|
Verify that db2 is a slave of db1 and that the slave is running.
On db3, stop server, destroy the data:
systemctl stop mariadb
|
datadir=/var/lib/mysql
|
logdir=/var/log/mysql
|
rm -fr $datadir/* $logdir/*
|
On db2, use mariabackup to stream a backup into the datadir of db3:
datadir=/var/lib/mysql
|
replica=192.168.8.113
|
mariabackup --user=root --backup --stream=xbstream | ssh -o StrictHostKeyChecking=NO root@"$replica" -t "cd $datadir; mbstream -x"
|
On db3, set the replication to db1
mhost=db1.edw.ee
|
datadir=/var/lib/mysql
|
logdir=/var/log/mysql
|
mport=3306
|
mpw=password
|
muser=repl
|
|
# Test the connectivity from this replica node to the primary:
|
mariadb -h$mhost -u$muser -p$mpw -ABNe "select now(); select @@hostname;"
|
|
cd $datadir
|
|
gtid=$(cat $datadir/*_binlog_info | tail -1 | awk '{print $3}')
|
echo $gtid $mhost $mport $muser $mpw
|
|
mariabackup --prepare --target-dir=$datadir
|
|
chown -R mysql:mysql $datadir
|
chown -R mysql:mysql $logdir
|
|
systemctl start mariadb;
|
|
mariadb -ABNe "stop slave; reset slave; set global gtid_slave_pos='$gtid'; change master to master_host='$mhost', master_port=$mport, master_user='$muser', master_password='$mpw', master_use_gtid=slave_pos; start slave;"
|
|
mariadb -Ae "show replica status\G"
|
ERROR OCCURS!
Last_SQL_Error: Error 'Duplicate entry '9699' for key 'PRIMARY'' on query. Default database: 'Rpa'. Query: 'INSERT INTO aria_table ( message, createdOn, processId) VALUES ( 'Lorem ipsum dolor sit amet, et pharetra nulla tincidunt.', now(), round(rand()*10000000))'
|
Attachments
Activity
I performed the above repro steps on CS 11.4.6 with the following exception:
since all three instances, Master (db1) / Slave (db2) / Slave (db3) are running on the same VM, I've used normal mariabackup (see below) instead of --stream=xbstream.
mariabackup --version
|
mariabackup based on MariaDB server 10.11.8-MariaDB debian-linux-gnu (x86_64)
|
|
mariabackup --user=root --port=12704 --socket=/test/MD170225-mariadb-11.4.6-linux-x86_64-opt/socket_slave.sock --backup --target-dir=$datadir
|
[00] 2025-02-26 17:18:38 Backup created in directory '/test/MD170225-mariadb-11.4.6-linux-x86_64-opt/data_slave2/'
|
[00] 2025-02-26 17:18:38 MySQL binlog position: filename '1.000002', position '3333478', GTID of the last change '0-1-7978'
|
[00] 2025-02-26 17:18:38 Writing backup-my.cnf
|
[00] 2025-02-26 17:18:38 ...done
|
[00] 2025-02-26 17:18:38 Writing xtrabackup_info
|
[00] 2025-02-26 17:18:38 ...done
|
[00] 2025-02-26 17:18:38 Redo log (from LSN 47791 to 3628255) was copied.
|
[00] 2025-02-26 17:18:38 completed OK!
|
I couldn't hit the above error after backup/restore. Slave (db3) successfully restored the backup generated on Slave (db2) and then it successfully started replication from Master (db1) post '$gtid'. So the above problem is not reproducible locally.
I added the following extra settings on Slave (db2) which are not there in the original steps,
log_bin=1
|
log_slave_updates=1
|
I performed the same (except normal mariabackup instead of --stream=xbstream) on 11.5.4 the release on which the customer has mentioned to have consistently reproduced, but I couldn't hit the error. The replication has been successful post backup/restore.
Master (db1) binlog_format = ROW
Slave (db2) binlog_format = MIXED
MariaDB [Rpa]> SELECT version();
|
+----------------+
|
| version() |
|
+----------------+
|
| 11.4.5-MariaDB |
|
+----------------+
|
1 row in set (0.000 sec)
|
|
MariaDB [Rpa]> system ./bin/mariabackup --version;
|
./bin/mariabackup: Deprecated program name. It will be removed in a future release, use '/test/MDEV-36143-MD270225-mariadb-11.4.5-linux-x86_64-opt/bin/mariadb-backup' instead
|
./bin/mariabackup based on MariaDB server 11.4.5-MariaDB Linux (x86_64)
|
MariaDB [Rpa]>
|
>may I ask you to update the remaining estimate, please?
julien.fritschThe only thing I'm doing different from the original repro steps is that I'm not using mariabackup --stream=xbstream. I shall try the same, so I'll need 1d more.
julien.fritschYes on CS 11.5.4 with mariabackup --stream=xbstream the above problem is consistently hit!
./bin/mariabackup --user=root --port=11531 --socket=/test/MDEV-36143-MD270225-mariadb-11.4.5-linux-x86_64-opt/socket_slave.sock --backup --stream=xbstream | ssh -o StrictHostKeyChecking=NO root@"$replica" -t "cd $datadir; mbstream -x"
|
...
|
[00] 2025-02-27 20:57:48 Finished backing up non-InnoDB tables and files
|
[00] 2025-02-27 20:57:48 Waiting for log copy thread to read lsn 107434590
|
[00] 2025-02-27 20:57:49 Start copying aria log file tail: /test/MDEV-36143-MD270225-mariadb-11.4.5-linux-x86_64-opt/data_slave//aria_log.00000001
|
[00] 2025-02-27 20:57:49 Stop copying aria log file tail: /test/MDEV-36143-MD270225-mariadb-11.4.5-linux-x86_64-opt/data_slave//aria_log.00000001, copied 8192 bytes
|
[00] 2025-02-27 20:57:49 BACKUP STAGE BLOCK_COMMIT
|
[00] 2025-02-27 20:57:49 Copied file ./mysql/slow_log.CSM for log table `mysql`.`slow_log`, 35 bytes
|
[00] 2025-02-27 20:57:49 Copied file ./mysql/slow_log.CSV for log table `mysql`.`slow_log`, 0 bytes
|
[00] 2025-02-27 20:57:49 Copied file ./mysql/general_log.CSV for log table `mysql`.`general_log`, 0 bytes
|
[00] 2025-02-27 20:57:49 Copied file ./mysql/general_log.CSM for log table `mysql`.`general_log`, 35 bytes
|
[00] 2025-02-27 20:57:49 mariabackup: The latest check point (for incremental): '70765024'
|
[00] 2025-02-27 20:57:49 Waiting for log copy thread to read lsn 18446744073709551615
|
[00] 2025-02-27 20:57:50 Start copying statistics aria tables.
|
[00] 2025-02-27 20:57:50 aria table file ./mysql/column_stats.MAI is copied successfully.
|
[00] 2025-02-27 20:57:50 aria table file ./mysql/column_stats.MAD is copied successfully.
|
[00] 2025-02-27 20:57:50 aria table file ./mysql/index_stats.MAI is copied successfully.
|
[00] 2025-02-27 20:57:50 aria table file ./mysql/index_stats.MAD is copied successfully.
|
[00] 2025-02-27 20:57:50 aria table file ./mysql/table_stats.MAI is copied successfully.
|
[00] 2025-02-27 20:57:50 aria table file ./mysql/table_stats.MAD is copied successfully.
|
[00] 2025-02-27 20:57:50 Stop copying statistics aria tables.
|
[00] 2025-02-27 20:57:50 Start copying aria log file tail: /test/MDEV-36143-MD270225-mariadb-11.4.5-linux-x86_64-opt/data_slave//aria_log.00000001
|
[00] 2025-02-27 20:57:50 Stop copying aria log file tail: /test/MDEV-36143-MD270225-mariadb-11.4.5-linux-x86_64-opt/data_slave//aria_log.00000001, copied 8192 bytes
|
[00] 2025-02-27 20:57:50 Streaming mariadb_backup_binlog_info
|
[00] 2025-02-27 20:57:50 ...done
|
[00] 2025-02-27 20:57:50 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
|
[00] 2025-02-27 20:57:50 BACKUP STAGE END
|
[00] 2025-02-27 20:57:50 Executing BACKUP STAGE END
|
[00] 2025-02-27 20:57:50 All tables unlocked
|
[00] 2025-02-27 20:57:50 Backup created in directory '/test/MDEV-36143-MD270225-mariadb-11.4.5-linux-x86_64-opt/mariadb_backup_files/'
|
[00] 2025-02-27 20:57:50 MySQL binlog position: filename '1.000002', position '109356715', GTID of the last change '0-1-237737'
|
[00] 2025-02-27 20:57:50 Streaming backup-my.cnf
|
[00] 2025-02-27 20:57:50 ...done
|
[00] 2025-02-27 20:57:50 Streaming mariadb_backup_info
|
[00] 2025-02-27 20:57:50 ...done
|
[00] 2025-02-27 20:57:50 Redo log (from LSN 70765024 to 107442118) was copied.
|
[00] 2025-02-27 20:57:50 completed OK!
|
MariaDB [(none)]> show replica status\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 127.0.0.1
|
Master_User: repl_user
|
Master_Port: 12523
|
Connect_Retry: 60
|
Master_Log_File: 1.000002
|
Read_Master_Log_Pos: 122896212
|
Relay_Log_File: qa-susil-1-relay-bin.000002
|
Relay_Log_Pos: 606
|
Relay_Master_Log_File: 1.000002
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: No
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 1062
|
Last_Error: Could not execute Write_rows_v1 event on table Rpa.aria_table; Duplicate entry '1724' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log 1.000002, end_log_pos 0
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 109356572
|
Relay_Log_Space: 8301308
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: Yes
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: NULL
|
Master_SSL_Verify_Server_Cert: Yes
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 1062
|
Last_SQL_Error: Could not execute Write_rows_v1 event on table Rpa.aria_table; Duplicate entry '1724' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log 1.000002, end_log_pos 0
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 0-1-267171
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: optimistic
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State:
|
Slave_DDL_Groups: 0
|
Slave_Non_Transactional_Groups: 1
|
Slave_Transactional_Groups: 0
|
Replicate_Rewrite_DB:
|
1 row in set (0.000 sec)
|
|
MariaDB [(none)]>SELECT VERSION();
|
+----------------+
|
| VERSION() |
|
+----------------+
|
| 11.4.5-MariaDB |
|
+----------------+
|
1 row in set (0.000 sec)
|
|
MariaDB [(none)]>
|
So, I confirm this customer issue is a valid bug.
So, the problem is with mariabackup --stream=xbstream only. Normal backup works fine.
julien.fritschAfter I confirmed the issue locally I thought you (I mean product manager) would be the one who would take it further. I think I'm wrong here, so I'm assigning this to my dev lead bnestere.
I've found out the issue here. It pertains to the online backup feature, which Aria supports when the table both 1) is transactional and 2) uses page_checksum=1).
To explain, first, there is a backup lock which mariabackup takes when saving the state of the server. This will prevent transactions from updating tables which aren't configured for online backup.
Then, when the slave is replicating events, in-addition to re-executing the regular transactions as they occurred on the master, it also updates another table pertaining to the slave state, mysql.gtid_slave_pos.
If the mysql.gtid_slave_pos table is not configured for online backup, then the state of the replica will be inconsistent during the backup. That is, the update to the regular tables targeted by the replicated transaction will proceed; however, the slave will get stuck at commit-time, as it won't be able to lock mysql.gtid_slave_pos, as the backup lock is already taken by mariabackup, and mysql.gtid_slave_pos isn't configured for online backup.
To figure out: should Aria engine support online backup on a replica? I think even if mysql.gtid_slave_pos would be configured to be transactional and use page_checksum=1, I think there would still be a race condition in-between these updates, where mariabackup could still backup new transaction data, while reading an out-dated GTID to restore replication at.
Re-assigning to monty, as he wants to fix the issue.
A couple notes from our discussion for inspiration to a fix:
Fix should be either
1) everything done in Aria in 1 commit, or
2) or a transaction that continues so the gtid_slave_pos can be updated.Note too, look at how InnoDB does this for inspiration.
And an MTR test which simplifies the issue:
#
|
# This test shows that mariabackup can take an inconsistent backup of a slave
|
# while replicating transactions from Aria tables configured to use online
|
# backup (page_checksum=1 and transactional=1). This is because the slave adds
|
# an extra implicit part of the transaction to also update the
|
# mysql.gtid_slave_pos table separately. When the BACKUP_COMMIT lock is taken,
|
# the updates to the replicated Aria table with online backup are allowed to
|
# proceed, but then the slave hangs when trying to update mysql.gtid_slave_pos.
|
#
|
# Note the test doesn't actually use mariabackup, but rather, emulates its
|
# locking behavior using the backup staging commands manually.
|
#
|
# References:
|
# MDEV-36143: xbstream gets the sequence number incorrect when active table
|
# is ENGINE=ARIA
|
#
|
|
--source include/have_innodb.inc
|
--source include/have_binlog_format_row.inc
|
--source include/master-slave.inc
|
|
--echo #
|
--echo # Initialize test schema
|
--connection master
|
CREATE TABLE t1 (a int) ENGINE=ARIA TRANSACTIONAL=1 PAGE_CHECKSUM=1;
|
--source include/save_master_gtid.inc
|
|
--echo #
|
--echo # Slave: Take BLOCK_COMMIT backup lock on slave
|
--connect(extra_slave,127.0.0.1,root,,test,$SLAVE_MYPORT)
|
--connection extra_slave
|
backup stage start;
|
backup stage flush;
|
backup stage block_ddl;
|
backup stage block_commit;
|
|
--echo #
|
--echo # Master: Insert into aria table
|
--connection master
|
INSERT INTO t1 VALUES (1);
|
--let $master_gtid= `SELECT @@global.gtid_binlog_pos`
|
|
--echo #
|
--echo # Wait for slave to start replicating transaction
|
--connection slave
|
|
--let $wait_condition= select count(*)=1 from test.t1;
|
--source include/wait_condition.inc
|
|
--echo # Validating slave state..
|
--let $slave_gtid= `SELECT @@global.gtid_slave_pos`
|
if (`SELECT strcmp("$master_gtid","$slave_gtid")`)
|
{
|
|
--echo # The following transaction was replicated on the slave after BLOCK_COMMIT backup lock was taken
|
select * from t1;
|
|
--echo # Yet the GTID state could not update
|
--echo # GTID for the transaction: $master_gtid
|
--echo # Slave state GTID: $slave_gtid
|
|
--echo # Slave state inconsistent on backup: BLOCK_COMMIT backup lock was taken on running slave before running a transaction, yet a partial transaction was allowed through
|
--echo # Mariabackup would create an invalid slave when restoring the backup taken at this point
|
--die Slave state inconsistent on backup: BLOCK_COMMIT backup lock was taken on running slave before running a transaction, yet a partial transaction was allowed through
|
}
|
|
--connection extra_slave
|
backup stage end;
|
|
--echo #
|
--echo # Cleanup
|
--connection master
|
drop table t1;
|
--source include/save_master_gtid.inc
|
--connection slave
|
--source include/sync_with_master_gtid.inc
|
|
--source include/rpl_end.inc
|
--echo # End of test
|
|
Also, updating the remaining time to 4h per Monty's request.
I have been busy with finishing my tasks for next release of MariaDB.
I will start looking at this as my main task next weeks Monday.
Working on it...soon will update my findings.