[MDEV-23586] Mariabackup: GTID saved for replication in 10.4.14 is wrong Created: 2020-08-26 Updated: 2021-08-27 Resolved: 2020-09-25 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | mariabackup, Server |
| Affects Version/s: | 10.4.14 |
| Fix Version/s: | 10.4.16, 10.5.7 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Rick Pizzi | Assignee: | Michael Widenius |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
It appears that the fix for https://jira.mariadb.org/browse/MDEV-21953 has introduced a serious regression and now any backup taken with MariaDB server 10.4.14 and above saves the wrong GTID. To reproduce, take a mariabackup from a 10.4.14 instance, then notice how the reported position does refer to a GTID which is not proper. In the example below, the PIT corresponds to position 25035029, but the reported GTID 0-1-18691 is wrong and if used, yields several duplicate keys errors. In fact the correct GTID for that position would be 0-1-18699 as reported by BINLOG_GTID_POS() function. This wrong GTID is saved to the xtrabackup_slave_info and since the position is no longer saved there (WHY???) the resulting backup does not carry a valid PIT information anymore. Thanks to aponzo who found and verified this issue. Workaround: downgrade server to 10.4.13
|
| Comments |
| Comment by Rick Pizzi [ 2020-08-26 ] | |||||||||||||||
|
From further testing:
| |||||||||||||||
| Comment by Geoff Montee (Inactive) [ 2020-08-26 ] | |||||||||||||||
Are you using the position printed by mariabackup --prepare to validate the correctness of the position in xtrabackup_binlog_info? i.e. this position?:
If so, then I want to point out that the position printed by mariabackup --prepare is not guaranteed to be consistent, so it is not a good idea to use this position to validate the position in xtrabackup_binlog_info. In fact, the position printed by mariabackup --prepare should never be used at all. For more information about that, see MDEV-21611. Only the positions in xtrabackup_binlog_info and xtrabackup_slave_info are currently guaranteed to be consistent by Mariabackup. In other words, it is totally normal and expected if the position saved in xtrabackup_binlog_info is different from the position printed by Mariabackup, because only the former position is supposed to be consistent. Is your position in xtrabackup_binlog_info actually incorrect? I don't know, so I can't comment on that. I just want to point out that it is unsafe to use the position printed by mariabackup --prepare to make any judgement about the correctness of the position in xtrabackup_binlog_info. | |||||||||||||||
| Comment by Rick Pizzi [ 2020-08-27 ] | |||||||||||||||
|
The printed position and the position saved in the file are identical. | |||||||||||||||
| Comment by Michael Widenius [ 2020-08-27 ] | |||||||||||||||
|
What is PIT? Point in time? (Binlog position would maybe a better term in that case) From what I understand from above, the printed binlog file and positions are correct. | |||||||||||||||
| Comment by Rick Pizzi [ 2020-08-27 ] | |||||||||||||||
|
Yes sorry I meant the the binlog position - which is correct, only GTID is wrong. The issue is, in the file xtrabackup_slave_info only the GTID is saved. I think it woud be best to go back to also printing the position... Rick | |||||||||||||||
| Comment by Rick Pizzi [ 2020-08-27 ] | |||||||||||||||
|
julien.fritsch A test case is more than trivial to build for this issue... just set up one master, one replica, have GTID enabled and parallel slave threads > 1, then take a mariabackup on the replica and verify that the saved GTID in the slave_info file does not correspond to the GTID of the saved position as reported by BINLOG_GTID_POS(). | |||||||||||||||
| Comment by Vladislav Lesin [ 2020-09-01 ] | |||||||||||||||
|
Binlog position is gotten with SHOW MASTER STATUS in mariabackup. GTID of the last change is gtid_current_pos variable which is gotten with SHOW VARIABLES statement. (see write_binlog_info()). Both statements are executed under BACKUP STAGE BLOCK_COMMIT lock (if --no-lock option is not used). I failed to repeat it with mtr test. I have started replication with several slave threads and set watchpoints on variables which are used to count binlog position(mysql_bin_log.log_file.pos_in_file, mysql_bin_log.log_file.write_pos, mysql_bin_log.log_file.request_pos, see also MYSQL_BIN_LOG::raw_get_current_log() - the function which is used to count binlog position on SHOW MASTER STATUS execution). I also set breakpoints to function which is used to update gtid_current_pos(rpl_binlog_state::update_nolock()). Then I set BLOCK_COMMIT lock on slave and executed a bunch of insert's on master. There were not hits after BLOCK_COMMIT on slave. I am trying to repeat the bug with sysbench. | |||||||||||||||
| Comment by Andrea Ponzo (Inactive) [ 2020-09-02 ] | |||||||||||||||
|
vlad.lesin let's explain this issue better: and we have [root@m2 andre2]# cat xtrabackup_slave_info of course we will add the MASTER_HOST of A doing this we have duplicate errors because the above gtid_slave_pos is wrong. if we open the file binlog file and pos of server B MariaDB [test]> SELECT BINLOG_GTID_POS("m2-bin.000007",25035029 );
--------------------------------------------
-------------------------------------------- it is different from that reported on xtrabackup_slave_info So, the MASTER_LOG_FILE and MASTER_LOG_POS are correct if we want to setup the server C as slave of B, but the GTID is not. the Sequence number difference is due to the number of parallel_workers used Can you please report, on your test, the output of : thanks | |||||||||||||||
| Comment by Vladislav Lesin [ 2020-09-02 ] | |||||||||||||||
|
My current assumption, which I am trying to repeat in MTR test, is the following: Consider the following code:
tc_log->log_and_order() flushes binlog and changes binlog position. commit_one_phase_2() acquires backup block commit lock. As we can see, in the case, when is_real_trans==true, binlog is flushed before block commit is acquired. So, there can be the following scenario: | |||||||||||||||
| Comment by Vladislav Lesin [ 2020-09-03 ] | |||||||||||||||
|
monty, I have just confirmed the above scenario with test case: https://github.com/MariaDB/server/commit/cbcb921bb83be8b84142dfa73583724c8af843a8. | |||||||||||||||
| Comment by Vladislav Lesin [ 2020-09-04 ] | |||||||||||||||
|
I have just added one more test case to reproduce the scenario without replication: https://github.com/MariaDB/server/commit/f19ff5b68e70214c4d75a2165ce8f5ef8cd2739f. The problem is the same - binlog flushing in ha_commit_trans() is not covered by backup lock. | |||||||||||||||
| Comment by Michael Widenius [ 2020-09-07 ] | |||||||||||||||
|
Code to review is in bb-10.4-monty | |||||||||||||||
| Comment by Michael Widenius [ 2020-09-25 ] | |||||||||||||||
|
Fixed by partly reverting The original problem for The main part of the bug fix is to release the MDL_BACKUP_COMMIT lock while |