[MDEV-32673] Handle InnoDB binlog position recovery for RESET MASTER or binlog name change Created: 2023-11-03 Updated: 2023-12-01 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Replication, Storage Engine - InnoDB |
| Affects Version/s: | 10.4 |
| Fix Version/s: | 10.4, 10.5, 10.6, 10.11, 11.0, 11.1, 11.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Kristian Nielsen | Assignee: | Sergei Golubchik |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
This is a re-filing of The problem is demonstrated by a test case here: https://github.com/MariaDB/server/commits/knielsen_mdev22351 With every transaction commit, InnoDB stores the corresponding binlog position in one of 128 slots in rseg headers. During recovery (or mariabackup --prepare), the most recent entry must be chosen as the one to recover. This is currently done by comparing (filename,offset) with lexicographical ordering. This usually works, because binlog files are named BASE.000001, BASE.000002, ... But if the server is restarted with a new basename that compares lower than the old, or if RESET MASTER is run, then an old entry can compare higher on the filename than a newer entry. This can lead to the old entry errorneously being recovered as the current binlog position. The above git branch contains a proof-of-concept fix that adds an incrementing version number to entries, but it requires extending the format of the entries in the rseg headers. This was deemed undesirable. An alternative fix is as suggested by Marko in |
| Comments |
| Comment by Sergei Golubchik [ 2023-11-03 ] | ||||||||||||||||||||||||||||||||||||
|
As far as I remember the need for InnoDB to store binlog position was removed around 2005, with the implementation of 2pc, which provided a different generic way of binlog-engine synchronization. I'm not sure we need to extend the server to allow InnoDB to maintain better the information it should've stopped maintaining 18 years ago. | ||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-03 ] | ||||||||||||||||||||||||||||||||||||
|
It's done to be able to take an xtrabackup/mariabackup of the InnoDB tablespaces and use it to provision a slave with transactionally consistent binlog start position, without having to block the server with any locks during the backup. | ||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-07 ] | ||||||||||||||||||||||||||||||||||||
|
serg, in InnoDB trx_t, there used to be fields mysql_log_file_name, mysql_master_log_file_name, and repl_wait_binlog_name. For persistent storage, I found TRX_SYS_MYSQL_MASTER_LOG_INFO and TRX_SYS_MYSQL_LOG_INFO. In MySQL 5.7, the field TRX_SYS_MYSQL_MASTER_LOG_INFO was removed in 2014 and its parent and grandparent commits. | ||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-07 ] | ||||||||||||||||||||||||||||||||||||
|
knielsen, while reviewing this, can you please suggest a test for covering changes of log_bin_basename? | ||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2023-11-07 ] | ||||||||||||||||||||||||||||||||||||
|
marko, please, don't change the storage engine API to support this hack. If it cannot be fixed purely inside InnoDB (with some kind of a new hack, I presume), then let's make mariabackup not to rely on it and merge the removal change from MySQL | ||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-07 ] | ||||||||||||||||||||||||||||||||||||
|
Review sent: Looks good, just one thing to fix regarding relaylogs and one suggestion for adding a comment. | ||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2023-11-07 ] | ||||||||||||||||||||||||||||||||||||
|
still, this code is redundant since 2005, MySQL removed it in 2014. We should remove it, not improve! Don't push server changes from your commit. | ||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-07 ] | ||||||||||||||||||||||||||||||||||||
|
serg, can you help me understand what code it is you think is redundant since 2005 and should be removed? mariabackup is a fork of xtrabackup which works by dirty-copying innodb data files and using redo-logs to recover a consistent backup from that, I don't think that is redundant. And to provision a slave from a mariabackup, a binlog position is needed consistent with the recovered backup. To do that without locking the server, the binlog position is stored transactionally in redo log. This is also necessary since the binlog is separate from InnoDB redo lock and requires 2-phase commit. So also not redundant? And now we have a regression in 10.4 (as shown with testcase) that the backup can have wrong binlog position in some cases. We have two different suggested fixes (with different pros and cons), I'm trying to understand if you have a third idea? - Kristian. | ||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-08 ] | ||||||||||||||||||||||||||||||||||||
|
serg, which removal are you talking about? mysql-5.7.44 (the last release of the MySQL 5.7 series) defines and uses TRX_SYS_MYSQL_LOG_INFO. Also the currently latest 8.0 and trunk branches of MySQL store and retrieve something at TRX_SYS_MYSQL_LOG_INFO. | ||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-08 ] | ||||||||||||||||||||||||||||||||||||
|
knielsen, thank you for the regression test. I verified that it fails if the fix is absent:
| ||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2023-11-09 ] | ||||||||||||||||||||||||||||||||||||
|
knielsen, yes. in the 2pc protocol synchronization of InnoDB and binlog is done via XIDs. Binlog position is the last XID committed in InnoDB, I presume. Supposedly, if InnoDB stores GTID, one can use GTID directly, that's more convenient. Otherwise, one can scan the binlog for the correct Xid_log_event. | ||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-09 ] | ||||||||||||||||||||||||||||||||||||
|
The problem we need to solve here is precisely being able to compare transactions to identify "the last committed in InnoDB". XIDs and GTIDs cannot be used for this purpose, they don't compare in binlog commit order. The binlog file/offset does compare in binlog order. Except for the cases of RESET MASTER and binlog file rename, which is what Marko's patch addresses. RESET MASTER deletes the binlog, so not sure how the binlog could be consulted to resolve this issue. There's another patch https://github.com/MariaDB/server/commit/398dc1fd3cde05669e6e635396da0f9b0738aa6e which solves this by allocating an extra number that compares in binlog order. But this requires InnoDB data format change, which we wanted to avoid. This all comes because of a scalability improvement in InnoDB where there are now multiple entries identifying committed transactions, and mariabackup needs to identify the most recent one. Before, this information was written into a single location in the TRX_SYS tablespace, so no possibility to pick the wrong one. It is well understood that this whole area could be greatly improved. That's why for example Marko and I are discussing how to move the binlog to be transactionally written under the InnoDB write-ahead log. But this is a separate issue of solving a specific regression in provisioning a new slave with mariabackup. | ||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-10 ] | ||||||||||||||||||||||||||||||||||||
|
Salve knielsen! A XID-based method to identify the "the last committed in InnoDB" (in binlog) actually exist. I remember pointing to you on the last Unconference in a slight different context though, that I also thought about assign XID := GTID which perhaps is even better, contemplating a fruit of your labor. | ||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-10 ] | ||||||||||||||||||||||||||||||||||||
|
Elkin, I think that changing the interpretation of persistent data structures actually constitutes a file format change. If some "BGC module" needs to be changed, how do you guarantee compatibility between (say) 10.4.32 and 10.4.33 running as primary or replica, either way around? This hypothetical "BGC module" change would be absent in 10.4.32. I think that we must generally avoid any file format changes in GA releases unless there is a very strong reason for that and a tested way to prevent a downgrade to older incompatible versions, such as in the case of You may have a different interpretation of what a "file format change" means. For example, | ||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-10 ] | ||||||||||||||||||||||||||||||||||||
|
marko, this point is reasonable but when it comes to files, the Innodb logs or mysqld binlog, there's no format change. The OLD <-> NEW replication concern would be fair, but it does not exist, because we don't replicate Xid_log_event::"seq_no" (we actually do, but it's fair to think it's filtered out of replication). | ||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-11 ] | ||||||||||||||||||||||||||||||||||||
|
Andrei, I don't understand how XIDs can be allocated in binlog order? The | ||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-13 ] | ||||||||||||||||||||||||||||||||||||
|
As far as I understand, the scenario that this bug affects is
As far as I understand, this implies --binlog-info=ON a.k.a. 1 when log_bin is enabled on the server. Kristian’s test case in my proposed fix explicitly specifies that option. See also | ||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2023-11-13 ] | ||||||||||||||||||||||||||||||||||||
|
I specifically wrote above
and then you won't need XID values to compare in commit order. | ||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-13 ] | ||||||||||||||||||||||||||||||||||||
|
Elkin, serg, if you want to participate in the decisions about how to fix this bug, then please take the time to understand the problem and the proposed solutions. Suggesting to scan the binlog for Xid_log_event makes no sense:
Marko's suggested patch fixes the problem by implementing a mechanism for replication to inform the storage engine when the binlog is initialized at RESET MASTER and server startup. This seems a reasonable thing to have, could also be useful for other engines and/or other situations. Discussing refinements to the new API will be very welcome, it's good to think such things through to plan for future enhancements. Or alternatively, suggest a concrete way to fix the bug, for which a testcase exists to verify possible solutions. Preferably with some prototype patch/pseudocode to show the proposed solution. | ||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-13 ] | ||||||||||||||||||||||||||||||||||||
|
knielsen, my proposal of changing XID::"seq_no" evaluation (honestly) assumed of course XID as relevant. Otherwise, as you've clarified, that should belong to the future. I guess I recalled the immediate issue, but not fully sure. In the 3rd bullet do you mean the last committed trx would be found committed at recovery too, | ||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-13 ] | ||||||||||||||||||||||||||||||||||||
|
knielsen, to the future part related question, as a concept I consider to assign set @@seq_no = counter++ to the user trx near innobase_xa_prepare and | ||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2023-11-20 ] | ||||||||||||||||||||||||||||||||||||
|
Why not have mariabackup ask the server under BACKUP STAGE BLOCK_COMMIT for the current binlog position and store that as part of the backup? | ||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2023-11-20 ] | ||||||||||||||||||||||||||||||||||||
|
MariaDB community backup is not using BACKUP STAGE BLOCK_COMMIT yet, but that should be fixed in 10.6 before next release | ||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-20 ] | ||||||||||||||||||||||||||||||||||||
|
knielsen, marko, serg: could we consider an existing handlerton::ha_flush_logs and engine would get the intent of the binlog RESETing, rather than the binlog rotation, through some flag in current_thd? | ||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-28 ] | ||||||||||||||||||||||||||||||||||||
|
I think that this is waiting for serg to suggest a better solution than the two in the Description. |