[MDEV-32673] Handle InnoDB binlog position recovery for RESET MASTER or binlog name change Created: 2023-11-03  Updated: 2023-12-01

Status: Stalled
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: 10.4
Fix Version/s: 10.4, 10.5, 10.6, 10.11, 11.0, 11.1, 11.2

Type: Bug Priority: Major
Reporter: Kristian Nielsen Assignee: Sergei Golubchik
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-22351 InnoDB may report incorrect binlog po... Closed

 Description   

This is a re-filing of MDEV-22351.
The original fix of MDEV-22351 was not working and has been reverted.

The problem is demonstrated by a test case here:

https://github.com/MariaDB/server/commits/knielsen_mdev22351

With every transaction commit, InnoDB stores the corresponding binlog position in one of 128 slots in rseg headers. During recovery (or mariabackup --prepare), the most recent entry must be chosen as the one to recover.

This is currently done by comparing (filename,offset) with lexicographical ordering. This usually works, because binlog files are named BASE.000001, BASE.000002, ...

But if the server is restarted with a new basename that compares lower than the old, or if RESET MASTER is run, then an old entry can compare higher on the filename than a newer entry. This can lead to the old entry errorneously being recovered as the current binlog position.

The above git branch contains a proof-of-concept fix that adds an incrementing version number to entries, but it requires extending the format of the entries in the rseg headers. This was deemed undesirable.

An alternative fix is as suggested by Marko in MDEV-22351. If the server would inform storage engines when the filename of the binlog changes (from RESET MASTER, or a config change and server restart), then InnoDB can clear all the old entries and write a new entry with the correct current position. Such clear+write should probably be done as a single mini-transaction to make it crash-safe.



 Comments   
Comment by Sergei Golubchik [ 2023-11-03 ]

As far as I remember the need for InnoDB to store binlog position was removed around 2005, with the implementation of 2pc, which provided a different generic way of binlog-engine synchronization.

I'm not sure we need to extend the server to allow InnoDB to maintain better the information it should've stopped maintaining 18 years ago.

Comment by Kristian Nielsen [ 2023-11-03 ]

It's done to be able to take an xtrabackup/mariabackup of the InnoDB tablespaces and use it to provision a slave with transactionally consistent binlog start position, without having to block the server with any locks during the backup.
It's kind of a hack, but it's been there since forever, and users rely on it.

Comment by Marko Mäkelä [ 2023-11-07 ]

serg, in InnoDB trx_t, there used to be fields mysql_log_file_name, mysql_master_log_file_name, and repl_wait_binlog_name. For persistent storage, I found TRX_SYS_MYSQL_MASTER_LOG_INFO and TRX_SYS_MYSQL_LOG_INFO. In MySQL 5.7, the field TRX_SYS_MYSQL_MASTER_LOG_INFO was removed in 2014 and its parent and grandparent commits.

Comment by Marko Mäkelä [ 2023-11-07 ]

knielsen, while reviewing this, can you please suggest a test for covering changes of log_bin_basename?

Comment by Sergei Golubchik [ 2023-11-07 ]

marko, please, don't change the storage engine API to support this hack. If it cannot be fixed purely inside InnoDB (with some kind of a new hack, I presume), then let's make mariabackup not to rely on it and merge the removal change from MySQL

Comment by Kristian Nielsen [ 2023-11-07 ]

Review sent:

https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/thread/CJZ6QRA6FSU5ATHARZG57ZM3TPOCJMMK/

Looks good, just one thing to fix regarding relaylogs and one suggestion for adding a comment.

Comment by Sergei Golubchik [ 2023-11-07 ]

still, this code is redundant since 2005, MySQL removed it in 2014. We should remove it, not improve! Don't push server changes from your commit.

Comment by Kristian Nielsen [ 2023-11-07 ]

serg, can you help me understand what code it is you think is redundant since 2005 and should be removed?

mariabackup is a fork of xtrabackup which works by dirty-copying innodb data files and using redo-logs to recover a consistent backup from that, I don't think that is redundant.

And to provision a slave from a mariabackup, a binlog position is needed consistent with the recovered backup. To do that without locking the server, the binlog position is stored transactionally in redo log. This is also necessary since the binlog is separate from InnoDB redo lock and requires 2-phase commit. So also not redundant?

And now we have a regression in 10.4 (as shown with testcase) that the backup can have wrong binlog position in some cases. We have two different suggested fixes (with different pros and cons), I'm trying to understand if you have a third idea?

- Kristian.

Comment by Marko Mäkelä [ 2023-11-08 ]

serg, which removal are you talking about? mysql-5.7.44 (the last release of the MySQL 5.7 series) defines and uses TRX_SYS_MYSQL_LOG_INFO. Also the currently latest 8.0 and trunk branches of MySQL store and retrieve something at TRX_SYS_MYSQL_LOG_INFO.

Comment by Marko Mäkelä [ 2023-11-08 ]

knielsen, thank you for the regression test. I verified that it fails if the fix is absent:

--- /mariadb/10.4m/mysql-test/suite/mariabackup/binlog.result	2023-11-08 16:40:52.616593017 +0200
+++ /mariadb/10.4m/mysql-test/suite/mariabackup/binlog.reject	2023-11-08 16:46:34.560730165 +0200
@@ -27,7 +27,7 @@
      INTO TABLE t2 FIELDS ESCAPED BY '' (file, pos);
 UPDATE t2 SET file= REPLACE(REPLACE(file, './', ''), '.\\', '');
 file_ok	pos_ok
-OK	OK
+Wrong file: master-bin.000003 expected: master-bin.000001	Wrong position: 507 expected: 657
 *** Check correct position after server restart.
 # restart
 INSERT INTO t VALUES (8);
@@ -36,7 +36,7 @@
      INTO TABLE t2 FIELDS ESCAPED BY '' (file, pos);
 UPDATE t2 SET file= REPLACE(REPLACE(file, './', ''), '.\\', '');
 file_ok	pos_ok
-OK	OK
+Wrong file: master-bin.000003 expected: master-bin.000002	OK
 *** Check correct position after rename binlog basename.
 # restart: --log-bin=000-bin
 INSERT INTO t VALUES (9);
@@ -45,7 +45,7 @@
      INTO TABLE t2 FIELDS ESCAPED BY '' (file, pos);
 UPDATE t2 SET file= REPLACE(REPLACE(file, './', ''), '.\\', '');
 file_ok	pos_ok
-OK	OK
+Wrong file: master-bin.000003 expected: 000-bin.000001	Wrong position: 507 expected: 504
 *** Check correct position after delete binlog files and restart
 # restart
 TRUNCATE TABLE t2;
@@ -53,5 +53,5 @@
      INTO TABLE t2 FIELDS ESCAPED BY '' (file, pos);
 UPDATE t2 SET file= REPLACE(REPLACE(file, './', ''), '.\\', '');
 file_ok	pos_ok
-OK	OK
+Wrong file: master-bin.000003 expected: master-bin.000001	Wrong position: 507 expected: 4
 DROP TABLE t,t2;

Comment by Sergei Golubchik [ 2023-11-09 ]

knielsen, yes. in the 2pc protocol synchronization of InnoDB and binlog is done via XIDs. Binlog position is the last XID committed in InnoDB, I presume. Supposedly, if InnoDB stores GTID, one can use GTID directly, that's more convenient. Otherwise, one can scan the binlog for the correct Xid_log_event.

Comment by Kristian Nielsen [ 2023-11-09 ]

The problem we need to solve here is precisely being able to compare transactions to identify "the last committed in InnoDB". XIDs and GTIDs cannot be used for this purpose, they don't compare in binlog commit order.

The binlog file/offset does compare in binlog order. Except for the cases of RESET MASTER and binlog file rename, which is what Marko's patch addresses.

RESET MASTER deletes the binlog, so not sure how the binlog could be consulted to resolve this issue.

There's another patch https://github.com/MariaDB/server/commit/398dc1fd3cde05669e6e635396da0f9b0738aa6e which solves this by allocating an extra number that compares in binlog order. But this requires InnoDB data format change, which we wanted to avoid.

This all comes because of a scalability improvement in InnoDB where there are now multiple entries identifying committed transactions, and mariabackup needs to identify the most recent one. Before, this information was written into a single location in the TRX_SYS tablespace, so no possibility to pick the wrong one.

It is well understood that this whole area could be greatly improved. That's why for example Marko and I are discussing how to move the binlog to be transactionally written under the InnoDB write-ahead log. But this is a separate issue of solving a specific regression in provisioning a new slave with mariabackup.

Comment by Andrei Elkin [ 2023-11-10 ]

Salve knielsen!

A XID-based method to identify the "the last committed in InnoDB" (in binlog) actually exist. I remember pointing to you on the last Unconference in a slight different context though, that
trx:s can be enumerated for binlogging (in that order) at the time they acquire xids for themself. The current XID value is based on the query-id which can be changed to
a-next-to-binlog sequence number.
Of course the BGC module would have to adapt to that, esp in the slave case, so that
in the end we should gain binlog trx:s ordered by XID:"seq_no".

I also thought about assign XID := GTID which perhaps is even better, contemplating a fruit of your labor.

Comment by Marko Mäkelä [ 2023-11-10 ]

Elkin, I think that changing the interpretation of persistent data structures actually constitutes a file format change. If some "BGC module" needs to be changed, how do you guarantee compatibility between (say) 10.4.32 and 10.4.33 running as primary or replica, either way around? This hypothetical "BGC module" change would be absent in 10.4.32.

I think that we must generally avoid any file format changes in GA releases unless there is a very strong reason for that and a tested way to prevent a downgrade to older incompatible versions, such as in the case of MDEV-13564. It would also be a good idea to introduce an option to retain the old format (like innodb_safe_truncate=OFF in MDEV-13564).

You may have a different interpretation of what a "file format change" means. For example, MDEV-17567 in MariaDB Server 10.6 introduced one more log file for the recovery of DDL operations. If you start MariaDB Server 10.5 on a data directory of a killed 10.6 server, it may actually work in practice, as long as there was no DDL operation that had to be recovered. Even if we ignore the ddl_recovery.log file and changes outside InnoDB, I would not claim that there was no file format change. Yes, the data format of the persistent InnoDB files was not changed, but the DDL operations were heavily refactored and the interpretation of the pre-existing system column DB_TRX_ID in data dictionary tables was changed. Strictly speaking, changing the interpretation of persistently stored data is a file format change.

Comment by Andrei Elkin [ 2023-11-10 ]

marko, this point is reasonable but when it comes to files, the Innodb logs or mysqld binlog, there's no format change.
I am talking about some integer type value changes, to merely 'enumerate' differently.

The OLD <-> NEW replication concern would be fair, but it does not exist, because we don't replicate Xid_log_event::"seq_no" (we actually do, but it's fair to think it's filtered out of replication).
NEW server master or slave would run an upgraded BGC module whose affects apply to XID::"seq_no" and Xid_log_event::"seq_no" values, never to their formats,
for Innodb/server recovery.
The only natural requirement is to not upgrade until recovery has done.

Comment by Kristian Nielsen [ 2023-11-11 ]

Andrei, I don't understand how XIDs can be allocated in binlog order? The
XID is needed in innobase_xa_prepare(), at that time the commit order is not
yet determined.

Comment by Marko Mäkelä [ 2023-11-13 ]

As far as I understand, the scenario that this bug affects is

mariadb-backup --backup --no-lock …

As far as I understand, this implies --binlog-info=ON a.k.a. 1 when log_bin is enabled on the server. Kristian’s test case in my proposed fix explicitly specifies that option.

See also MDEV-26322 and MDEV-18917. There seems to be some misunderstanding here, but I would trust knielsen more than the big fat warnings in the documentation of the --no-lock option.

Comment by Sergei Golubchik [ 2023-11-13 ]

I specifically wrote above

Otherwise, one can scan the binlog for the correct Xid_log_event.

and then you won't need XID values to compare in commit order.

Comment by Kristian Nielsen [ 2023-11-13 ]

Elkin, serg, if you want to participate in the decisions about how to fix this bug, then please take the time to understand the problem and the proposed solutions.

Suggesting to scan the binlog for Xid_log_event makes no sense:

  • We do not have the XID, nor do we need it as we have the binlog coordinate of the Xid_log_event directly.
  • We do not have the binlog at the time of mariabackup --prepare, as mariabackup does not backup binlog files.
  • The specific bug here is about transactions committed shortly before RESET MASTER is run. RESET MASTER deletes the binlog files, so it is not possible to scan them to reliably identify the old commits.

Marko's suggested patch fixes the problem by implementing a mechanism for replication to inform the storage engine when the binlog is initialized at RESET MASTER and server startup. This seems a reasonable thing to have, could also be useful for other engines and/or other situations.

Discussing refinements to the new API will be very welcome, it's good to think such things through to plan for future enhancements.

Or alternatively, suggest a concrete way to fix the bug, for which a testcase exists to verify possible solutions. Preferably with some prototype patch/pseudocode to show the proposed solution.

Comment by Andrei Elkin [ 2023-11-13 ]

knielsen, my proposal of changing XID::"seq_no" evaluation (honestly) assumed of course XID as relevant. Otherwise, as you've clarified, that should belong to the future.

I guess I recalled the immediate issue, but not fully sure. In the 3rd bullet do you mean the last committed trx would be found committed at recovery too,
and its binlog coordinates would become a part of --binlog-info content? If that's correct we'd go with Marko's plan in GA:s, but also consider to switch to GTID instead of file:pos in 11.4?

Comment by Andrei Elkin [ 2023-11-13 ]

knielsen, to the future part related question, as a concept I consider to assign set @@seq_no = counter++ to the user trx near innobase_xa_prepare and
exploit the same or refined/extended logics of the slave BGC that orders/waits-for commit parents.

Comment by Michael Widenius [ 2023-11-20 ]

Why not have mariabackup ask the server under BACKUP STAGE BLOCK_COMMIT for the current binlog position and store that as part of the backup?
As there are no commits, the current binlog position is well known. This will also work if InnoDB would not be enabled.

Comment by Michael Widenius [ 2023-11-20 ]

MariaDB community backup is not using BACKUP STAGE BLOCK_COMMIT yet, but that should be fixed in 10.6 before next release

Comment by Andrei Elkin [ 2023-11-20 ]

knielsen, marko, serg: could we consider an existing handlerton::ha_flush_logs and engine would get the intent of the binlog RESETing, rather than the binlog rotation, through some flag in current_thd?

Comment by Marko Mäkelä [ 2023-11-28 ]

I think that this is waiting for serg to suggest a better solution than the two in the Description.

Generated at Thu Feb 08 10:33:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.