[MDEV-30165] X-lock on supremum for prepared transaction for RR Created: 2022-12-06  Updated: 2023-12-22  Resolved: 2023-09-22

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.15
Fix Version/s: 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Critical
Reporter: Vladislav Lesin Assignee: Vladislav Lesin
Resolution: Fixed Votes: 1
Labels: parallelslave

Issue Links:
Problem/Incident
causes MDEV-32272 lock_release_on_prepare_try() does no... Closed
causes MDEV-32353 lock_release_on_prepare_try() does no... Closed
Relates
relates to MDEV-29642 Server Crash During XA Prepare Can Br... Closed
relates to MDEV-31949 slow parallel replication of user xa In Review
relates to MDEV-28709 unexpected X lock on Supremum in READ... Closed

 Description   

Statement-based replications stops as on slave there is X-lock on supremum in prepared transaction.



 Comments   
Comment by Vladislav Lesin [ 2022-12-19 ]

Some update.

I have created custom build branch with the following fixes:

1. lock_release_on_prepare() is invoked for any isolation level and contains check for lock->is_gap(), but any lock on supremum is considered as gap lock. We can't just release lock with supremum bit set for RR on prepare, because there can be other bits set. The solution is to create gap-lock for supremum(currently "gap" bit is always cleared on supremum), and it will be released by MDEV-26682 fix on XA preparing. Such fix can cause extra memory consumption and need to be tested carefully before merging to trunk.

2. trx_t::set_skip_lock_inheritance() must be invoked at the very beginning of lock_release_on_prepare(). Currently trx_t::set_skip_lock_inheritance() is invoked at the end of lock_release_on_prepare() when lock_sys and trx are released, and there can be a case when locks on prepare are released, but "not inherit gap locks" bit has not yet been set, and page split inherits lock to supremum.

3. Besides, I would print out stack trace if transaction is prepared in lock_rec_set_nth_bit(), and create one more build for the customer, which would include the above fixes.

It's under the testing now.

But, I have two notes about it.

1. MDEV-26682 does not fully protect replication from lock wait timeouts. It releases S- and gap locks, but X-locks can still cause timeouts if prepared transaction was not committed or rolled back long enough. Besides, there also can be next-key locks in RR, which are not released on prepare if they are exclusive. Should we fix MDEV-30165? X-lock on supremum can be set in RR, and it's valid operation. If we remove such locks on preparing, how will this help us to avoid lock wait timeouts?

2. After analysing log file with extra logging for timed out waiting locks, provided by the customer, I found out that waited transaction is prepared recovered one. It was recovered, but not committed or rolled back long enough to cause lock wait timeout error for waiting transaction. InnoDB just recovers prepared XA's from redo-log during recovery procedure, the higher layer should decide what to do with such transactions. And there is nothing to do here for InnoDB perspective.

Summary.

The current changes in custom build make prepared transaction less locking, like MDEV-26682, but does not allow to fully avoid lock wait timeout errors, and does not fix the root case - long not committed or rolled back recovered prepared transactions. We should address this to replication team.

Comment by Aurélien LEQUOY [ 2023-06-08 ]

in binlog_format = statement + parallele replication, it's happen to me when I restarted the replication after the load made with mydumper, i go same problem. (it was without crash)

Comment by Aurélien LEQUOY [ 2023-06-08 ]

GTID was actived

Comment by Marko Mäkelä [ 2023-08-22 ]

Could fixing MDEV-31949 fix this?

Comment by Marko Mäkelä [ 2023-09-15 ]

The fix looks correct to me. The debug check lock_sys.assert_locked(cell) could be moved from lock_rec_unlock_supremum() to lock_rec_rebuild_waiting_queue() to better document the concurrency protection. Also, it would be good to add a comment to lock_rec_rebuild_waiting_queue() that it is similar to lock_rec_dequeue_from_page().

We may want to introduce a new member function for this check that is invoked in both lock_release_on_prepare() and lock_release_on_prepare_try():

if (lock->mode() != LOCK_X || lock->is_gap())

Based on MDEV-30567, I believe that this translates into two conditional branches, while with some use of bitwise arithmetics we can cope with 1 conditional branch:

if ((lock->type_mode & (LOCK_MODE_MASK | LOCK_GAP)) == LOCK_X)

or (in this case, because we know that the LOCK_WAIT bit cannot be set):

if ((lock->type_mode & (LOCK_REC_NOT_GAP - 1)) == LOCK_X)

The last expression would correspond to a predicate name is_rec_granted_exclusive_not_gap().

Comment by Vladislav Lesin [ 2023-09-18 ]

My previous fix was based on the idea to make locks on supemum to be true gap locks to allow the code, which releases gap locks on "XA prepare" to release supremum locks too.

Currently, all the locks on supremum are not-gap locks, but everywhere in the code they are treated as gap locks. Such approach allows to reuse bits in locks bitmaps instead of creating new lock object. It's supposed that not-gap locks creating is more likely case, then gap-locks creating, and that's why not-gap lock bitmap reusing is more likely case then gap lock bitmap reusing.

That's why my fix is not effective from the perspective of memory usage. And I decided to remake it. The idea is to reset supremum lock and to rebuild waiting queue on "XA prepare". The corresponding commit has passed RQG testing, and I am currently working on code review fixes.

Comment by Kristian Nielsen [ 2023-09-21 ]

I'm deeply concerned about the amount of changes to work-around what is fundamentally a wrong design of the replication of user XA transactions, MDEV-32020.

This bug seems just one example of an endless stream of changes around the code that will significantly complicate future development and likely introduce new bugs. I believe this will very negatively affect the value of MariaDB for the vast majority of users, all for an extremely marginal use-case of XA-prepare-on-master-XA-commit-on-slave. And I don't believe it will ever be successful, as it doesn't address the underlying design problems, MDEV-32020.

Please consider instead fixing the XA stuff to replicate at XA COMMIT, not XA prepare, as it should, and used to do. Then a lot of problems around this just disappear. And the recovery on the slave of the XA PREPAREd transaction on the master can be implemented without sprinkling code and if-statements all over the replication and innodb and binlog code. I'm happy to help with this.

Comment by Marko Mäkelä [ 2023-09-21 ]

knielsen, thank you, I really appreciate your comment, and that you filed MDEV-32020.

I agree that this as well as MDEV-28709 or MDEV-26682 are work-arounds for something that would better be fixed at a higher layer, specifically, the replication of distributed transactions.

Comment by Andrei Elkin [ 2023-09-21 ]

knielsen, not less than Marko I appreciate your constructive critique and ideas always full of merit. When it comes to the 'wrong design' I must admit it was not perfect. Few decisions could have been taken earlier. Nevertheless there's nothing fundamentally in the way of two phase XA replication to perform bug-free, including MDEV-32020 - where I mentioned a solution i.e add the forced index(es) into the Rows-log-event context and force using them on slave too, to provide - this must come without saying - fastest (compare with other methods) failover as well as a framework to further online replication.
That's right it's a field of great technical challenges. Yet so far we've been emerging successfully from any difficulty for which I must be grateful to Innodb colleagues.

Comment by Marko Mäkelä [ 2023-09-22 ]

This change may be useful on its own, independent of replication, to reduce some avoidable locking conflicts. This has been extensively tested, and I do not foresee any correctness problems. But I do tend to agree with knielsen’s views in MDEV-32020.

Comment by Kristian Nielsen [ 2023-09-22 ]

I did a quick test, and I see that the gap locks are released only after the XA PREPARE is binlogged, so the binlog order between conflicting transactions should be ok.

But now I wonder how it is ensured that the binlog order is consistent with the order of XA PREPARE inside of InnoDB? This is usually handled through innobase_commit_ordered(), but that I think is not involved here as there is no commit during XA PREPARE.

I'm wondering if this causes a problem for mariabackup --no-lock?
Can it happen that the binlog has XA PREPARE of transactions T1 and T2 in this order, while InnoDB has in its redo lock the transactions prepared in the opposite order T2, T1?

And then if mariabackup --no-lock snapshots the database at the point where T2 is XA PREPAREd, but T1 is not, then there will be no valid binlog position corresponding to this state, so a slave cannot be provisioned from the backup?

Comment by Andrei Elkin [ 2023-12-22 ]

> And then if mariabackup --no-lock snapshots ...

While Innodb can't (maybe yet) record the last binlog position of XA-PREPARE, those could be recovered on a being provisioned slave from binlog as normal transactions do. Say REDO log contains C1,C2,P3,P4 (C stands for Commit, P for XA-PREPARE), and binlog has only C1,C2,P3. At slave server recovery P4 is to be rolled back, but P3 would advance the replication start position (that initially specifies "next to C2").
Notice that in place of P3 in binlog can be any P:i. That is binlog does not have to be ordered for P:s in the REDO order. And while it's so I see there's only concern of In doubt P:s identification. For that purpose the last binlog position of a C that Innodb remembers upon crash can safely serve as the origin for scanning of P:s in binlog. E.g C2 is such origin (2 is the binlog index), any P1 or earlier may not be considered as in-doubt. P3 or P4 must be.

Generated at Thu Feb 08 10:14:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.