[MDEV-31840] Parallel replication undetected deadlocks with outside transaction Created: 2023-08-03 Updated: 2023-09-03 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.6.14 |
| Fix Version/s: | 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Kristian Nielsen | Assignee: | Kristian Nielsen |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
Consider (parallel) replicated transactions T1 and T2 that must commit in this order, along with extra transaction U that is not ordered with T1 and T2 (U can be a user-transaction or replicated in a separate domain_id). Suppose further that row locks cause U to wait on T2 and T1 to wait on U: T1 -> U -> T2 Since T2 must wait_for_prior_commit on T1, this is a deadlock. It will not be caught by the InnoDB deadlock checker as this does not know about wait_for_prior_commit. In 10.4 this was resolved by a deadlock kill. This seems to be somewhat accidental. It looks like InnoDB 10.4 would traverse the wait-for graph in the deadlock detector and report to parallel replication all transitive waits in the graph, ie. in this case it will report both direct T1->U and indirect T1->T2. The latter will cause T2 to be deadlock killed and the deadlock resolved. In 10.6 these transitive waits seem to be no longer reported, at least from first testing (see below for testcase). This means that replication will hang until we get a lock wait timeout. It wasn't really intended in my original design that the storage engine would be required to report all transitive waits in the wait-for graph. It might be expensive to add, though on the other hand a lock wait is already expensive. I'm not sure how important this problem is either. Normally, on a slave, we would not expect there to be user (non-replicated) transactions, much less such that conflict with replicated transactions; nor are such conflicts expected between different GTID domain IDs. Still, it's not nice to have deadlocks that are not detected by the server. Of course, the root of the problem really is the lack of a server-wide deadlock detector, so that InnoDB and parallel replication each have only part of the picture, and neither can solve the problem 100%. I think there will be other similar cases of lack of deadlock detection eg. with milti-engine transactions. I have not decided what should be the resolution of this issue, but for now at least making sure that the issue is documented. Test case:
|