[MDEV-26670] Unable to maintain replication since upgrading from 10.4 to 10.5 Created: 2021-09-23 Updated: 2021-11-22 Resolved: 2021-11-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.5.12 |
| Fix Version/s: | 10.5.13, 10.6.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jon Ellis | Assignee: | Marko Mäkelä |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | regression | ||
| Environment: |
Ubuntu Focal 20.04.3 LTS |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
We're seeing the following error on a very regular basis when replicating a workload of XA transactions with a 10.5.12 primary / secondary:
The replication snapshot is setup with:
We've tried changing slave_parallel_mode back to conservative and non-zero values for slave_parallel_threads but nothing keeps replication alive consistently. We're seeing this across multiple installs. What can we do to debug / analyze this problem? |
| Comments |
| Comment by Andrei Elkin [ 2021-09-24 ] |
|
jellisgwn: Howdy! Thanks for the report. We've been analyzing similar cases. Do you have a script that would reproduce it by us? |
| Comment by Andrei Elkin [ 2021-09-24 ] |
|
jellisgwn, at any rate could you please upload
Thank you. |
| Comment by Andrei Elkin [ 2021-09-24 ] |
|
jellisgwn: also if you're having or will be having this sort of timeout error, maybe you could check a work-around which is 1. to XA ROLLBACK xa prepared transactions (reported by XA RECOVER), Notice p.1 may involve a big number of "casualties" and can be refined to rollback only involved in the timeout error trx:s which you'll have to find out yourself (sure we could help with that as well). |
| Comment by Jon Ellis [ 2021-09-24 ] |
|
Hi Andrei. Sadly we haven't yet been able to narrow down a reproduce. It has been plaguing us in prod for weeks, and if there is a pattern as to which workloads... we haven't yet identified it. Will work on providing the requested uploads over the next few days! Its likely not going to be possible to share binlog as the data is sensitive, but we'll do what we can. |
| Comment by Andrei Elkin [ 2021-09-25 ] |
|
jellisgwn: Fyi, we already identified few patterns leading to the timeout. Those involve non-unique indexes and also FK constraints (whether their uniqueness matters will be cleared out soon) . |
| Comment by Andrei Elkin [ 2021-09-27 ] |
|
W have caught up a use case involving secondary (unique) index. According to lock status, a XA-prepared transaction gains apparently an incorrect S GAP lock (by its Write_rows_log_event aka INSERT) which blocks the following transaction's Write_rows to that table. |
| Comment by Jon Ellis [ 2021-09-27 ] |
|
Hi Andrei, that's great news! Will see if we can confirm that the tables that are giving us issues all involve secondary unique indexes. |
| Comment by Sergei Golubchik [ 2021-09-29 ] |
|
So far there are three isolate test cases that exhibit this behavior. One is MDEV-26652 and it should be fixed in 10.6. You can try 10.6 to see if the issue goes away. Two others are in And by "try" I mean not in production, of course. |
| Comment by Jon Ellis [ 2021-09-30 ] |
|
Sergei, when you said:
i'm confused by what it means to try the READ-COMMITTED isolation level "on the slave". our application is using READ-COMMITTED, but is there replication configuration that changes the slave's isolation level? that seems strange... |
| Comment by Sergei Golubchik [ 2021-09-30 ] |
|
The default isolation level is REPEATABLE-READ, and I assume that's what most people use. I suggested that if you use REPEATABLE-READ, then you can set --transaction-isolation=READ-COMMITTED (only on the slave so that it'd have less effect on your application) and see if it'll help. If you're already using READ-COMMITTED, then, of course, this suggestion doesn't apply. You can still try 10.6 to see if you're affected by MDEV-26652. |
| Comment by Jon Ellis [ 2021-10-19 ] |
|
Sergei, Your response prompted some further investigation / experiments. By snapshotting a database and binlog and bringing it to a test environment we were able to run the secondary with READ_COMMITTED as the default isolation... and this, in turn, made it possible to play the binlog back without issue. This was (is?) surprising. As previously noted, our workload is created via an XA connection, and at READ_COMMITTED. However, that causes problems when the binlog is processed on a secondary where the default is REPEATABLE_READ. My expectation was that the locking would be encapsulated within the binlog and that the default isolation of the secondary would have no impact... this is obviously not the case - was this behavior something which changed between 10.4 and 10.5? With the above progress we were able to take the step of setting the default isolation of both the primary and secondary to READ_COMMITTED for both the primary and secondary. Since that time replication has returned to its normal, pre-10.5, state of stability. One other thing to note, use of 10.6 had no impact on the issue, only changing the default isolation was required. |
| Comment by Andrei Elkin [ 2021-10-20 ] |
|
jellisgwn. It's good to hear you have sorted it out! Locks are not presented within binlog events, and apparently in your case the slave creates unnecessary ones. The 10.5 changed XA binlogging to log the prepared part of the transaction separately from the following COMMIT|ROLLBACK part. |
| Comment by Sergei Golubchik [ 2021-11-22 ] |
|
I'll close it as a duplicate of |