[MDEV-31949] slow parallel replication of user xa Created: 2023-08-18  Updated: 2024-02-06

Status: In Review
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.5, 10.6, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2, 11.3
Fix Version/s: 10.5, 10.6, 10.11, 11.1, 11.2

Type: Bug Priority: Critical
Reporter: Andrei Elkin Assignee: Kristian Nielsen
Resolution: Unresolved Votes: 1
Labels: triage

Issue Links:
Blocks
is blocked by MDEV-32257 Assertion `thd->is_killed()' failed i... Closed
is blocked by MDEV-32272 lock_release_on_prepare_try() does no... Closed
is blocked by MDEV-32347 Stack smashing/looping, ASAN use-afte... Closed
is blocked by MDEV-32470 MDEV-31949: use-after-poison in xid_t... Closed
is blocked by MDEV-32830 refactor XA binlogging for better int... In Review
is blocked by MDEV-32852 Assertion `is_prepared_xa(thd) || ((t... Closed
is blocked by MDEV-32857 Assertion `thd->lex->sql_command == S... Closed
Issue split
Problem/Incident
causes MDEV-32257 Assertion `thd->is_killed()' failed i... Closed
causes MDEV-32347 Stack smashing/looping, ASAN use-afte... Closed
causes MDEV-32463 SIGSEGV in __memmove_avx_unaligned_er... Closed
causes MDEV-32469 MDEV-31949: MTR outcome difference: M... Closed
causes MDEV-32470 MDEV-31949: use-after-poison in xid_t... Closed
is caused by MDEV-742 LP:803649 - Xa recovery failed on cli... Closed
Relates
relates to MDEV-33168 XA crash-recovery base on engines pre... In Progress
relates to MDEV-21469 Implement crash-safe logging of the u... Stalled
relates to MDEV-30165 X-lock on supremum for prepared trans... Closed
relates to MDEV-32455 Implicit rollback by LOAD INDEX or CA... Confirmed
relates to MDEV-32463 SIGSEGV in __memmove_avx_unaligned_er... Closed

 Description   

With some type of load, such as one statement "light" transactions, the user xa may perform much poorer on the optimistic parallel slave than an equivalent normal - BEGIN..COMMIT - load.

This ticket aims at fixing the slowness while the XA high-availability is to be preserved.

The whole work is arranged in two parts:
1. refactoring of the user xa binary logging to facilitate slave side parallel execution
MDEV-32830;
2. the slave side change of the parallel scheduler to apply Round-Robin distribution
for the user XA transactions.



 Comments   
Comment by Kristian Nielsen [ 2023-08-22 ]

I'm trying to follow what the underlying issue here is...
Andrei mentions some discussions in the Work Log, is there anywhere I can follow these discussions?

Comment by Marko Mäkelä [ 2023-08-23 ]

knielsen, according to vlad.lesin’s analysis with offcputime from https://github.com/iovisor/bcc/ (which involved rebuilding some libraries with -fno-omit-frame-pointer so that we can get valid stack traces), the function Xid_apply_log_event::do_apply_event(rpl_group_info*) is waiting 34.46% of the execution time when the workload uses XA PREPARE and XA COMMIT. With a similar workload that uses regular transactions, the wait time in that function would be 1.85%.

Comment by Andrei Elkin [ 2023-08-24 ]

knielsen, there's no good written summary yet, I am close to compose one. In short the slowness must be caused by skew of xa transactions load distribution to parallel works.
The XA trx:s are logged on master in two phases: the prepare (XAP - consists of XA START..PREPARE) and commit (XAC - just XA COMMIT xid). Both XAP and XAC find their worker (the same then) on slave not with round-robin but via a hash function on xid. Such distribution may create `Wait for prior commit` sequences of a much greater size in terms of commit_sub_id than the number of workers (which is the (about) worst of the normal trx:s).
To get such numbers I made a perf monitoring patch that computes sub_id - last_committed_sub_id when a worker is about to wait (for prior commits).

At this point I've found a way to subject XAP and XAC distributions to round-robin (to round-robin XAP alone and assign the following XAC to XAP's worker is much simpler but the skew remains and to my benchmarking the performance was still not good).
I am having a poc patch (whose [maybe apparent] troubles include to make workers race properly for a XID that first is to be prepared, then be released, for to be finally available for committing) that I am about to start benchmarking.
I was very busy trying to make those numbers yesterday (that's why I could not see your question too).
Please wait for few more hours and come back with the XA round-robin results.

Discussions started on slack 'cos it dealt with a customer case, their data in particular which we used to emulate their load etc.
Later we were able to involve community and you personally into the process. But that was time already when the skew had been reliably detected to strongly suggest what to do next.

Myself I need to learn how to release at some checkpoints even raw results of my personal r&d process.

Comment by Andrei Elkin [ 2023-08-24 ]

After exercising a finally constructed (but yet fully stable yet) round-robin XAP, XAC I definitely see both a grown performance and scale up with the number of worker.
knielsen, bnestere: I'm preparing to push two "intermediate" commits - one for worker stats and the other for the round-robin XA - soon to encourage your review/assessments/separate independent benchmarking.
We sure can discuss tech details on Zulip.

Comment by Andrei Elkin [ 2023-09-19 ]

After reviewing on a poc level, I am taking it back to full completion.

Comment by Roel Van de Paar [ 2023-09-26 ]

Status update: patch nearing completion, 10.6 rebase next by developers. Also commenced preliminary runs against patch branch.

Created MENT-1957 Assertion `thd->is_killed()' failed in XID_cache_element::acquired_to_recovered (opt) or acquire_xid (dbg) on XA ROLLBACK

Comment by Roel Van de Paar [ 2023-09-26 ]

Created MENT-1958 safe_mutex: Found wrong usage of mutex 'flush_list_mutex' and 'mutex'

Comment by Roel Van de Paar [ 2023-09-28 ]

Created MDEV-32272 Failing assertion: table->n_rec_locks == 0 in dict_sys_t::remove on DROP TABLE.
While this related bug is not caused by MDEV-31949, it causes interference with testing of MDEV-31949 and re-testing of MDEV-32257.

Comment by Roel Van de Paar [ 2023-10-04 ]

Created MDEV-32347 Stack smashing/looping, ASAN use-after-poison in xid_t::eq/event_xid_t::serialize, SIGSEGV in serialize_xid and Assertion `is_async_xac || thd->lex->xid->eq(thd->transaction->xid_state.get_xid())' failed in binlog_rollback_flush_trx_cache upon LOAD INDEX

Comment by Roel Van de Paar [ 2023-10-13 ]

Created MDEV-32469: MDEV-31949: MTR outcome difference: MTR testcase from MDEV-27512 crashes in BASE but not PATCH
Created MDEV-32463: SIGSEGV in __memmove_avx_unaligned_erms from a memcpy in xid_t::set (sql/handler.h:896) from Gtid_log_event::Gtid_log_event
Created MDEV-32470: MDEV-31949: use-after-poison in xid_t::key_length()

Comment by Andrei Elkin [ 2023-10-14 ]

Roel, while bb-10.6-MDEV-31949 is ready for retesting, I am squashing lately collected commits into one and also invite Kristian Nielsen for external review once that's done (by the end of today).

Considering the customer pressure we need to focus fully on this case, and prepare ourselves to retesting after KN's notes (if any) too.

Comment by Andrei Elkin [ 2023-10-18 ]

Howdy Kristian,

it'd be great to have your note of the patch. As we exchanged on Zulip its current 10.6-based branch may not be one for the CS. Also feel free to ignore the top commit whose test's failure
is not caused by this Round-Robin XA work (despite being headed with MDEV-31949: Fix rpl_xa_prepare_gtid_fail; that part belongs to MDEV-21469/21777).

If it suits you well, exchanges about the branch would be good have via github and naturally Zulip.

Looking forward to hear from you soon,

Andrei

Comment by Kristian Nielsen [ 2023-10-22 ]

Review sent:
https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/thread/OKY77SPCQNIBMJ7VX2TIU5DBE747TYFW/

Comment by Roel Van de Paar [ 2023-11-21 ]

Status update: patch development continues, now in progress in MDEV-32830 which is a pre-req for this ticket. That ticket is in review by KN & early tests are being run.

Comment by Andrei Elkin [ 2023-11-24 ]

The part 1 of the description is at review.
The part 2 is largely has been cherry-picked from the 1st version of the patch , currently in bb-10.6-MDEV-31949.
It should be sent to review in few hours after all polishing, as requested by reviewers has been done.

Comment by Andrei Elkin [ 2023-11-28 ]

BB has showed just one rpl.rpl_parallel_optimistic_xa_lsu_off failure on binlog-less slave.
This needs investigating tomorrow before to change to In-review.

Comment by Andrei Elkin [ 2023-12-08 ]

Michael Widenius, there're a good news! I finally got a pettry good performance with one little hunk (to be added) to bb-10.6-MDEV-31949.
It was a crazy overlook form my part which was about a forgotten trx->flush_log_later = true; in TC_LOG::run_commit_ordered () -> ... innobase_commit_by_xid() new path for XA-COMMIT.

Performance just rocks now.
I am wrapping up the "ver1" branch for review by Sergei (the XA binlogging refactoring), the slave side by Kristian, and the last commit - which is due by tomorrow - for Innodb. In particular I need to work out interfaces for Marko's blessing.

Comment by Andrei Elkin [ 2023-12-15 ]

bb-10.6-MDEV-31949 is updated into close-to-review-ready branch. The branch structure is of three I-III parts each consists with 2-3 commits dedicated to a specific "big enough to keep it separate" issue.
I may rebase it tomorrow after fully completion of part I,II self-review. The Innodb part III is practically ready ^ marko.

Comment by Marko Mäkelä [ 2023-12-16 ]

Note: Unlike normal ROLLBACK, XA ROLLBACK must really be durably written, because there already must have been a preceding durable write of a XA PREPARE. Non-XA transactions are not written to the binlog before commit, so for normal ROLLBACK there is no need for any durability. InnoDB will simply roll back any recovered incomplete transactions that did not reach the XA PREPARE state. The only mentions about durability that I was able to find in the commit messages were about XA PREPARE and XA COMMIT. The current commit messages mostly say ‘what’ was changed but not ‘why’.

I would like to see a clear description how durability is guaranteed and recovery will work for all of XA PREPARE, XA ROLLBACK, and XA COMMIT in various scenarios. This should also cover a clear description (including call stacks) of the InnoDB mini-transaction commits and the writes of the binlog events, as well as where the writes are being made durable. I did not see any tracking of the mini-transaction commit LSN among the code changes. If we are deferring some calls to durable writes, we must keep track of the LSN so that log_write_up_to(lsn, true) can be called by the time it is really needed. (Other InnoDB threads may write the buffered log records, maybe even durably.) I do not see any tests that would try to prevent InnoDB page flushing or log checkpoints from occurring.

I believe that when the safe setting of sync_binlog=1 is enabled, each of these operations XA PREPARE, XA ROLLBACK, XA COMMIT must be durably written to both the binlog and the storage engine before an OK packet can be sent to the client or any replication connections. It could be helpful to make use of the interface that was implemented in MDEV-24341.

Comment by Andrei Elkin [ 2023-12-16 ]

marko, good notes and questions, thanks for checking!
To XA-ROLLBACK, good that you highlight on it, as admitted elsewhere that's covered but need crash-recovery tests.
I will expand the commit message to carry the 'why'.
Informally, as binlog writes for a group of xa:s go first (that's MDEV-742 logging protocol) and the following engine prepare,commit,rollback do next in the one-by-one manner in the current branch, all but the last of them are made to avoid durable write (the commit III.2).

To the how durability is guaranteed and recovery will work, that is the subject of MDEV-21469 binlog replay for XA recovery. And a measure taken in the III.2 to restore write's durability is for the current branch not to show any different with BASE in this regard.
Specifically when XA-operation returns its effect is guaranteed, the same as in BASE, to be durable even without MDEV-21469 fixes, which covers your last paragraph's concern.

Integration of binlogging with MDEV-24341may have enormous potential, I need to take a look and measure, at or after MDEV-21469, whether the current binlog leader conducted one-by-one engine prepare could be turned parallel with the leader to durably "stamp" everything before.

Comment by Andrei Elkin [ 2023-12-19 ]

Howdy Kristian,

bb-10.6-MDEV-31949 is ready for review for all its three parts; commits in the branch are titled with a roman number of the part.

Part I I formally arrange as MDEV-32830, it's on Serg's laps now.
Part III is for Marko and you.

(I am going to preserve the branch structure with the parts at accounting of any notes that are expected of course).

Comment by Andrei Elkin [ 2023-12-19 ]

I have to push twice to settle down tests, a new binlog.binlog_xa_mdev-31949 particular. No functional changes.

Comment by Kristian Nielsen [ 2023-12-21 ]

Let me just state once and for all:

The idea to binlog first and prepare in engine later for XA prepare will not be approved. Don't spend more effort in this direction.

  • The engines, InnoDB in particular, have very mature and carefully designed and optimized write-ahead logging. This should be the primary source of recovery, not the binlog which is not very well designed or suitable as a high-performance and scalable write-ahead log. And the binlog is optional, the engine needs to implement recovery from its own log anyway. This should not be duplicated in the binlog.
  • For normal transactions, we call innobase_xa_prepare() before binlogging. We should not do the opposite semantics for user XA PREPARE.
  • Recovering the transaction from the binlog is not possible in all cases. Statement-based binlogging is a supported user configuration and does not admit this, even row-based binlogging doesn't support it in all cases.

For these reasons, the engine prepare must come before the binlog write for XA PREPARE. Again, patches that does the opposite will be rejected in review.

Comment by Andrei Elkin [ 2023-12-21 ]

knielsen, thanks you for expressing your opinion!

I read in your comments both technical arguments, that I comment on once again at the end of this message,
and pessimism of course rooted by your huge experience.
Not to challenge the latter and while it's all common truth about Innodb in the 1st bullet,
I just don't see how that can object to the choice of the recovery source.
Binlog always has been point in time recovery method, which the server recovery emulates in this case.

MDEV-21469 idea of recovery has its practical reasons. There are many actually, and one major one as you know or remember is
further elimination of one of two fsync:s. Back in MDEV-18959 I proposed it'd be Innodb's to go.
Yet the immediate actual reason is that In the context of the slow parallel slave the binlog write for XA-prepare first, so far, this hton:s ordering
was the only way to improve performance.

I've not given up on the "conventional" ordering though and continued with experiments and analysis after having send MDEV-31949 for review. Perhaps we'll have something good out of that soon. Fingers crossed.

So to the technical part, despite all scares about the STATEMENT format danger there exists only the ROW format BUG-32020 which is of a specific and familiar class of ROW format replication.
The lack of unique indexes and use of non default on master makes slave to execute a trx to end up different non-unique index locks.
That could lead of course to various bad things like data consistency, but without XA no hang.

Should the indexes be specified with the event, the bug must be gone.

Comment by Mason Sharp [ 2024-01-05 ]

What are the next steps? knielsen are you still reviewing, or Elkin are you making changes?

Comment by Andrei Elkin [ 2024-01-06 ]

masonmariadb, the whole work includes now the IVth recovery commit. Sergei is reviewing the general binlogging part I. split from this ticket as MDEV-32830, which has been refined to reflect the recovery concern raised by Kristian.
The part II is on Kristian, I'll notify him once the part IV is finally readied for review by Sergei and him (ETA: Mon).
The part III is about few Innodb changes that Marko needs look at, also when the IV is out.

Comment by Kristian Nielsen [ 2024-01-07 ]

I've started implementing MDEV-32020 to fix the XA stuff properly.

Comment by Andrei Elkin [ 2024-01-08 ]

The branch is force-updated

ff8c19e7b64...82aedbe9213 HEAD -> bb-10.6-MDEV-31949 (forced update)

to preserve its structure of four parts.
Part I,II,III remain with In-Review status, the part IV needs incorporation with a Xid_log_list_event piece being worked on by Brandon.

knielsen, 82aedbe9213 combined with the upcoming Xid_log_list_event must be a recovery solution that you prefer.
Feel free to check it soon.
It am sure it should not hold you on anymore for reviewing the part II.

I'd be really glad to discuss your plan of implementation of xa-prepare deferred execution that you must've been about in the MDEV-32020 c omment.
Having your design presented may help me to understand what you mean in 'proper'. I pointed it a number of times already that
that bug case itself relates to the XA replication framework rather remotely. Much more tightly it deals with freaking specifics that unique key absent Rows events have manifested in various ways. It is resolvable within few hours by standard means. While
the deferring XA transaction execution may be a solution too, I do feel uncomfortable about that, especially when it's rather clear that the deferring method must ensue a number of design hurdles esp recovery related.

Comment by Andrei Elkin [ 2024-01-09 ]

Updated the branch in the IV recovery part/HEAD of MDEV-33168:

82aedbe9213...a71f13489d0 HEAD -> bb-10.6-MDEV-31949 (forced update)

The current HEAD is enriched with a number of mtr tests and fixes few glitches in implementation of business logics.
Integration with Xid_log_list_event is scheduled for tomorrow.

Comment by Andrei Elkin [ 2024-01-22 ]

Howdy julien.fritsch!

The whole issue is of 3 parts:
1. refactoring of XA binloggin (MDEV-32830)
2. parallel slave XA execution performance
3. XA binlog-recovery (MDEV-33168)

The first two were ready for review for some time. serg ] has provided some notes to the part I. There must be more to come after my reply about a week ago. knielsen did so to the part II of a previous version of the patch and those of his notes have been accounted in the current branch. He also strongly suggested to change the part 3's plan from the MDEV-21649 method to the MDEV-33168 one.
The latter p.3 has been under my testing for few last days. I am about to publish it for review and QA preliminary testing: ETA 1-2 days.

Generated at Thu Feb 08 10:27:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.