Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL)
-
None
Description
This ticket offers an alternative to MDEV-33667 solution for improving parallel slave performance of XA transaction load.
Round-robin distribution is mingled with assigning XA-"COMPLETE" events to the same
slave workers that currently may be processing respective XA-PREPARE parts.
The algorithm tries to reach fair distribution but it can't provide the parallelism of
XA-PREPARE with XA-COMMIT (ROLLBACK).
Currently knielsen_xa_sched_minimal_fix branch contains such solution.
10.6 Rebase branch: bb-10.6-MDEV-33668
https://github.com/MariaDB/server/commits/bb-10.6-MDEV-33668/
Attachments
Issue Links
- is blocked by
-
MDEV-27512 Assertion `! thd->transaction_rollback_request' failed in rows_event_stmt_cleanup
-
- Closed
-
- relates to
-
MDEV-742 LP:803649 - Xa recovery failed on client disconnection
-
- Closed
-
-
MDEV-21107 Assertion `!tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id' failed in finish_event_group
-
- Confirmed
-
- split from
-
MDEV-31949 slow parallel replication of user xa
-
- Stalled
-
Activity
Pushed a small fix to bb-10.6-MDEV-33668, taking the same fix already used in Andrei's rev0 and rev1 patches.
There was a wait_for_prior_commit() in XA_prepare_log_event::do_commit() before doing the actual XA PREPARE. This would prevent all XA PREPARE events from group-committing with any prior transactions, which would hurt performance of parallel replication of workloads with a significant fraction of XA transactions.
Andrei's ver0 work (ref) includes a lot of testcase work. Will this need to be added to this patch also, and will it be the same?
These tests (rpl_xa_prepare_gtid_fail, rpl_xa_concurrent_2pc, rpl_xa_empty_transaction, rpl_parallel_xa_same_xid and binlog_xa_prepared_bugs) have proven stable in bb-10.6-MDEV-33668 when copied (inc related files) from ver0.
roel salve! I'm going to push fixes to stabilize rpl_xa_prepare_gtid_fail separately. Other the 31949 branches' are not relevant in the current context. E.g rpl_parallel_xa_same_xid changes in origin/bb-10.6-MDEV-31949_ver0_opt are only for "patient" repeating of duplicate xid:s while in this ticket's branch even a single retry can't occur for this reason (just as it is in the base 10.6).
Oth, a new rpl_parallel_multi_domain_xa is added to the branch with the purpose of proving
multiple gtid domain parallel applying.
Could you please look into it as we need this sort of setup tested rather aggressively as well.
I've not heard yet from axel about benchmarking results. But it all looks like this branch will be pushed as a solution to the slowness issue.
Cheers!
Howdy knielsen. I've made my few notes to your branch and followed with few small commits. It's a great piece of work and, I've grown convinced we should go along with it rather than with MDEV-33667 as a "hot-fixes" to the slowness.
While we're for Axel's benchmarking, as a blessing I have not doubts , could you please review the current total branch on your own? I only had some concern about maybe_active_xid... array size and scan search in it. The size is made under control. To my understanding the latter can't be really concerning while the number of workers is not big.
Elkin, I looked over the complete patch series, looks good to me.
I squashed some commits to simplify the commit history and force-pushed the bb-10.6-MDEV-33668 branch. I kept the preparatory/refactoring patch separate, I think it's nice to keep the difference between the refactoring and the change of logic in XA scheduling.
Agree with your comments about the size of maybe_active_xid. There will be some cost to the scanning of the array to search for XID. This could be improved using a hash table; also a more careful accounting of history could reduce the size of generations. But I'm not sure it will be a problem in practice, so maybe just go with the simple approach for now. Even if the size of the array would be significant, this only affects the SQL driver thread, and it seems unlikely that the impact would be severe enough to make it a bottleneck.
Let me know if there is something else I should do on this MDEV at this point.
- Kristian.
monty (knielsen), there's more clear picture about performance done by axel. I am pasting the comparison table between few branches investigated.
Optimistic parallel mode slave is run with --log-bin and --log-slave-updates.
Variant A Variant B Variant C Variant D
|
-------------------------------------------------------------------------------------
|
|
#thd tps #thd tps #thd tps #thd tps
|
1 2150 1 2092 1 2109 1 2217
|
2 2226 2 2395 2 2503 2 2845
|
4 2572 4 3434 4 3380 4 3851
|
8 3024 8 5180 8 4419 8 5560
|
16 3649 16 7801 16 5746 16 8171
|
32 4358 32 10628 32 7852 32 11011
|
64 5107 64 15094 64 9323 64 14565
|
|
|
A: mariadb-10.6-567c0973591, branch 10.6
|
B: mariadb-10.6-093efc509b1 branch bb-10.6-MDEV-31949_ver0_opt
|
C: mariadb-10.6-0dc13677f3d, branch bb-10.6-MDEV-33668
|
D: mariadb-10.6-1bdc9bc8077, branch bb-10.6-MDEV-31949
|
(bb-10.6-MDEV-31949_ver0_opt should've been renamed to bb-10.6-MDEV-33667).
As you can see with 4 workers bb-10.6-MDEV-33668 is better than 10.6 by about 30%,
with 8 workers by about 50%. This is somewhat worse than my "capsule" benchmarking showed.
Testing has been restarted on the yesterday updated bb-10.6-MDEV-33668 branch
I looked over the patch, and have the following thoughts. 2-4 are quite minor (just for consideration), though I think it would be good to address the first point (and it wouldn't impact Roel's testing).
1) The test doesn't test a pretty big aspect of the patch, generations. It'd be good to wrap that main loop with another loop that separates out XAP from XAC/XAR by <slave_parallel_workers>*<index_of_new_loop> transactions for 0-4 generations.
2) The granularity of generation seems like it can overzealously fix the scheduling for transactions in the last vulnerable generation - can we infer it based on sub_id? I imagine that may allow a little more parallelism
3) The freeze_size(&maybe_active_xid) condition can be done after the whole event group has been queued so we don't impede "real work to be done" for maintenance work
4) Though I doubt comparing each xid for just three generations would be too much overhead, I wonder if a sliding bloom filter could be used for a quick initial xid existence quick, before attempting to compare each xid. I don't necessarily think we should incorporate this now (after all the first rule of code optimization is don't optimize your code), but just leaving it as a future thought.
Agree that a test for generations would be good.
The use of generations in the patch was a minimal way of ensuring
correctness in the (presumably rare) case of duplicate XID between
transactions. The granularity of generations is calculated in a very
conservative manner, and generations can be quite a lot longer than they
need to. This does not impact parallelism if XIDs are all distinct. But it
does lead to a longer linear search for the XID inside the SQL driver
thread. Which will have little impact as long as the SQL driver thread is
not a bottleneck.
As I understand, this was found not to be a problem in practice in
performance testing. But if/when we want to improve it, the obvious idea
would be to have a hash table mapping XID to the sched_bucket it was last
scheduled to. This would eliminate the linear search length that depends on
the estimated size of generations.
In pathological cases it can be necessary to keep track of an XID for
arbitrarily long. In principle, we could schedule thousands of event groups
between workers W1 and W2 due to XA dependencies, and then the next event
group scheduled on W3 could in principle run in parallel with any or all of
them. So I don't think sub_id by itself is enough.
Some improvement is though possible using sub_id. For example, suppose we
schedule event groups E1 and E10 (say) to W1, and we scheduled E15 to W2,
where 1, 10, 15 are the sub_ids for the event groups. Then we know that E15
will wait for E10 to commit (because 15>10, sub_id comparison). So then it's
safe to schedule the next event group to W2, even if it has the same XA XID
as E1.
So in summary, the generations is just a simple and low-cost way to
conservatively bound the time for which an XID needs to be remembered. And a
hash table of the XIDs could be used to eliminate the linear search through
the XID history, if required.
I'm not sure about point (3), maybe I did not catch your meaning. This all
runs in the SQL driver thread, and freeze_size() is just a realloc(), I
don't understand how moving this could significantly affect "real work to be
done"?
To clarify point (3), I only mean it can be moved to a point after we've queued all events of the current transaction, so the resizing can be done (by the SQL thread) in parallel with the worker thread executing the transaction. Though I wouldn't classify it as significant, just for thought.
Agreed that sub_id isn't enough alone to replace the idea of a generation. My thought seemed to parallel the example you gave of using sub_id as an improvement. That is, to calculate a relative generation using sub_id (while still maintaining an XID list for active generations), then I imagine we wouldn't need to store the XIDs of the third/last generation (which seems to be in-line with your example). Just wanted to define it to be sure we're on the same page. Again, not something I'd say that is significant.
Thanks for the in-depth elaboration!
As promised, a detailed testing status update.
* I am currently focusing on debugging these issues seen during testing thus far:
1. [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-20, Internal MariaDB error code: 1964 (and variations thereof)
2. [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event '.' at 4, the last event read from 'binlog.000009' at 2595, the last byte read from 'binlog.000009' at 2702.', Internal MariaDB error code: 1236 (and variations thereof)
3. [ERROR] Slave SQL: Error executing row event: 'You can't combine write-locking of system tables with other tables or lock types', Gtid 0-15-143, Internal MariaDB error code: 1428 (and variations thereof)
4. [Warning] Slave: Got error 140 "Wrong create options" from storage engine InnoDB Error_code: 1030 (w/ slave halt). This one may be XA related.
5. (Slave Crash) SIGSEGV|dict_table_t::is_active_ddl|trx_t::rollback_low|trx_t::rollback|trx_rollback_last_sql_stat_for_mysql
6. (Slave Assertion) thd->transaction->stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)|SIGABRT|close_thread_tables|mysql_execute_command|mysql_parse|Query_log_event::do_apply_event
7. Other crashes/assertions which may or may not be related
* After that, or concurrent as feasible, the plan is to do the following tests:
8. Stress test related XA replication MTR tests (and verify any failures against BASE as needed)
9. Run the MENT-1905 XA lua scripts in various configurations. This is important as these scripts have repeatedly shown various issues in this area.
My total overall estimate, keeping in mind my availability (4 days/week), remains the 17th of April. If it possible to finish testing before then, it will be done.
Andrei also mentioned that multiple GTID domains is very important to check in connection with this ticket, which he indicated is one area where the fixes may play out something unexpected. For this, I propose knielsen or Elkin to add an MTR test.
Also of some interest is the assertion
!look_for_rollover || start_scan_slot != slot|SIGABRT|trx_assign_rseg_low|trx_start_low|trx_start_if_not_started_xa_low|create_table_info_t::create_foreign_keys
|
Though it is seen in base runs as well.
Roel can you please create sub task for all of the above listed errors (including how to repeat them / what you did to get them) so that we get them fixed?
Yes; all the issues above - as well as various others seen in trunk/base during testing - will be debugged and logged, provided they prove to be bugs.
In case it was not clear, the list of issues 1-6 above were only seen in the patch tree, not in trunk/base.
Hi bnestere,
I pushed a couple patches to branch knielsen_xa_sched_minimal_fix that implement
what I described in my previous comment.
One is a testcase for generations, as you also mentioned. It runs a lot of
random XA transactions with different distances between XA PREPARE and XA
COMMIT, and with some duplicate XIDs. This test was successful at failing
when I deliberately introduced a couple bugs in the generation accounting.
Unfortunately that test fails sporadically, both with my patches and without
(vanilla 10.6). I think perhaps this is because the slave's GTID position is
updated too early, so there is a race where MASTER_GTID_WAIT() has returned
successfully but the corresponding transaction is not yet visible in the
InnoDB table. Something for Andrei to look into perhaps?
The second is a small patch that re-writes the computation for when a new
generation starts. This seems much better than the original code, both
simpler and more precise. It just uses the FIFO-order of scheduling to see
when the last bucket of the previous generation has been scheduled for the
current generation. Instead of the overly clever index manipulation that
significantly over-estimates the time before a new generation is reached.
The third patch is more complex. It replaces the linear array scan of
recently scheduled XIDs with a single hash table lookup. So it avoids the
potentially costly scan in the original patch if the workload has a high
density of XA transactions and the number of worker threads is large. It
also implements the refinement using sub_id to avoid XID dependencies in
some cases where they would otherwise be implied by generation
considerations alone.
I don't necessarily suggest to include these new patches in the upcoming
release or custom build. They are not critical, the functionality should be
the same with or without. I just wanted to show how I envisioned the minimal
patch's accounting of dependencies and generations could look with a more
complex and full implementation.
- Kristian.
Few reviews were done. Two commits are pushed.
89c907bd4f7..d90a2b44add HEAD -> 10.6
Elkin What will happen with the additional patches that bnestere and knielsen discussed/proposed? Also, where does this leave MDEV-31949/ MENT-1905?
Roel, that code is currently in Kristian's branch. I thought the best place for it would be 11.5, as a polished version of this bug fixes.
MDEV-31949 keeps aiming to 11.6.
Elkin Ack, thank you. Should a new ticket be created for those patches? As they will require testing, I think we should look at 11.6 for those as well.
Also, is this patch (MDEV-33668 inc the additional patches) fully compatible with MDEV-31949?
Roel, a new ticket, of course. I'll report one when have fully understood its merge with MDEV-31949. The latter is different on the slave side, maybe the merge with this solution will lessen that, and 31949 also refines the master side because of recovery (MDEV-21777, MDEV-32830).
From the list above (all bugs reproduced in BASE as well, except #6 ftm):
1. [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid 0-1-20, Internal MariaDB error code: 1964 (and variations thereof)
> This is now MDEV-34010
2. [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'could not find next log; the first event '.' at 4, the last event read from 'binlog.000009' at 2595, the last byte read from 'binlog.000009' at 2702.', Internal MariaDB error code: 1236 (and variations thereof)
> Filtered after analysis
3. [ERROR] Slave SQL: Error executing row event: 'You can't combine write-locking of system tables with other tables or lock types', Gtid 0-15-143, Internal MariaDB error code: 1428 (and variations thereof)
> This is now MDEV-34011
4. [Warning] Slave: Got error 140 "Wrong create options" from storage engine InnoDB Error_code: 1030 (w/ slave halt). This one may be XA related.
> This is now MDEV-33961
5. (Slave Crash) SIGSEGV|dict_table_t::is_active_ddl|trx_t::rollback_low|trx_t::rollback|trx_rollback_last_sql_stat_for_mysql
> This is now MDEV-34051
6. (Slave Assertion) thd->transaction->stmt.is_empty() || thd->in_sub_stmt || (thd->state_flags & Open_tables_state::BACKUPS_AVAIL)|SIGABRT|close_thread_tables|mysql_execute_command|mysql_parse|Query_log_event::do_apply_event
> Ref MDEV-32372
7. !look_for_rollover || start_scan_slot != slot|SIGABRT|trx_assign_rseg_low|trx_start_low|trx_start_if_not_started_xa_low|create_table_info_t::create_foreign_keys and
!look_for_rollover || start_scan_slot != slot|SIGABRT|trx_assign_rseg_low|trx_start_low|trx_start_internal_low|dict_stats_fetch_from_ps
> This is now MDEV-33917
This comment will be updated as work on bugs progresses is now complete.
Set myself to review Kristian's branch.