[MDEV-29684] Testing and improving 'cluster conflict resolving' with 10.4 and later Created: 2022-10-03 Updated: 2023-01-16 Due: 2023-10-28 Resolved: 2023-01-16 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Fix Version/s: | 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3 |
| Type: | Task | Priority: | Critical |
| Reporter: | Seppo Jaakola | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
There are a number of bug reports of cluster wide conflict resolving related crashes or hangs. To troubleshoot these, it would be good to run generic cluster conflict tests, just for easier reproducing of the issue(s). Run high intensity write-write conflict load in the cluster, with varying SQL access patterns. Analyze and troubleshoot issues surfacing from the testing. |
| Comments |
| Comment by Jan Lindström (Inactive) [ 2022-10-13 ] | |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-10-13 ] | |||||||||||||||||||||||||||||||||||||
|
ramesh branch : bb-10.4- This requires multi-master testing so that you have KILL [QUERY|CONNECTION] clauses there. | |||||||||||||||||||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-10-25 ] | |||||||||||||||||||||||||||||||||||||
|
seppo jplindst RQG run shows server hang issue using KILL connection/query grammar. Seems the issue is related to MDL. Please find the attached RQG grammar, error log and stack trace. mysql.err
| |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-10-25 ] | |||||||||||||||||||||||||||||||||||||
|
seppo To me it seems like KILL does not bf kill conflicting DDL if it owns MDL-lock. From error log I found following (see new.err):
| |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-10-25 ] | |||||||||||||||||||||||||||||||||||||
|
I found that in wsrep_abort_thd we have:
Based on debug output OPTIMIZE TABLE is aborted and INSERT is aborting state. | |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-10-25 ] | |||||||||||||||||||||||||||||||||||||
|
From same run
| |||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2022-10-26 ] | |||||||||||||||||||||||||||||||||||||
|
The pull request should not affect MDL lock conflict resolving, and these tests are specifically for MDL conflicts. Do you have a reason to believe that the PR has caused this behavior, and is a regression? | |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-10-26 ] | |||||||||||||||||||||||||||||||||||||
|
ramesh Can you verify using 10.4 branch is this MDL problem reproducible there also? | |||||||||||||||||||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-10-26 ] | |||||||||||||||||||||||||||||||||||||
|
jplindst Could not reproduce the MDL issue on the latest 10.4 base build yet. | |||||||||||||||||||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-10-28 ] | |||||||||||||||||||||||||||||||||||||
|
seppo PFA stack all_bt_28102022.txt
| |||||||||||||||||||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-11-16 ] | |||||||||||||||||||||||||||||||||||||
|
seppo The latest commit still has a hung issue. PFA full stack bt_all_v2.txt
| |||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2022-12-13 ] | |||||||||||||||||||||||||||||||||||||
|
I completed a RQG run with 10 connections against one node in a cluster, with the attached oltp_and_ddl_v1.yy grammar. It appears that this test is generating a load with mixture of KILL commands, DML, and DDL both with TOI and RSU method. The loaded node will hang very soon after starting the test. The reason for this hang appears to be RSU execution, which desyncs and pauses the node, while there is ongoing TOI replication in execution. Because of the paused state, the earlier TOI execution cannot complete, and because of the executing TOI, the RSU cannot get necessary MDL locks => hence the hang. | |||||||||||||||||||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-12-14 ] | |||||||||||||||||||||||||||||||||||||
|
seppo We can reproduce the problem even without the {{SET wsrep_OSU_method..}}statement. This is a sporadic issue, so start the RQG run multiple times to reproduce the problem. PFA grammar oltp_and_ddl_v2.yy RQG command to reproduce the hang problem.
| |||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2022-12-27 ] | |||||||||||||||||||||||||||||||||||||
|
This new RQG grammar makes two other scenarios to surface, I will submit separate MDEV issue for both, as they are not related to the fixes suggested in PR's for this MDEV. The surfacing issues are:
| |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2023-01-10 ] | |||||||||||||||||||||||||||||||||||||
| |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-01-11 ] | |||||||||||||||||||||||||||||||||||||
|
seppo, I provided some review comments to a 10.6 version of this: 1, 2, 3, 4. Please address it. I did not look at other versions yet. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-01-11 ] | |||||||||||||||||||||||||||||||||||||
|
For the 10.6 version, a fix of |