[MDEV-29684] Testing and improving 'cluster conflict resolving' with 10.4 and later Created: 2022-10-03  Updated: 2023-01-16  Due: 2023-10-28  Resolved: 2023-01-16

Status: Closed
Project: MariaDB Server
Component/s: Galera
Fix Version/s: 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3

Type: Task Priority: Critical
Reporter: Seppo Jaakola Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File all_bt_28102022.txt     Text File bt_all.txt     Text File bt_all_v2.txt     File mysql.err     File new.err     File oltp.zz     File oltp_and_ddl_v1.yy     File oltp_and_ddl_v2.yy    
Issue Links:
Relates
relates to MDEV-30317 Transaction savepoint may cause failu... Closed

 Description   

There are a number of bug reports of cluster wide conflict resolving related crashes or hangs. To troubleshoot these, it would be good to run generic cluster conflict tests, just for easier reproducing of the issue(s).

Run high intensity write-write conflict load in the cluster, with varying SQL access patterns.
Also, some mtr tests exercising high priority victim aborting, appear to be disabled in 10.4. Enable these and fix the test if it appears to be non-deterministic.

Analyze and troubleshoot issues surfacing from the testing.



 Comments   
Comment by Jan Lindström (Inactive) [ 2022-10-13 ]
Comment by Jan Lindström (Inactive) [ 2022-10-13 ]

ramesh branch : bb-10.4-MDEV-29684-galera

This requires multi-master testing so that you have KILL [QUERY|CONNECTION] clauses there.

Comment by Ramesh Sivaraman [ 2022-10-25 ]

seppo jplindst RQG run shows server hang issue using KILL connection/query grammar. Seems the issue is related to MDL. Please find the attached RQG grammar, error log and stack trace. mysql.err oltp_and_ddl_v1.yy oltp.zz bt_all.txt

MariaDB [(none)]> show processlist;
+----+-------------+-----------------+------+---------+------+---------------------------------+---------------------------------------------------------------------------------------+----------+
| Id | User        | Host            | db   | Command | Time | State                           | Info                                                                                  | Progress |
+----+-------------+-----------------+------+---------+------+---------------------------------+---------------------------------------------------------------------------------------+----------+
|  1 | system user |                 | NULL | Sleep   | 1488 | wsrep aborter idle              | NULL                                                                                  |    0.000 |
|  2 | system user |                 | NULL | Sleep   | 1488 | Closing tables                  | NULL                                                                                  |    0.000 |
|  3 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                  |    0.000 |
|  4 | system user |                 | NULL | Daemon  | NULL | InnoDB purge coordinator        | NULL                                                                                  |    0.000 |
|  5 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                  |    0.000 |
|  6 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                  |    0.000 |
|  7 | system user |                 | NULL | Daemon  | NULL | InnoDB shutdown handler         | NULL                                                                                  |    0.000 |
| 11 | system user |                 | NULL | Sleep   | 1487 | Closing tables                  | NULL                                                                                  |    0.000 |
| 12 | system user |                 | NULL | Sleep   | 1487 | wsrep applier idle              | NULL                                                                                  |    0.000 |
| 14 | system user |                 | NULL | Sleep   | 1487 | wsrep applier idle              | NULL                                                                                  |    0.000 |
| 22 | ramesh      | localhost:43962 | test | Query   | 1445 | acquiring total order isolation | KILL CONNECTION @conid /* QNO 981 CON_ID 22 */                                        |    0.000 |
| 23 | ramesh      | localhost:43972 | test | Query   | 1445 | Commit                          | INSERT INTO `oltp5` ( `id` ) VALUES ( 35403 )  /* QNO 924 CON_ID 23 */                |    0.000 |
| 24 | ramesh      | localhost:43974 | test | Query   | 1446 | acquiring total order isolation | TRUNCATE `oltp6` /* QNO 940 CON_ID 24 */                                              |    0.000 |
| 27 | ramesh      | localhost:53284 | test | Query   | 1445 | acquiring total order isolation | KILL QUERY @conid /* QNO 923 CON_ID 27 */                                             |    0.000 |
| 29 | ramesh      | localhost:53304 | test | Killed  | 1445 | acquiring total order isolation | KILL QUERY @conid /* QNO 956 CON_ID 29 */                                             |    0.000 |
| 30 | ramesh      | localhost:53318 | test | Query   | 1446 | Commit                          | INSERT INTO `oltp4` ( `id`, `k`) VALUES ( NULL, 565051392 ) /* QNO 913 CON_ID 30 */   |    0.000 |
| 32 | ramesh      | localhost:53334 | test | Query   | 1415 | acquiring total order isolation | ALTER ONLINE TABLE `oltp1` PARTITION BY KEY(`id`) /* QNO 926 CON_ID 32 */             |    0.000 |
| 33 | ramesh      | localhost:53346 | test | Query   | 1205 | acquiring total order isolation | TRUNCATE `oltp4` /* QNO 923 CON_ID 33 */                                              |    0.000 |
| 34 | ramesh      | localhost:53360 | test | Query   | 1446 | Waiting for table metadata lock | ALTER ONLINE TABLE `oltp6` ROW_FORMAT=COMPACT /* QNO 905 CON_ID 34 */                 |    0.000 |
| 35 | ramesh      | localhost:53374 | test | Query   | 1295 | acquiring total order isolation | OPTIMIZE TABLE `oltp3` /* QNO 918 CON_ID 35 */                                        |    0.000 |
| 36 | ramesh      | localhost:53376 | test | Query   | 1445 | Freeing items                   | INSERT INTO `oltp6` ( `id`, `k`) VALUES ( NULL, 1681063936 )  /* QNO 905 CON_ID 36 */ |    0.000 |
| 38 | ramesh      | localhost:53392 | test | Query   | 1445 | Init                            | KILL CONNECTION @conid /* QNO 908 CON_ID 31 */                                        |    0.000 |
| 40 | ramesh      | localhost:53406 | test | Query   | 1445 | Init                            | KILL CONNECTION @conid /* QNO 956 CON_ID 21 */                                        |    0.000 |
| 41 | ramesh      | localhost:37020 | test | Query   | 1445 | Init                            | COMMIT /* QNO 943 CON_ID 25 */                                                        |    0.000 |
| 42 | ramesh      | localhost:37022 | test | Query   | 1446 | Commit                          | INSERT INTO `oltp4` ( `id`, `pad` ) VALUES ( NULL, 'm' )  /* QNO 921 CON_ID 26 */     |    0.000 |
| 43 | ramesh      | localhost:37036 | test | Query   |  995 | Init                            | KILL CONNECTION @conid /* QNO 930 CON_ID 28 */                                        |    0.000 |
| 44 | root        | localhost       | NULL | Query   |    0 | Init                            | show processlist                                                                      |    0.000 |
| 45 | root        | localhost       | NULL | Query   | 1319 | Checking permissions            | create database test_one                                                              |    0.000 |
+----+-------------+-----------------+------+---------+------+---------------------------------+---------------------------------------------------------------------------------------+----------+
28 rows in set (0.001 sec)
 
MariaDB [(none)]> 

Comment by Jan Lindström (Inactive) [ 2022-10-25 ]

seppo To me it seems like KILL does not bf kill conflicting DDL if it owns MDL-lock. From error log I found following (see new.err):

2022-10-25 18:01:52 24 [Note] WSREP: MDL conflict db=test table=oltp8 ticket=4 solved by abort
2022-10-25 18:01:53 39 [Note] WSREP: MDL conflict 
schema:  test
request: (24 	seqno 490 	wsrep (toi, exec, aborted) cmd 3 3 	ALTER ONLINE TABLE `oltp8` ROW_FORMAT=DYNAMIC /* QNO 942 CON_ID 24 */)
granted: (43 	seqno -1 	wsrep (local, exec, must_abort) cmd 3 50 	KILL QUERY @conid /* QNO 956 CON_ID 34 */)
2022-10-25 18:01:53 39 [Note] WSREP: MDL ticket: type: MDL_SHARED_WRITE space: TABLE db: test name: oltp8 (Waiting for table metadata lock)
2022-10-25 18:01:53 39 [Note] WSREP: MDL conflict-> BF abort
schema:  test
request: (24 	seqno 490 	wsrep (toi, exec, aborted) cmd 3 3 	ALTER ONLINE TABLE `oltp8` ROW_FORMAT=DYNAMIC /* QNO 942 CON_ID 24 */)
granted: (43 	seqno -1 	wsrep (local, exec, must_abort) cmd 3 50 	KILL QUERY @conid /* QNO 956 CON_ID 34 */)
2022-10-25 18:01:53 39 [Note] WSREP: MDL ticket: type: MDL_SHARED_WRITE space: TABLE db: test name: oltp8 (Waiting for table metadata lock)
2022-10-25 18:01:53 39 [Note] WSREP: wsrep_abort_thd not effective: bf 22620232664832 victim 22620235196160 wsrep 1 wsrep_on 1 RSU 0 TOI 1 aborting 1
2022-10-25 18:01:53 39 [Note] WSREP: MDL conflict db=test table=oltp8 ticket=4 solved by abort

Comment by Jan Lindström (Inactive) [ 2022-10-25 ]

I found that in wsrep_abort_thd we have:

  if ((WSREP(bf_thd) ||
       ((WSREP_ON || bf_thd->variables.wsrep_OSU_method == WSREP_OSU_RSU) &&
	 wsrep_thd_is_toi(bf_thd))) &&
       !wsrep_thd_is_aborting(victim_thd))
 
but wsrep_thd_is_aborting(victim_thd) == true

Based on debug output OPTIMIZE TABLE is aborted and INSERT is aborting state.

Comment by Jan Lindström (Inactive) [ 2022-10-25 ]

From same run

MariaDB [(none)]> show processlist;
+----+-------------+-----------------+------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id | User        | Host            | db   | Command | Time | State                           | Info                                                                                                 | Progress |
+----+-------------+-----------------+------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
|  1 | system user |                 | NULL | Sleep   |  376 | wsrep aborter idle              | NULL                                                                                                 |    0.000 |
|  2 | system user |                 | NULL | Sleep   |  376 | Closing tables                  | NULL                                                                                                 |    0.000 |
|  5 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
|  6 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
|  3 | system user |                 | NULL | Daemon  | NULL | InnoDB purge coordinator        | NULL                                                                                                 |    0.000 |
|  4 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
|  7 | system user |                 | NULL | Daemon  | NULL | InnoDB shutdown handler         | NULL                                                                                                 |    0.000 |
| 11 | system user |                 | NULL | Sleep   |  375 | wsrep applier idle              | NULL                                                                                                 |    0.000 |
| 12 | system user |                 | NULL | Sleep   |  375 | Closing tables                  | NULL                                                                                                 |    0.000 |
| 14 | system user |                 | NULL | Sleep   |  375 | wsrep applier idle              | NULL                                                                                                 |    0.000 |
| 22 | ramesh      | localhost:50776 | test | Query   |  295 | acquiring total order isolation | OPTIMIZE TABLE `oltp8` /* QNO 982 CON_ID 22 */                                                       |    0.000 |
| 23 | ramesh      | localhost:50788 | test | Query   |  325 | Commit                          | INSERT INTO `oltp10` ( `id` ) VALUES ( NULL ),( NULL ),( NULL ),( NULL )  /* QNO 940 CON_ID 23 */    |    0.000 |
| 24 | ramesh      | localhost:50802 | test | Query   |  325 | Waiting for table metadata lock | ALTER ONLINE TABLE `oltp8` ROW_FORMAT=DYNAMIC /* QNO 942 CON_ID 24 */                                |    0.000 |
| 27 | ramesh      | localhost:50836 | test | Query   |  325 | Commit                          | INSERT INTO `oltp7` ( `id` ) VALUES ( NULL ),( NULL ),( NULL ),( NULL )  /* QNO 934 CON_ID 27 */     |    0.000 |
| 28 | ramesh      | localhost:50852 | test | Query   |  295 | acquiring total order isolation | KILL CONNECTION @conid /* QNO 931 CON_ID 28 */                                                       |    0.000 |
| 31 | ramesh      | localhost:50878 | test | Query   |  115 | acquiring total order isolation | KILL CONNECTION @conid /* QNO 929 CON_ID 31 */                                                       |    0.000 |
| 35 | ramesh      | localhost:50900 | test | Query   |   85 | acquiring total order isolation | KILL CONNECTION @conid /* QNO 941 CON_ID 35 */                                                       |    0.000 |
| 36 | ramesh      | localhost:50916 | test | Query   |  325 | acquiring total order isolation | KILL QUERY @conid /* QNO 927 CON_ID 36 */                                                            |    0.000 |
| 37 | ramesh      | localhost:54008 | test | Query   |   25 | Init                            | INSERT INTO `oltp5` ( `id` ) VALUES ( NULL ),( NULL ),( NULL ),( NULL )  /* QNO 953 CON_ID 29 */     |    0.000 |
| 39 | ramesh      | localhost:54032 | test | Query   |   25 | Init                            | INSERT INTO `oltp8` ( `id`, `id`) VALUES ( NULL, -699531264 )  /* QNO 959 CON_ID 33 */               |    0.000 |
| 40 | ramesh      | localhost:54034 | test | Query   |  325 | acquiring total order isolation | OPTIMIZE TABLE `oltp8` /* QNO 943 CON_ID 26 */                                                       |    0.000 |
| 42 | ramesh      | localhost:52848 | test | Query   |   25 | Init                            | SELECT CONCAT('SELECT * FROM INFORMATION_SCHEMA.',table_name,' ORDER BY RAND() LIMIT 10') FROM INFOR |    0.000 |
| 43 | ramesh      | localhost:52850 | test | Killed  |  325 | acquiring total order isolation | KILL QUERY @conid /* QNO 956 CON_ID 34 */                                                            |    0.000 |
| 44 | ramesh      | localhost:52856 | test | Query   |  325 | Commit                          | INSERT INTO `oltp1` ( `id` ) VALUES ( NULL ),( NULL ),( NULL ),( NULL )  /* QNO 983 CON_ID 21 */     |    0.000 |
| 45 | ramesh      | localhost:52864 | test | Killed  |  325 | Commit                          | INSERT INTO `oltp7` ( `id`, `pad` ) VALUES ( NULL, 'v' ) /* QNO 952 CON_ID 25 */                     |    0.000 |
| 46 | ramesh      | localhost:52876 | test | Query   |   25 | Init                            | INSERT INTO `oltp4` ( `c` ) VALUES ( 's' )  /* QNO 939 CON_ID 32 */                                  |    0.000 |
| 47 | ramesh      | localhost:52890 | test | Query   |  145 | acquiring total order isolation | ALTER ONLINE TABLE `oltp9` FORCE, LOCK=SHARED, ALGORITHM=INPLACE /* QNO 961 CON_ID 25 */             |    0.000 |
| 48 | root        | localhost       | NULL | Query   |  177 | acquiring total order isolation | create database test_one                                                                             |    0.000 |
| 49 | root        | localhost       | NULL | Query   |    0 | Init                            | show processlist                                                                                     |    0.000 |
+----+-------------+-----------------+------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
29 rows in set (0.001 sec)
 
MariaDB [(none)]> 

Comment by Seppo Jaakola [ 2022-10-26 ]

The pull request should not affect MDL lock conflict resolving, and these tests are specifically for MDL conflicts. Do you have a reason to believe that the PR has caused this behavior, and is a regression?
I have later work for MDL conflict resolving, expect a PR for that shortly.

Comment by Jan Lindström (Inactive) [ 2022-10-26 ]

ramesh Can you verify using 10.4 branch is this MDL problem reproducible there also?

Comment by Ramesh Sivaraman [ 2022-10-26 ]

jplindst Could not reproduce the MDL issue on the latest 10.4 base build yet.

Comment by Ramesh Sivaraman [ 2022-10-28 ]

seppo PFA stack all_bt_28102022.txt from hung server built with latest changes

9da381796a49b74643808972bcc3028b15962dfd

Comment by Ramesh Sivaraman [ 2022-11-16 ]

seppo The latest commit still has a hung issue. PFA full stack bt_all_v2.txt

10.4.28 665f81757b2a7b29099307f6c5a15e04964502a0

MariaDB [test]> show processlist;
+----+-------------+-----------------+------+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+----------+
| Id | User        | Host            | db   | Command | Time  | State                           | Info                                                                                             | Progress |
+----+-------------+-----------------+------+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+----------+
|  1 | system user |                 | NULL | Sleep   | 54136 | wsrep aborter idle              | NULL                                                                                             |    0.000 |
|  2 | system user |                 | NULL | Sleep   | 54136 | Closing tables                  | NULL                                                                                             |    0.000 |
|  3 | system user |                 | NULL | Daemon  |  NULL | InnoDB purge coordinator        | NULL                                                                                             |    0.000 |
|  6 | system user |                 | NULL | Daemon  |  NULL | InnoDB purge worker             | NULL                                                                                             |    0.000 |
|  4 | system user |                 | NULL | Daemon  |  NULL | InnoDB purge worker             | NULL                                                                                             |    0.000 |
|  5 | system user |                 | NULL | Daemon  |  NULL | InnoDB purge worker             | NULL                                                                                             |    0.000 |
|  7 | system user |                 | NULL | Daemon  |  NULL | InnoDB shutdown handler         | NULL                                                                                             |    0.000 |
| 12 | system user |                 | NULL | Sleep   | 54136 | Closing tables                  | NULL                                                                                             |    0.000 |
| 11 | system user |                 | NULL | Sleep   | 54136 | wsrep applier idle              | NULL                                                                                             |    0.000 |
| 14 | system user |                 | NULL | Sleep   | 54136 | wsrep applier idle              | NULL                                                                                             |    0.000 |
| 25 | ramesh      | localhost:33900 | test | Query   | 54084 | Waiting for table metadata lock | OPTIMIZE TABLE `oltp8` /* QNO 956 CON_ID 25 */                                                   |    0.000 |
| 27 | ramesh      | localhost:46682 | test | Query   | 54084 | Commit                          | INSERT INTO `oltp7` ( `id`, `k`) VALUES ( NULL, 2059206656 ) /* QNO 974 CON_ID 27 */             |    0.000 |
| 28 | ramesh      | localhost:46696 | test | Killed  | 54084 | acquiring total order isolation | KILL QUERY @conid /* QNO 979 CON_ID 28 */                                                        |    0.000 |
| 31 | ramesh      | localhost:46720 | test | Query   | 53934 | acquiring total order isolation | KILL CONNECTION @conid /* QNO 1001 CON_ID 31 */                                                  |    0.000 |
| 32 | ramesh      | localhost:46728 | test | Query   | 54084 | acquiring total order isolation | KILL CONNECTION @conid /* QNO 965 CON_ID 32 */                                                   |    0.000 |
| 33 | ramesh      | localhost:46732 | test | Query   | 54085 | Commit                          | INSERT INTO `oltp11` ( `pad` ) VALUES ( 't' )  /* QNO 963 CON_ID 33 */                           |    0.000 |
| 34 | ramesh      | localhost:46742 | test | Query   | 54084 | Commit                          | INSERT INTO `oltp4` ( `id` ) VALUES ( NULL ),( NULL ),( NULL ),( NULL )  /* QNO 992 CON_ID 34 */ |    0.000 |
| 38 | root        | localhost       | test | Query   | 53968 | acquiring total order isolation | create database test_one                                                                         |    0.000 |
| 42 | root        | localhost       | test | Query   |     0 | Init                            | show processlist                                                                                 |    0.000 |
+----+-------------+-----------------+------+---------+-------+---------------------------------+--------------------------------------------------------------------------------------------------+----------+
19 rows in set (0.001 sec)
 
MariaDB [test]> 

Comment by Seppo Jaakola [ 2022-12-13 ]

I completed a RQG run with 10 connections against one node in a cluster, with the attached oltp_and_ddl_v1.yy grammar. It appears that this test is generating a load with mixture of KILL commands, DML, and DDL both with TOI and RSU method. The loaded node will hang very soon after starting the test. The reason for this hang appears to be RSU execution, which desyncs and pauses the node, while there is ongoing TOI replication in execution. Because of the paused state, the earlier TOI execution cannot complete, and because of the executing TOI, the RSU cannot get necessary MDL locks => hence the hang.
This is rather artificial situation, as DDL is supposed to happen either by TOI or RSU, not mixed. The fix for such hang should be to make the RSU to wait until all ongoing TOI has completed. This may lead to RSU starving with this test load, though.

Comment by Ramesh Sivaraman [ 2022-12-14 ]

seppo We can reproduce the problem even without the {{SET wsrep_OSU_method..}}statement. This is a sporadic issue, so start the RQG run multiple times to reproduce the problem.

PFA grammar oltp_and_ddl_v2.yy to reproduce the issue.

RQG command to reproduce the hang problem.

perl runall-new.pl \
   --basedir=/test/mtest/mariadb-10.4.28-linux-x86_64 \
   --vardir=/home/ramesh/galera-rqg-test \
   --mysqld=--wsrep-provider=/test/mtest/mariadb-10.4.28-linux-x86_64/lib/libgalera_smm.so \
   --gendata=conf/mariadb/oltp.zz \
   --grammar=conf/mariadb/oltp_and_ddl_v2.yy \
   --threads=16 \
   --galera=mss \
   --mysqld=--wsrep_sst_method=rsync \
   --mysqld=--core \
   --mysqld=--general-log \
   --mysqld=--general-log-file=queries.log \
   --mysqld=--log-output=file \
   --mysqld=--wsrep-debug=0 \
   --mysqld=--wsrep-sync-wait=15 \
   --mysqld=--wsrep_retry_autocommit=0 \
   --mysqld=--wsrep_slave_threads=12 \
   --mysqld=--wsrep_log_conflicts=1 \
   --mysqld=--wsrep_on=1 \
   --mysqld=--default-storage-engine=innodb \
   --mysqld=--sort_buffer_size=200M \
   --mysqld=--innodb-lock-wait-timeout=1 \
   --mysqld=--gtid_domain_id=10 \
   --mysqld=--wsrep_gtid_domain_id=100 \
   --mysqld=--wsrep_gtid_mode=ON \
   --mysqld=--wsrep_slave_threads=4 \
   --mysqld=--slave_parallel_threads=4 \
   --mysqld=--server_id=11 \
   --mysqld=--gtid_strict_mode=1 \
   --mysqld=--log_slave_updates=ON \
   --mysqld=--log_bin=binlog \
   --mysqld=--binlog_format=ROW \
   --mysqld=--master_info_repository=TABLE \
   --mysqld=--relay_log_info_repository=TABLE 

Comment by Seppo Jaakola [ 2022-12-27 ]

This new RQG grammar makes two other scenarios to surface, I will submit separate MDEV issue for both, as they are not related to the fixes suggested in PR's for this MDEV. The surfacing issues are:

  • KILL command is issued from inside of a transaction. And further, this KILL execution can be a target of another KILL command. BF aborting a KILL inside a transaction appears to cause a crash.
  • SAVEPOINT command will be replicated as query log event. If transaction containing such SAVEPOINT event, has to replay, the replaying may fail for un-assigned character set information.
Comment by Jan Lindström (Inactive) [ 2023-01-10 ]
Comment by Marko Mäkelä [ 2023-01-11 ]

seppo, I provided some review comments to a 10.6 version of this: 1, 2, 3, 4. Please address it.

I did not look at other versions yet.

Comment by Marko Mäkelä [ 2023-01-11 ]

For the 10.6 version, a fix of MDEV-29860 could be useful to have.

Generated at Thu Feb 08 10:10:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.