[MDEV-33136] GCF-1060 test hangs when BF-abort logic mistreats transactions with explicit MDL locks Created: 2023-12-28  Updated: 2024-01-18

Status: In Progress
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.32
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Denis Protivensky Assignee: Denis Protivensky
Resolution: Unresolved Votes: 0
Labels: None

Attachments: HTML File backtrace_all     File mysqld.1.err     File mysqld.2.err    
Issue Links:
Relates
relates to MDEV-32160 GCF-1060 test failure due to wsrep MD... Stalled

 Description   

The if branch of wsrep_handle_mdl_conflict():

    else if (granted_thd->lex->sql_command == SQLCOM_FLUSH ||
             granted_thd->mdl_context.has_explicit_locks())
    {
      WSREP_DEBUG("BF thread waiting for FLUSH");

doesn't consider that there may be regular transactions having explicit MDL locks.

Example output:

2023-12-28 17:44:17 2 [Note] WSREP: Wsrep_high_priority_service::apply_toi: 1831
2023-12-28 17:44:17 2 [Note] WSREP: assigned new next query and  trx id: 4379
T@4    : 17:44:17.637771 Query_log_event::do_apply_event: query: TRUNCATE TABLE t1
T@4    : 17:44:17.637797 reset_current_stmt_binlog_format_row: debug: temporary_tables: no, in_sub_stmt: no, system_thread: SYSTEM_THREAD_SLAVE_SQL
2023-12-28 17:44:17 2 [Note] WSREP: MDL conflict·
schema:  test
request: (2     seqno 1831  wsrep (toi, exec, committed) cmd 0 8    TRUNCATE TABLE t1)
granted: (237   seqno -1    wsrep (local, exec, executing) cmd 3 5  INSERT INTO t1 VALUE (4, 'z'))
2023-12-28 17:44:17 2 [Note] WSREP: MDL ticket: type: MDL_SHARED_WRITE space: TABLE db: test name: t1 (Waiting for table metadata lock)
2023-12-28 17:44:17 2 [Note] WSREP: BF thread waiting for FLUSH
2023-12-28 17:44:17 2 [Note] WSREP: MDL ticket: type: MDL_SHARED_WRITE space: TABLE db: test name: t1 (Waiting for table metadata lock)

Fixed debug output printing SQL query in the mentioned if branch results in:

2023-12-28 20:09:11 2 [Note] WSREP: Wsrep_high_priority_service::apply_toi: 6599
2023-12-28 20:09:11 2 [Note] WSREP: assigned new next query and  trx id: 16942
T@4    : 20:09:11.867236 Query_log_event::do_apply_event: query: TRUNCATE TABLE t1
T@4    : 20:09:11.867266 reset_current_stmt_binlog_format_row: debug: temporary_tables: no, in_sub_stmt: no, system_thread: SYSTEM_THREAD_SLAVE_SQL
2023-12-28 20:09:11 2 [Note] WSREP: MDL conflict·
schema:  test
request: (2     seqno 6599  wsrep (toi, exec, committed) cmd 0 8    TRUNCATE TABLE t1)
granted: (907   seqno -1    wsrep (local, exec, executing) cmd 3 5  INSERT INTO t1 VALUE (4, 'z'))
2023-12-28 20:09:11 2 [Note] WSREP: MDL ticket: type: MDL_SHARED_WRITE space: TABLE db: test name: t1 (Waiting for table metadata lock)
2023-12-28 20:09:11 2 [Note] WSREP: BF thread waiting for INSERT INTO t1 VALUE (4, 'z')
2023-12-28 20:09:11 2 [Note] WSREP: MDL ticket: type: MDL_SHARED_WRITE space: TABLE db: test name: t1 (Waiting for table metadata lock)

In this case no BF-abort happens as the DML operation INSERT INTO t1 VALUE (4, 'z') holding explicit MDL locks is treated as FLUSH TABLES, which is not the case. This prevents such an operation to be aborted.

The reason why a DML operation may hold explicit locks is an open question.


Generated at Thu Feb 08 10:36:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.