[MDEV-22543] Galera SST donation fails, FLUSH TABLES WITH READ LOCK times out - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL)
Fix Version/s: 10.2.35, 10.3.26, 10.4.16, 10.5.7
Component/s: Galera, Tests, MTR
Labels:
None

Description

SST donation fails occasionally under heavy load due to FLUSH TABLES WITH READ LOCK timing out after one second of waiting:

2020-05-13 10:18:44 139871772944128 [Note] WSREP: sst_donor_thread signaled with 0

2020-05-13 10:18:45 139871226160896 [Note] WSREP: Flushing tables for SST...

2020-05-13 10:18:46 139871226160896 [Warning] WSREP: Error executing 'FLUSH TABLES WITH READ LOCK': 1205 (Lock wait timeout exceeded; try restarting transaction)

2020-05-13 10:18:46 139871226160896 [ERROR] WSREP: Failed to flush and lock tables

The following MTR test demonstrates the issue by issuing an UPDATE on donor node and stopping the UPDATE execution at sync point after some MDL locks have been taken. When node_2 tries to join with SST, the lock wait time out happens.

# The test verifies that the FLUSH TABLES WITH READ LOCK does not

# time out if it needs to wait for another MDL lock for short duration

# during SST donation.

--source include/galera_cluster.inc

--let $node_1 = node_1

--let $node_2 = node_2

--source include/auto_increment_offset_save.inc

--let $galera_connection_name = node_1_ctrl

--let $galera_server_number = 1

--source include/galera_connect.inc

# Run UPDATE on node_1 and make it block before table locks are taken.

# This should block FTWRL.

--connection node_1

CREATE TABLE t1 (f1 INT PRIMARY KEY, f2 INT);

INSERT INTO t1 VALUES (1, 1);

SET DEBUG_SYNC = "before_lock_tables_takes_lock SIGNAL sync_point_reached WAIT_FOR sync_point_continue";

--send UPDATE t1 SET f2 = 2 WHERE f1 = 1

--connection node_1_ctrl

SET DEBUG_SYNC = "now WAIT_FOR sync_point_reached";

# Restart node_2, force SST.

--connection node_2

--source include/shutdown_mysqld.inc

--remove_file $MYSQLTEST_VARDIR/mysqld.2/data/grastate.dat

# Restart without waiting. The UPDATE should block FTWRL on node_1,

# so the SST cannot be completed and node_2 cannot join before

# UPDATE connection is signalled to continue.

--exec echo "restart:$start_mysqld_params" > $_expect_file_name

# If the bug is present, FTWRL times out on node_1 in couple of

# seconds and node_2 fails to join.

--sleep 10

--connection node_1_ctrl

SET DEBUG_SYNC = "now SIGNAL sync_point_continue";

--connection node_1

--reap

SET DEBUG_SYNC = "RESET";

--connection node_2

--enable_reconnect

--source include/wait_until_connected_again.inc

--connection node_1

DROP TABLE t1;

--source include/auto_increment_offset_restore.inc

Apparently the reason for early time out is the following in MDL_context::acquire_lock

    /* Check if the client is gone while we were waiting. */

    if (! thd_is_connected(m_owner->get_thd()))

/*

       * The client is disconnected. Don't wait forever:

       * assume it's the same as a wait timeout, this

       * ensures all error handling is correct.

*/

      wait_status= MDL_wait::TIMEOUT;

      break;

The call to thd_is_connected() always returns false for SST donor THD, so if the lock wait lasts more than one second, it will bail out with timeout.

Attachments

Issue Links

causes

MDEV-23483 Set Galera SST thd as system thread

Closed

Activity

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Teemu Ollakka

Votes:: 3 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2020-05-13 09:27

Updated:: 2025-11-19 11:21

Resolved:: 2020-08-11 10:01

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.