[MDEV-22052] Server with wsrep enabled doesn't respect lock wait timeouts under FLUSH TABLE WITH READ LOCK - Jira

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.4(EOL)
Component/s: Locking, wsrep
Labels:
None

Description

Note: might be related to ~~MDEV-22051~~. The test case is similar, but the one in ~~MDEV-22051~~ shows what happens if DDL is attempted in the thread holding the lock, while this one – what happens if DDL is attempted concurrently with the lock.

--source include/galera_cluster.inc

--source include/have_innodb.inc

FLUSH TABLES WITH READ LOCK;

--connect (con1,localhost,root,,)

SET lock_wait_timeout= 1;

--error ER_LOCK_WAIT_TIMEOUT

CREATE TABLE t1 (a INT) ENGINE=InnoDB;

The expected result is that after 1 second the CREATE attempt fails with a timeout which is set for the session.. Instead, the statement hangs seemingly forever.

Reproducible on 10.1-10.5, debug and non-debug alike.
Not reproducible if wsrep is disabled.

Attachments

Issue Links

relates to

MDEV-22051 Protocol::end_statement(): Assertion `0' failed on Galera node upon DDL attempt with conflicting lock

Closed

Activity

Ascending order - Click to sort in descending order

Jan Lindström (Inactive) added a comment - 2020-04-09 06:16

Fixed on edc3899d9781e98b4328931884527913ebffb11f

Jan Lindström (Inactive) added a comment - 2020-04-09 06:16 Fixed on edc3899d9781e98b4328931884527913ebffb11f

Elena Stepanova added a comment - 2020-04-12 14:55 - edited

It still doesn't respect lock wait timeouts.

Only now, if it's started with configuration like galera test suite provides (whatever it is), the DDL fails with ER_UNKNOWN_COM_ERROR ("Aborting TOI: Global Read-Lock (FTWRL) in place") right away, regardless the configured lock_wait_timeout; and if the server is stared as a single node with WSREP enabled, e.g. with

--wsrep_on=ON --wsrep_cluster_address=gcomm:// --wsrep_provider=/home/elenst/galera/galera-4.so --innodb_autoinc_lock_mode=2 --innodb_doublewrite=1 --binlog-format=row

startup options, then it hangs seemingly forever as it did before, also regardless the lock_wait_timeout.

Also, here is another test case which still hangs even with the patch above, also with the standard MTR configuration under suite/galera:

--source include/galera_cluster.inc

CREATE TABLE t1 (a INT) ENGINE=InnoDB;

LOCK TABLE t1 WRITE;

--connect (con1,localhost,root,,test)

SET lock_wait_timeout= 1;

--error ER_LOCK_WAIT_TIMEOUT

CREATE VIEW v1 AS SELECT * FROM t1;

Elena Stepanova added a comment - 2020-04-12 14:55 - edited It still doesn't respect lock wait timeouts. Only now, if it's started with configuration like galera test suite provides (whatever it is), the DDL fails with ER_UNKNOWN_COM_ERROR ("Aborting TOI: Global Read-Lock (FTWRL) in place") right away , regardless the configured lock_wait_timeout ; and if the server is stared as a single node with WSREP enabled, e.g. with --wsrep_on=ON --wsrep_cluster_address=gcomm:// --wsrep_provider=/home/elenst/galera/galera-4.so --innodb_autoinc_lock_mode=2 --innodb_doublewrite=1 --binlog-format=row startup options, then it hangs seemingly forever as it did before, also regardless the lock_wait_timeout . Also, here is another test case which still hangs even with the patch above, also with the standard MTR configuration under suite/galera : --source include/galera_cluster.inc CREATE TABLE t1 (a INT ) ENGINE=InnoDB; LOCK TABLE t1 WRITE; --connect (con1,localhost,root,,test) SET lock_wait_timeout= 1; --error ER_LOCK_WAIT_TIMEOUT CREATE VIEW v1 AS SELECT * FROM t1;

Jan Lindström (Inactive) added a comment - 2020-05-18 07:24

elenst Does it really need to wait up to that lock_wait_timeout seconds in my opinion if we find out problematic usage it is correct to return right away, this is only way at least on brute force transactions as they can't wait.

Jan Lindström (Inactive) added a comment - 2020-05-18 07:24 elenst Does it really need to wait up to that lock_wait_timeout seconds in my opinion if we find out problematic usage it is correct to return right away, this is only way at least on brute force transactions as they can't wait.

Elena Stepanova added a comment - 2020-05-18 08:52 - edited

No, it doesn't have to wait if a problem is revealed right away in a consistent manner (and if it really is kind of a problem which should make Galera fail, I'll leave it to Galera experts to decide on that).
But please read the whole comment. This FTWRL detection only covers a subset of the problem, the comment shows at least two cases where the check doesn't help, which is why it is not a proper fix.

lock_wait_timeout has a wide-spread effect on server operation, if it isn't respected, there are probably numerous other cases where it would be visible, a specific hack around FTWRL won't fix the whole issue.

Elena Stepanova added a comment - 2020-05-18 08:52 - edited No, it doesn't have to wait if a problem is revealed right away in a consistent manner (and if it really is kind of a problem which should make Galera fail, I'll leave it to Galera experts to decide on that). But please read the whole comment. This FTWRL detection only covers a subset of the problem, the comment shows at least two cases where the check doesn't help, which is why it is not a proper fix. lock_wait_timeout has a wide-spread effect on server operation, if it isn't respected, there are probably numerous other cases where it would be visible, a specific hack around FTWRL won't fix the whole issue.

People

Assignee:: Jan Lindström

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2020-03-26 17:27

Updated:: 2023-04-11 06:01

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server