[MDEV-11709] lock_wait_timeout has no effect when slave_parallel_mode!=none Created: 2017-01-03  Updated: 2017-06-19  Resolved: 2017-06-19

Status: Closed
Project: MariaDB Server
Component/s: Locking, Replication
Affects Version/s: 10.1.19
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Søren Kröger Assignee: Kristian Nielsen
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Ubuntu LTS 16



 Description   

Running on a slave server, with slave_parallel_mode set to "conservative" and another process with an active lock, lock_wait_timeout won't have any effect.

How to reproduce:

  1. Create a replicating slave with incoming data
  2. Make sure slave_parallel_mode is set to "conservative" (maybe other modes than none will cause the same problem, but I haven't tested it)
  3. Create a global lock

    FLUSH TABLES WITH READ LOCK;
    

  4. Start another mysql session and do

    set lock_wait_timeout=1;  FLUSH TABLES WITH READ LOCK;
    

Result:
The flush command will hang (maybe until the previous lock is released)
When looking at the processlist, the flush command will hang in the "Waiting for worker threads to pause for global read lock" state.

Expected:
The flush command should timeout like this:

MariaDB [(none)]> set global lock_wait_timeout=1;  FLUSH TABLES WITH READ LOCK;
Query OK, 0 rows affected (0.00 sec)
 
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Workaround:

stop slave;
slave_parallel_mode=none;
start slave;



 Comments   
Comment by Elena Stepanova [ 2017-01-03 ]

I'm not sure there is a bug in here.
Semantically, lock_wait_timeout applies to waiting for locks (metadata locks). In this case, the query is waiting for something other than a lock – "Waiting for worker threads to pause". So, it makes sense that the lock waiting timeout does not apply here.
It looks more like a job for max_statement_time setting, which would indeed work in this case.

That said, I'll still assign it to the locking expert svoj to double-check and confirm (or object).

Comment by Sergey Vojtovich [ 2017-06-16 ]

I tend to agree with Elena, but I also feel like lock_wait_timeout might be more convenient.

knielsen, do you feel like rpl_pause_for_ftwrl() can be thought as a lock? Should we honour lock_wait_timeout there?

Comment by Kristian Nielsen [ 2017-06-16 ]

rpl_pause_for_ftwrl() is not a lock.
Rather, it is waiting for the replication threads to reach a particular idle
state, before FTWRL can proceed. This is needed to fix a problem where the
server could get deadlocked.
The pause should only happen during the process of taking the FTWRL inside
sql_parse.cc, not for the full duration of a lock.

Maybe what happens here is that the replication worker threads are waiting
for the first FLOUSH TABLES WITH READ LOCK? And the worker threads might be
running with a different lock_wait_timeout.

Probably a similar thing would be seen if replicating a long-running
transaction (eg. INSERT INTO t1 VALUES (sleep(100)))?

A similar case is for STOP SLAVE. It also needs to running replicated
transactions to complete, and that also does not time out on
lock_wait_timeout.

I don't mind if you make rpl_pause_for_ftwrl() time out on
lock_wait_timeout. But it might be somewhat tricky to implement correctly. I
am thinking about the case where some worker threads are already
successfully paused while others are still running, and maybe another FTWRL
is running in parallel...

Comment by Sergey Vojtovich [ 2017-06-19 ]

If this code is not a lock, then timeout on lock_wait_timeout is semantically wrong. Please use max_statement_time as Elena suggested.

knielsen, rpl_pause_for_ftwrl() already has code to handle interrupt, though I have no idea if it works well.

Generated at Thu Feb 08 07:52:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.