[MDEV-17350] FLUSH TABLES keeps tables locked after timeout Created: 2018-10-02 Updated: 2023-04-27 |
|
| Status: | Confirmed |
| Project: | MariaDB Server |
| Component/s: | Locking |
| Affects Version/s: | 5.5, 10.0, 10.1, 10.1.21, 10.1.34, 10.2, 10.3, 10.4 |
| Fix Version/s: | 10.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Thomas Mischke | Assignee: | Michael Widenius |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | upstream | ||
| Environment: |
for example Linux version 2.6.32-696.10.1.el6.x86_64 (mockbuild@c1bl.rdu2.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-18) (GCC) ) #1 SMP Tue Aug 22 18:51:35 UTC 2017 |
||
| Description |
|
Transactions started while the database is waiting to finish a "FLUSH TABLES WITH READ LOCK" are still blocked, when the FLUSH command hits a timeout. In our environment this happens as follows: Session 1: select * from a; # this is a very long running statement, 1800 seconds for example Session 2: SET lock_wait_timeout = 10; Session 3: select * from a where a.x=1; # this is executed, while session 2 is waiting in the flush tables After 10 seconds the session 2 gets a timeout. We then close the connection. But now session 3 is still blocked. The list of running statements shows, that it is "waiting for table flush". Once session 1 returns, session 3 continues. This is the behavior we see with version 10.1.34. In our live environment we see a slightly different behavior of session 3: Instead of waiting for session 1 to finish, it continues after exactly 600 seconds (our slow query log is full with such statements started shortly after we started "flush tables with read lock"). In our live system we are using version 10.1.21. In the slow query log we see one more interesting aspect. The relevant line reads: While our production tables have lots of indices we were able to reproduce it with a newly created table: Is there any way how we can tell session 3 to continue? |
| Comments |
| Comment by Elena Stepanova [ 2018-10-03 ] | |||||||||||||||||||||
|
Thanks for the report. Reproducible as described on all major versions of MariaDB / MySQL, which would often indicate that it's an intended legacy behavior (however weird), but I can't find anything in MySQL documentation that would confirm it. Unless we are missing something, it's a really nasty flaw. The test case below represents the scenario in the description, with only two changes: The test case fails when select * from t1 in con1 hits a timeout. It shouldn't happen.
| |||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-10-04 ] | |||||||||||||||||||||
|
Yes, this is intended behaviour and strictly speaking it has nothing to do with locking. That is client waits for invalidated table to get evicted from table definition cache, which happens only when previously started statements release this table. I doubt it is feasible to rollback FLUSH TABLES effect on error, that is "uninvalidate" flushed tables. At the very least it means we need to track who invalidated what, which is fairly major effort. On a positive note there's some progress going on to make backups less painful. Probably it could help in your situation as well, see | |||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-10-05 ] | |||||||||||||||||||||
|
thomas.mischke, could you briefly explain how you use FLUSH TABLES WITH READ LOCK, probably we could come up with some alternative. | |||||||||||||||||||||
| Comment by Thomas Mischke [ 2018-10-05 ] | |||||||||||||||||||||
|
We are using it for our backup. Since the database is running in production while performing the backup, we set the timeout for FLUSH TABLES WITH READ LOCK to three seconds and keep trying (every 10 seconds), until we get the lock. Our data folder resides on an LVM volumn, so we create a snapshot then, issue UNLOCK TABLES and start to copy the content of the snapshot to our backup storage. This complete backup runs once a day. In addition we have an incremental backup (that is every hour we copy the binlog of our server instance). Additional information:
After your comment from yesterday, I read about BACKUP LOCK: Looking forward to it. But some other comments I read made me wonder: Do I need FLUSH TABLES WITH READ LOCK with InnoDB tables, or can we go without (if we have the binlog at hand)? | |||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-10-05 ] | |||||||||||||||||||||
|
thomas.mischke, if only InnoDB tables matter and LVM snapshot produces something like copy-on-write pages, you definitely don't need the FLUSH TABLES part, but I still feel like WITH READ LOCK matters. Mainly because concurrent DDL may get InnoDB metadata and .frm metadata out of sync, that is you may end up with unrecoverable (or painfully recoverable) data. Another problem might be if InnoDB data files get updated concurrently and snapshot sees partially updated page (not sure if this is possible, most probably it is), then you may get unrecoverable data. WITH READ LOCK most probably protects against this as well. Since you copy data anyway, you may consider trying mariabackup. It workarounds partially updated pages by verifying page checksum and retrying if it doesn't match. There's mode, when it doesn't even have to issue FLUSH TABLES WITH READ LOCK. However I'm not sure what is mariabackup status quality wise. Another option might be implementing syntax to be able to acquire global read lock without flushing tables. From InnoDB point of view this protection shouldn't differ from current FLUSH TABLES WITH READ LOCK. wlad, could you correct me if I'm wrong? | |||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-10-05 ] | |||||||||||||||||||||
|
FLUSH TABLES WITH READ LOCK part is needed for non-Innodb (system tables, mysql.user and such, in case they are modified). They are needed in case DDL is running. If caught in the middle of DDL operation, the backup might not be recoverable, even for Innodb. FLUSH TABLES WITH READ LOCK is also used to get current binlog position, to bootstrap a new replication slave from a backup. Current alternatives to FTWRL
The downside, as in case of LOCK TABLES READ, is you also need to calculate the list of the tables yourself from information schema, and ensure no new tables being created while taking backup. Current plan for backup locks , that are lightweighter than FTWRL is But if you just care about Innodb, and can ensure that no DDL is running, yes, you can take snapshots without any locks. you can try with mariabackup --backup --no-lock , but in this case you cannot ensure consistency of MyISAM, or give a consistent binlog position. | |||||||||||||||||||||
| Comment by Thomas Mischke [ 2018-10-09 ] | |||||||||||||||||||||
|
Thanks for pointing out the alternatives. I will take a closer look at mariabackup in each case (even if I find we need FTWRL). I am quite sure that DDL is constantly being used in some of our databases, but I will double check. To me it seems that we have to live with the current situation until 10.4, that is try to avoid long running statements when the backup wants to start and use our replication server for backups (it is a bit smaller than the main server, thus the backup is painful for other reasons...). |