[MDEV-17772] 3 way lock : ALTER, MDL, BACKUP STAGE BLOCK_DDL - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4.1
Fix Version/s: 10.4.1
Component/s: Locking
Labels:
None

Description

Here is an unexpected wait with "Backup locks" (similar to the one described in ~~MDEV-15636~~, FTWRL-related, but Backup locks were supposed to be better than FTWRL, and be instant in most cases) Backup locks was supposed not to wait for SELECTs, or ALTER in progress, but in this case, it does. Moreover, in the case below, it waits until the end of a transaction, while no DDL, DML or SELECT is currently running (even if example has a DDL command, it is waiting and it did not start to run yet).

create table t1(i int) engine innodb;

Connection 1 ( Acquire MDL lock)

MariaDB [test]> start transaction;
Query OK, 0 rows affected (0.000 sec)

MariaDB [test]> select 1 from t1; # <-- acquires MDL lock
Empty set (0.001 sec)

Connection 2 (ALTER TABLE)

MariaDB [test]> alter table t1 add column (j int); # <-- waits on MDL

Connection 3 (BACKUP STAGE ...)

MariaDB [(none)]> backup stage start;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> backup stage flush;
Query OK, 0 rows affected (0.002 sec)

MariaDB [(none)]> backup stage block_ddl; # <-- waits on something

Attachments

Issue Links

relates to

MDEV-5336 Implement BACKUP STAGE for safe external backups

Closed

MDEV-15636 mariabackup --lock-ddl-per-table hangs in FLUSH TABLES due to MDL conflict if ALTER TABLE issued

Closed

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Vladislav Vaintroub added a comment - 2018-11-20 17:10 - edited

The deadlock I'm referring to is the mentioned ~~MDEV-15636~~ (mariabackup --lock-ddl-per-table hangs in FLUSH TABLES due to MDL conflict if ALTER TABLE issued)

Since this bug is prominently featured in the description ~~MDEV-5336~~(Implement LOCK FOR BACKUP) assumed the problem is known/understood. if not, here is what happens.

This happens with mariabackup that runs with --lock-ddl-per-table option

a) A connection inside mariabackup is holding MDLs for all Innodb tables. MDL is acquired as SELECT 1 from table, a transaction. Transaction is not commited until the end of backup.
b) A user connection is doing ALTER TABLE for which MDL is held. ALTER waits.
c) Another connection inside mariabackup is trying to acquire FTWRL.. It waits for ALTER in b), which is waits for transaction holding MDL in a), which is not commited until the end of backup.

This ^ is a deadlock.

To resolve the deadlock, mariabackup issues KILL QUERY for the ALTER in b). This allows FTWRL to proceed. Another possible "solution" is --no-lock which avoids FTWRL.

So,
-I tested BACKUP STAGE with mariabackup.
-I read ~~MDEV-5336~~. It said in description "This lock will also solve the problem with ~~MDEV-15636~~ (killing running queries that conflicts with FLUSH) as the backup locks will
not conflict with other DDL locks"
-I removed KILL QUERY from mariabackup. I wanted to see whether the statement "this lock will also solve the problem with ~~MDEV-15636~~" holds.
-As a result , mariabackup hanged in a test that tests "lock-ddl-per-table" option.

My conclusion therefore was that it did not really solve the problem with ~~MDEV-15636~~.

When you have a fix for that, I will remove KILL QUERY from the backup code.

(I did not decide yet, whether to remove --lock-ddl-per-table option, maybe there is some weird case, where people would still like to use it. At least it turned out to be a useful tool to test BACKUP STAGE so far)

Vladislav Vaintroub added a comment - 2018-11-20 17:10 - edited The deadlock I'm referring to is the mentioned MDEV-15636 (mariabackup --lock-ddl-per-table hangs in FLUSH TABLES due to MDL conflict if ALTER TABLE issued) Since this bug is prominently featured in the description MDEV-5336 (Implement LOCK FOR BACKUP) assumed the problem is known/understood. if not, here is what happens. This happens with mariabackup that runs with --lock-ddl-per-table option a) A connection inside mariabackup is holding MDLs for all Innodb tables. MDL is acquired as SELECT 1 from table, a transaction. Transaction is not commited until the end of backup . b) A user connection is doing ALTER TABLE for which MDL is held. ALTER waits. c) Another connection inside mariabackup is trying to acquire FTWRL.. It waits for ALTER in b), which is waits for transaction holding MDL in a), which is not commited until the end of backup . This ^ is a deadlock. To resolve the deadlock, mariabackup issues KILL QUERY for the ALTER in b). This allows FTWRL to proceed. Another possible "solution" is --no-lock which avoids FTWRL. So, -I tested BACKUP STAGE with mariabackup. -I read MDEV-5336 . It said in description "This lock will also solve the problem with MDEV-15636 (killing running queries that conflicts with FLUSH) as the backup locks will not conflict with other DDL locks" -I removed KILL QUERY from mariabackup. I wanted to see whether the statement "this lock will also solve the problem with MDEV-15636 " holds. -As a result , mariabackup hanged in a test that tests "lock-ddl-per-table" option. My conclusion therefore was that it did not really solve the problem with MDEV-15636 . When you have a fix for that, I will remove KILL QUERY from the backup code. (I did not decide yet, whether to remove --lock-ddl-per-table option, maybe there is some weird case, where people would still like to use it. At least it turned out to be a useful tool to test BACKUP STAGE so far)

Michael Widenius added a comment - 2018-11-21 13:10 - edited

Hi!

Thanks for the explanation, it explains clearly the problem!

The issue is how inplace/online ALTER TABLE is implemented.
It first takes a DDL lock and then open tables in shared mode. Later, in mysql_sql_alter() and mysql_inplace_alter_table() it upgrades the table to exclusive lock and this blocks because of the active transaction.
Because of the DDL lock that is hold, there is no way for BACKUP STAGE BLOCK_DDL to proceed.

It's possible to fix this for ALTER TABLE, but you have the same issue with DROP TABLE and other DDL's, so there is no easy to fix it.
Another solution would be to change lock order and first do table level locks and then backup locks. The problem with this is that we could the the backup lock while the DDL is waiting for the transaction to end, but the side effect is that while the ddl is waiting, no DML's can be done at the tables, which would be even worse.

The solution is simple to not allow --lock-ddl-per-table when using backup locks.
As we want to allow DDL's during the major part of the backup, until BLOCK_DLL, this is the right thing to do anyway.
--lock-ddl-per-table is also not needed as we now have BACKUP STAGE BLOCK_DDL that does the same, but for all tables (not only innodb tables).

I checked also this scenario with the old FTWRL:
tr1: begin ; select 1 from t1;
tr2: alter table t1;
tr3: FLUSH TABLES WITH READ LOCK;

FTWRL locks on both 10.3 and new 10.4 as it as has to wait for ALTER TABLE that takes a MDL_INTENTION_EXLUSIVE/MDL_BACKUP_DDL lock which blocks the MDL_SHARED/MDL_BACKUP_FTWRL1 lock needed in lock_global_read_lock()

Michael Widenius added a comment - 2018-11-21 13:10 - edited Hi! Thanks for the explanation, it explains clearly the problem! The issue is how inplace/online ALTER TABLE is implemented. It first takes a DDL lock and then open tables in shared mode. Later, in mysql_sql_alter() and mysql_inplace_alter_table() it upgrades the table to exclusive lock and this blocks because of the active transaction. Because of the DDL lock that is hold, there is no way for BACKUP STAGE BLOCK_DDL to proceed. It's possible to fix this for ALTER TABLE, but you have the same issue with DROP TABLE and other DDL's, so there is no easy to fix it. Another solution would be to change lock order and first do table level locks and then backup locks. The problem with this is that we could the the backup lock while the DDL is waiting for the transaction to end, but the side effect is that while the ddl is waiting, no DML's can be done at the tables, which would be even worse. The solution is simple to not allow --lock-ddl-per-table when using backup locks. As we want to allow DDL's during the major part of the backup, until BLOCK_DLL, this is the right thing to do anyway. --lock-ddl-per-table is also not needed as we now have BACKUP STAGE BLOCK_DDL that does the same, but for all tables (not only innodb tables). I checked also this scenario with the old FTWRL: tr1: begin ; select 1 from t1; tr2: alter table t1; tr3: FLUSH TABLES WITH READ LOCK; FTWRL locks on both 10.3 and new 10.4 as it as has to wait for ALTER TABLE that takes a MDL_INTENTION_EXLUSIVE/MDL_BACKUP_DDL lock which blocks the MDL_SHARED/MDL_BACKUP_FTWRL1 lock needed in lock_global_read_lock()

Vladislav Vaintroub added a comment - 2018-11-21 13:30 - edited

As I said, I'm not certain of the usefulness of lock-ddl-per-table.
I'd rather keep it, and I would also keep other things like --no-lock. The options are there, they are documented, and in case of bugs in BACKUP STAGE people could still can use them as workaround.

I can document them as "not needed anymore", and "deprecated", but they have not previously gone through any deprecation phase, so they need to be there for the time being.

I would also like to see fix in the server.
The promise of the new LOCK is that it does not block in presence of SELECTS, UPDATES, does not wait for ends of transactions, and even is able to run simultaneously with running DDLs .
Basically, the promise if that it is instant, almost always.

This needs to still hold if there is a non-running DDL present, otherwise BACKUP LOCK waits for all of the above

Vladislav Vaintroub added a comment - 2018-11-21 13:30 - edited As I said, I'm not certain of the usefulness of lock-ddl-per-table. I'd rather keep it, and I would also keep other things like --no-lock. The options are there, they are documented, and in case of bugs in BACKUP STAGE people could still can use them as workaround. I can document them as "not needed anymore", and "deprecated", but they have not previously gone through any deprecation phase, so they need to be there for the time being. I would also like to see fix in the server. The promise of the new LOCK is that it does not block in presence of SELECTS, UPDATES, does not wait for ends of transactions, and even is able to run simultaneously with running DDLs . Basically, the promise if that it is instant, almost always. This needs to still hold if there is a non-running DDL present, otherwise BACKUP LOCK waits for all of the above

Michael Widenius added a comment - 2018-11-21 13:46

Yes, it's instant under some premisses. One is that the connection that is using BACKUP LOCKS is not holding any locks of it self.
There is no easy way to fix that if the connection holds locks that causes other connections to wait in DML or DDL, then it will cause BACKUP LOCK's to fail.

The purpose of BACKUP LOCK BLOCK_DDL is to wait for all running DDL's (those that has DDL lock) to finish and stop new's from being issued. This exactly what it does.
ALTER TABLE is fixed so that while it's copying data it's not blocking BLOCK_DDL.

I am open for suggestion of how to fix it that doesn't involve dark magic .
This discussions is however better to have on IRC

Michael Widenius added a comment - 2018-11-21 13:46 Yes, it's instant under some premisses. One is that the connection that is using BACKUP LOCKS is not holding any locks of it self. There is no easy way to fix that if the connection holds locks that causes other connections to wait in DML or DDL, then it will cause BACKUP LOCK's to fail. The purpose of BACKUP LOCK BLOCK_DDL is to wait for all running DDL's (those that has DDL lock) to finish and stop new's from being issued. This exactly what it does. ALTER TABLE is fixed so that while it's copying data it's not blocking BLOCK_DDL. I am open for suggestion of how to fix it that doesn't involve dark magic . This discussions is however better to have on IRC

Vladislav Vaintroub added a comment - 2018-11-29 00:55

reworded the description so there is no reference to deadlock, as it seems confusing for some readers. Now it is called an "unexpected wait"

Vladislav Vaintroub added a comment - 2018-11-29 00:55 reworded the description so there is no reference to deadlock, as it seems confusing for some readers. Now it is called an "unexpected wait"

MariaDB Server

3 way lock : ALTER, MDL, BACKUP STAGE BLOCK_DDL

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration