[MDEV-25899] intermediate files operations are not protected by backup locks Created: 2021-06-11  Updated: 2021-06-20  Resolved: 2021-06-20

Status: Closed
Project: MariaDB Server
Component/s: Backup, Server
Affects Version/s: 10.6
Fix Version/s: 10.6.2

Type: Bug Priority: Blocker
Reporter: Vladislav Lesin Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
PartOf
is part of MDEV-25854 Restoring a backup may result in garb... Closed
Relates
relates to MDEV-5336 Implement BACKUP STAGE for safe exter... Closed
relates to MDEV-25666 Atomic DDL: InnoDB: Operating system ... Closed

 Description   

After MDEV-25666 fix mariabackup.innodb_ddl_on_intermediate_table fails with the following way:

  1. mariabackup holds "BACKUP STAGE START" (see MDEV-5336 for details).
  2. The server starts "ALTER TABLE" with "COPY" algorithm. copy_data_between_tables() is invoked and holds MDL_BACKUP_ALTER_COPY lock and copies data to intermediate table. InnoDB writes FILE_CREATE redo log record.
  3. mariabackup requests BLOCK_COMMIT lock, and the lock is granted. mariabackup remembers LSN just after the lock is granted.
  4. The server finishes data copying in copy_data_between_tables() and tries to upgrade the lock to MDL_BACKUP_DDL, which conflicts with the lock held by mariabackup. The lock request finishes by timeout and copy_data_between_tables() returns error.
  5. mysql_alter_table() drops intermediate table, this drop is not protected with MDL_BACKUP_DDL, so the table is dropped and InnoDB writes FILE_DELETE redo log record.
  6. mariabackup waits while redo log is read to the LSN remembered on the step 3. So FILE_CREATE for the intermediate table is read, but FILE_DELETE has not been read yet.
  7. mariabackup executes backup_fix_ddl(), and as FILE_CREATE was read, but FILE_DELETE was not read for the intermediate table, it tries to backup that table, and fails trying to open it, as it was deleted on step 5.

Just for the note, mariabackup parses InnoDB redo log during backup and remembers all file operations read from the log, and then copies newly created files or creates special files for renamed and deleted files, which are processed during backup prepare phase.

Before MDEV-25666 fix mariabackup ignored temp and intermediate tables, and backup_fix_ddl() worked well. But after this fix it does not work because intermediate file's operations are not protected with MDL_BACKUP_DDL lock.

It's not obvious for me how to fix it or if this is bug or not. The transformation of #sql-* file into a real tablespace should not happen, and some BACKUP STAGE , or FTWRL should prevent that last rename during backup(see MDEV-5336). It might be that errors in MDEV-25666 log should not be treated as errors.

The test: https://github.com/MariaDB/server/tree/10.6-MDEV-25899



 Comments   
Comment by Marko Mäkelä [ 2021-06-16 ]

For native ALTER TABLE we also have the problem that rollback is not protected by backup locks at all. The original reasoning why the rollback is not necessarily protected by MDL is that during the final phase of ALTER TABLE…LOCK=NONE the lock upgrade to MDL_EXCLUSIVE may time out, and as a result we would want to roll back the operation. This rollback may delete and rename files if the operation was a table-rebuilding ALTER.
Relevant code in sql_table.cc:

  /* Set MDL_BACKUP_DDL */
  if (backup_reset_alter_copy_lock(thd))
    goto rollback;

We would invoke the rollback if there is a conflict with a backup lock, instead of waiting for the backup lock!

Comment by Marko Mäkelä [ 2021-06-16 ]

I expect that MDEV-25854 can fully address most problems related to restoring a backup where a DDL operation on the server was blocked by backup locks. But, I expect that if the server was running a rollback of native ALTER TABLE during the last phase of backup, we can end up with a corrupted backup.

Comment by Sergei Golubchik [ 2021-06-20 ]

It seems that "rollback of native ALTER TABLE" isn't in the scope of this bug report. This was specifically about #sql temporary files created by ALTER TABLE ... ALGORITHM=COPY. And this was, apparently, fixed by MDEV-25854.

Generated at Thu Feb 08 09:41:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.