[MDEV-25662] Atomic DDL: InnoDB: Expected tablespace id ... but found 0 upon crash recovery - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: 10.6.2
Component/s: Storage Engine - InnoDB
Labels:
- rr-profile

Description

The failure comes in two flavors.

One looks like this:

bb-10.6-monty-innodb f179fdefd1
2021-05-11 19:14:17 0 [ERROR] InnoDB: Expected tablespace id 67 but found 0 in the file ./test/tt13.ibd
2021-05-11 19:14:17 0 [Note] DDL_LOG: Crash recovery executed 2 entries

That is, the error message is thrown once, and then the recovery finishes. I'm not sure whether it's a problem or a bogus message here.

Another one looks like this:

2021-05-12  3:39:01 0 [Note] InnoDB: Creating #2 encryption thread id 140169671862016 total threads 4.

2021-05-12  3:39:01 0 [Note] InnoDB: Creating #3 encryption thread id 140169663469312 total threads 4.

2021-05-12  3:39:01 0 [Note] InnoDB: Creating #4 encryption thread id 140169655076608 total threads 4.

2021-05-12  3:39:01 0 [Note] InnoDB: Buffer pool(s) load completed at 210512  3:39:01

2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd

2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd

2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd

...

2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd

Thousands of identical messages are produced within a second. I don't know if it's an endless loop or it would finish eventually, as the test was killing the server after encountering those.

In the first case, the server was running DDL in 2 threads when it was sigkilled. The last queries according to the general log were

27 Query    DROP /* QNO 16561 CON_ID 27 */ TABLE IF EXISTS vv11, tt13

26 Query    ALTER /* QNO 507 CON_ID 26 */ TABLE t3 ENGINE=InnoDB, ADD COLUMN IF NOT EXISTS icol3 MEDIUMINT UNSIGNED ZEROFILL DEFAULT 2, LOCK=SHARED

In the second case, a previously stored backup from MariaBackup was being restored.

Logs and data directories are available for both cases. The rr profiles are available for the first one.

At least the recovery failure on previously created datadir from case 1 is reproducible on bb-10.6-monty-innodb 98ff31770.

Attachments

Issue Links

relates to

MDEV-17567 Atomic DDL

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2021-05-12 14:46

Neither trace in the -1 directory works for me. The more interesting one (the recovery) ends prematurely:

[ERROR /home/roc/rr/rr/src/TraceStream.cc:1122:read_mapped_region()] Metadata of /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/mariadbd changed: replay divergence likely, but continuing anyway. inode: 81948310/83127419; mode: 33277/33277; uid: 1001/1001; gid: 1001/1001; size: 181508408/181508408; mtime: 1620826716/1620752489

…

Version: '10.6.1-MariaDB-debug-log'  socket: '/home/mdbe/atomic_ddl/workdirs/worker1/job14/current1_1/mysql.sock'  port: 10510  Source distribution

[FATAL /home/roc/rr/rr/src/ReplaySession.cc:762:guard_unexpected_signal()]

 (task 36504 (rec:19071) at time 136822)

 -> Assertion `false' failed to hold. Replay got unrecorded signal SIGSEGV while awaiting signal

In the -2 directory I see no rr replay traces.
But, the backup directories or a copy of the data directory could be enough to diagnose this.

Marko Mäkelä added a comment - 2021-05-12 14:46 Neither trace in the -1 directory works for me. The more interesting one (the recovery) ends prematurely: [ERROR /home/roc/rr/rr/src/TraceStream.cc:1122:read_mapped_region()] Metadata of /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/mariadbd changed: replay divergence likely, but continuing anyway. inode: 81948310/83127419; mode: 33277/33277; uid: 1001/1001; gid: 1001/1001; size: 181508408/181508408; mtime: 1620826716/1620752489 … Version: '10.6.1-MariaDB-debug-log' socket: '/home/mdbe/atomic_ddl/workdirs/worker1/job14/current1_1/mysql.sock' port: 10510 Source distribution [FATAL /home/roc/rr/rr/src/ReplaySession.cc:762:guard_unexpected_signal()] (task 36504 (rec:19071) at time 136822) -> Assertion `false' failed to hold. Replay got unrecorded signal SIGSEGV while awaiting signal In the -2 directory I see no rr replay traces. But, the backup directories or a copy of the data directory could be enough to diagnose this.

Elena Stepanova added a comment - 2021-05-12 19:30

marko,

I've checked the traces in -1 directory.
The first one (before recovery) seems to work, the plugin error flood aside. At least it goes all the way to SIGKILL, as it should. I've also created a packed copy, it seems to help to get rid of the annoying errors.
The second one (recovery) indeed doesn't work, but it shouldn't matter. Recovery fails reliably on the provided datadir, with the new binaries too, so you can always get a new one. I've created one for your convenience, rr_upon_recovery, along with its packed copy.

Elena Stepanova added a comment - 2021-05-12 19:30 marko , I've checked the traces in -1 directory. The first one (before recovery) seems to work, the plugin error flood aside. At least it goes all the way to SIGKILL, as it should. I've also created a packed copy, it seems to help to get rid of the annoying errors. The second one (recovery) indeed doesn't work, but it shouldn't matter. Recovery fails reliably on the provided datadir, with the new binaries too, so you can always get a new one. I've created one for your convenience, rr_upon_recovery , along with its packed copy.

Marko Mäkelä added a comment - 2021-05-14 06:15

Thank you, the -1 directory works for me now. The server was killed in the middle of DROP /* QNO 16561 CON_ID 27 */ TABLE IF EXISTS vv11, tt13. The transaction for removing persistent statistics was durably committed:

#12 0x00005606c4e5840a in dict_stats_drop_table (

    db_and_table=0x7fe7c523c630 "test/tt13",

    errstr=0x7fe7c523c190 "\300\301#\305\347\177", errstr_sz=1024)

    at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/dict/dict0stats.cc:3606

#13 0x00005606c4cb9ba3 in row_drop_table_for_mysql (

    name=0x7fe7c523c630 "test/tt13", trx=0x49327cd686d8,

    sqlcom=SQLCOM_DROP_TABLE, create_failed=false, nonatomic=true)

    at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/row/row0mysql.cc:3332

The call returned to row_drop_table_for_mysql(), but row_drop_table_for_mysql() did not finish before the SIGKILL arrived. The SIGKILL arrived during the execution of fil_delete_tablespace().

This seems to be an omission in ~~MDEV-18518~~: We are attempting to delete the file before the transaction has been committed! In reality, only the deletion of persistent statistics for the table had been committed (in a separate transaction, due to bad design that we inherited from MySQL), but the actual DROP TABLE was not.

Marko Mäkelä added a comment - 2021-05-14 06:15 Thank you, the -1 directory works for me now. The server was killed in the middle of DROP /* QNO 16561 CON_ID 27 */ TABLE IF EXISTS vv11, tt13 . The transaction for removing persistent statistics was durably committed: #12 0x00005606c4e5840a in dict_stats_drop_table ( db_and_table=0x7fe7c523c630 "test/tt13", errstr=0x7fe7c523c190 "\300\301#\305\347\177", errstr_sz=1024) at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/dict/dict0stats.cc:3606 #13 0x00005606c4cb9ba3 in row_drop_table_for_mysql ( name=0x7fe7c523c630 "test/tt13", trx=0x49327cd686d8, sqlcom=SQLCOM_DROP_TABLE, create_failed=false, nonatomic=true) at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/row/row0mysql.cc:3332 The call returned to row_drop_table_for_mysql() , but row_drop_table_for_mysql() did not finish before the SIGKILL arrived. The SIGKILL arrived during the execution of fil_delete_tablespace() . This seems to be an omission in MDEV-18518 : We are attempting to delete the file before the transaction has been committed! In reality, only the deletion of persistent statistics for the table had been committed (in a separate transaction, due to bad design that we inherited from MySQL), but the actual DROP TABLE was not.

Marko Mäkelä added a comment - 2021-05-14 17:34

Correction: the problem is an omission in ~~MDEV-25506~~, which I reopened. We must not delete files before the transaction is committed.

Marko Mäkelä added a comment - 2021-05-14 17:34 Correction: the problem is an omission in MDEV-25506 , which I reopened. We must not delete files before the transaction is committed.

Marko Mäkelä added a comment - 2021-06-10 14:07

During the testing of the third part of ~~MDEV-25506~~ (which was pushed to 10.6 yesterday), this type of failure was no longer found. Before that fix, we would delete .ibd files already before the DDL transaction was committed. So, if the server had been killed between the deletion and the redo log write of the commit, the DDL transaction would be rolled back, but the file would be gone.

Marko Mäkelä added a comment - 2021-06-10 14:07 During the testing of the third part of MDEV-25506 (which was pushed to 10.6 yesterday), this type of failure was no longer found. Before that fix, we would delete .ibd files already before the DDL transaction was committed. So, if the server had been killed between the deletion and the redo log write of the commit, the DDL transaction would be rolled back, but the file would be gone.

MariaDB Server

Atomic DDL: InnoDB: Expected tablespace id ... but found 0 upon crash recovery

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration