Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25662

Atomic DDL: InnoDB: Expected tablespace id ... but found 0 upon crash recovery

Details

    Description

      The failure comes in two flavors.

      One looks like this:

      bb-10.6-monty-innodb f179fdefd1

      2021-05-11 19:14:17 0 [ERROR] InnoDB: Expected tablespace id 67 but found 0 in the file ./test/tt13.ibd
      2021-05-11 19:14:17 0 [Note] DDL_LOG: Crash recovery executed 2 entries
      

      That is, the error message is thrown once, and then the recovery finishes. I'm not sure whether it's a problem or a bogus message here.

      Another one looks like this:

      2021-05-12  3:39:01 0 [Note] InnoDB: Creating #2 encryption thread id 140169671862016 total threads 4.
      2021-05-12  3:39:01 0 [Note] InnoDB: Creating #3 encryption thread id 140169663469312 total threads 4.
      2021-05-12  3:39:01 0 [Note] InnoDB: Creating #4 encryption thread id 140169655076608 total threads 4.
      2021-05-12  3:39:01 0 [Note] InnoDB: Buffer pool(s) load completed at 210512  3:39:01
      2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd
      2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd
      2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd
      ...
      2021-05-12  3:39:01 0 [ERROR] InnoDB: Expected tablespace id 10 but found 0 in the file ./test/D.ibd
      

      Thousands of identical messages are produced within a second. I don't know if it's an endless loop or it would finish eventually, as the test was killing the server after encountering those.

      In the first case, the server was running DDL in 2 threads when it was sigkilled. The last queries according to the general log were

      27 Query    DROP /* QNO 16561 CON_ID 27 */ TABLE IF EXISTS vv11, tt13
      26 Query    ALTER /* QNO 507 CON_ID 26 */ TABLE t3 ENGINE=InnoDB, ADD COLUMN IF NOT EXISTS icol3 MEDIUMINT UNSIGNED ZEROFILL DEFAULT 2, LOCK=SHARED
      

      In the second case, a previously stored backup from MariaBackup was being restored.

      Logs and data directories are available for both cases. The rr profiles are available for the first one.

      At least the recovery failure on previously created datadir from case 1 is reproducible on bb-10.6-monty-innodb 98ff31770.

      Attachments

        Issue Links

          Activity

            Neither trace in the -1 directory works for me. The more interesting one (the recovery) ends prematurely:

            [ERROR /home/roc/rr/rr/src/TraceStream.cc:1122:read_mapped_region()] Metadata of /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/mariadbd changed: replay divergence likely, but continuing anyway. inode: 81948310/83127419; mode: 33277/33277; uid: 1001/1001; gid: 1001/1001; size: 181508408/181508408; mtime: 1620826716/1620752489
            Version: '10.6.1-MariaDB-debug-log'  socket: '/home/mdbe/atomic_ddl/workdirs/worker1/job14/current1_1/mysql.sock'  port: 10510  Source distribution
            [FATAL /home/roc/rr/rr/src/ReplaySession.cc:762:guard_unexpected_signal()] 
             (task 36504 (rec:19071) at time 136822)
             -> Assertion `false' failed to hold. Replay got unrecorded signal SIGSEGV while awaiting signal
            

            In the -2 directory I see no rr replay traces.
            But, the backup directories or a copy of the data directory could be enough to diagnose this.

            marko Marko Mäkelä added a comment - Neither trace in the -1 directory works for me. The more interesting one (the recovery) ends prematurely: [ERROR /home/roc/rr/rr/src/TraceStream.cc:1122:read_mapped_region()] Metadata of /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/sql/mariadbd changed: replay divergence likely, but continuing anyway. inode: 81948310/83127419; mode: 33277/33277; uid: 1001/1001; gid: 1001/1001; size: 181508408/181508408; mtime: 1620826716/1620752489 … Version: '10.6.1-MariaDB-debug-log' socket: '/home/mdbe/atomic_ddl/workdirs/worker1/job14/current1_1/mysql.sock' port: 10510 Source distribution [FATAL /home/roc/rr/rr/src/ReplaySession.cc:762:guard_unexpected_signal()] (task 36504 (rec:19071) at time 136822) -> Assertion `false' failed to hold. Replay got unrecorded signal SIGSEGV while awaiting signal In the -2 directory I see no rr replay traces. But, the backup directories or a copy of the data directory could be enough to diagnose this.

            marko,

            I've checked the traces in -1 directory.
            The first one (before recovery) seems to work, the plugin error flood aside. At least it goes all the way to SIGKILL, as it should. I've also created a packed copy, it seems to help to get rid of the annoying errors.
            The second one (recovery) indeed doesn't work, but it shouldn't matter. Recovery fails reliably on the provided datadir, with the new binaries too, so you can always get a new one. I've created one for your convenience, rr_upon_recovery, along with its packed copy.

            elenst Elena Stepanova added a comment - marko , I've checked the traces in -1 directory. The first one (before recovery) seems to work, the plugin error flood aside. At least it goes all the way to SIGKILL, as it should. I've also created a packed copy, it seems to help to get rid of the annoying errors. The second one (recovery) indeed doesn't work, but it shouldn't matter. Recovery fails reliably on the provided datadir, with the new binaries too, so you can always get a new one. I've created one for your convenience, rr_upon_recovery , along with its packed copy.

            Thank you, the -1 directory works for me now. The server was killed in the middle of DROP /* QNO 16561 CON_ID 27 */ TABLE IF EXISTS vv11, tt13. The transaction for removing persistent statistics was durably committed:

            #12 0x00005606c4e5840a in dict_stats_drop_table (
                db_and_table=0x7fe7c523c630 "test/tt13", 
                errstr=0x7fe7c523c190 "\300\301#\305\347\177", errstr_sz=1024)
                at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/dict/dict0stats.cc:3606
            #13 0x00005606c4cb9ba3 in row_drop_table_for_mysql (
                name=0x7fe7c523c630 "test/tt13", trx=0x49327cd686d8, 
                sqlcom=SQLCOM_DROP_TABLE, create_failed=false, nonatomic=true)
                at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/row/row0mysql.cc:3332
            

            The call returned to row_drop_table_for_mysql(), but row_drop_table_for_mysql() did not finish before the SIGKILL arrived. The SIGKILL arrived during the execution of fil_delete_tablespace().

            This seems to be an omission in MDEV-18518: We are attempting to delete the file before the transaction has been committed! In reality, only the deletion of persistent statistics for the table had been committed (in a separate transaction, due to bad design that we inherited from MySQL), but the actual DROP TABLE was not.

            marko Marko Mäkelä added a comment - Thank you, the -1 directory works for me now. The server was killed in the middle of DROP /* QNO 16561 CON_ID 27 */ TABLE IF EXISTS vv11, tt13 . The transaction for removing persistent statistics was durably committed: #12 0x00005606c4e5840a in dict_stats_drop_table ( db_and_table=0x7fe7c523c630 "test/tt13", errstr=0x7fe7c523c190 "\300\301#\305\347\177", errstr_sz=1024) at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/dict/dict0stats.cc:3606 #13 0x00005606c4cb9ba3 in row_drop_table_for_mysql ( name=0x7fe7c523c630 "test/tt13", trx=0x49327cd686d8, sqlcom=SQLCOM_DROP_TABLE, create_failed=false, nonatomic=true) at /home/mdbe/atomic_ddl/bb-10.6-monty-for-rr/storage/innobase/row/row0mysql.cc:3332 The call returned to row_drop_table_for_mysql() , but row_drop_table_for_mysql() did not finish before the SIGKILL arrived. The SIGKILL arrived during the execution of fil_delete_tablespace() . This seems to be an omission in MDEV-18518 : We are attempting to delete the file before the transaction has been committed! In reality, only the deletion of persistent statistics for the table had been committed (in a separate transaction, due to bad design that we inherited from MySQL), but the actual DROP TABLE was not.

            Correction: the problem is an omission in MDEV-25506, which I reopened. We must not delete files before the transaction is committed.

            marko Marko Mäkelä added a comment - Correction: the problem is an omission in MDEV-25506 , which I reopened. We must not delete files before the transaction is committed.

            During the testing of the third part of MDEV-25506 (which was pushed to 10.6 yesterday), this type of failure was no longer found. Before that fix, we would delete .ibd files already before the DDL transaction was committed. So, if the server had been killed between the deletion and the redo log write of the commit, the DDL transaction would be rolled back, but the file would be gone.

            marko Marko Mäkelä added a comment - During the testing of the third part of MDEV-25506 (which was pushed to 10.6 yesterday), this type of failure was no longer found. Before that fix, we would delete .ibd files already before the DDL transaction was committed. So, if the server had been killed between the deletion and the redo log write of the commit, the DDL transaction would be rolled back, but the file would be gone.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.