Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28923

atomic.rename_table occasionally fails

Details

    Description

      While MDEV-28870 reduced the failure probability of test atomic.rename_table, the test still occasionally fails, especially on the kvm-ubsan builder. Here is the relevant part of one failure:

      10.6 0fa19fdebf0925be6ec5503938d541332f259cb5

      CURRENT_TEST: atomic.rename_table
      mysqltest: At line 155: query 'let $res=`select t1.a+t2.b+t3.c+t4.d from t1,t2,t3,t4`' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
      2022-06-22  8:00:48 0 [ERROR] InnoDB: Tablespace 5 was not found at ./test/t5.ibd.
      2022-06-22  8:00:48 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace.
      2022-06-22  8:00:48 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1447] with error Tablespace not found
      

      I would need at least a copy of a data directory that failed to recover. Preferrably from 10.8 or a later branch, because the MDEV-14425 log file format is easier to read, thanks to the lack of any block framing. An rr record trace of the killed server could significantly ease the analysis.

      Attachments

        Issue Links

          Activity

            I checked a copy of the data directory for a 10.9 kvm-ubsan failure that had failed like this:

            10.9 f421d8f50dc6f730b1d01b0443880178e550e03a

            CURRENT_TEST: atomic.rename_table
            mysqltest: At line 155: query 'let $res=`select t1.a+t2.b+t3.c+t4.d from t1,t2,t3,t4`' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
            ***Warnings generated in error logs during shutdown after running tests: atomic.rename_table
             
            2022-06-22 16:21:45 0 [ERROR] InnoDB: Tablespace 8 was not found at ./test/t5.ibd.
            

            wget https://hasky.askmonty.org/logs/kvm-ubsan/1824-var.tar.gz
            tar xzf 1824-var.tar.gz
            cd 10.9
            rsync -avil --delete ../var/log/atomic.rename_table-innodb/mysqld.1/data/ /dev/shm/data/
            sql/mariadbd --datadir /dev/shm/data
            

            data.tar.xz is a copy of the data directory, for posterity. I think that the original copy will be available for a limited time only.

            The first mention of the name test/t5 in the ib_logfile0 looks like a WRITE|0x80 record for SYS_TABLES at LSN (and file byte offset) 0x10e71, presumably updating an earlier record:

            00010e70: 80bf 030b 0201 0774 6573 742f 7435 7800  .......test/t5x.
            00010e80: 80bf 004e b296 cd01 7c68 b37f 2029 0008  ...N....|h.. )..
            

            The second mention is a FILE_RENAME mini-transaction at 0x10f00 (the last 5 bytes are the end marker and the checksum):

            00010f00: a00f 0500 2e2f 7465 7374 2f74 312e 6962  ...../test/t1.ib
            00010f10: 6400 2e2f 7465 7374 2f74 352e 6962 6401  d../test/t5.ibd.
            00010f20: 935c 03a1
            

            Note: At this point, the tablespace ID was 5.

            There is quite a bit of renaming going on. It is best to search the ib_logfile0 for test/t5.ibd and not test/t5 (which would be updates of SYS_TABLES.NAME). There is also a FILE_RENAME for renaming test/t5.ibd back to test/t1.ibd as tablespace 5.

            The last record that mentions test/t5.ibd is a FILE_RENAME of test/t4.ibd to test/t5.ibd:

            00077610: ____ ____ ____ ____ __a0 0f08 002e 2f74  ............../t
            00077620: 6573 742f 7434 2e69 6264 002e 2f74 6573  est/t4.ibd../tes
            00077630: 742f 7435 2e69 6264 0144 89f4 65__ ____  t/t5.ibd........
            

            The tablespace identifier here is 8, just like in the message.

            All data files test/t1.ibd, test/t2.ibd, test/t4.ibd, test/t5.ibd in the data directory are 6*16384 bytes long and consist of NUL bytes.

            The last checkpoint is 0xec63, and it seems to be straight at the end of the ./mtr --bootstrap. All of the test/ tables were created after that, so the data directory should be fully recoverable.

            thiru, it looks like the deferred recovery logic of MDEV-24626 still has some trouble with rename operations.

            marko Marko Mäkelä added a comment - I checked a copy of the data directory for a 10.9 kvm-ubsan failure that had failed like this: 10.9 f421d8f50dc6f730b1d01b0443880178e550e03a CURRENT_TEST: atomic.rename_table mysqltest: At line 155: query 'let $res=`select t1.a+t2.b+t3.c+t4.d from t1,t2,t3,t4`' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB' … ***Warnings generated in error logs during shutdown after running tests: atomic.rename_table   2022-06-22 16:21:45 0 [ERROR] InnoDB: Tablespace 8 was not found at ./test/t5.ibd. wget https://hasky.askmonty.org/logs/kvm-ubsan/1824-var.tar.gz tar xzf 1824-var.tar.gz cd 10.9 rsync -avil --delete ../var/log/atomic.rename_table-innodb/mysqld.1/data/ /dev/shm/data/ sql/mariadbd --datadir /dev/shm/data data.tar.xz is a copy of the data directory, for posterity. I think that the original copy will be available for a limited time only. The first mention of the name test/t5 in the ib_logfile0 looks like a WRITE|0x80 record for SYS_TABLES at LSN (and file byte offset) 0x10e71, presumably updating an earlier record: 00010e70: 80bf 030b 0201 0774 6573 742f 7435 7800 .......test/t5x. 00010e80: 80bf 004e b296 cd01 7c68 b37f 2029 0008 ...N....|h.. ).. The second mention is a FILE_RENAME mini-transaction at 0x10f00 (the last 5 bytes are the end marker and the checksum): 00010f00: a00f 0500 2e2f 7465 7374 2f74 312e 6962 ...../test/t1.ib 00010f10: 6400 2e2f 7465 7374 2f74 352e 6962 6401 d../test/t5.ibd. 00010f20: 935c 03a1 Note: At this point, the tablespace ID was 5. There is quite a bit of renaming going on. It is best to search the ib_logfile0 for test/t5.ibd and not test/t5 (which would be updates of SYS_TABLES.NAME ). There is also a FILE_RENAME for renaming test/t5.ibd back to test/t1.ibd as tablespace 5. The last record that mentions test/t5.ibd is a FILE_RENAME of test/t4.ibd to test/t5.ibd : 00077610: ____ ____ ____ ____ __a0 0f08 002e 2f74 ............../t 00077620: 6573 742f 7434 2e69 6264 002e 2f74 6573 est/t4.ibd../tes 00077630: 742f 7435 2e69 6264 0144 89f4 65__ ____ t/t5.ibd........ The tablespace identifier here is 8, just like in the message. All data files test/t1.ibd , test/t2.ibd , test/t4.ibd , test/t5.ibd in the data directory are 6*16384 bytes long and consist of NUL bytes. The last checkpoint is 0xec63, and it seems to be straight at the end of the ./mtr --bootstrap . All of the test/ tables were created after that, so the data directory should be fully recoverable. thiru , it looks like the deferred recovery logic of MDEV-24626 still has some trouble with rename operations.

            I debugged an rr replay trace of the failed recovery, checking all occurrences of fil_name_process for space_id==8. Recovery starts scanning the log at the following records that were written by fil_names_clear() corresponding to the latest checkpoint:

            00076280: 0300 58ee 42f4 0181 2a0c 73b0 1201 002e  ..X.B...*.s.....
            00076290: 2f6d 7973 716c 2f69 6e6e 6f64 625f 7461  /mysql/innodb_ta
            000762a0: 626c 655f 7374 6174 732e 6962 64b0 1202  ble_stats.ibd...
            000762b0: 002e 2f6d 7973 716c 2f69 6e6e 6f64 625f  ../mysql/innodb_
            000762c0: 696e 6465 785f 7374 6174 732e 6962 64b0  index_stats.ibd.
            000762d0: 0105 002e 2f74 6573 742f 7431 2e69 6264  ..../test/t1.ibd
            000762e0: b001 0600 2e2f 7465 7374 2f74 322e 6962  ...../test/t2.ib
            000762f0: 64b0 0107 002e 2f74 6573 742f 7433 2e69  d...../test/t3.i
            00076300: 6264 b001 0800 2e2f 7465 7374 2f74 342e  bd...../test/t4.
            00076310: 6962 64fa 0000 0000 0000 0000 ec63 01ff  ibd..........c..
            00076320: 3078 e846 0008 82be 0600 b200 80c3 0006  0x.F............
            

            Then it rewinds to the checkpoint LSN (0xec63) and scans all records. There is quite a bit of renaming back and forth between t4 and t5. Here are all scanned FILE_ records for tablespace 8:

            LSN LSN (hexadecimal) operation file name(s)
            483979 0x7628b FILE_MODIFY t4
            66182 0x10286 FILE_CREATE t4
            66204 0x1029c FILE_MODIFY t4
            180948 0x2c2d4 FILE_RENAME t4 to t5
            182463 0x2c8bf FILE_RENAME t5 to t4
            237607 0x3a027 FILE_RENAME t4 to t5
            239131 0x3a61b FILE_RENAME t5 to t4
            256208 0x3e8d0 FILE_RENAME t4 to t5
            260787 0x3fab3 FILE_RENAME t5 to t4
            316432 0x4d410 FILE_RENAME t4 to t5
            318110 0x4da9e FILE_RENAME t5 to t4
            335362 0x51e02 FILE_RENAME t4 to t5
            339791 0x52f4f FILE_RENAME t5 to t4
            394187 0x603cb FILE_RENAME t4 to t5
            395712 0x609c0 FILE_RENAME t5 to t4
            412313 0x64a99 FILE_RENAME t4 to t5
            416784 0x65c10 FILE_RENAME t5 to t4
            470565 0x72e25 FILE_RENAME t4 to t5
            472090 0x7341a FILE_RENAME t5 to t4
            483979 0x7628b FILE_MODIFY t4
            488985 0x77619 FILE_RENAME t4 to t5
            marko Marko Mäkelä added a comment - I debugged an rr replay trace of the failed recovery, checking all occurrences of fil_name_process for space_id==8 . Recovery starts scanning the log at the following records that were written by fil_names_clear() corresponding to the latest checkpoint: 00076280: 0300 58ee 42f4 0181 2a0c 73b0 1201 002e ..X.B...*.s..... 00076290: 2f6d 7973 716c 2f69 6e6e 6f64 625f 7461 /mysql/innodb_ta 000762a0: 626c 655f 7374 6174 732e 6962 64b0 1202 ble_stats.ibd... 000762b0: 002e 2f6d 7973 716c 2f69 6e6e 6f64 625f ../mysql/innodb_ 000762c0: 696e 6465 785f 7374 6174 732e 6962 64b0 index_stats.ibd. 000762d0: 0105 002e 2f74 6573 742f 7431 2e69 6264 ..../test/t1.ibd 000762e0: b001 0600 2e2f 7465 7374 2f74 322e 6962 ...../test/t2.ib 000762f0: 64b0 0107 002e 2f74 6573 742f 7433 2e69 d...../test/t3.i 00076300: 6264 b001 0800 2e2f 7465 7374 2f74 342e bd...../test/t4. 00076310: 6962 64fa 0000 0000 0000 0000 ec63 01ff ibd..........c.. 00076320: 3078 e846 0008 82be 0600 b200 80c3 0006 0x.F............ Then it rewinds to the checkpoint LSN (0xec63) and scans all records. There is quite a bit of renaming back and forth between t4 and t5 . Here are all scanned FILE_ records for tablespace 8: LSN LSN (hexadecimal) operation file name(s) 483979 0x7628b FILE_MODIFY t4 66182 0x10286 FILE_CREATE t4 66204 0x1029c FILE_MODIFY t4 180948 0x2c2d4 FILE_RENAME t4 to t5 182463 0x2c8bf FILE_RENAME t5 to t4 237607 0x3a027 FILE_RENAME t4 to t5 239131 0x3a61b FILE_RENAME t5 to t4 256208 0x3e8d0 FILE_RENAME t4 to t5 260787 0x3fab3 FILE_RENAME t5 to t4 316432 0x4d410 FILE_RENAME t4 to t5 318110 0x4da9e FILE_RENAME t5 to t4 335362 0x51e02 FILE_RENAME t4 to t5 339791 0x52f4f FILE_RENAME t5 to t4 394187 0x603cb FILE_RENAME t4 to t5 395712 0x609c0 FILE_RENAME t5 to t4 412313 0x64a99 FILE_RENAME t4 to t5 416784 0x65c10 FILE_RENAME t5 to t4 470565 0x72e25 FILE_RENAME t4 to t5 472090 0x7341a FILE_RENAME t5 to t4 483979 0x7628b FILE_MODIFY t4 488985 0x77619 FILE_RENAME t4 to t5

            The test is executing a multi-table RENAME, with crashes injected after different numbers of performed operations. The log file of the execution said:

            10.9 f421d8f50dc6f730b1d01b0443880178e550e03a

            "engine: innodb  crash point: ddl_log_rename_after_stat_tables  position: 3"
            "engine: innodb  crash point: ddl_log_rename_after_stat_tables  position: 4"
            "engine: innodb  crash point: ddl_log_rename_after_stat_tables  position: 5"
             
            More results from queries before failure can be found in /dev/shm/var/2/log/rename_table.log
            

            This indicates that we successfully recovered from earlier injected crashes, and the final one, an injected crash after the very last RENAME operation, failed to recover:

                  --error 0,2013
                  rename table t1 to t5, t2 to t1, t5 to t2, t4 to t5, t3 to t4;
            

            In case no crash was injected, the operation would be rolled back by executing the following:

                    rename table t4 to t3, t5 to t4, t2 to t5, t1 to t2, t5 to t1;
            

            This operation would swap t1 with t2 and ‘increment’ the names of t3 and t4.
            The tablespace identifiers and file names before and after the first multi-table RENAME are expected to be as follows:
            Before: (5,t1),(6,t2),(7,t3),(8,t4)
            After: (5,t2),(6,t1),(7,t4),(8,t5)
            The recv_spaces at the time the error message is displayed agrees with this, but deferred_spaces is missing any entry for the tablespace identifier 8 or the file name t5.
            There were no FILE_DELETE operations in the log. The test simply creates 4 tables and then renames them back and forth using the above statements, with injected crashes in between. Some data in the tables is being modified as well.

            marko Marko Mäkelä added a comment - The test is executing a multi-table RENAME , with crashes injected after different numbers of performed operations. The log file of the execution said: 10.9 f421d8f50dc6f730b1d01b0443880178e550e03a … "engine: innodb crash point: ddl_log_rename_after_stat_tables position: 3" "engine: innodb crash point: ddl_log_rename_after_stat_tables position: 4" "engine: innodb crash point: ddl_log_rename_after_stat_tables position: 5"   More results from queries before failure can be found in /dev/shm/var/2/log/rename_table.log This indicates that we successfully recovered from earlier injected crashes, and the final one, an injected crash after the very last RENAME operation, failed to recover: --error 0,2013 rename table t1 to t5, t2 to t1, t5 to t2, t4 to t5, t3 to t4; In case no crash was injected, the operation would be rolled back by executing the following: rename table t4 to t3, t5 to t4, t2 to t5, t1 to t2, t5 to t1; This operation would swap t1 with t2 and ‘increment’ the names of t3 and t4. The tablespace identifiers and file names before and after the first multi-table RENAME are expected to be as follows: Before: (5,t1),(6,t2),(7,t3),(8,t4) After: (5,t2),(6,t1),(7,t4),(8,t5) The recv_spaces at the time the error message is displayed agrees with this, but deferred_spaces is missing any entry for the tablespace identifier 8 or the file name t5 . There were no FILE_DELETE operations in the log. The test simply creates 4 tables and then renames them back and forth using the above statements, with injected crashes in between. Some data in the tables is being modified as well.

            Thanks to some debugging by thiru, who was able to reproduce the issue locally, the problem was that fil_name_process() was wrongly avoiding a call to fil_ibd_load() when the tablespace identifier occurred in deferred_spaces.

            marko Marko Mäkelä added a comment - Thanks to some debugging by thiru , who was able to reproduce the issue locally, the problem was that fil_name_process() was wrongly avoiding a call to fil_ibd_load() when the tablespace identifier occurred in deferred_spaces .

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.