Details

    Description

      dump/load, check table, drop table, create table ... as ... all seem to crash the server.

      [ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch

      Attachments

        Issue Links

          Activity

            elenst, MDEV-11585 introduced in 10.2.3 a bug that was fixed in 10.2.5 by MDEV-11927.
            If manttila originally created the database in 10.2.3 or 10.2.4 and then upgraded to 10.2.5 or later, it is theoretically possible that this is somehow a duplicate of MDEV-11927.

            The symptom of MDEV-11927 was that the secondary index on SYS_TABLES.ID was corrupted, because a buffered delete-mark or purge operation was not merged.
            Here, the symptom is that there is a mismatch reported on SYS_COLUMNS.TABLE_ID, apparently because we have a SYS_TABLE record with no matching records in SYS_COLUMNS. MariaDB or InnoDB should not allow the creation of a table with no columns.

            I do not think that this can be a duplicate of MDEV-11927. When loading a table definition, we would use the clustered index record of SYS_TABLES. So, even if the index on SYS_TABLES.ID was corrupted, it should have nothing to do with this error.

            marko Marko Mäkelä added a comment - elenst , MDEV-11585 introduced in 10.2.3 a bug that was fixed in 10.2.5 by MDEV-11927 . If manttila originally created the database in 10.2.3 or 10.2.4 and then upgraded to 10.2.5 or later, it is theoretically possible that this is somehow a duplicate of MDEV-11927 . The symptom of MDEV-11927 was that the secondary index on SYS_TABLES.ID was corrupted, because a buffered delete-mark or purge operation was not merged. Here, the symptom is that there is a mismatch reported on SYS_COLUMNS.TABLE_ID, apparently because we have a SYS_TABLE record with no matching records in SYS_COLUMNS. MariaDB or InnoDB should not allow the creation of a table with no columns. I do not think that this can be a duplicate of MDEV-11927 . When loading a table definition, we would use the clustered index record of SYS_TABLES. So, even if the index on SYS_TABLES.ID was corrupted, it should have nothing to do with this error.

            Sorry, my links are not quite correct there. I meant to say that the same problem was initially reported in MDEV-11894:

            Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24  0:44:21 140172607566592 [ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch
            Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24 00:44:21 0x7f7c7a7b3300  InnoDB: Assertion failure in thread 140172607566592 in file ut0ut.cc line 949
            

            Later it was closed as fixed in scope of MDEV-11927, which, in turn, was said to be caused by MDEV-11585.

            elenst Elena Stepanova added a comment - Sorry, my links are not quite correct there. I meant to say that the same problem was initially reported in MDEV-11894 : Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24 0:44:21 140172607566592 [ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24 00:44:21 0x7f7c7a7b3300 InnoDB: Assertion failure in thread 140172607566592 in file ut0ut.cc line 949 Later it was closed as fixed in scope of MDEV-11927 , which, in turn, was said to be caused by MDEV-11585 .

            I checked the code, and there is no secondary index defined on SYS_COLUMNS.
            MDEV-11894 reported two problems. One looks same to this one and could have been introduced by MDEV-11585 and fixed by MDEV-11927.
            But, rgpublic mentioned TRUNCATE TABLE in MDEV-11927. Could it be that the MySQL 5.7 WL#6501 TRUNCATE TABLE is not entirely crash-safe and could cause the ID mismatch?

            I see that the function row_truncate_update_table_id() is updating the table_id in multiple InnoDB data dictionary tables. I seem to remember that dict_load_table() is bypassing the undo log, essentially using the READ UNCOMMITTED isolation level. If that is the case, it would explain the mismatch. To reproduce the problem, we would need a crash in the middle of that function, between the update of the SYS_TABLES and SYS_COLUMNS records.

            It seems to me that MDEV-11927 fixed a crash that could have triggered the bug that demonstrates that the ‘new more crash-safe’ TRUNCATE TABLE is not actually crash-safe.

            manttila, can you confirm if any TRUNCATE TABLE was executed on InnoDB tables prior to this crash?

            marko Marko Mäkelä added a comment - I checked the code, and there is no secondary index defined on SYS_COLUMNS. MDEV-11894 reported two problems. One looks same to this one and could have been introduced by MDEV-11585 and fixed by MDEV-11927 . But, rgpublic mentioned TRUNCATE TABLE in MDEV-11927 . Could it be that the MySQL 5.7 WL#6501 TRUNCATE TABLE is not entirely crash-safe and could cause the ID mismatch? I see that the function row_truncate_update_table_id() is updating the table_id in multiple InnoDB data dictionary tables. I seem to remember that dict_load_table() is bypassing the undo log, essentially using the READ UNCOMMITTED isolation level. If that is the case, it would explain the mismatch. To reproduce the problem, we would need a crash in the middle of that function, between the update of the SYS_TABLES and SYS_COLUMNS records. It seems to me that MDEV-11927 fixed a crash that could have triggered the bug that demonstrates that the ‘new more crash-safe’ TRUNCATE TABLE is not actually crash-safe. manttila , can you confirm if any TRUNCATE TABLE was executed on InnoDB tables prior to this crash?

            With the current 10.2, TRUNCATE TABLE appears to be crash-safe. I tested as follows:

            --source include/have_innodb.inc
            create table t(a int) engine=innodb;
            insert into t values(42);
            truncate table t;
            select * from t;
            drop table t;
            

            I set a breakpoint on row_truncate_update_table_id and once it was reached (on the TRUNCATE statement), also on row_upd.
            The first row_upd() call was for node->table->name = "SYS_TABLES". The second call was for "SYS_COLUMNS". At that point, I did

            call log_write_up_to(log_sys->lsn, true)
            run
            

            to kill and restart the server. On the restart, I got a call to row_upd() from truncate_t::update_root_page_no()/row_truncate_update_sys_tables_during_fix_up()/truncate_t::fixup_tables_in_non_system_tablespace (), and then a call to row_truncate_update_table_id() from row_truncate_update_sys_tables_during_fix_up().

            The only problem that I see in the TRUNCATE recovery is that is not being skipped if innodb_force_recovery>=3 is specified, and that could cause a lock conflict with the previous attempt of TRUNCATE that was interrupted by a server kill. The TRUNCATE recovery appears to be in the correct place.

            So, after all, it is possible be that elenst made the correct conclusion, and that the corruption on SYS_COLUMNS.TABLE_ID mismatch actually was fixed by MDEV-11927.

            marko Marko Mäkelä added a comment - With the current 10.2, TRUNCATE TABLE appears to be crash-safe. I tested as follows: --source include/have_innodb.inc create table t(a int ) engine=innodb; insert into t values (42); truncate table t; select * from t; drop table t; I set a breakpoint on row_truncate_update_table_id and once it was reached (on the TRUNCATE statement), also on row_upd. The first row_upd() call was for node->table->name = "SYS_TABLES". The second call was for "SYS_COLUMNS". At that point, I did call log_write_up_to(log_sys->lsn, true) run to kill and restart the server. On the restart, I got a call to row_upd() from truncate_t::update_root_page_no()/row_truncate_update_sys_tables_during_fix_up()/truncate_t::fixup_tables_in_non_system_tablespace (), and then a call to row_truncate_update_table_id() from row_truncate_update_sys_tables_during_fix_up(). The only problem that I see in the TRUNCATE recovery is that is not being skipped if innodb_force_recovery>=3 is specified, and that could cause a lock conflict with the previous attempt of TRUNCATE that was interrupted by a server kill. The TRUNCATE recovery appears to be in the correct place. So, after all, it is possible be that elenst made the correct conclusion, and that the corruption on SYS_COLUMNS.TABLE_ID mismatch actually was fixed by MDEV-11927 .

            Back when I fixed MDEV-11927, I did not study the exact mechanism on how it would lead to the fatal error message that no matching SYS_COLUMNS.TABLE_ID is found.
            But still, this looks very much like a duplicate of the problem that was reported in MDEV-11927. The corruption could have been introduced into the data by the 10.2.4 server, and an upgrade to a version with the bug fix (10.2.5 or later) would suffer from the already present corruption.

            marko Marko Mäkelä added a comment - Back when I fixed MDEV-11927 , I did not study the exact mechanism on how it would lead to the fatal error message that no matching SYS_COLUMNS.TABLE_ID is found. But still, this looks very much like a duplicate of the problem that was reported in MDEV-11927 . The corruption could have been introduced into the data by the 10.2.4 server, and an upgrade to a version with the bug fix (10.2.5 or later) would suffer from the already present corruption.

            People

              marko Marko Mäkelä
              manttila Manu Anttila
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.