Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20198

Instant ALTER TABLE is not crash-safe

Details

    Description

      I'm using the official MariaDB mariadb:10.4.6-bionic image and I have a problem: my tables corrupt themselves. This had happened at least four times this month:

      • on two different dev machines using docker-compose
      • on the staging machine using docker swarm

      Everytime the symptom is that a table exists in the information schema but it can't be read or altered at all. Any operation leads to a 'Table does not exists in engine' error.

      The container logs contain this error at startup:

      [ERROR] InnoDB: Table sfdb.specialist contains unrecognizable instant ALTER metadata

      I've checked the server system variables with

      SHOW VARIABLES;

      And alter_algorithm has the value DEFAULT which means (if I understand the documentation correctly) that it should not use the INSTANT algorithm, only INPLACE and COPY.
      https://mariadb.com/kb/en/library/server-system-variables/#alter_algorithm

      Trying to dig into the MariaDB codebase led me to this check:
      https://github.com/MariaDB/server/blob/cccfa9dcfe2c161779824c01a84edfa64fc4378a/storage/innobase/btr/btr0cur.cc#L465

      But I do not have enough knowledge of the inner workings of MariaDB to understand what is going on.

      What am I doing wrong? Should I set some flag or configuration to have a stable database in docker?

      PS: if this JIRA is not the correct place to ask this can you point me to a more suitable place to post my question ?

      Attachments

        1. MDEV-20198.yy
          0.7 kB
        2. REPLAY_SIMP.sh
          11 kB
        3. simp_MDEV-20198.cfg
          43 kB

        Issue Links

          Activity

            Hi nreynisama,
            can you provide test case so we could reproduce the error ?
            Thanks

            anel Anel Husakovic added a comment - Hi nreynisama , can you provide test case so we could reproduce the error ? Thanks

            Sorry I don't have any test case, we have a hard time isolating the problem. It just happens sporadicaly for no apparent reason.

            I was hoping they may be some misconfiguration on my side or some best practice I failed to follow.

            nreynisama Nicolas Reynis (Inactive) added a comment - Sorry I don't have any test case, we have a hard time isolating the problem. It just happens sporadicaly for no apparent reason. I was hoping they may be some misconfiguration on my side or some best practice I failed to follow.

            Hi nreynisama,
            I will close for now this MDEV, feel free to open new again if you find some bug and please provide some failing test so it can be easier to reproduce and work on.
            Additionally, if you have some questions you can use our zulip chat or irc (#maria).
            Thanks

            anel Anel Husakovic added a comment - Hi nreynisama , I will close for now this MDEV, feel free to open new again if you find some bug and please provide some failing test so it can be easier to reproduce and work on. Additionally, if you have some questions you can use our zulip chat or irc (#maria). Thanks
            judahnator Judah Wright added a comment -

            I am seeing the same issue with our database. We are using version 10.4.7 on both Ubuntu Bionic and Disco, as well as CentOS.

            On server restart there is maybe a 20-30% chance that one of two specific tables will get the error described above. It is always one of these two tables.
            Sep 6 13:58:58 homestead mysqld[1693]: 2019-09-06 13:58:58 1 [ERROR] InnoDB: Table `mydatabase`.`agencies` contains unrecognizable instant ALTER metadata

            This is reproducible locally, on a few co-workers machines, and on a few different cloud providers. It is not just with production data either, if we wipe and re-seed the database with test data the issue persists.

            I cannot seem to reproduce the problem independent of our specific schema, but I would be happy to share it privately if that might help.

            judahnator Judah Wright added a comment - I am seeing the same issue with our database. We are using version 10.4.7 on both Ubuntu Bionic and Disco, as well as CentOS. On server restart there is maybe a 20-30% chance that one of two specific tables will get the error described above. It is always one of these two tables. Sep 6 13:58:58 homestead mysqld [1693] : 2019-09-06 13:58:58 1 [ERROR] InnoDB: Table `mydatabase`.`agencies` contains unrecognizable instant ALTER metadata This is reproducible locally, on a few co-workers machines, and on a few different cloud providers. It is not just with production data either, if we wipe and re-seed the database with test data the issue persists. I cannot seem to reproduce the problem independent of our specific schema, but I would be happy to share it privately if that might help.
            jgcovas Juan Gabriel Covas added a comment - - edited

            Hi, I'm affected by MDEV-18543 where I'm unable to transport innodb tables since the schema metadata is wrong / is unable to be read by IMPORT TABLESPACE after a DROP COLUMN or ADD COLUMN which is not the last one is done with "Instant" alter algorithm.

            The only way I've found to run 10.4.7 making ALTERs that avoid the "instant" stuff is setting alter_algorithm=COPY (sadly), so DEFAULT is not safe (it will use Instant on some cases).

            I just want to point this in case this is a solution for Nicolas to achieve stability with his mariadb servers and this reported bug.

            Perhaps some of us can have time on trying to reproduce, but for me it's for sure a thing of doing ALTERs to those tables (with DEFAULT alter_algorithm) and get the metadata corrupted as mentioned in this bug report (and perhaps involving a mariadb instance restart, dunno...).

            jgcovas Juan Gabriel Covas added a comment - - edited Hi, I'm affected by MDEV-18543 where I'm unable to transport innodb tables since the schema metadata is wrong / is unable to be read by IMPORT TABLESPACE after a DROP COLUMN or ADD COLUMN which is not the last one is done with "Instant" alter algorithm. The only way I've found to run 10.4.7 making ALTERs that avoid the "instant" stuff is setting alter_algorithm=COPY (sadly), so DEFAULT is not safe (it will use Instant on some cases). I just want to point this in case this is a solution for Nicolas to achieve stability with his mariadb servers and this reported bug. Perhaps some of us can have time on trying to reproduce, but for me it's for sure a thing of doing ALTERs to those tables (with DEFAULT alter_algorithm) and get the metadata corrupted as mentioned in this bug report (and perhaps involving a mariadb instance restart, dunno...).

            Thanks Juan, I'll give it a shot.

            nreynisama Nicolas Reynis (Inactive) added a comment - Thanks Juan, I'll give it a shot.

            So, we are having the exact same problem in our company with mariadb 10.4.10-1 currently only affecting Linux and MacOS environments. It seems to us, that putting a machine to sleep mode and waking it corrupts the InnoDB tablespace data in some way. Not only productive tables are being corrupted, but internal InnoDB tables as well. We haven't fiddled with any config settings other than logsizes and are using lower_case_table_names = 1.
            We switched back to MariaDB 10.3. So far, the problem doesn't occur any more.

            CvD Christopher v D added a comment - So, we are having the exact same problem in our company with mariadb 10.4.10-1 currently only affecting Linux and MacOS environments. It seems to us, that putting a machine to sleep mode and waking it corrupts the InnoDB tablespace data in some way. Not only productive tables are being corrupted, but internal InnoDB tables as well. We haven't fiddled with any config settings other than logsizes and are using lower_case_table_names = 1. We switched back to MariaDB 10.3. So far, the problem doesn't occur any more.
            nreynisama Nicolas Reynis (Inactive) added a comment - - edited

            @Christopher:
            Setting `--alter-algorithm=copy` like Juan suggested solved the problem for us.
            Thoses option have been introduced in MariaDB 10.3.7, I suspect that every version since that one are unreliable with default configuration.

            nreynisama Nicolas Reynis (Inactive) added a comment - - edited @Christopher: Setting `--alter-algorithm=copy` like Juan suggested solved the problem for us. Thoses option have been introduced in MariaDB 10.3.7, I suspect that every version since that one are unreliable with default configuration.

            @Nicolas
            Ty, we will try it.

            CvD Christopher v D added a comment - @Nicolas Ty, we will try it.

            I suspect that this ticket reports what was fixed in MariaDB 10.4.7 in MDEV-19916. The corruption of the metadata could remain unnoticed until the server is restarted or the table definition is evicted and reloaded to the InnoDB data dictionary cache.

            jgcovas, you could prevent instantaneous operation and request a table rebuild with the FORCE keyword. That is supported even with ALGORITHM=INPLACE.

            Once the table is corrupted, I am afraid that it is very hard to fix it. Restoring from a SQL dump or a backup should work.

            marko Marko Mäkelä added a comment - I suspect that this ticket reports what was fixed in MariaDB 10.4.7 in MDEV-19916 . The corruption of the metadata could remain unnoticed until the server is restarted or the table definition is evicted and reloaded to the InnoDB data dictionary cache. jgcovas , you could prevent instantaneous operation and request a table rebuild with the FORCE keyword. That is supported even with ALGORITHM=INPLACE . Once the table is corrupted, I am afraid that it is very hard to fix it. Restoring from a SQL dump or a backup should work.

            Thanks Marko. I've tested MDEV-18543 under mariadb 10.4.10 and the problem is still there. Anyway I also noted that setting "ALGORITHM" on the ALTER statement does NOT work as expected (if I'm not wrong).

            If mariadb server is run using alter_algorithm=COPY, then it does NOT matter if you put ALGORITHM=INSTANT on the ALTER statement: the table is OK afterwards (no Instant bug there). If you restart the server with alter_algorithm=DEFAULT, then the ALTER gets the table metadata corrupted (Instant bug). This confuses me, but just to follow-up if this leads to something (shouldn't the ALTER honor the ALGORITHM keyword?). Setting the session alter_algorithm to INSTANT also works (makes the ALTER to corrupt the table metadata).

            jgcovas Juan Gabriel Covas added a comment - Thanks Marko. I've tested MDEV-18543 under mariadb 10.4.10 and the problem is still there. Anyway I also noted that setting "ALGORITHM" on the ALTER statement does NOT work as expected (if I'm not wrong). If mariadb server is run using alter_algorithm=COPY, then it does NOT matter if you put ALGORITHM=INSTANT on the ALTER statement: the table is OK afterwards (no Instant bug there). If you restart the server with alter_algorithm=DEFAULT, then the ALTER gets the table metadata corrupted (Instant bug). This confuses me, but just to follow-up if this leads to something (shouldn't the ALTER honor the ALGORITHM keyword?). Setting the session alter_algorithm to INSTANT also works (makes the ALTER to corrupt the table metadata).

            Having the same issue it started when i updated from 10.4.8 to 10.4.10 i downgraded back to 10.4.8 but the issue still exist. It also seems to only be happening to table with a dash in the the name.

            obissick Oren Bissick (Inactive) added a comment - Having the same issue it started when i updated from 10.4.8 to 10.4.10 i downgraded back to 10.4.8 but the issue still exist. It also seems to only be happening to table with a dash in the the name.

            mleich reproduced this error in internal testing of something else. In that case, the server was killed during an instant reorder column (MDEV-15562) while it was executing in the following stack trace:

            btr_page_reorganize
            btr_cur_pessimistic_update
            innobase_instant_try
            

            At the point of the SIGKILL, in the durably written state of the page contains a metadata record stub for which the metadata BLOB had not been written out yet. Therefore, btr_cur_instant_init_low() would fail due to the following:

            		uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4);
            		if (!len
            		    || mach_read_from_4(ptr + BTR_EXTERN_OFFSET)
            		    != FIL_PAGE_DATA
            		    || mach_read_from_4(ptr + BTR_EXTERN_SPACE_ID)
            		    != space->id) {
            			goto incompatible;
            		}
            

            I think that the correct fix is that when the metadata BLOB pointer is all-zero, btr_cur_instant_init_low() will assume that the metadata record is in uncommitted state and pretend that it does not exist. In fact, at that point of time we should have access to the transaction metadata and could easily check if DB_TRX_ID points to an incomplete transaction.

            marko Marko Mäkelä added a comment - mleich reproduced this error in internal testing of something else. In that case, the server was killed during an instant reorder column ( MDEV-15562 ) while it was executing in the following stack trace: btr_page_reorganize btr_cur_pessimistic_update innobase_instant_try At the point of the SIGKILL, in the durably written state of the page contains a metadata record stub for which the metadata BLOB had not been written out yet. Therefore, btr_cur_instant_init_low() would fail due to the following: uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4); if (!len || mach_read_from_4(ptr + BTR_EXTERN_OFFSET) != FIL_PAGE_DATA || mach_read_from_4(ptr + BTR_EXTERN_SPACE_ID) != space->id) { goto incompatible; } I think that the correct fix is that when the metadata BLOB pointer is all-zero, btr_cur_instant_init_low() will assume that the metadata record is in uncommitted state and pretend that it does not exist. In fact, at that point of time we should have access to the transaction metadata and could easily check if DB_TRX_ID points to an incomplete transaction.

            Here is the start of a fix.

            diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc
            index ceac301a502..3b4200efd9d 100644
            --- a/storage/innobase/btr/btr0cur.cc
            +++ b/storage/innobase/btr/btr0cur.cc
            @@ -536,6 +536,19 @@ static dberr_t btr_cur_instant_init_low(dict_index_t* index, mtr_t* mtr)
             			goto incompatible;
             		}
             
            +		if (!memcmp(ptr,
            +			    field_ref_zero, BTR_EXTERN_FIELD_REF_SIZE)
            +		    && trx_sys.is_registered(
            +			    current_trx(),
            +			    mach_read_from_6(rec + trx_id_offset))) {
            +			/* The server must have been killed during
            +			an instant ALTER TABLE operation after a
            +			metadata record BLOB stub with an
            +			all-zero metadata BLOB pointer was durably written. */
            +			ut_a(!index->is_instant()); // FIXME
            +			return DB_SUCCESS;
            +		}
            +
             		uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4);
             		if (!len
             		    || mach_read_from_4(ptr + BTR_EXTERN_OFFSET)
            

            The fix is incomplete, and the ut_a() assertion would fail for the data directory that I have.

            According to the rr replay of the killed server, an instant ADD COLUMN as well as some column reordering had already been executed on the table before the column-reordering ALTER TABLE started to execute.

            I think that we would have to fetch a previous version of the metadata record where the metadata is available. If a previous version is not found (the server was killed during the very first column-reordering ALTER TABLE), then !index->is_instant() should hold and we can return DB_SUCCESS.

            I seem to remember that dict_load_indexes() is using the READ UNCOMMITTED isolation level. If we have to refer to the preceding version of the metadata record, I think that we would also have to refer to the corresponding version of the data dictionary tables when loading the metadata.

            The correct fix might be that we return a special error code from btr_cur_instant_init_low() and then retry loading the data dictionary definition using the earlier version, and on a subsequent call to btr_cur_instant_init_low() pass some state that tells that the previous version of the metadata record needs to be used.

            For fault injection testing, I think that we must let the server hang at this point:

            diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc
            index ceac301a502..93396c558f6 100644
            --- a/storage/innobase/btr/btr0cur.cc
            +++ b/storage/innobase/btr/btr0cur.cc
            @@ -5123,6 +5123,7 @@ btr_cur_pessimistic_update(
             			would have too many fields, and we would be unable to
             			know the size of the freed record. */
             			btr_page_reorganize(page_cursor, index, mtr);
            +			DEBUG_SYNC_C("update_metadata");
             			rec = page_cursor->rec;
             			rec_offs_make_valid(rec, index, true, *offsets);
             			if (page_cursor->block->page.id().page_no()
            

            To test the scenario where the server is being during the very first MDEV-15562 type operation (such as ALTER TABLE MODIFY COLUMN b FIRST), I think that we can use the existing DEBUG_SYNC point before_row_ins_extern_latch.

            marko Marko Mäkelä added a comment - Here is the start of a fix. diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc index ceac301a502..3b4200efd9d 100644 --- a/storage/innobase/btr/btr0cur.cc +++ b/storage/innobase/btr/btr0cur.cc @@ -536,6 +536,19 @@ static dberr_t btr_cur_instant_init_low(dict_index_t* index, mtr_t* mtr) goto incompatible; } + if (!memcmp(ptr, + field_ref_zero, BTR_EXTERN_FIELD_REF_SIZE) + && trx_sys.is_registered( + current_trx(), + mach_read_from_6(rec + trx_id_offset))) { + /* The server must have been killed during + an instant ALTER TABLE operation after a + metadata record BLOB stub with an + all-zero metadata BLOB pointer was durably written. */ + ut_a(!index->is_instant()); // FIXME + return DB_SUCCESS; + } + uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4); if (!len || mach_read_from_4(ptr + BTR_EXTERN_OFFSET) The fix is incomplete, and the ut_a() assertion would fail for the data directory that I have. According to the rr replay of the killed server, an instant ADD COLUMN as well as some column reordering had already been executed on the table before the column-reordering ALTER TABLE started to execute. I think that we would have to fetch a previous version of the metadata record where the metadata is available. If a previous version is not found (the server was killed during the very first column-reordering ALTER TABLE ), then !index->is_instant() should hold and we can return DB_SUCCESS . I seem to remember that dict_load_indexes() is using the READ UNCOMMITTED isolation level. If we have to refer to the preceding version of the metadata record, I think that we would also have to refer to the corresponding version of the data dictionary tables when loading the metadata. The correct fix might be that we return a special error code from btr_cur_instant_init_low() and then retry loading the data dictionary definition using the earlier version, and on a subsequent call to btr_cur_instant_init_low() pass some state that tells that the previous version of the metadata record needs to be used. For fault injection testing, I think that we must let the server hang at this point: diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc index ceac301a502..93396c558f6 100644 --- a/storage/innobase/btr/btr0cur.cc +++ b/storage/innobase/btr/btr0cur.cc @@ -5123,6 +5123,7 @@ btr_cur_pessimistic_update( would have too many fields, and we would be unable to know the size of the freed record. */ btr_page_reorganize(page_cursor, index, mtr); + DEBUG_SYNC_C("update_metadata"); rec = page_cursor->rec; rec_offs_make_valid(rec, index, true, *offsets); if (page_cursor->block->page.id().page_no() To test the scenario where the server is being during the very first MDEV-15562 type operation (such as ALTER TABLE MODIFY COLUMN b FIRST ), I think that we can use the existing DEBUG_SYNC point before_row_ins_extern_latch .

            Something like this should cover the scenario that the server is killed during first-time MDEV-15562 operation:

            --source include/have_innodb.inc
            --source include/have_debug.inc
            --source include/have_debug_sync.inc
            CREATE TABLE t(a INT) ENGINE=InnoDB;
            INSERT INTO t VALUES(1);
            connect (con1,localhost,root,,);
            SET DEBUG_DBUG='+d,row_ins_extern_checkpoint';
            SET DEBUG_SYNC='before_row_ins_extern_latch SIGNAL stuck WAIT_FOR ever';
            send ALTER TABLE t ADD COLUMN b INT FIRST;
            connection default;
            SET DEBUG_SYNC='now WAIT_FOR stuck';
            --let $restart_timeout=0
            --source include/restart_mysqld.inc
            disconnect con1;
            SELECT * FROM t;
            DROP TABLE t;
            

            This test did not deliver the expected result for me: the table was recovered as (b,a)=(NULL,1) instead of (a)=(1).

            marko Marko Mäkelä added a comment - Something like this should cover the scenario that the server is killed during first-time MDEV-15562 operation: --source include/have_innodb.inc --source include/have_debug.inc --source include/have_debug_sync.inc CREATE TABLE t(a INT ) ENGINE=InnoDB; INSERT INTO t VALUES (1); connect (con1,localhost,root,,); SET DEBUG_DBUG= '+d,row_ins_extern_checkpoint' ; SET DEBUG_SYNC= 'before_row_ins_extern_latch SIGNAL stuck WAIT_FOR ever' ; send ALTER TABLE t ADD COLUMN b INT FIRST ; connection default ; SET DEBUG_SYNC= 'now WAIT_FOR stuck' ; --let $restart_timeout=0 --source include/restart_mysqld.inc disconnect con1; SELECT * FROM t; DROP TABLE t; This test did not deliver the expected result for me: the table was recovered as (b,a)=(NULL,1) instead of (a)=(1).

            This was fixed by MDEV-27234 in MariaDB Server 10.6.8, by implementing the READ COMMITTED isolation level in data dictionary recovery, instead of the previous READ UNCOMMITTED.

            Before 10.6, DDL operations are not crash-safe, not even if we ignore the mismatch between the two data dictionaries .frm files and what is stored inside InnoDB. The .frm file mismatch mostly affected ALTER TABLE and was fixed in MDEV-25180. Other DDL crash safety issues in InnoDB were fixed in MDEV-25506 and MDEV-25919. It is not feasible to implement any of those fixes in earlier major releases.

            marko Marko Mäkelä added a comment - This was fixed by MDEV-27234 in MariaDB Server 10.6.8, by implementing the READ COMMITTED isolation level in data dictionary recovery, instead of the previous READ UNCOMMITTED . Before 10.6, DDL operations are not crash-safe, not even if we ignore the mismatch between the two data dictionaries .frm files and what is stored inside InnoDB. The .frm file mismatch mostly affected ALTER TABLE and was fixed in MDEV-25180 . Other DDL crash safety issues in InnoDB were fixed in MDEV-25506 and MDEV-25919 . It is not feasible to implement any of those fixes in earlier major releases.

            People

              marko Marko Mäkelä
              nreynisama Nicolas Reynis (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.