Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20198

Instant ALTER TABLE is not crash-safe

Details

    Description

      I'm using the official MariaDB mariadb:10.4.6-bionic image and I have a problem: my tables corrupt themselves. This had happened at least four times this month:

      • on two different dev machines using docker-compose
      • on the staging machine using docker swarm

      Everytime the symptom is that a table exists in the information schema but it can't be read or altered at all. Any operation leads to a 'Table does not exists in engine' error.

      The container logs contain this error at startup:

      [ERROR] InnoDB: Table sfdb.specialist contains unrecognizable instant ALTER metadata

      I've checked the server system variables with

      SHOW VARIABLES;

      And alter_algorithm has the value DEFAULT which means (if I understand the documentation correctly) that it should not use the INSTANT algorithm, only INPLACE and COPY.
      https://mariadb.com/kb/en/library/server-system-variables/#alter_algorithm

      Trying to dig into the MariaDB codebase led me to this check:
      https://github.com/MariaDB/server/blob/cccfa9dcfe2c161779824c01a84edfa64fc4378a/storage/innobase/btr/btr0cur.cc#L465

      But I do not have enough knowledge of the inner workings of MariaDB to understand what is going on.

      What am I doing wrong? Should I set some flag or configuration to have a stable database in docker?

      PS: if this JIRA is not the correct place to ask this can you point me to a more suitable place to post my question ?

      Attachments

        1. MDEV-20198.yy
          0.7 kB
        2. REPLAY_SIMP.sh
          11 kB
        3. simp_MDEV-20198.cfg
          43 kB

        Issue Links

          Activity

            Having the same issue it started when i updated from 10.4.8 to 10.4.10 i downgraded back to 10.4.8 but the issue still exist. It also seems to only be happening to table with a dash in the the name.

            obissick Oren Bissick (Inactive) added a comment - Having the same issue it started when i updated from 10.4.8 to 10.4.10 i downgraded back to 10.4.8 but the issue still exist. It also seems to only be happening to table with a dash in the the name.

            mleich reproduced this error in internal testing of something else. In that case, the server was killed during an instant reorder column (MDEV-15562) while it was executing in the following stack trace:

            btr_page_reorganize
            btr_cur_pessimistic_update
            innobase_instant_try
            

            At the point of the SIGKILL, in the durably written state of the page contains a metadata record stub for which the metadata BLOB had not been written out yet. Therefore, btr_cur_instant_init_low() would fail due to the following:

            		uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4);
            		if (!len
            		    || mach_read_from_4(ptr + BTR_EXTERN_OFFSET)
            		    != FIL_PAGE_DATA
            		    || mach_read_from_4(ptr + BTR_EXTERN_SPACE_ID)
            		    != space->id) {
            			goto incompatible;
            		}
            

            I think that the correct fix is that when the metadata BLOB pointer is all-zero, btr_cur_instant_init_low() will assume that the metadata record is in uncommitted state and pretend that it does not exist. In fact, at that point of time we should have access to the transaction metadata and could easily check if DB_TRX_ID points to an incomplete transaction.

            marko Marko Mäkelä added a comment - mleich reproduced this error in internal testing of something else. In that case, the server was killed during an instant reorder column ( MDEV-15562 ) while it was executing in the following stack trace: btr_page_reorganize btr_cur_pessimistic_update innobase_instant_try At the point of the SIGKILL, in the durably written state of the page contains a metadata record stub for which the metadata BLOB had not been written out yet. Therefore, btr_cur_instant_init_low() would fail due to the following: uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4); if (!len || mach_read_from_4(ptr + BTR_EXTERN_OFFSET) != FIL_PAGE_DATA || mach_read_from_4(ptr + BTR_EXTERN_SPACE_ID) != space->id) { goto incompatible; } I think that the correct fix is that when the metadata BLOB pointer is all-zero, btr_cur_instant_init_low() will assume that the metadata record is in uncommitted state and pretend that it does not exist. In fact, at that point of time we should have access to the transaction metadata and could easily check if DB_TRX_ID points to an incomplete transaction.

            Here is the start of a fix.

            diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc
            index ceac301a502..3b4200efd9d 100644
            --- a/storage/innobase/btr/btr0cur.cc
            +++ b/storage/innobase/btr/btr0cur.cc
            @@ -536,6 +536,19 @@ static dberr_t btr_cur_instant_init_low(dict_index_t* index, mtr_t* mtr)
             			goto incompatible;
             		}
             
            +		if (!memcmp(ptr,
            +			    field_ref_zero, BTR_EXTERN_FIELD_REF_SIZE)
            +		    && trx_sys.is_registered(
            +			    current_trx(),
            +			    mach_read_from_6(rec + trx_id_offset))) {
            +			/* The server must have been killed during
            +			an instant ALTER TABLE operation after a
            +			metadata record BLOB stub with an
            +			all-zero metadata BLOB pointer was durably written. */
            +			ut_a(!index->is_instant()); // FIXME
            +			return DB_SUCCESS;
            +		}
            +
             		uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4);
             		if (!len
             		    || mach_read_from_4(ptr + BTR_EXTERN_OFFSET)
            

            The fix is incomplete, and the ut_a() assertion would fail for the data directory that I have.

            According to the rr replay of the killed server, an instant ADD COLUMN as well as some column reordering had already been executed on the table before the column-reordering ALTER TABLE started to execute.

            I think that we would have to fetch a previous version of the metadata record where the metadata is available. If a previous version is not found (the server was killed during the very first column-reordering ALTER TABLE), then !index->is_instant() should hold and we can return DB_SUCCESS.

            I seem to remember that dict_load_indexes() is using the READ UNCOMMITTED isolation level. If we have to refer to the preceding version of the metadata record, I think that we would also have to refer to the corresponding version of the data dictionary tables when loading the metadata.

            The correct fix might be that we return a special error code from btr_cur_instant_init_low() and then retry loading the data dictionary definition using the earlier version, and on a subsequent call to btr_cur_instant_init_low() pass some state that tells that the previous version of the metadata record needs to be used.

            For fault injection testing, I think that we must let the server hang at this point:

            diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc
            index ceac301a502..93396c558f6 100644
            --- a/storage/innobase/btr/btr0cur.cc
            +++ b/storage/innobase/btr/btr0cur.cc
            @@ -5123,6 +5123,7 @@ btr_cur_pessimistic_update(
             			would have too many fields, and we would be unable to
             			know the size of the freed record. */
             			btr_page_reorganize(page_cursor, index, mtr);
            +			DEBUG_SYNC_C("update_metadata");
             			rec = page_cursor->rec;
             			rec_offs_make_valid(rec, index, true, *offsets);
             			if (page_cursor->block->page.id().page_no()
            

            To test the scenario where the server is being during the very first MDEV-15562 type operation (such as ALTER TABLE MODIFY COLUMN b FIRST), I think that we can use the existing DEBUG_SYNC point before_row_ins_extern_latch.

            marko Marko Mäkelä added a comment - Here is the start of a fix. diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc index ceac301a502..3b4200efd9d 100644 --- a/storage/innobase/btr/btr0cur.cc +++ b/storage/innobase/btr/btr0cur.cc @@ -536,6 +536,19 @@ static dberr_t btr_cur_instant_init_low(dict_index_t* index, mtr_t* mtr) goto incompatible; } + if (!memcmp(ptr, + field_ref_zero, BTR_EXTERN_FIELD_REF_SIZE) + && trx_sys.is_registered( + current_trx(), + mach_read_from_6(rec + trx_id_offset))) { + /* The server must have been killed during + an instant ALTER TABLE operation after a + metadata record BLOB stub with an + all-zero metadata BLOB pointer was durably written. */ + ut_a(!index->is_instant()); // FIXME + return DB_SUCCESS; + } + uint len = mach_read_from_4(ptr + BTR_EXTERN_LEN + 4); if (!len || mach_read_from_4(ptr + BTR_EXTERN_OFFSET) The fix is incomplete, and the ut_a() assertion would fail for the data directory that I have. According to the rr replay of the killed server, an instant ADD COLUMN as well as some column reordering had already been executed on the table before the column-reordering ALTER TABLE started to execute. I think that we would have to fetch a previous version of the metadata record where the metadata is available. If a previous version is not found (the server was killed during the very first column-reordering ALTER TABLE ), then !index->is_instant() should hold and we can return DB_SUCCESS . I seem to remember that dict_load_indexes() is using the READ UNCOMMITTED isolation level. If we have to refer to the preceding version of the metadata record, I think that we would also have to refer to the corresponding version of the data dictionary tables when loading the metadata. The correct fix might be that we return a special error code from btr_cur_instant_init_low() and then retry loading the data dictionary definition using the earlier version, and on a subsequent call to btr_cur_instant_init_low() pass some state that tells that the previous version of the metadata record needs to be used. For fault injection testing, I think that we must let the server hang at this point: diff --git a/storage/innobase/btr/btr0cur.cc b/storage/innobase/btr/btr0cur.cc index ceac301a502..93396c558f6 100644 --- a/storage/innobase/btr/btr0cur.cc +++ b/storage/innobase/btr/btr0cur.cc @@ -5123,6 +5123,7 @@ btr_cur_pessimistic_update( would have too many fields, and we would be unable to know the size of the freed record. */ btr_page_reorganize(page_cursor, index, mtr); + DEBUG_SYNC_C("update_metadata"); rec = page_cursor->rec; rec_offs_make_valid(rec, index, true, *offsets); if (page_cursor->block->page.id().page_no() To test the scenario where the server is being during the very first MDEV-15562 type operation (such as ALTER TABLE MODIFY COLUMN b FIRST ), I think that we can use the existing DEBUG_SYNC point before_row_ins_extern_latch .

            Something like this should cover the scenario that the server is killed during first-time MDEV-15562 operation:

            --source include/have_innodb.inc
            --source include/have_debug.inc
            --source include/have_debug_sync.inc
            CREATE TABLE t(a INT) ENGINE=InnoDB;
            INSERT INTO t VALUES(1);
            connect (con1,localhost,root,,);
            SET DEBUG_DBUG='+d,row_ins_extern_checkpoint';
            SET DEBUG_SYNC='before_row_ins_extern_latch SIGNAL stuck WAIT_FOR ever';
            send ALTER TABLE t ADD COLUMN b INT FIRST;
            connection default;
            SET DEBUG_SYNC='now WAIT_FOR stuck';
            --let $restart_timeout=0
            --source include/restart_mysqld.inc
            disconnect con1;
            SELECT * FROM t;
            DROP TABLE t;
            

            This test did not deliver the expected result for me: the table was recovered as (b,a)=(NULL,1) instead of (a)=(1).

            marko Marko Mäkelä added a comment - Something like this should cover the scenario that the server is killed during first-time MDEV-15562 operation: --source include/have_innodb.inc --source include/have_debug.inc --source include/have_debug_sync.inc CREATE TABLE t(a INT ) ENGINE=InnoDB; INSERT INTO t VALUES (1); connect (con1,localhost,root,,); SET DEBUG_DBUG= '+d,row_ins_extern_checkpoint' ; SET DEBUG_SYNC= 'before_row_ins_extern_latch SIGNAL stuck WAIT_FOR ever' ; send ALTER TABLE t ADD COLUMN b INT FIRST ; connection default ; SET DEBUG_SYNC= 'now WAIT_FOR stuck' ; --let $restart_timeout=0 --source include/restart_mysqld.inc disconnect con1; SELECT * FROM t; DROP TABLE t; This test did not deliver the expected result for me: the table was recovered as (b,a)=(NULL,1) instead of (a)=(1).

            This was fixed by MDEV-27234 in MariaDB Server 10.6.8, by implementing the READ COMMITTED isolation level in data dictionary recovery, instead of the previous READ UNCOMMITTED.

            Before 10.6, DDL operations are not crash-safe, not even if we ignore the mismatch between the two data dictionaries .frm files and what is stored inside InnoDB. The .frm file mismatch mostly affected ALTER TABLE and was fixed in MDEV-25180. Other DDL crash safety issues in InnoDB were fixed in MDEV-25506 and MDEV-25919. It is not feasible to implement any of those fixes in earlier major releases.

            marko Marko Mäkelä added a comment - This was fixed by MDEV-27234 in MariaDB Server 10.6.8, by implementing the READ COMMITTED isolation level in data dictionary recovery, instead of the previous READ UNCOMMITTED . Before 10.6, DDL operations are not crash-safe, not even if we ignore the mismatch between the two data dictionaries .frm files and what is stored inside InnoDB. The .frm file mismatch mostly affected ALTER TABLE and was fixed in MDEV-25180 . Other DDL crash safety issues in InnoDB were fixed in MDEV-25506 and MDEV-25919 . It is not feasible to implement any of those fixes in earlier major releases.

            People

              marko Marko Mäkelä
              nreynisama Nicolas Reynis (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.