Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15641

InnoDB crash while committing table-rebuilding ALTER TABLE

Details

    Description

      Currently, mysql_inplace_alter_table() in sql/sql_table.cc does the following:

        // Upgrade to EXCLUSIVE before commit.
        if (wait_while_table_is_used(thd, table, HA_EXTRA_PREPARE_FOR_RENAME))
          goto rollback;
      

      During this time, the ALTER TABLE thread is not doing anything useful. If this is a table-rebuilding ALTER TABLE operation (something that cannot be done as ALGORITHM=INSTANT (MDEV-13134) until parts of MDEV-11424 are finished, or if rebuild was requested by specifying the FORCE keyword), then it would make sense to let InnoDB invoke row_log_table_apply() after a request to upgrade the lock has been submitted.

      I would like to consider adding a variant of the ha_innobase::inplace_alter_table() call that would let InnoDB invoke the next batch of row_log_table_apply(). Once this call returns, the caller would check if the ALTER TABLE operation was killed or the MDL upgrade timed out, or the MDL was granted. As long as the lock wait should continue, the storage engine would be called again to do useful work during the wait.

      Attachments

        Issue Links

          Activity

            The row_log_table_apply() is actually invoked while holding both dict_sys->mutex and dict_operation_lock. If there is a lot of log to apply, this may actually cause InnoDB to crash.
            I collaborated with a user on the IRC #maria channel who reported a crash in MySQL 5.7.21 with the following kind of test case:

            1. {{CREATE TABLE t }} with many secondary indexes.
            2. INSERT INTO t … 20 million rows.
            3. Issue ALTER TABLE t FORCE, LOCK=NONE;
            4. While the table is being copied, do: BEGIN; INSERT INTO t VALUES(…); (single row, do not commit)
            5. Wait for the ALTER TABLE to reach the stage where it is waiting in the MDL upgrade (no I/O taking place).
            6. Issue DELETE FROM t; COMMIT; in the open transaction.
            7. While this is executing, try to access any (possibly other) InnoDB table, or CREATE TABLE…ENGINE=InnoDB. It will be blocked too!
            8. Wait for the server crash like this:

              2018-03-23T15:19:10.847595Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
              

            marko Marko Mäkelä added a comment - The row_log_table_apply() is actually invoked while holding both dict_sys->mutex and dict_operation_lock . If there is a lot of log to apply, this may actually cause InnoDB to crash. I collaborated with a user on the IRC #maria channel who reported a crash in MySQL 5.7.21 with the following kind of test case: {{CREATE TABLE t }} with many secondary indexes. INSERT INTO t … 20 million rows. Issue ALTER TABLE t FORCE, LOCK=NONE; While the table is being copied, do: BEGIN; INSERT INTO t VALUES(…); (single row, do not commit) Wait for the ALTER TABLE to reach the stage where it is waiting in the MDL upgrade (no I/O taking place). Issue DELETE FROM t; COMMIT; in the open transaction. While this is executing, try to access any (possibly other) InnoDB table, or CREATE TABLE…ENGINE=InnoDB . It will be blocked too! Wait for the server crash like this: 2018-03-23T15:19:10.847595Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            quantuml3ap quantuml3ap added a comment -

            Step 1:

            A table with ~25M rows.

            | users_wc | CREATE TABLE `users_wc` (
              `id` int(11) NOT NULL AUTO_INCREMENT,
              `name` varchar(50) DEFAULT NULL,
              `email` varchar(100) DEFAULT NULL,
              `active` mediumint(9) DEFAULT NULL,
              `active_woi` mediumint(9) DEFAULT NULL,
              `parent_id` int(11) DEFAULT NULL,
              PRIMARY KEY (`id`),
              KEY `parent_id` (`parent_id`),
              CONSTRAINT `users_wc_ibfk_1` FOREIGN KEY (`parent_id`) REFERENCES `users` (`id`)
            ) ENGINE=InnoDB AUTO_INCREMENT=32319559 DEFAULT CHARSET=utf8 |
            

            Step 2:

            Have 2 sessions - S1 and S2.

            In S1 run:

            ALTER TABLE users_wc ADD COLUMN random_col varchar(50) DEFAULT 'hello world foo bar';
            

            In S2 do this:

            BEGIN;
             
            INSERT INTO users_wc SET name = 'a', email = 'a';
            

            Now when the alter awaits X MDL before the commit phase (post-alter phase), try this in S2 (deleting 18M records):

            DELETE FROM users_wc LIMIT 18000000;
            (took ~9min)
            COMMIT;
            

            Now `SHOW PROCESSLIST` will show "committing alter table to storage engine". Saw this for ~15min until Mysql crashed and restarted.

            Final observation:

            #sql-ib... files were there, had to drop them and realized the ALTER hadn't completed successfully. The column was not added.

            quantuml3ap quantuml3ap added a comment - Step 1: A table with ~25M rows. | users_wc | CREATE TABLE `users_wc` ( `id` int (11) NOT NULL AUTO_INCREMENT, ` name ` varchar (50) DEFAULT NULL , `email` varchar (100) DEFAULT NULL , `active` mediumint(9) DEFAULT NULL , `active_woi` mediumint(9) DEFAULT NULL , `parent_id` int (11) DEFAULT NULL , PRIMARY KEY (`id`), KEY `parent_id` (`parent_id`), CONSTRAINT `users_wc_ibfk_1` FOREIGN KEY (`parent_id`) REFERENCES `users` (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=32319559 DEFAULT CHARSET=utf8 | Step 2: Have 2 sessions - S1 and S2. In S1 run: ALTER TABLE users_wc ADD COLUMN random_col varchar (50) DEFAULT 'hello world foo bar' ; In S2 do this: BEGIN ;   INSERT INTO users_wc SET name = 'a' , email = 'a' ; Now when the alter awaits X MDL before the commit phase (post-alter phase), try this in S2 (deleting 18M records): DELETE FROM users_wc LIMIT 18000000; (took ~9min) COMMIT ; Now `SHOW PROCESSLIST` will show "committing alter table to storage engine". Saw this for ~15min until Mysql crashed and restarted. Final observation: #sql-ib... files were there, had to drop them and realized the ALTER hadn't completed successfully. The column was not added.

            The problematic code would be removed when MDEV-16329 implements cross-engine online ALTER TABLE.

            marko Marko Mäkelä added a comment - The problematic code would be removed when MDEV-16329 implements cross-engine online ALTER TABLE.

            It seems the problem is even easier to hit with partitioned table ALTERed, see upstream https://bugs.mysql.com/bug.php?id=94610

            valerii Valerii Kravchuk added a comment - It seems the problem is even easier to hit with partitioned table ALTERed, see upstream https://bugs.mysql.com/bug.php?id=94610

            Currently, there are three things blocked while applying concurrently
            DMLs in commit phase:
            1) DML on the particular table (due to MDL_EXCLUSIVE on the table)
            2) InnoDB DDLs (due to dict_operation_lock)
            3) Purge thread, stats thread, the master thread (2nd reason)

            The approach to apply the log of concurrent DMLs before acquiring dict_sys latches and
            dict_operation_lock. So (2), (3) shouldn't be a issue anymore.

            thiru Thirunarayanan Balathandayuthapani added a comment - Currently, there are three things blocked while applying concurrently DMLs in commit phase: 1) DML on the particular table (due to MDL_EXCLUSIVE on the table) 2) InnoDB DDLs (due to dict_operation_lock) 3) Purge thread, stats thread, the master thread (2nd reason) The approach to apply the log of concurrent DMLs before acquiring dict_sys latches and dict_operation_lock. So (2), (3) shouldn't be a issue anymore.

            Patch approved by kevg. I would like to run some RQG test before pushing it to 10.2

            thiru Thirunarayanan Balathandayuthapani added a comment - Patch approved by kevg . I would like to run some RQG test before pushing it to 10.2

            People

              thiru Thirunarayanan Balathandayuthapani
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.