[MDEV-15641] InnoDB crash while committing table-rebuilding ALTER TABLE Created: 2018-03-23  Updated: 2021-02-16  Resolved: 2019-07-10

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Storage Engine - InnoDB, Storage Engine - XtraDB
Affects Version/s: 10.0, 10.1, 10.2, 10.3
Fix Version/s: 10.2.26, 10.3.17, 10.4.7

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 1
Labels: crash, hang, online-ddl, performance, upstream

Issue Links:
Relates
relates to MDEV-16329 Engine-independent online ALTER TABLE Closed
relates to MDEV-11424 Instant ALTER TABLE of failure-free r... Closed
relates to MDEV-13134 Introduce ALTER TABLE attributes ALGO... Closed

 Description   

Currently, mysql_inplace_alter_table() in sql/sql_table.cc does the following:

  // Upgrade to EXCLUSIVE before commit.
  if (wait_while_table_is_used(thd, table, HA_EXTRA_PREPARE_FOR_RENAME))
    goto rollback;

During this time, the ALTER TABLE thread is not doing anything useful. If this is a table-rebuilding ALTER TABLE operation (something that cannot be done as ALGORITHM=INSTANT (MDEV-13134) until parts of MDEV-11424 are finished, or if rebuild was requested by specifying the FORCE keyword), then it would make sense to let InnoDB invoke row_log_table_apply() after a request to upgrade the lock has been submitted.

I would like to consider adding a variant of the ha_innobase::inplace_alter_table() call that would let InnoDB invoke the next batch of row_log_table_apply(). Once this call returns, the caller would check if the ALTER TABLE operation was killed or the MDL upgrade timed out, or the MDL was granted. As long as the lock wait should continue, the storage engine would be called again to do useful work during the wait.



 Comments   
Comment by Marko Mäkelä [ 2018-03-23 ]

The row_log_table_apply() is actually invoked while holding both dict_sys->mutex and dict_operation_lock. If there is a lot of log to apply, this may actually cause InnoDB to crash.
I collaborated with a user on the IRC #maria channel who reported a crash in MySQL 5.7.21 with the following kind of test case:

  1. {{CREATE TABLE t }} with many secondary indexes.
  2. INSERT INTO t … 20 million rows.
  3. Issue ALTER TABLE t FORCE, LOCK=NONE;
  4. While the table is being copied, do: BEGIN; INSERT INTO t VALUES(…); (single row, do not commit)
  5. Wait for the ALTER TABLE to reach the stage where it is waiting in the MDL upgrade (no I/O taking place).
  6. Issue DELETE FROM t; COMMIT; in the open transaction.
  7. While this is executing, try to access any (possibly other) InnoDB table, or CREATE TABLE…ENGINE=InnoDB. It will be blocked too!
  8. Wait for the server crash like this:

    2018-03-23T15:19:10.847595Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
    

Comment by quantuml3ap [ 2018-03-23 ]

Step 1:

A table with ~25M rows.

| users_wc | CREATE TABLE `users_wc` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(50) DEFAULT NULL,
  `email` varchar(100) DEFAULT NULL,
  `active` mediumint(9) DEFAULT NULL,
  `active_woi` mediumint(9) DEFAULT NULL,
  `parent_id` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `parent_id` (`parent_id`),
  CONSTRAINT `users_wc_ibfk_1` FOREIGN KEY (`parent_id`) REFERENCES `users` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=32319559 DEFAULT CHARSET=utf8 |

Step 2:

Have 2 sessions - S1 and S2.

In S1 run:

ALTER TABLE users_wc ADD COLUMN random_col varchar(50) DEFAULT 'hello world foo bar';

In S2 do this:

BEGIN;
 
INSERT INTO users_wc SET name = 'a', email = 'a';

Now when the alter awaits X MDL before the commit phase (post-alter phase), try this in S2 (deleting 18M records):

DELETE FROM users_wc LIMIT 18000000;
(took ~9min)
COMMIT;

Now `SHOW PROCESSLIST` will show "committing alter table to storage engine". Saw this for ~15min until Mysql crashed and restarted.

Final observation:

#sql-ib... files were there, had to drop them and realized the ALTER hadn't completed successfully. The column was not added.

Comment by Marko Mäkelä [ 2019-01-28 ]

The problematic code would be removed when MDEV-16329 implements cross-engine online ALTER TABLE.

Comment by Valerii Kravchuk [ 2019-03-15 ]

It seems the problem is even easier to hit with partitioned table ALTERed, see upstream https://bugs.mysql.com/bug.php?id=94610

Comment by Thirunarayanan Balathandayuthapani [ 2019-07-04 ]

Currently, there are three things blocked while applying concurrently
DMLs in commit phase:
1) DML on the particular table (due to MDL_EXCLUSIVE on the table)
2) InnoDB DDLs (due to dict_operation_lock)
3) Purge thread, stats thread, the master thread (2nd reason)

The approach to apply the log of concurrent DMLs before acquiring dict_sys latches and
dict_operation_lock. So (2), (3) shouldn't be a issue anymore.

Comment by Thirunarayanan Balathandayuthapani [ 2019-07-05 ]

Patch approved by kevg. I would like to run some RQG test before pushing it to 10.2

Generated at Thu Feb 08 08:22:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.