[MDEV-32103] InnoDB ALTER TABLE is not crash safe Created: 2023-09-05  Updated: 2023-10-03  Resolved: 2023-09-11

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Storage Engine - InnoDB
Affects Version/s: N/A
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Blocker
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: corruption, recovery, regression, rr-profile-analyzed

Issue Links:
Problem/Incident
is caused by MDEV-30100 Assertion `purge_sys.tail.trx_no <= p... Closed
Relates
relates to MDEV-24341 Innodb - do not block in foreground t... Closed
relates to MDEV-25180 Atomic ALTER TABLE Closed

 Description   

origin/bb-10.6-MDEV-32068 a26c51a190030443063362eb3f21c510fcb3c734 2023-09-05T16:30:53+03:00
but observed on other trees in history too.
 
Scenario:
1. Start server, generate some initial data
2. Several sessions run concurrent a DDL/DML mix
3. Intentional crash, restart with success
4. Check the data
SELECT `col2` FROM `test`.`t3` FORCE INDEX (`Marvão_uidx1`)  WHERE `col2` IS NULL OR `col2` IS NOT NULL harvests 1030: Got error 1 "Operation not permitted" from storage engine InnoDB.
The server error log contains for the time of the SELECT ... FORCE ...
[rr 3698310 16732]2023-09-05  7:42:03 6 [ERROR] Cannot find index Marvão_uidx1 in InnoDB index dictionary.
[rr 3698310 16736]2023-09-05  7:42:03 6 [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t3
[rr 3698310 16740]2023-09-05  7:42:03 6 [ERROR] InnoDB: Table test/t3 contains 2 indexes inside InnoDB, which is different from the number of indexes 3 defined in the .frm file. See https://mariadb.com/kb/en/innodb-troubleshooting/
The table t3 is ROW_FORMAT = Compact.
 
sdp:/data1/results/1693922571/TBR-1311--CATCH-0001$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
sdp:/data1/results/1693922571/TBR-1311--CATCH-0001/1/fbackup/ contains a file backup
made after the crash before the restart attempt.
 
RQG
===
# git clone https://github.com/mleich1/rqg --branch <pick the right branch> RQG
#
# GIT_SHOW: HEAD -> master, origin/master, origin/HEAD d5889a108c00974367f47657cabe4804fe7df566 2023-09-04T13:53:19+02:00
# rqg.pl  : Version 4.4.0 (2023-08)
#
# $RQG_HOME/rqg.pl \
# --gendata=conf/mariadb/table_stress.zz \
# --gendata_sql=conf/mariadb/table_stress.sql \
# --grammar=conf/mariadb/table_stress_innodb.yy \
# --reporters=CrashRecovery \
# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--wait_timeout=28800 \
# --no_mask \
# --queries=10000000 \
# --seed=random \
# --reporters=None \
# --reporters=ErrorLog \
# --reporters=Deadlock \
# --validators=None \
# --mysqld=--log_output=none \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --engine=InnoDB \
# --restart_timeout=240 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--plugin-load-add=provider_lzo.so \
# --mysqld=--plugin-load-add=provider_bzip2.so \
# --mysqld=--plugin-load-add=provider_lzma.so \
# --mysqld=--plugin-load-add=provider_snappy.so \
# --mysqld=--plugin-load-add=provider_lz4.so \
# --mysqld=--loose-innodb_compression_level=1 \
# --duration=300 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --mysqld=--lock-wait-timeout=15 \
# --mysqld=--innodb-lock-wait-timeout=10 \
# --mysqld=--loose-innodb_fast_shutdown=0 \
# --mysqld=--sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION \
# --mysqld=--innodb_file_per_table=1 \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--innodb_stats_persistent=on \
# --mysqld=--innodb_adaptive_hash_index=off \
# --mysqld=--innodb_sort_buffer_size=65536 \
# --mysqld=--innodb_random_read_ahead=OFF \
# --mysqld=--log-bin \
# --mysqld=--sync-binlog=1 \
# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
# --mysqld=--loose-max-statement-time=30 \
# --threads=33 \
# --mysqld=--innodb-use-native-aio=0 \
# --mysqld=--loose-gdb \
# --mysqld=--loose-debug-gdb \
# --rr=Extended \
# --rr_options=--chaos --wait \
# --mysqld=--innodb_undo_log_truncate=OFF \
# --mysqld=--innodb_rollback_on_timeout=OFF \
# --mysqld=--innodb_page_size=16K \
# --mysqld=--innodb-buffer-pool-size=256M \
#  <local settings>



 Comments   
Comment by Marko Mäkelä [ 2023-09-06 ]

The first server instance after bootstrap executed an ALTER TABLE slightly before it was killed:

Thread 54 hit Breakpoint 2, ha_innobase::commit_inplace_alter_table (this=0xc826406c8b0, altered_table=0x7f2b2f087d90, ha_alter_info=0x7f2b2f087cd0, commit=true) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/handler/handler0alter.cc:11093
11093	{
1: m_prebuilt.table.name = {m_name = 0x7aac48017d70 "test/t3"}
3: m_user_thd.query_string = {string = {str = 0xc8264028220 "ALTER TABLE t3 ADD FULLTEXT INDEX IF NOT EXISTS `Marvão_ftidx2` ( col_text ), ADD UNIQUE KEY IF NOT EXISTS `Marvão_uidx1` ( col2 /* 100800 ASC */, col1 /* 100800 DESC */ )  /* E_R Thread2 QNO 332 CO"..., length = 210}, cs = 0x55f97b094520 <my_charset_latin1>}
Current event: 433927
 
Thread 47 received signal SIGKILL, Killed.
[Switching to Thread 3460236.3474134]
0x0000000070000002 in syscall_traced ()
(rr) when
Current event: 434122

It looks like the last persistent write (update of log_sys.flushed_to_disk_lsn) was some time earlier in this thread:

#6  0x000055f97a6a09e4 in trx_t::commit_persist (this=this@entry=0x1ac73ffcea80) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/trx/trx0trx.cc:1583
#7  0x000055f97a700611 in trx_t::commit (this=this@entry=0x1ac73ffcea80, deleted=std::vector of length 0, capacity 0) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/dict/drop.cc:238
#8  0x000055f97a5e50b1 in commit_unlock_and_unlink (trx=0x1ac73ffcea80) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/handler/handler0alter.cc:4339
#9  0x000055f97a5ec0cc in prepare_inplace_alter_table_dict (ha_alter_info=<optimized out>, altered_table=<optimized out>, old_table=<optimized out>, table_name=<optimized out>, flags=<optimized out>, 
    flags2=<optimized out>, fts_doc_id_col=<optimized out>, add_fts_doc_id=<optimized out>, add_fts_doc_id_idx=<optimized out>) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/handler/handler0alter.cc:7352
#10 0x000055f97a5ee8e2 in ha_innobase::prepare_inplace_alter_table (this=<optimized out>, altered_table=<optimized out>, ha_alter_info=<optimized out>)
    at /data/Server/bb-10.6-MDEV-32068/storage/innobase/handler/ha_innodb.h:695
#11 0x000055f97a1d802d in mysql_inplace_alter_table (target_mdl_request=0x7f2b2f0885e0, alter_ctx=0x7f2b2f089740, trigger_param=0x7f2b2f088190, ddl_log_state=0x7f2b2f087c70, ha_alter_info=0x7f2b2f087cd0, 
    altered_table=0x7f2b2f087d90, table=0xc826408d4c8, table_list=0xc8264028470, thd=0xc826401dad8) at /data/Server/bb-10.6-MDEV-32068/sql/sql_table.cc:7361

The later call to ha_innobase::commit_inplace_alter_table() returned false (indicating success), while log_sys.lsn=19710260 and log_sys.flushed_to_disk_lsn=19691339 (at the value of the above write). This is definitely wrong. These values remained until the SIGKILL. Because there was no call to innodb_check_version on the subsequent server startup, I suppose that the SQL layer had replaced the .frm file and cleared the ddl_recovery.log. That subsequent recovery reported the following:

2023-09-05  7:41:47 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=42486,42486
2023-09-05  7:41:50 0 [Note] InnoDB: 10.6.16 started; log sequence number 19709789; transaction id 6814
2023-09-05  7:42:03 6 [ERROR] Cannot find index Marvão_uidx1 in InnoDB index dictionary.
2023-09-05  7:42:03 6 [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t3
2023-09-05  7:42:03 6 [ERROR] InnoDB: Table test/t3 contains 2 indexes inside InnoDB, which is different from the number of indexes 3 defined in the .frm file. See https://mariadb.com/kb/en/innodb-troubleshooting/

I believe that this is a regression due to MDEV-30100. We must durably write the log for committing a DDL transaction.

Comment by Marko Mäkelä [ 2023-09-06 ]

The atomic DDL commit that fails to be persisted is here:

(rr) backtrace
#0  trx_t::commit_in_memory (mtr=0x7f2b2f087610, this=0x1ac73ffcea80) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/trx/trx0trx.cc:1366
#1  trx_t::commit_low (this=0x1ac73ffcea80, mtr=0x7f2b2f087610) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/trx/trx0trx.cc:1569
#2  0x000055f97a6a09e4 in trx_t::commit_persist (this=this@entry=0x1ac73ffcea80) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/trx/trx0trx.cc:1583
#3  0x000055f97a700611 in trx_t::commit (this=this@entry=0x1ac73ffcea80, deleted=std::vector of length 0, capacity 0) at /data/Server/bb-10.6-MDEV-32068/storage/innobase/dict/drop.cc:238
#4  0x000055f97a5f1ae6 in ha_innobase::commit_inplace_alter_table (this=<optimized out>, altered_table=<optimized out>, ha_alter_info=<optimized out>, commit=<optimized out>)
    at /data/Server/bb-10.6-MDEV-32068/storage/innobase/handler/handler0alter.cc:11549
(rr) print mtr->m_commit_lsn
$1 = 19710260

Comment by Marko Mäkelä [ 2023-09-06 ]

The reason why the DDL commit was not written is related to MDEV-24341. trx_flush_log_if_needed() is invoking an asynchronous write, and we fail to wait for that to complete before returning from the DDL operation:

  if ((cb.m_param= thd_increment_pending_ops(trx->mysql_thd)))
  {
    cb.m_callback = (void (*)(void *)) thd_decrement_pending_ops;
    log_write_up_to(lsn, flush, false, &cb);
  }

Comment by Marko Mäkelä [ 2023-09-07 ]

It looks like this must have been broken for a longer time. MDEV-24341 was implemented already in MariaDB Server 10.6.0. The last fundamental InnoDB change related to atomic ALTER TABLE (MDEV-25180) was the third part of MDEV-25506 in MariaDB Server 10.6.2.

wlad confirmed to me that we would either need a call to thd->async_state.wait_for_pending_ops(); after the DDL operation returns from InnoDB, or a synchronous call of log_write_up_to() (without a completion callback) is needed, to ensure the correct durability on DDL.

Comment by Marko Mäkelä [ 2023-09-07 ]

After all, this does look like a regression due to this preparatory change of MDEV-30100, which would cause the LSN to be 0 for the calls log_write_up_to(commit_lsn, true) in DDL operations.

In fact, mleich did catch this type of failure while testing MDEV-30100, and that preparatory change was revised accordingly (but insufficiently). I think that we must revert that change:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
index 751d7567881..e0eb723426f 100644
--- a/storage/innobase/trx/trx0trx.cc
+++ b/storage/innobase/trx/trx0trx.cc
@@ -1238,7 +1238,7 @@ static void trx_flush_log_if_needed(lsn_t lsn, trx_t *trx)
   if (log_sys.get_flushed_lsn() > lsn)
     return;
 
-  const bool flush= trx->dict_operation ||
+  const bool flush=
     (srv_file_flush_method != SRV_NOSYNC &&
      (srv_flush_log_at_trx_commit & 1));
 

We can make use of the field trx_t::flush_log_later in all DDL operations. The only DDL operation that does not invoke trx_t::commit(std::vector<pfs_os_file_t> &) is ha_innobase::rename_table().

Generated at Thu Feb 08 10:28:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.