[MDEV-23134] SEGV in dict_load_table_one during restart after server crash Created: 2020-07-09  Updated: 2020-07-23  Resolved: 2020-07-23

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2.33, 10.3.24, 10.5.5, 10.4
Fix Version/s: 10.2.33, 10.3.24, 10.4.14, 10.5.5

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: rr-profile-analyzed

Attachments: File MDEV-23134.yy    
Issue Links:
Relates
relates to MDEV-21419 Server crash upon startup after resto... Closed
relates to MDEV-23236 [draft] ASAN heap-use-after-free in d... Closed
relates to MDEV-23257 SEGV in dict_load_table_one after se... Open

 Description   

Workflow of the test
1. Start the server
2. One session runs random DDL on the table t1.
3. SIGKILL the server process at some point of time with ongoing 2.
4. Attempt to restart that server which fails with SEGV.
 
Thread 1 received signal SIGSEGV, Segmentation fault.
0x0000555aaccc746f in dict_load_table_one (name=..., ignore_err=ignore_err@entry=6, fk_tables=std::deque with 0 elements) at storage/innobase/dict/dict0load.cc:2992
2992                                    dict_table_get_first_index(table)->page);
(rr) bt
#0  0x0000555aaccc746f in dict_load_table_one (name=..., ignore_err=ignore_err@entry=6, fk_tables=std::deque with 0 elements) at storage/innobase/dict/dict0load.cc:2992
#1  0x0000555aaccc7c21 in dict_load_table (name=name@entry=0x555ab005dd80 "test/FTS_", '0' <repeats 13 times>, "645_", '0' <repeats 13 times>, "7e6_INDEX_1", ignore_err=ignore_err@entry=6) at storage/innobase/dict/dict0load.cc:2745
#2  0x0000555aaccb447f in dict_table_open_on_name (table_name=0x555ab005dd80 "test/FTS_", '0' <repeats 13 times>, "645_", '0' <repeats 13 times>, "7e6_INDEX_1", dict_locked=dict_locked@entry=1, try_drop=try_drop@entry=0, ignore_err=ignore_err@entry=6)
    at storage/innobase/dict/dict0dict.cc:1097
#3  0x0000555aacd3f040 in fts_drop_table (trx=trx@entry=0x7f610ce731e8, table_name=<optimized out>) at storage/innobase/fts/fts0fts.cc:1431
#4  0x0000555aacd42dae in fts_drop_aux_table_from_vector (trx=trx@entry=0x7f610ce731e8, tables=tables@entry=0x555ab0064748) atstorage/innobase/fts/fts0fts.cc:6507
#5  0x0000555aacd44dbf in fts_check_and_drop_orphaned_tables (trx=<optimized out>, tables=tables@entry=0x555ab007d828) at storage/innobase/fts/fts0fts.cc:6859
#6  0x0000555aacd4ab1b in fts_drop_orphaned_tables () at storage/innobase/fts/fts0fts.cc:6951
#7  0x0000555aacb667e4 in srv_start (create_new_db=<optimized out>) at storage/innobase/srv/srv0start.cc:1888
#8  0x0000555aac94e3ad in innodb_init (p=<optimized out>) at storage/innobase/handler/ha_innodb.cc:4001
#9  0x0000555aac57b8f8 in ha_initialize_handlerton (plugin=0x555aafa68d30) at sql/handler.cc:623
#10 0x0000555aac2d83b4 in plugin_initialize (tmp_root=tmp_root@entry=0x7ffdb5f7ae60, plugin=plugin@entry=0x555aafa68d30, argc=argc@entry=0x555aadbb0280 <remaining_argc>, argv=argv@entry=0x555aafa285f0, options_only=<optimized out>)
    at sql/sql_plugin.cc:1459
#11 0x0000555aac2d9a59 in plugin_init (argc=argc@entry=0x555aadbb0280 <remaining_argc>, argv=0x555aafa285f0, flags=0) at sql/sql_plugin.cc:1752
#12 0x0000555aac1b3459 in init_server_components () at sql/mysqld.cc:4909
#13 0x0000555aac1b8658 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at sql/mysqld.cc:5492
#14 0x0000555aac1a8b33 in main (argc=<optimized out>, argv=<optimized out>) at sql/main.cc:25
(rr)
 
origin/10.5 1813d92d0c505a1c752f4a9d6e37db6bffe4efdd 2020-07-02T09:41:44+03:00
 
RQG
-------
git clone https://github.com/mleich1/rqg --branch experimental RQG
origin/experimental 437813fbbff324a52a6317d2ef262729e5645242 2020-07-08T13:41:44+02:00
 
perl rqg.pl \                        
--duration=100 \
--queries=10000000 \
--threads=1 \
--no_mask \
--seed=random \
--rpl_mode=none \
--engine=InnoDB \
--gendata=conf/runtime/alter_online.zz \
--mysqld=--net_write_timeout=60 \
--mysqld=--log-output=none \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--lock-wait-timeout=86400 \
--mysqld=--connect_timeout=60 \
--mysqld=--log-bin \
--mysqld=--slave_net_timeout=60 \
--mysqld=--loose_innodb_use_native_aio=0 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--loose_innodb_lock_schedule_algorithm=fcfs \
--mysqld=--wait_timeout=28800 \
--mysqld=--loose-max-statement-time=30 \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
--mysqld=--innodb-lock-wait-timeout=50 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--mysqld=--net_read_timeout=30 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--interactive_timeout=28800 \
--reporters=Backtrace,CrashRecovery1,Deadlock1,ErrorLog \
--validators=None \
--grammar=MDEV-23134.yy \
--workdir=<local settings> \
--vardir=<local settings> \
--mtr-build-thread=<local settings> \
--basedir1=<local settings> \
--script_debug=_nix_ \
--rr=Server \
--rr_options=--chaos
 
The issue seems to be somehow a sibling of https://jira.mariadb.org/browse/MDEV-21419 even though
mariabackup is not involved at all.
It might be necessary to run the RQG test several times because some of the tests will
pass without failure and other fail with other issues like
-   Restart with success but
    Verifying table: `test`.`t1`
    DBD::mysql::st execute failed: Table 'test.t1' doesn't exist in engine at lib/GenTest/Reporter/CrashRecovery1.pm line 234.
- Restart with success but
  INFO: Reporter 'CrashRecovery1': Executing OPTIMIZE TABLE `test`.`t1`.
  'test.t1',  'optimize', 'note', 'Table does not support optimize, doing recreate + analyze instead'
  'test.t1', 'optimize', 'error', 'Got error 11 "Resource temporarily unavailable" from storage engine InnoDB'
  'test.t1', 'optimize', 'status', 'Operation failed'
 
The same test shows on origin/10.4 41221091f6541ea83eeec70732b92b94746a44e7 2020-07-08T23:27:19+03:00
quite similar effects.
 
Replay also on MariaDB
- 10.3.24 origin/10.3 f3f23b5c4bdc669ad0af4a1c79bd70c40ed9c594 2020-07-05T16:31:34+03:00
- 10.2.33 origin/10.2 737c3025e9ed55855ee66806ad14e9e7e7852fa7 2020-07-10T00:01:24+05:30



 Comments   
Comment by Thirunarayanan Balathandayuthapani [ 2020-07-18 ]

Test case for this issue:

--source include/have_innodb.inc
--source include/have_debug.inc
--source include/have_debug_sync.inc
 
# Embedded server does not support restarting
--source include/not_embedded.inc
 
--let $_server_id= `SELECT @@server_id`
--let $_expect_file_name= $MYSQLTEST_VARDIR/tmp/mysqld.$_server_id.expect
 
CREATE TABLE t1(f1 INT NOT NULL PRIMARY KEY,
                f2 CHAR(100), f3 CHAR(100),
                fulltext(f2))ENGINE=InnoDB;
INSERT INTO t1 VALUES(1, "thiru", "mysql");
--disable_result_log
--disable_query_log
set debug_dbug="+d,fts_aux_table_drop";
# Write file to make mysql-test-run.pl expect crash
--exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
 
--error 2013
ALTER TABLE t1 DROP PRIMARY KEY;
--source include/start_mysqld.inc
SELECT * FROM t1;
DROP TABLE t1;

Test case need debug_execute_if to kill the server at the correct place.

 
diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
index b304c03f7ed..7a715966775 100644
--- a/storage/innobase/row/row0mysql.cc
+++ b/storage/innobase/row/row0mysql.cc
@@ -3724,6 +3724,10 @@ row_drop_table_for_mysql(
                                NULL, table->name.m_name, IBD, false);
                }
 
+               DBUG_EXECUTE_IF("fts_aux_table_drop",
+                               if (strstr(tablename, "BEING_DELETED_CACHE")) {
+                                       ut_ad(0);
+                               });
                /* Free the dict_table_t object. */
                err = row_drop_table_from_cache(tablename, table, trx);
                if (err != DB_SUCCESS) {

InnoDB gets killed after removing table entry from sys tables, sys tablespaces etc..
While restarting the server, InnoDB tries to open the table via tablename. It tries to
load the index for it. But sys_indexes doesn't have any corresponding entry for table id.
But dict_table_load_one() tries to access clustered index.

Comment by Marko Mäkelä [ 2020-07-22 ]

You are suggesting a simple fix:

diff --git a/storage/innobase/dict/dict0load.cc b/storage/innobase/dict/dict0load.cc
index 4014d28c2fb..9f4f76c4751 100644
--- a/storage/innobase/dict/dict0load.cc
+++ b/storage/innobase/dict/dict0load.cc
@@ -3059,7 +3059,7 @@ dict_load_table_one(
 corrupted:
                                table->corrupted = true;
                                table->file_unreadable = true;
-                       } else {
+                       } else if (dict_table_get_first_index(table)) {
                                const page_id_t page_id(
                                        table->space,
                                        dict_table_get_first_index(table)

But, should we not add that condition to the preceding if expression, so that we would set the flags in table? If there is no clustered index defined for a table, the table is certainly as corrupted and unreadable as if the data file were missing.

Generated at Thu Feb 08 09:20:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.