[MDEV-18654] Failing assertion: sym_node->table != NULL in buildbot with innodb_fts.sync_ddl and outside Created: 2019-02-20  Updated: 2020-07-17  Resolved: 2020-07-17

Status: Closed
Project: MariaDB Server
Component/s: Full-text Search, Storage Engine - InnoDB, Storage Engine - XtraDB, Tests
Affects Version/s: 10.1, 10.2, 10.3, 10.4
Fix Version/s: 10.5.1

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mdev18654a.test     HTML File threads_full_2019-05-05    
Issue Links:
Relates
relates to MDEV-13564 TRUNCATE TABLE and undo tablespace tr... Closed
relates to MDEV-16678 Use MDL for innodb background threads... Closed
relates to MDEV-16855 Fix fts_sync() synchronization in InnoDB Closed
relates to MDEV-18220 [Draft] ASAN heap-use-after-free in f... Closed
relates to MDEV-19529 InnoDB hang on DROP FULLTEXT INDEX Closed
relates to MDEV-19647 Server hangs after dropping full text... Closed

 Description   

http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/16190

nnodb_fts.sync_ddl 'innodb'             w2 [ fail ]
        Test ended at 2019-02-04 11:04:55
 
CURRENT_TEST: innodb_fts.sync_ddl
Warning: /mnt/buildbot/build/mariadb-10.3.13/libmysqld/examples/mysqltest_embedded: unknown variable 'loose-ssl-ca=/mnt/buildbot/build/mariadb-10.3.13/mysql-test/std_data/cacert.pem'
Warning: /mnt/buildbot/build/mariadb-10.3.13/libmysqld/examples/mysqltest_embedded: unknown variable 'loose-ssl-cert=/mnt/buildbot/build/mariadb-10.3.13/mysql-test/std_data/client-cert.pem'
Warning: /mnt/buildbot/build/mariadb-10.3.13/libmysqld/examples/mysqltest_embedded: unknown variable 'loose-ssl-key=/mnt/buildbot/build/mariadb-10.3.13/mysql-test/std_data/client-key.pem'
Warning: /mnt/buildbot/build/mariadb-10.3.13/libmysqld/examples/mysqltest_embedded: unknown option '--loose-skip-ssl'
2019-02-04 11:04:53 0xae6f5b40  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.3.13/storage/innobase/pars/pars0pars.cc line 815
InnoDB: Failing assertion: sym_node->table != NULL
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
mysqltest got signal 6
read_command_buf (0x829cac98): TRUN
conn->name (0x82c7c220): 
conn->cur_query (0x82c933d8): TRUNCATE TABLE t1
Attempting backtrace...
stack_bottom = 0x0 thread_stack 0x49000
/mnt/buildbot/build/mariadb-10.3.13/libmysqld/examples/mysqltest_embedded(my_print_stacktrace+0x3c)[0x803a871a]
mysys/stacktrace.c:269(my_print_stacktrace)[0x8038af28]
client/mysqltest.cc:9092(dump_backtrace())[0x8038af66]
addr2line: '': No such file
[0xb7725c14]
[0xb7725c31]
/lib/i386-linux-gnu/libc.so.6(gsignal+0x39)[0xb7091e89]
/lib/i386-linux-gnu/libc.so.6(abort+0x157)[0xb70933e7]
include/sync0types.h:1130(my_atomic_loadlint(unsigned int const*))[0x80d768e2]
pars/pars0pars.cc:817(pars_retrieve_table_def(sym_node_t*))[0x80e2a457]
pars/pars0pars.cc:1315(pars_insert_statement(sym_node_t*, void*, sel_node_t*))[0x80e2aff0]
innobase/pars0grm.y:375(yyparse())[0x80dfbc60]
pars/pars0pars.cc:2135(pars_sql(pars_info_t*, char const*))[0x80e2c23e]
fts/fts0sql.cc:184(fts_parse_sql(fts_table_t*, pars_info_t*, char const*))[0x80c5aafd]
fts/fts0fts.cc:3930(fts_write_node(trx_t*, que_fork_t**, fts_table_t*, fts_string_t*, fts_node_t*))[0x80b69295]
fts/fts0fts.cc:4071(fts_sync_write_words(trx_t*, fts_index_cache_t*, bool))[0x80b69841]
fts/fts0fts.cc:4152(fts_sync_index(fts_sync_t*, fts_index_cache_t*))[0x80b69e0e]
fts/fts0fts.cc:4397(fts_sync(fts_sync_t*, bool, bool, bool))[0x80b6a6de]
fts/fts0fts.cc:4482(fts_sync_table(dict_table_t*, bool, bool, bool))[0x80b6a9c5]
fts/fts0opt.cc:2829(fts_optimize_sync_table(unsigned long long))[0x80d2c078]
fts/fts0opt.cc:2942(fts_optimize_thread(void*))[0x80d2c469]
/lib/i386-linux-gnu/libpthread.so.0(+0x62b5)[0xb76fc2b5]
/lib/i386-linux-gnu/libc.so.6(clone+0x6e)[0xb714d16e]
Writing a core file...



 Comments   
Comment by Elena Stepanova [ 2019-05-05 ]

On 10.2: http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/16895/steps/mtr_nm/logs/stdio

The last occurrence in buildbot was over a month ago, but that might be just because it happens rarely. The problem itself is still there, I'm getting it in my tests on the current 10.2 (also with a low probability):

10.2 ce195987

2019-05-05 22:34:22 0x7fd9b3fff700  InnoDB: Assertion failure in file /data/src/10.2/storage/innobase/pars/pars0pars.cc line 815
InnoDB: Failing assertion: sym_node->table != NULL
 
#6  0x000055fb803504e8 in ut_dbg_assertion_failed (expr=0x55fb808aa587 "sym_node->table != NULL", file=0x55fb808aa1e8 "/data/src/10.2/storage/innobase/pars/pars0pars.cc", line=815) at /data/src/10.2/storage/innobase/ut/ut0dbg.cc:60
#7  0x000055fb8022ef79 in pars_retrieve_table_def (sym_node=0x7fd9a0152d28) at /data/src/10.2/storage/innobase/pars/pars0pars.cc:815
#8  0x000055fb8022fc98 in pars_insert_statement (table_sym=0x7fd9a0152d28, values_list=0x7fd9a0152e30, select=0x0) at /data/src/10.2/storage/innobase/pars/pars0pars.cc:1315
#9  0x000055fb804af3b7 in yyparse () at pars0grm.y:374
#10 0x000055fb80231040 in pars_sql (info=0x7fd9a0152318, str=0x7fd9a0003390 "PROCEDURE P() IS\nBEGIN\nINSERT INTO $index_table_name VALUES (:token, :first_doc_id,  :last_doc_id, :doc_count, :ilist);\nEND;\n") at /data/src/10.2/storage/innobase/pars/pars0pars.cc:2127
#11 0x000055fb8048fe09 in fts_parse_sql (fts_table=0x7fd9b3ffe8b0, info=0x7fd9a0152318, sql=0x55fb80982978 "BEGIN\nINSERT INTO $index_table_name VALUES (:token, :first_doc_id,  :last_doc_id, :doc_count, :ilist);") at /data/src/10.2/storage/innobase/fts/fts0sql.cc:183
#12 0x000055fb80473e53 in fts_write_node (trx=0x7fd9cba2c3a8, graph=0x7fd9580517c0, fts_table=0x7fd9b3ffe8b0, word=0x7fd958207530, node=0x7fd958205b98) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:3952
#13 0x000055fb80474414 in fts_sync_write_words (trx=0x7fd9cba2c3a8, index_cache=0x7fd958120b28, unlock_cache=true) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4090
#14 0x000055fb804749b0 in fts_sync_index (sync=0x7fd958051660, index_cache=0x7fd958120b28) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4174
#15 0x000055fb8047529f in fts_sync (sync=0x7fd958051660, unlock_cache=true, wait=false, has_dict=false) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4419
#16 0x000055fb8047558e in fts_sync_table (table=0x7fd9580503a8, unlock_cache=true, wait=false, has_dict=false) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4503
#17 0x000055fb80484bc3 in fts_optimize_sync_table (table_id=193) at /data/src/10.2/storage/innobase/fts/fts0opt.cc:2830
#18 0x000055fb80484f90 in fts_optimize_thread (arg=0x55fb8345b630) at /data/src/10.2/storage/innobase/fts/fts0opt.cc:2943
#19 0x00007fd9d238c4a4 in start_thread (arg=0x7fd9b3fff700) at pthread_create.c:456
#20 0x00007fd9d08d4d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

All threads: threads_full_2019-05-05

Comment by Marko Mäkelä [ 2019-05-06 ]

The problem is that fts_sync_table() is attempting to access a nonexistent internal table for the fulltext index. Likely explanations include that the table was renamed to a different database, or it was dropped (DROP INDEX, DROP TABLE) or rebuilt (ALTER TABLE, TRUNCATE, OPTIMIZE TABLE).

I would tend to believe that fts_optimize_thread() fails to properly increment the table reference count, or to acquire an InnoDB table lock, or a metadata lock (MDL).

Comment by Marko Mäkelä [ 2019-05-06 ]

The crash dump says that the current statement is TRUNCATE. MDEV-13564 implements that internally using DROP TABLE, among other things. It could be that this implementation makes the crash more probable.

Comment by Marko Mäkelä [ 2019-05-06 ]

The test is using debug instrumentation that does not look deterministic:

			case FTS_MSG_SYNC_TABLE:
				DBUG_EXECUTE_IF(
					"fts_instrument_msg_sync_sleep",
					os_thread_sleep(300000););
 
				fts_optimize_sync_table(
					*static_cast<table_id_t*>(msg->ptr));

That said, this 0.3-second sleep should not be blamed for the crash. I see that the table is being opened properly, incrementing the reference count:

/** Sync fts cache of a table
@param[in]	table_id	table id */
void
fts_optimize_sync_table(
	table_id_t	table_id)
{
	dict_table_t*   table = NULL;
 
	table = dict_table_open_on_id(table_id, FALSE, DICT_TABLE_OP_NORMAL);
 
	if (table) {
		if (dict_table_has_fts_index(table) && table->fts->cache) {
			fts_sync_table(table, true, false, false);
		}
 
		dict_table_close(table, FALSE, FALSE);
	}
}

I see that there was an attempt to fix this in MySQL 5.6.38 that was merged to MariaDB as well. Maybe the convoluted piece of code that was added to row_drop_table_for_mysql() appears to suffer from a race condition with fts_sync():

			for (;;) {
				bool retry = false;
				if (dict_fts_index_syncing(table)) {
					retry = true;
				}
				if (!retry) {
					break;
				}
				DICT_BG_YIELD(trx);
			}
			row_mysql_unlock_data_dictionary(trx);
			fts_optimize_remove_table(table);
			row_mysql_lock_data_dictionary(trx);

The problem appears to be that only after we successfully checked that fts_sync() was not running, we will remove the table from the queue. I believe that we should first invoke fts_optimize_remove_table(table) and then ensure that the indexes have really been removed from the fts_optimize_thread queue. Alternatively, set some flag on the table or index so that it will not be picked up by fts_sync_table().

I was unable to repeat the failure even with the following patch:

diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
index a06c3cbd1a8..2c06273ffbc 100644
--- a/storage/innobase/row/row0mysql.cc
+++ b/storage/innobase/row/row0mysql.cc
@@ -3348,6 +3348,7 @@ row_drop_table_for_mysql(
 				}
 				DICT_BG_YIELD(trx);
 			}
+			os_thread_sleep(300000);
 			row_mysql_unlock_data_dictionary(trx);
 			fts_optimize_remove_table(table);
 			row_mysql_lock_data_dictionary(trx);

Comment by Marko Mäkelä [ 2019-05-06 ]

The test case was added in MDEV-16855, which cleaned up something in the synchronization.

The function fts_optimize_remove_table() always waited for the removal of the table from the work queue, starting from the very first public release of the code in MySQL 5.6.4.

The problem seems to be that the removal and the check for activity are reversed in row_drop_table_for_mysql(). I hope that the problem will disappear with the following code change:

diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
index a06c3cbd1a8..72e3d6ea75b 100644
--- a/storage/innobase/row/row0mysql.cc
+++ b/storage/innobase/row/row0mysql.cc
@@ -3338,18 +3338,11 @@ row_drop_table_for_mysql(
 			ut_ad(!table->fts->add_wq);
 			ut_ad(lock_trx_has_sys_table_locks(trx) == 0);
 
-			for (;;) {
-				bool retry = false;
-				if (dict_fts_index_syncing(table)) {
-					retry = true;
-				}
-				if (!retry) {
-					break;
-				}
-				DICT_BG_YIELD(trx);
-			}
 			row_mysql_unlock_data_dictionary(trx);
-			fts_optimize_remove_table(table);
+			do {
+				fts_optimize_remove_table(table);
+			} while (dict_fts_index_syncing(table));
+
 			row_mysql_lock_data_dictionary(trx);
 		}
 

Comment by Marko Mäkelä [ 2019-05-06 ]

On shutdown, fts_drop_index() can invoke fts_free() in an unsafe fashion:

			while (index->index_fts_syncing
				&& !trx_is_interrupted(trx)) {
				DICT_BG_YIELD(trx);
			}
 
			fts_free(table);

The predicate trx_is_interrupted(trx) would hold on all transactions on shutdown.

MDEV-18220 should be very closely related to this. Perhaps that test case involved KILL QUERY?

I cannot find any logic in the concurrency rules around the field dict_index_t::index_fts_syncing. I think that we’d better remove this field and ensure correctness by design, somehow.

Comment by Elena Stepanova [ 2019-05-06 ]

The attached MTR test case mdev18654a.test failes with this assertion failure on 10.2 54d0a55ad debug build (with and without the patch above) frequently enough. It's highly non-deterministic, but short. Run with e.g. --repeat=100. It usually fails within 30 attempts for me, which takes 1-2 minutes.
The test case is dirty and for reproducing purposes only!

Please note that since the assertion fails upon shutdown, MTR doesn't produce a proper failure report, it looks like this:

***Warnings generated in error logs during shutdown after running tests: bug.mdev18654a
 
Attempting backtrace. You can use the following information to find out
 
mysql-test-run: *** ERROR: Test suite aborted

But the error log should show the assertion failure.

Comment by Marko Mäkelä [ 2019-05-06 ]

For me, mdev18654a.test is crashing instantly, even after I removed dict_index_t::index_fts_syncing and tried to implement more appropriate synchronization.

Comment by Marko Mäkelä [ 2019-05-07 ]

mdev18654a.test repeats the problem for me only in a -O2 build, not non-optimized debug build. We fail to find an internal table for one of the fulltext indexes of table id 169 (table400_innodb_int_autoinc which was renamed to #sql2 name prefix and is about to be dropped). Here are a couple of occasions: table name ending in d2_INDEX_6, and another one ending in 103_INDEX_6. It would appear to me that these tables were never created in the first place.

Comment by Marko Mäkelä [ 2019-05-08 ]

The bug revealed by mdev18654a.test is very interesting. It turns out that the supposedly ‘missing’ table actually still exists not only in SYS_TABLES, but also in dict_sys. The test is actually repeating MDEV-18220: fts_get_table_name_prefix() is dereferencing a freed table->name that used to contain the #sql name of the intermediate copy of the base table. The current table->name would already point to the user table name at that point. AddressSanitizer confirms this:

10.2 e0271a7b43c6df652c6a074858853a6d0da20c1e

==9899==ERROR: AddressSanitizer: heap-use-after-free on address 0x60300003c160 at pc 0x55fe91f1c349 bp 0x7f259be91050 sp 0x7f259be907f8
READ of size 2 at 0x60300003c160 thread T15
    #0 0x55fe91f1c348 in __interceptor_strlen (/dev/shm/10.2c7/sql/mysqld+0xc68348)
    #1 0x55fe936e1b55 in fts_get_table_name_prefix(fts_table_t const*) /mariadb/10.2/storage/innobase/fts/fts0sql.cc:108:34
    #2 0x55fe936e1f72 in fts_get_table_name(fts_table_t const*, char*) /mariadb/10.2/storage/innobase/fts/fts0sql.cc:148:16
freed by thread T29 here:
    #0 0x55fe91fb1162 in realloc (/dev/shm/10.2c7/sql/mysqld+0xcfd162)
    #1 0x55fe93548e97 in dict_table_rename_in_cache(dict_table_t*, char const*, bool, bool) /mariadb/10.2/storage/innobase/dict/dict0dict.cc:1728:4
    #2 0x55fe93196ad9 in row_rename_table_for_mysql(char const*, char const*, trx_t*, bool, bool) /mariadb/10.2/storage/innobase/row/row0mysql.cc:4503:9
    #3 0x55fe92eedb0f in innobase_rename_table(trx_t*, char const*, char const*, bool, bool) /mariadb/10.2/storage/innobase/handler/ha_innodb.cc:13403:10

If the freed string contains suitable garbage, then fts_get_table_name_prefix() would return a string that starts with FTS_, lacking the database name prefix and a '/'. Only a few internal SYS_ tables that exist in the InnoDB system tablespace lack the '/' character in their name. I wasted at least 1 day because I missed the fact that the table name is incorrect.

I think that we need two independent changes: the removal of dict_index_t::index_fts_syncing and the related cleanup (in this MDEV-18654), and the fix of the race condition with rename operations (MDEV-18220).

Comment by Marko Mäkelä [ 2019-05-09 ]

While the dict_index_t::index_fts_syncing feels ugly and seems to be prone to race conditions, my current attempt at removing it would cause problems as pointed out by thiru. A better solution would be to implement MDL protection for the fts_sync() operation, in MDEV-16678.

For now, I think that we can do some minor cleanup in fts0opt.cc but not touch the logic related to dict_index_t::index_fts_syncing.

Comment by Thirunarayanan Balathandayuthapani [ 2019-05-09 ]

/* Make a concurrent Drop fts Index to wait until sync of that
        fts index is happening in the background */
        for (int retry_count = 0;;) {
                bool    retry = false;
 
                for (inplace_alter_handler_ctx** pctx = ctx_array;
                    *pctx; pctx++) {
                        ha_innobase_inplace_ctx*        ctx
                                = static_cast<ha_innobase_inplace_ctx*>(*pctx);
                        DBUG_ASSERT(new_clustered == ctx->need_rebuild());
 
                        if (dict_fts_index_syncing(ctx->old_table)) {
                                retry = true;
                                break;
                        }
 
                        if (new_clustered && dict_fts_index_syncing(ctx->new_table)) {
                                retry = true;
                                break;
                        }
                }

We don't need to check for the new table in ha_innobase::commit_inplace_alter_table(). Because we just did fts_optimize_remove_table for new table
just before locking data dictionary. The simplest solution is that acquiring MDL lock for fts_sync(). So that DDL won't drop the table or index in the mean time.
It is already done for purge thread when it encounters virtual column.

Comment by Marko Mäkelä [ 2019-05-10 ]

I pushed some cleanup as the first commit associated with MDEV-18220. Because the logic related to dict_index_t::index_fts_syncing was not touched, the assertion should be able to fail when using KILL QUERY or when shutting down the server.

Comment by Marko Mäkelä [ 2020-07-17 ]

elenst, could this have been fixed in 10.5 by MDEV-16678?

Comment by Elena Stepanova [ 2020-07-17 ]

Among my test failure records, I have one of June 3rd, 2020 which was automatically recognized as MDEV-18654. The logs are no longer available, so I can't confirm or deny it was an accurate recognition. It happened on 10.4 main branch, revision hash isn't recorded, but it's typically up-to-date when the tests are run.

Comment by Marko Mäkelä [ 2020-07-17 ]

On buildbot cross-reference, the latest failure of this type was on bb-10.5-MDEV-19176 branch on 2019-12-20 and 2019-12-22, for two builders of the same commit. It is possible that the commit was at fault. Its parent commit 305081a7354f4ef17d2e16ca16f747ee754fee69 did include MDEV-16678, which I believe should have fixed this bug.

There were some follow-up fixes to MDEV-16678 for the MDL acquisition. The last one is MDEV-21344, which was applied to 10.5.1. I think that we can conclude that this bug was fixed in MariaDB 10.5.1.

Generated at Thu Feb 08 08:45:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.