|
On 10.2: http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/16895/steps/mtr_nm/logs/stdio
The last occurrence in buildbot was over a month ago, but that might be just because it happens rarely. The problem itself is still there, I'm getting it in my tests on the current 10.2 (also with a low probability):
|
10.2 ce195987
|
2019-05-05 22:34:22 0x7fd9b3fff700 InnoDB: Assertion failure in file /data/src/10.2/storage/innobase/pars/pars0pars.cc line 815
|
InnoDB: Failing assertion: sym_node->table != NULL
|
|
#6 0x000055fb803504e8 in ut_dbg_assertion_failed (expr=0x55fb808aa587 "sym_node->table != NULL", file=0x55fb808aa1e8 "/data/src/10.2/storage/innobase/pars/pars0pars.cc", line=815) at /data/src/10.2/storage/innobase/ut/ut0dbg.cc:60
|
#7 0x000055fb8022ef79 in pars_retrieve_table_def (sym_node=0x7fd9a0152d28) at /data/src/10.2/storage/innobase/pars/pars0pars.cc:815
|
#8 0x000055fb8022fc98 in pars_insert_statement (table_sym=0x7fd9a0152d28, values_list=0x7fd9a0152e30, select=0x0) at /data/src/10.2/storage/innobase/pars/pars0pars.cc:1315
|
#9 0x000055fb804af3b7 in yyparse () at pars0grm.y:374
|
#10 0x000055fb80231040 in pars_sql (info=0x7fd9a0152318, str=0x7fd9a0003390 "PROCEDURE P() IS\nBEGIN\nINSERT INTO $index_table_name VALUES (:token, :first_doc_id, :last_doc_id, :doc_count, :ilist);\nEND;\n") at /data/src/10.2/storage/innobase/pars/pars0pars.cc:2127
|
#11 0x000055fb8048fe09 in fts_parse_sql (fts_table=0x7fd9b3ffe8b0, info=0x7fd9a0152318, sql=0x55fb80982978 "BEGIN\nINSERT INTO $index_table_name VALUES (:token, :first_doc_id, :last_doc_id, :doc_count, :ilist);") at /data/src/10.2/storage/innobase/fts/fts0sql.cc:183
|
#12 0x000055fb80473e53 in fts_write_node (trx=0x7fd9cba2c3a8, graph=0x7fd9580517c0, fts_table=0x7fd9b3ffe8b0, word=0x7fd958207530, node=0x7fd958205b98) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:3952
|
#13 0x000055fb80474414 in fts_sync_write_words (trx=0x7fd9cba2c3a8, index_cache=0x7fd958120b28, unlock_cache=true) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4090
|
#14 0x000055fb804749b0 in fts_sync_index (sync=0x7fd958051660, index_cache=0x7fd958120b28) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4174
|
#15 0x000055fb8047529f in fts_sync (sync=0x7fd958051660, unlock_cache=true, wait=false, has_dict=false) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4419
|
#16 0x000055fb8047558e in fts_sync_table (table=0x7fd9580503a8, unlock_cache=true, wait=false, has_dict=false) at /data/src/10.2/storage/innobase/fts/fts0fts.cc:4503
|
#17 0x000055fb80484bc3 in fts_optimize_sync_table (table_id=193) at /data/src/10.2/storage/innobase/fts/fts0opt.cc:2830
|
#18 0x000055fb80484f90 in fts_optimize_thread (arg=0x55fb8345b630) at /data/src/10.2/storage/innobase/fts/fts0opt.cc:2943
|
#19 0x00007fd9d238c4a4 in start_thread (arg=0x7fd9b3fff700) at pthread_create.c:456
|
#20 0x00007fd9d08d4d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
|
All threads: threads_full_2019-05-05
|
|
The problem is that fts_sync_table() is attempting to access a nonexistent internal table for the fulltext index. Likely explanations include that the table was renamed to a different database, or it was dropped (DROP INDEX, DROP TABLE) or rebuilt (ALTER TABLE, TRUNCATE, OPTIMIZE TABLE).
I would tend to believe that fts_optimize_thread() fails to properly increment the table reference count, or to acquire an InnoDB table lock, or a metadata lock (MDL).
|
|
The crash dump says that the current statement is TRUNCATE. MDEV-13564 implements that internally using DROP TABLE, among other things. It could be that this implementation makes the crash more probable.
|
|
The test is using debug instrumentation that does not look deterministic:
case FTS_MSG_SYNC_TABLE:
|
DBUG_EXECUTE_IF(
|
"fts_instrument_msg_sync_sleep",
|
os_thread_sleep(300000););
|
|
fts_optimize_sync_table(
|
*static_cast<table_id_t*>(msg->ptr));
|
That said, this 0.3-second sleep should not be blamed for the crash. I see that the table is being opened properly, incrementing the reference count:
/** Sync fts cache of a table
|
@param[in] table_id table id */
|
void
|
fts_optimize_sync_table(
|
table_id_t table_id)
|
{
|
dict_table_t* table = NULL;
|
|
table = dict_table_open_on_id(table_id, FALSE, DICT_TABLE_OP_NORMAL);
|
|
if (table) {
|
if (dict_table_has_fts_index(table) && table->fts->cache) {
|
fts_sync_table(table, true, false, false);
|
}
|
|
dict_table_close(table, FALSE, FALSE);
|
}
|
}
|
I see that there was an attempt to fix this in MySQL 5.6.38 that was merged to MariaDB as well. Maybe the convoluted piece of code that was added to row_drop_table_for_mysql() appears to suffer from a race condition with fts_sync():
for (;;) {
|
bool retry = false;
|
if (dict_fts_index_syncing(table)) {
|
retry = true;
|
}
|
if (!retry) {
|
break;
|
}
|
DICT_BG_YIELD(trx);
|
}
|
row_mysql_unlock_data_dictionary(trx);
|
fts_optimize_remove_table(table);
|
row_mysql_lock_data_dictionary(trx);
|
The problem appears to be that only after we successfully checked that fts_sync() was not running, we will remove the table from the queue. I believe that we should first invoke fts_optimize_remove_table(table) and then ensure that the indexes have really been removed from the fts_optimize_thread queue. Alternatively, set some flag on the table or index so that it will not be picked up by fts_sync_table().
I was unable to repeat the failure even with the following patch:
diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
|
index a06c3cbd1a8..2c06273ffbc 100644
|
--- a/storage/innobase/row/row0mysql.cc
|
+++ b/storage/innobase/row/row0mysql.cc
|
@@ -3348,6 +3348,7 @@ row_drop_table_for_mysql(
|
}
|
DICT_BG_YIELD(trx);
|
}
|
+ os_thread_sleep(300000);
|
row_mysql_unlock_data_dictionary(trx);
|
fts_optimize_remove_table(table);
|
row_mysql_lock_data_dictionary(trx);
|
|
|
The test case was added in MDEV-16855, which cleaned up something in the synchronization.
The function fts_optimize_remove_table() always waited for the removal of the table from the work queue, starting from the very first public release of the code in MySQL 5.6.4.
The problem seems to be that the removal and the check for activity are reversed in row_drop_table_for_mysql(). I hope that the problem will disappear with the following code change:
diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
|
index a06c3cbd1a8..72e3d6ea75b 100644
|
--- a/storage/innobase/row/row0mysql.cc
|
+++ b/storage/innobase/row/row0mysql.cc
|
@@ -3338,18 +3338,11 @@ row_drop_table_for_mysql(
|
ut_ad(!table->fts->add_wq);
|
ut_ad(lock_trx_has_sys_table_locks(trx) == 0);
|
|
- for (;;) {
|
- bool retry = false;
|
- if (dict_fts_index_syncing(table)) {
|
- retry = true;
|
- }
|
- if (!retry) {
|
- break;
|
- }
|
- DICT_BG_YIELD(trx);
|
- }
|
row_mysql_unlock_data_dictionary(trx);
|
- fts_optimize_remove_table(table);
|
+ do {
|
+ fts_optimize_remove_table(table);
|
+ } while (dict_fts_index_syncing(table));
|
+
|
row_mysql_lock_data_dictionary(trx);
|
}
|
|
|
|
On shutdown, fts_drop_index() can invoke fts_free() in an unsafe fashion:
while (index->index_fts_syncing
|
&& !trx_is_interrupted(trx)) {
|
DICT_BG_YIELD(trx);
|
}
|
|
fts_free(table);
|
The predicate trx_is_interrupted(trx) would hold on all transactions on shutdown.
MDEV-18220 should be very closely related to this. Perhaps that test case involved KILL QUERY?
I cannot find any logic in the concurrency rules around the field dict_index_t::index_fts_syncing. I think that we’d better remove this field and ensure correctness by design, somehow.
|
|
The attached MTR test case mdev18654a.test failes with this assertion failure on 10.2 54d0a55ad debug build (with and without the patch above) frequently enough. It's highly non-deterministic, but short. Run with e.g. --repeat=100. It usually fails within 30 attempts for me, which takes 1-2 minutes.
The test case is dirty and for reproducing purposes only!
Please note that since the assertion fails upon shutdown, MTR doesn't produce a proper failure report, it looks like this:
***Warnings generated in error logs during shutdown after running tests: bug.mdev18654a
|
|
Attempting backtrace. You can use the following information to find out
|
|
mysql-test-run: *** ERROR: Test suite aborted
|
But the error log should show the assertion failure.
|
|
For me, mdev18654a.test is crashing instantly, even after I removed dict_index_t::index_fts_syncing and tried to implement more appropriate synchronization.
|
|
mdev18654a.test repeats the problem for me only in a -O2 build, not non-optimized debug build. We fail to find an internal table for one of the fulltext indexes of table id 169 (table400_innodb_int_autoinc which was renamed to #sql2 name prefix and is about to be dropped). Here are a couple of occasions: table name ending in d2_INDEX_6, and another one ending in 103_INDEX_6. It would appear to me that these tables were never created in the first place.
|
|
The bug revealed by mdev18654a.test is very interesting. It turns out that the supposedly ‘missing’ table actually still exists not only in SYS_TABLES, but also in dict_sys. The test is actually repeating MDEV-18220: fts_get_table_name_prefix() is dereferencing a freed table->name that used to contain the #sql name of the intermediate copy of the base table. The current table->name would already point to the user table name at that point. AddressSanitizer confirms this:
|
10.2 e0271a7b43c6df652c6a074858853a6d0da20c1e
|
==9899==ERROR: AddressSanitizer: heap-use-after-free on address 0x60300003c160 at pc 0x55fe91f1c349 bp 0x7f259be91050 sp 0x7f259be907f8
|
READ of size 2 at 0x60300003c160 thread T15
|
#0 0x55fe91f1c348 in __interceptor_strlen (/dev/shm/10.2c7/sql/mysqld+0xc68348)
|
#1 0x55fe936e1b55 in fts_get_table_name_prefix(fts_table_t const*) /mariadb/10.2/storage/innobase/fts/fts0sql.cc:108:34
|
#2 0x55fe936e1f72 in fts_get_table_name(fts_table_t const*, char*) /mariadb/10.2/storage/innobase/fts/fts0sql.cc:148:16
|
…
|
freed by thread T29 here:
|
#0 0x55fe91fb1162 in realloc (/dev/shm/10.2c7/sql/mysqld+0xcfd162)
|
#1 0x55fe93548e97 in dict_table_rename_in_cache(dict_table_t*, char const*, bool, bool) /mariadb/10.2/storage/innobase/dict/dict0dict.cc:1728:4
|
#2 0x55fe93196ad9 in row_rename_table_for_mysql(char const*, char const*, trx_t*, bool, bool) /mariadb/10.2/storage/innobase/row/row0mysql.cc:4503:9
|
#3 0x55fe92eedb0f in innobase_rename_table(trx_t*, char const*, char const*, bool, bool) /mariadb/10.2/storage/innobase/handler/ha_innodb.cc:13403:10
|
If the freed string contains suitable garbage, then fts_get_table_name_prefix() would return a string that starts with FTS_, lacking the database name prefix and a '/'. Only a few internal SYS_ tables that exist in the InnoDB system tablespace lack the '/' character in their name. I wasted at least 1 day because I missed the fact that the table name is incorrect.
I think that we need two independent changes: the removal of dict_index_t::index_fts_syncing and the related cleanup (in this MDEV-18654), and the fix of the race condition with rename operations (MDEV-18220).
|
|
While the dict_index_t::index_fts_syncing feels ugly and seems to be prone to race conditions, my current attempt at removing it would cause problems as pointed out by thiru. A better solution would be to implement MDL protection for the fts_sync() operation, in MDEV-16678.
For now, I think that we can do some minor cleanup in fts0opt.cc but not touch the logic related to dict_index_t::index_fts_syncing.
|
|
/* Make a concurrent Drop fts Index to wait until sync of that
|
fts index is happening in the background */
|
for (int retry_count = 0;;) {
|
bool retry = false;
|
|
for (inplace_alter_handler_ctx** pctx = ctx_array;
|
*pctx; pctx++) {
|
ha_innobase_inplace_ctx* ctx
|
= static_cast<ha_innobase_inplace_ctx*>(*pctx);
|
DBUG_ASSERT(new_clustered == ctx->need_rebuild());
|
|
if (dict_fts_index_syncing(ctx->old_table)) {
|
retry = true;
|
break;
|
}
|
|
if (new_clustered && dict_fts_index_syncing(ctx->new_table)) {
|
retry = true;
|
break;
|
}
|
}
|
We don't need to check for the new table in ha_innobase::commit_inplace_alter_table(). Because we just did fts_optimize_remove_table for new table
just before locking data dictionary. The simplest solution is that acquiring MDL lock for fts_sync(). So that DDL won't drop the table or index in the mean time.
It is already done for purge thread when it encounters virtual column.
|
|
I pushed some cleanup as the first commit associated with MDEV-18220. Because the logic related to dict_index_t::index_fts_syncing was not touched, the assertion should be able to fail when using KILL QUERY or when shutting down the server.
|
|
elenst, could this have been fixed in 10.5 by MDEV-16678?
|
|
Among my test failure records, I have one of June 3rd, 2020 which was automatically recognized as MDEV-18654. The logs are no longer available, so I can't confirm or deny it was an accurate recognition. It happened on 10.4 main branch, revision hash isn't recorded, but it's typically up-to-date when the tests are run.
|
|
On buildbot cross-reference, the latest failure of this type was on bb-10.5-MDEV-19176 branch on 2019-12-20 and 2019-12-22, for two builders of the same commit. It is possible that the commit was at fault. Its parent commit 305081a7354f4ef17d2e16ca16f747ee754fee69 did include MDEV-16678, which I believe should have fixed this bug.
There were some follow-up fixes to MDEV-16678 for the MDL acquisition. The last one is MDEV-21344, which was applied to 10.5.1. I think that we can conclude that this bug was fixed in MariaDB 10.5.1.
|