[MDEV-23651] InnoDB: Failing assertion: !space->referenced() on DROP TABLE Created: 2020-09-02  Updated: 2020-12-17  Resolved: 2020-09-03

Status: Closed
Project: MariaDB Server
Component/s: Encryption, Storage Engine - InnoDB
Affects Version/s: 10.3.7, 10.4.0, 10.5.0
Fix Version/s: 10.3.26, 10.4.16, 10.5.7

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: not-10.1, not-10.2, race, regression, rr-profile-analyzed

Issue Links:
Problem/Incident
is caused by MDEV-15983 Reduce fil_system.mutex contention fu... Closed
Relates
relates to MDEV-15949 InnoDB: Failing assertion: space->n_p... Closed
relates to MDEV-23776 Test encryption.create_or_replace fai... Closed

 Description   

Assert hit during RQG testing.
# 2020-09-02T03:02:48 [4059144] | [rr 4061206 1280885]2020-09-02 02:59:21 0x7f08f187d300[rr 4061206 1280888]  InnoDB: Assertion failure in file /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc line 2243
# 2020-09-02T03:02:48 [4059144] | [rr 4061206 1280890]InnoDB: Failing assertion: !space->referenced()
# 2020-09-02T03:02:48 [4059144] | [rr 4061206 1280892]InnoDB: We intentionally generate a memory trap.
...
Query (0x62b000126238): ALTER TABLE t2 ADD PRIMARY KEY ( col_int, col_text(9) )  /* E_R Thread11 QNO 444 CON_ID 62 */
# 2020-09-02T03:02:48 [4059144] | Connection ID (thread ID): 62
# 2020-09-02T03:02:48 [4059144] | [rr 4061206 1300994]Status: NOT_KILLED
 
(rr) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f08efbd9859 in __GI_abort () at abort.c:79
#2  0x00005637b2e99aa8 in ut_dbg_assertion_failed (expr=expr@entry=0x5637b3f9ca00 "!space->referenced()", file=file@entry=0x5637b3fa3090 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc", line=line@entry=2243)
    at /home/mleich/Server/10.5A/storage/innobase/ut/ut0dbg.cc:60
#3  0x00005637b316faf3 in fil_delete_tablespace (id=<optimized out>, if_exists=if_exists@entry=false, detached_handles=detached_handles@entry=0x7f08f18728c0) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2243
#4  0x00005637b2c487aa in row_drop_table_for_mysql (name=<optimized out>, trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_DROP_DB, create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=false)
    at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3722
#5  0x00005637b320ffef in fts_drop_table (trx=trx@entry=0x22773c0f2b90, table_name=table_name@entry=0x7f08f1873bc0 "test/FTS_", '0' <repeats 13 times>, "26a_DELETED") at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1443
#6  0x00005637b321090d in fts_drop_common_tables (trx=trx@entry=0x22773c0f2b90, fts_table=fts_table@entry=0x7f08f1873f60) at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1577
#7  0x00005637b3210b4c in fts_drop_tables (trx=trx@entry=0x22773c0f2b90, table=table@entry=0x619000274820) at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1688
#8  0x00005637b2c47a8b in row_drop_ancillary_fts_tables (trx=0x22773c0f2b90, table=0x619000274820) at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3214
#9  row_drop_table_for_mysql (name=name@entry=0x7f08f1875730 "test/#sql-alter-3df816-3e", trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE, create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=true)
    at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3682
#10 0x00005637b2840018 in ha_innobase::delete_table (this=this@entry=0x62b000128fb0, name=name@entry=0x61b0005a1140 "./test/#sql-alter-3df816-3e", sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE)
    at /home/mleich/Server/10.5A/storage/innobase/handler/ha_innodb.cc:13211
#11 0x00005637b281c22e in ha_innobase::delete_table (this=0x62b000128fb0, name=0x61b0005a1140 "./test/#sql-alter-3df816-3e") at /home/mleich/Server/10.5A/storage/innobase/handler/ha_innodb.cc:13338
#12 0x00005637b1b65996 in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /home/mleich/Server/10.5A/sql/handler.cc:564
#13 0x00005637b19acec9 in THD::rm_temporary_table (this=this@entry=0x62b00011f218, base=0x615000001e98, path=0x61b0005a1140 "./test/#sql-alter-3df816-3e") at /home/mleich/Server/10.5A/sql/temporary_tables.cc:703
#14 0x00005637b19af049 in THD::free_tmp_table_share (this=this@entry=0x62b00011f218, share=share@entry=0x61b0005a0b98, delete_table=delete_table@entry=true) at /home/mleich/Server/10.5A/sql/handler.h:1718
#15 0x00005637b19b4f8b in THD::drop_temporary_table (this=this@entry=0x62b00011f218, table=table@entry=0x619000956a98, is_trans=is_trans@entry=0x0, delete_table=delete_table@entry=true) at /home/mleich/Server/10.5A/sql/temporary_tables.cc:669
#16 0x00005637b15a3be7 in mysql_alter_table (thd=thd@entry=0x62b00011f218, new_db=<optimized out>, new_name=new_name@entry=0x62b000124060, create_info=create_info@entry=0x7f08f1878e50, table_list=<optimized out>, table_list@entry=0x62b0001263d0, 
    alter_info=alter_info@entry=0x7f08f1878d40, order_num=<optimized out>, order=<optimized out>, ignore=<optimized out>, if_exists=<optimized out>) at /home/mleich/Server/10.5A/sql/sql_table.cc:10997
#17 0x00005637b174eb72 in Sql_cmd_alter_table::execute (this=<optimized out>, thd=0x62b00011f218) at /home/mleich/Server/10.5A/sql/structs.h:559
#18 0x00005637b12b662a in mysql_execute_command (thd=thd@entry=0x62b00011f218) at /home/mleich/Server/10.5A/sql/sql_parse.cc:5952
#19 0x00005637b126454a in mysql_parse (thd=thd@entry=0x62b00011f218, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7f08f187aed0, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false)
    at /home/mleich/Server/10.5A/sql/sql_parse.cc:7994
#20 0x00005637b1295bb6 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62b00011f218, packet=packet@entry=0x629001b26219 " ALTER TABLE t2 ADD PRIMARY KEY ( col_int, col_text(9) )  /* E_R Thread11 QNO 444 CON_ID 62 */ ", 
    packet_length=packet_length@entry=95, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /home/mleich/Server/10.5A/sql/sql_class.h:1251
#21 0x00005637b12a105d in do_command (thd=0x62b00011f218) at /home/mleich/Server/10.5A/sql/sql_parse.cc:1348
#22 0x00005637b173467c in do_handle_one_connection (connect=<optimized out>, connect@entry=0x6080000031b8, put_in_cache=put_in_cache@entry=true) at /home/mleich/Server/10.5A/sql/sql_connect.cc:1410
#23 0x00005637b17358b2 in handle_one_connection (arg=arg@entry=0x6080000031b8) at /home/mleich/Server/10.5A/sql/sql_connect.cc:1312
#24 0x00005637b25b3d03 in pfs_spawn_thread (arg=0x61500000a818) at /home/mleich/Server/10.5A/storage/perfschema/pfs.cc:2201
#25 0x000032542082d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#26 0x00007f08efcd6103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr)
 
origin/10.5, 10.5 0af6e525212f81c2da8185817c6ca2e7c3ac036a 2020-09-02T09:29:57+10:00
The server claims to be 10.5.6.
 
Getting RQG
==========
git clone https://github.com/mleich1/rqg --branch experimental RQG
origin/experimental 8762b0e26ec3c127695bd36b19a00aa6723705d4 2020-09-01T13:43:47+02:00
 
perl rqg.pl \
--no-mask \
--seed=random \
--engine=InnoDB \
--queries=100000000 \
--restart_timeout=120 \
--mysqld=--log_output=none \
--mysqld=--loose-innodb-use-atomic-writes \
--mysqld=--plugin-load-add=file_key_management.so \
--mysqld=--loose-file-key-management-filename=/RQG/conf/mariadb/encryption_keys.txt \
--mysqld=--innodb_lock_schedule_algorithm=fcfs \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--connect_timeout=60 \
--mysqld=--interactive_timeout=28800 \
--mysqld=--slave_net_timeout=60 \
--mysqld=--net_read_timeout=30 \
--mysqld=--net_write_timeout=60 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--wait_timeout=28800 \
--mysqld=--lock-wait-timeout=86400 \
--mysqld=--innodb-lock-wait-timeout=50 \
--reporters=Backtrace \
--reporters=ErrorLog \
--reporters=Deadlock1 \
--validators=None \
--mysqld=--log-bin \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--duration=300 \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
--grammar=conf/mariadb/table_stress_innodb.yy \
--gendata=conf/mariadb/table_stress.zz \
--gendata_sql=conf/mariadb/table_stress.sql \
--mysqld=--loose-innodb_buffer_pool_instances=3 \
--mysqld=--loose-innodb_doublewrite=1 \
--mysqld=--loose-innodb-encryption-threads=7 \
--mysqld=--innodb_encryption_rotate_key_age=0 \
--redefine=conf/mariadb/redefine_innodb_undo.yy \
--threads=13 \
--mysqld=--innodb-encrypt-log \
--mysqld=--innodb-encrypt-tables \
--duration=300 \
--no_mask \
--workdir=<local settings< \
--vardir=<local settings< \
--mtr-build-thread=<local settings< \
--basedir1=<local settings< \
--script_debug=_nix_ \
--rr=Server \
--rr_options=--chaos
 
Error tagging pattern for RQG
[ 'TBR-671-MDEV-23651'    , 'InnoDB: Assertion failure in file .{1,150}fil0fil.cc.{1,200}InnoDB: Failing assertion: \!space->referenced\(\)' ],



 Comments   
Comment by Marko Mäkelä [ 2020-09-03 ]

The reference count was incremented from 0 (and never reached 0 after that) in

10.5

Thread 8 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i
 
Old value = 0
New value = 1
fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
    space=<optimized out>, this=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
1498	  return space;
(rr) when
Current event: 1280868
(rr) bt
#0  fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
    space=<optimized out>, this=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
#1  fil_space_next (space=0x0, recheck=<optimized out>, 
    encrypt=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514
#2  0x00005637b31a4e19 in fil_crypt_find_space_to_rotate (
    key_state=key_state@entry=0x6d50794c9d40, 
    state=state@entry=0x6d50794c9d60, recheck=recheck@entry=0x6d50794c9d30)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575
#3  0x00005637b31a5846 in fil_crypt_thread ()
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163
1594			state->space = fil_space_next(state->space, *recheck,
(rr) s
fil_space_next (space=0x6160020dab98, recheck=false, encrypt=false)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1510
1510	{
(rr) n
1511	  mutex_enter(&fil_system.mutex);
(rr) 
[Switching to Thread 4061206.4071781]
 
Thread 9 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i
 
Old value = 1
New value = 2
fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
    space=<optimized out>, this=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
1498	  return space;
(rr) bt
#0  fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
    space=<optimized out>, this=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
#1  fil_space_next (space=0x0, recheck=<optimized out>, 
    encrypt=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514
#2  0x00005637b31a4e19 in fil_crypt_find_space_to_rotate (
    key_state=key_state@entry=0x640001027d40, 
    state=state@entry=0x640001027d60, recheck=recheck@entry=0x640001027d30)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575   0x00003254208317db <+43>:	xchg   %eax,(%rdi)
=> 0x00003254208317dd <+45>:	cmp    $0x1,%eax
   0x00003254208317e0 <+48>:	jg     0x325420831820 <__GI___pthread_mutex_unlock+112>
 
#3  0x00005637b31a5846 in fil_crypt_thread ()
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163
(rr) when
Current event: 1280872

At the same time when Thread 8 is about to decrement the reference count for this tablespace, Thread 9 will acquire the same tablespace for processing. A bit later, Thread 9 will release the reference count and Thread 10 will increment and decrement it, all while Thread 8 is still waiting for fil_system.mutex:

Thread 8 (Thread 4061206.4071774):
#0  __pthread_mutex_unlock_usercnt (decr=1, mutex=0x6d50794c97a8) at pthread_mutex_unlock.c:58
#1  __GI___pthread_mutex_unlock (mutex=0x6d50794c97a8) at pthread_mutex_unlock.c:357
#2  0x00005637b2837118 in OSMutex::exit (this=0x6d50794c97a0) at /home/mleich/Server/10.5A/storage/innobase/include/sync0types.h:434
#3  MutexDebug<TTASEventMutex<GenericPolicy> >::set (thread_id=18446744073709551615, line=0, filename=0x0, mutex=0x0, this=0x6d50794c9770) at /home/mleich/Server/10.5A/storage/innobase/include/sync0policy.h:59
#4  MutexDebug<TTASEventMutex<GenericPolicy> >::enter (line=1511, filename=0x5637b3fa6b00 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc", mutex=..., this=0x5637b4c20ab0 <fil_system+48>) at /home/mleich/Server/10.5A/storage/innobase/include/sync0policy.h:126
#5  PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5637b4c20aa0 <fil_system+32>, n_spins=<optimized out>, n_delay=<optimized out>, name=name@entry=0x5637b3fa6b00 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc", line=line@entry=1511) at /home/mleich/Server/10.5A/storage/innobase/include/ib0mutex.h:588
#6  0x00005637b31a052a in fil_space_next (space=0x6160020dab98, recheck=<optimized out>, encrypt=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1511

This situation continues until the assertion fails, at event 1280908. It is unclear why the function is stuck here:

   0x00003254208317db <+43>:	xchg   %eax,(%rdi)
=> 0x00003254208317dd <+45>:	cmp    $0x1,%eax
   0x00003254208317e0 <+48>:	jg     0x325420831820 <__GI___pthread_mutex_unlock+112>

It looks like the failure occurs simply be because rr decided not to let this thread run:

Thread 9 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i
 
Old value = 1
New value = 2
fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
    space=<optimized out>, this=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
1498	  return space;
1: x/i $pc
=> 0x5637b31a0b31 <fil_space_next(fil_space_t*, bool, bool)+1864>:	
    lea    -0x1e0(%rbx),%rax
(rr) disa 1
(rr) thr 8
[Switching to thread 8 (Thread 4061206.4071774)]
#0  __pthread_mutex_unlock_usercnt (decr=1, mutex=0x6d50794c97a8)
    at pthread_mutex_unlock.c:58
58	pthread_mutex_unlock.c: No such file or directory.
(rr) display
1: x/i $pc
=> 0x3254208317dd <__GI___pthread_mutex_unlock+45>:	cmp    $0x1,%eax
(rr) si
[rr 4061206 1280885]2020-09-02 02:59:21 0x7f08f187d300[rr 4061206 1280888]  InnoDB: Assertion failure in file /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc line 2243

I think that this bug can only be fixed by removing the reliance on the tablespace reference count in background operations, and always acquiring proper MDL (or an InnoDB table reference). MDEV-19514 and MDEV-16678 solved this problem in 10.5 in many other scenarios. But, even with MDEV-19514 fixed, as noted in MDEV-11634, ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change, and that merge is not necessarily covered by any MDL or InnoDB table reference.

Comment by Marko Mäkelä [ 2020-09-03 ]

Even though the execution trace involves an internal table of a FULLTEXT INDEX, I think that this race condition between background operations and deleting a data file is possible in any DDL operation that involves rebuilding or dropping a normal user table.

Comment by Marko Mäkelä [ 2020-09-03 ]

I initially forgot to check fil_space_t::stop_new_ops. It turns out that fil_system_t::keyrotate_next() is kind-of disregarding it:

Thread 50 hit Hardware watchpoint 3: -location sp->stop_new_ops
 
Old value = false
New value = true
fil_check_pending_operations (id=607, truncate=truncate@entry=false, 
    path=path@entry=0x7f08f1871960)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2037
2037			if (sp->crypt_data) {
(rr) when
Current event: 1280847
(rr) c
Continuing.
 
Thread 50 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 
    {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>}
New value = 
    {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>}
fil_check_pending_operations (id=607, truncate=truncate@entry=false, 
    path=path@entry=0x7f08f1871960)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2039
2039				mutex_exit(&fil_system.mutex);
(rr) c
Continuing.
 
Thread 50 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 
    {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>}
New value = 
    {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>}
fil_space_t::release (this=0x6160020dab98)
    at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:509
509		bool release() { auto n= n_pending_ops--; ut_ad(n); return n == 1; }
(rr) bt
#0  fil_space_t::release (this=0x6160020dab98)
    at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:509
#1  fil_check_pending_operations (id=607, truncate=truncate@entry=false, 
    path=path@entry=0x7f08f1871960)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2042
#2  0x00005637b316e33c in fil_delete_tablespace (id=<optimized out>, 
    if_exists=if_exists@entry=false, 
    detached_handles=detached_handles@entry=0x7f08f18728c0)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2172
#3  0x00005637b2c487aa in row_drop_table_for_mysql (name=<optimized out>, 
    trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_DROP_DB, 
    create_failed=create_failed@entry=false, nonatomic=<optimized out>, 
    nonatomic@entry=false)
    at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3722
#4  0x00005637b320ffef in fts_drop_table (trx=trx@entry=0x22773c0f2b90, 
    table_name=table_name@entry=0x7f08f1873bc0 "test/FTS_", '0' <repeats 13 times>, "26a_DELETED")
    at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1443
#5  0x00005637b321090d in fts_drop_common_tables (
    trx=trx@entry=0x22773c0f2b90, fts_table=fts_table@entry=0x7f08f1873f60)
    at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1577
#6  0x00005637b3210b4c in fts_drop_tables (trx=trx@entry=0x22773c0f2b90, 
    table=table@entry=0x619000274820)
    at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1688
#7  0x00005637b2c47a8b in row_drop_ancillary_fts_tables (trx=0x22773c0f2b90, 
    table=0x619000274820)
    at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3214
#8  row_drop_table_for_mysql (
    name=name@entry=0x7f08f1875730 "test/#sql-alter-3df816-3e", 
    trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE, 
    create_failed=create_failed@entry=false, nonatomic=<optimized out>, 
    nonatomic@entry=true)
    at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3682
#9  0x00005637b2840018 in ha_innobase::delete_table (
    this=this@entry=0x62b000128fb0, 
    name=name@entry=0x61b0005a1140 "./test/#sql-alter-3df816-3e", 
    sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE)
    at /home/mleich/Server/10.5A/storage/innobase/handler/ha_innodb.cc:13211
#10 0x00005637b281c22e in ha_innobase::delete_table (this=0x62b000128fb0, 
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(rr) c
Continuing.
[Switching to Thread 4061206.4071774]
 
Thread 7 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 
    {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>}
New value = 
    {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>}
fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
    space=<optimized out>, this=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
1498	  return space;
(rr) bt
#0  fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
    space=<optimized out>, this=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
#1  fil_space_next (space=0x0, recheck=<optimized out>, 
    encrypt=<optimized out>)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514
#2  0x00005637b31a4e19 in fil_crypt_find_space_to_rotate (
    key_state=key_state@entry=0x6d50794c9d40, 
    state=state@entry=0x6d50794c9d60, recheck=recheck@entry=0x6d50794c9d30)
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575
#3  0x00005637b31a5846 in fil_crypt_thread ()
    at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163
#4  0x000032542082d609 in start_thread (arg=<optimized out>)
    at pthread_create.c:477
#5  0x00007f08efcd6103 in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

According to this extract from the same rr replay trace, the reference count correctly reached 0 when the ALTER TABLE…ALGORITHM=COPY operation dropped the original copy of the table. But, fil_system_t::keyrotate_next() had already checked fil_space_t::is_stopping() before any of the above occurred:

Thread 8 hit Hardware access (read/write) watchpoint 4: *$5
 
Value = false
0x00005637b319b12e in fil_crypt_space_needs_rotation (
    state=state@entry=0x640001027d60, 
    key_state=key_state@entry=0x640001027d40, 
    recheck=recheck@entry=0x640001027d30)
    at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:419
419		bool is_stopping() const { return stop_new_ops;	}
(rr) when
Current event: 1280814

I think that this race condition was introduced by MDEV-15983 in 10.3.7 when fil_space_release() (which acquired fil_system->mutex) was replaced with fil_space_t::release().

A possible fix might be to fuse stop_new_ops into the most significant bit of n_pending_ops, and make fil_space_t::acquire() use a compare-and-swap operation spinloop that may fail, similar to rw_lock_lock_word_decr(). This should guarantee that n_pending_ops cannot be incremented after stop_new_ops has been set.

Comment by Matthias Leich [ 2020-09-03 ]

bb-10.4-marko commit 0212c417408983dfd0e973774200d5418603fa88 containing the patch for the current MDEV
behaved well during RQG testing. There were no issues related to the patch.

Generated at Thu Feb 08 09:24:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.