Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23651

InnoDB: Failing assertion: !space->referenced() on DROP TABLE

Details

    Description

      Assert hit during RQG testing.
      # 2020-09-02T03:02:48 [4059144] | [rr 4061206 1280885]2020-09-02 02:59:21 0x7f08f187d300[rr 4061206 1280888]  InnoDB: Assertion failure in file /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc line 2243
      # 2020-09-02T03:02:48 [4059144] | [rr 4061206 1280890]InnoDB: Failing assertion: !space->referenced()
      # 2020-09-02T03:02:48 [4059144] | [rr 4061206 1280892]InnoDB: We intentionally generate a memory trap.
      ...
      Query (0x62b000126238): ALTER TABLE t2 ADD PRIMARY KEY ( col_int, col_text(9) )  /* E_R Thread11 QNO 444 CON_ID 62 */
      # 2020-09-02T03:02:48 [4059144] | Connection ID (thread ID): 62
      # 2020-09-02T03:02:48 [4059144] | [rr 4061206 1300994]Status: NOT_KILLED
       
      (rr) bt
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #1  0x00007f08efbd9859 in __GI_abort () at abort.c:79
      #2  0x00005637b2e99aa8 in ut_dbg_assertion_failed (expr=expr@entry=0x5637b3f9ca00 "!space->referenced()", file=file@entry=0x5637b3fa3090 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc", line=line@entry=2243)
          at /home/mleich/Server/10.5A/storage/innobase/ut/ut0dbg.cc:60
      #3  0x00005637b316faf3 in fil_delete_tablespace (id=<optimized out>, if_exists=if_exists@entry=false, detached_handles=detached_handles@entry=0x7f08f18728c0) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2243
      #4  0x00005637b2c487aa in row_drop_table_for_mysql (name=<optimized out>, trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_DROP_DB, create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=false)
          at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3722
      #5  0x00005637b320ffef in fts_drop_table (trx=trx@entry=0x22773c0f2b90, table_name=table_name@entry=0x7f08f1873bc0 "test/FTS_", '0' <repeats 13 times>, "26a_DELETED") at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1443
      #6  0x00005637b321090d in fts_drop_common_tables (trx=trx@entry=0x22773c0f2b90, fts_table=fts_table@entry=0x7f08f1873f60) at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1577
      #7  0x00005637b3210b4c in fts_drop_tables (trx=trx@entry=0x22773c0f2b90, table=table@entry=0x619000274820) at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1688
      #8  0x00005637b2c47a8b in row_drop_ancillary_fts_tables (trx=0x22773c0f2b90, table=0x619000274820) at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3214
      #9  row_drop_table_for_mysql (name=name@entry=0x7f08f1875730 "test/#sql-alter-3df816-3e", trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE, create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=true)
          at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3682
      #10 0x00005637b2840018 in ha_innobase::delete_table (this=this@entry=0x62b000128fb0, name=name@entry=0x61b0005a1140 "./test/#sql-alter-3df816-3e", sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE)
          at /home/mleich/Server/10.5A/storage/innobase/handler/ha_innodb.cc:13211
      #11 0x00005637b281c22e in ha_innobase::delete_table (this=0x62b000128fb0, name=0x61b0005a1140 "./test/#sql-alter-3df816-3e") at /home/mleich/Server/10.5A/storage/innobase/handler/ha_innodb.cc:13338
      #12 0x00005637b1b65996 in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /home/mleich/Server/10.5A/sql/handler.cc:564
      #13 0x00005637b19acec9 in THD::rm_temporary_table (this=this@entry=0x62b00011f218, base=0x615000001e98, path=0x61b0005a1140 "./test/#sql-alter-3df816-3e") at /home/mleich/Server/10.5A/sql/temporary_tables.cc:703
      #14 0x00005637b19af049 in THD::free_tmp_table_share (this=this@entry=0x62b00011f218, share=share@entry=0x61b0005a0b98, delete_table=delete_table@entry=true) at /home/mleich/Server/10.5A/sql/handler.h:1718
      #15 0x00005637b19b4f8b in THD::drop_temporary_table (this=this@entry=0x62b00011f218, table=table@entry=0x619000956a98, is_trans=is_trans@entry=0x0, delete_table=delete_table@entry=true) at /home/mleich/Server/10.5A/sql/temporary_tables.cc:669
      #16 0x00005637b15a3be7 in mysql_alter_table (thd=thd@entry=0x62b00011f218, new_db=<optimized out>, new_name=new_name@entry=0x62b000124060, create_info=create_info@entry=0x7f08f1878e50, table_list=<optimized out>, table_list@entry=0x62b0001263d0, 
          alter_info=alter_info@entry=0x7f08f1878d40, order_num=<optimized out>, order=<optimized out>, ignore=<optimized out>, if_exists=<optimized out>) at /home/mleich/Server/10.5A/sql/sql_table.cc:10997
      #17 0x00005637b174eb72 in Sql_cmd_alter_table::execute (this=<optimized out>, thd=0x62b00011f218) at /home/mleich/Server/10.5A/sql/structs.h:559
      #18 0x00005637b12b662a in mysql_execute_command (thd=thd@entry=0x62b00011f218) at /home/mleich/Server/10.5A/sql/sql_parse.cc:5952
      #19 0x00005637b126454a in mysql_parse (thd=thd@entry=0x62b00011f218, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7f08f187aed0, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false)
          at /home/mleich/Server/10.5A/sql/sql_parse.cc:7994
      #20 0x00005637b1295bb6 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x62b00011f218, packet=packet@entry=0x629001b26219 " ALTER TABLE t2 ADD PRIMARY KEY ( col_int, col_text(9) )  /* E_R Thread11 QNO 444 CON_ID 62 */ ", 
          packet_length=packet_length@entry=95, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /home/mleich/Server/10.5A/sql/sql_class.h:1251
      #21 0x00005637b12a105d in do_command (thd=0x62b00011f218) at /home/mleich/Server/10.5A/sql/sql_parse.cc:1348
      #22 0x00005637b173467c in do_handle_one_connection (connect=<optimized out>, connect@entry=0x6080000031b8, put_in_cache=put_in_cache@entry=true) at /home/mleich/Server/10.5A/sql/sql_connect.cc:1410
      #23 0x00005637b17358b2 in handle_one_connection (arg=arg@entry=0x6080000031b8) at /home/mleich/Server/10.5A/sql/sql_connect.cc:1312
      #24 0x00005637b25b3d03 in pfs_spawn_thread (arg=0x61500000a818) at /home/mleich/Server/10.5A/storage/perfschema/pfs.cc:2201
      #25 0x000032542082d609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #26 0x00007f08efcd6103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      (rr)
       
      origin/10.5, 10.5 0af6e525212f81c2da8185817c6ca2e7c3ac036a 2020-09-02T09:29:57+10:00
      The server claims to be 10.5.6.
       
      Getting RQG
      ==========
      git clone https://github.com/mleich1/rqg --branch experimental RQG
      origin/experimental 8762b0e26ec3c127695bd36b19a00aa6723705d4 2020-09-01T13:43:47+02:00
       
      perl rqg.pl \
      --no-mask \
      --seed=random \
      --engine=InnoDB \
      --queries=100000000 \
      --restart_timeout=120 \
      --mysqld=--log_output=none \
      --mysqld=--loose-innodb-use-atomic-writes \
      --mysqld=--plugin-load-add=file_key_management.so \
      --mysqld=--loose-file-key-management-filename=/RQG/conf/mariadb/encryption_keys.txt \
      --mysqld=--innodb_lock_schedule_algorithm=fcfs \
      --mysqld=--loose-idle_write_transaction_timeout=0 \
      --mysqld=--loose-idle_transaction_timeout=0 \
      --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      --mysqld=--connect_timeout=60 \
      --mysqld=--interactive_timeout=28800 \
      --mysqld=--slave_net_timeout=60 \
      --mysqld=--net_read_timeout=30 \
      --mysqld=--net_write_timeout=60 \
      --mysqld=--loose-table_lock_wait_timeout=50 \
      --mysqld=--wait_timeout=28800 \
      --mysqld=--lock-wait-timeout=86400 \
      --mysqld=--innodb-lock-wait-timeout=50 \
      --reporters=Backtrace \
      --reporters=ErrorLog \
      --reporters=Deadlock1 \
      --validators=None \
      --mysqld=--log-bin \
      --mysqld=--log_bin_trust_function_creators=1 \
      --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      --duration=300 \
      --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      --grammar=conf/mariadb/table_stress_innodb.yy \
      --gendata=conf/mariadb/table_stress.zz \
      --gendata_sql=conf/mariadb/table_stress.sql \
      --mysqld=--loose-innodb_buffer_pool_instances=3 \
      --mysqld=--loose-innodb_doublewrite=1 \
      --mysqld=--loose-innodb-encryption-threads=7 \
      --mysqld=--innodb_encryption_rotate_key_age=0 \
      --redefine=conf/mariadb/redefine_innodb_undo.yy \
      --threads=13 \
      --mysqld=--innodb-encrypt-log \
      --mysqld=--innodb-encrypt-tables \
      --duration=300 \
      --no_mask \
      --workdir=<local settings< \
      --vardir=<local settings< \
      --mtr-build-thread=<local settings< \
      --basedir1=<local settings< \
      --script_debug=_nix_ \
      --rr=Server \
      --rr_options=--chaos
       
      Error tagging pattern for RQG
      [ 'TBR-671-MDEV-23651'    , 'InnoDB: Assertion failure in file .{1,150}fil0fil.cc.{1,200}InnoDB: Failing assertion: \!space->referenced\(\)' ],
      

      Attachments

        Issue Links

          Activity

            The reference count was incremented from 0 (and never reached 0 after that) in

            10.5

            Thread 8 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i
             
            Old value = 0
            New value = 1
            fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
                space=<optimized out>, this=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
            1498	  return space;
            (rr) when
            Current event: 1280868
            (rr) bt
            #0  fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
                space=<optimized out>, this=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
            #1  fil_space_next (space=0x0, recheck=<optimized out>, 
                encrypt=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514
            #2  0x00005637b31a4e19 in fil_crypt_find_space_to_rotate (
                key_state=key_state@entry=0x6d50794c9d40, 
                state=state@entry=0x6d50794c9d60, recheck=recheck@entry=0x6d50794c9d30)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575
            #3  0x00005637b31a5846 in fil_crypt_thread ()
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163
            …
            1594			state->space = fil_space_next(state->space, *recheck,
            (rr) s
            fil_space_next (space=0x6160020dab98, recheck=false, encrypt=false)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1510
            1510	{
            (rr) n
            1511	  mutex_enter(&fil_system.mutex);
            (rr) 
            [Switching to Thread 4061206.4071781]
             
            Thread 9 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i
             
            Old value = 1
            New value = 2
            fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
                space=<optimized out>, this=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
            1498	  return space;
            (rr) bt
            #0  fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
                space=<optimized out>, this=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
            #1  fil_space_next (space=0x0, recheck=<optimized out>, 
                encrypt=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514
            #2  0x00005637b31a4e19 in fil_crypt_find_space_to_rotate (
                key_state=key_state@entry=0x640001027d40, 
                state=state@entry=0x640001027d60, recheck=recheck@entry=0x640001027d30)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575   0x00003254208317db <+43>:	xchg   %eax,(%rdi)
            => 0x00003254208317dd <+45>:	cmp    $0x1,%eax
               0x00003254208317e0 <+48>:	jg     0x325420831820 <__GI___pthread_mutex_unlock+112>
             
            #3  0x00005637b31a5846 in fil_crypt_thread ()
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163
            …
            (rr) when
            Current event: 1280872
            

            At the same time when Thread 8 is about to decrement the reference count for this tablespace, Thread 9 will acquire the same tablespace for processing. A bit later, Thread 9 will release the reference count and Thread 10 will increment and decrement it, all while Thread 8 is still waiting for fil_system.mutex:

            Thread 8 (Thread 4061206.4071774):
            #0  __pthread_mutex_unlock_usercnt (decr=1, mutex=0x6d50794c97a8) at pthread_mutex_unlock.c:58
            #1  __GI___pthread_mutex_unlock (mutex=0x6d50794c97a8) at pthread_mutex_unlock.c:357
            #2  0x00005637b2837118 in OSMutex::exit (this=0x6d50794c97a0) at /home/mleich/Server/10.5A/storage/innobase/include/sync0types.h:434
            #3  MutexDebug<TTASEventMutex<GenericPolicy> >::set (thread_id=18446744073709551615, line=0, filename=0x0, mutex=0x0, this=0x6d50794c9770) at /home/mleich/Server/10.5A/storage/innobase/include/sync0policy.h:59
            #4  MutexDebug<TTASEventMutex<GenericPolicy> >::enter (line=1511, filename=0x5637b3fa6b00 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc", mutex=..., this=0x5637b4c20ab0 <fil_system+48>) at /home/mleich/Server/10.5A/storage/innobase/include/sync0policy.h:126
            #5  PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5637b4c20aa0 <fil_system+32>, n_spins=<optimized out>, n_delay=<optimized out>, name=name@entry=0x5637b3fa6b00 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc", line=line@entry=1511) at /home/mleich/Server/10.5A/storage/innobase/include/ib0mutex.h:588
            #6  0x00005637b31a052a in fil_space_next (space=0x6160020dab98, recheck=<optimized out>, encrypt=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1511
            

            This situation continues until the assertion fails, at event 1280908. It is unclear why the function is stuck here:

               0x00003254208317db <+43>:	xchg   %eax,(%rdi)
            => 0x00003254208317dd <+45>:	cmp    $0x1,%eax
               0x00003254208317e0 <+48>:	jg     0x325420831820 <__GI___pthread_mutex_unlock+112>
            

            It looks like the failure occurs simply be because rr decided not to let this thread run:

            Thread 9 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i
             
            Old value = 1
            New value = 2
            fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
                space=<optimized out>, this=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
            1498	  return space;
            1: x/i $pc
            => 0x5637b31a0b31 <fil_space_next(fil_space_t*, bool, bool)+1864>:	
                lea    -0x1e0(%rbx),%rax
            (rr) disa 1
            (rr) thr 8
            [Switching to thread 8 (Thread 4061206.4071774)]
            #0  __pthread_mutex_unlock_usercnt (decr=1, mutex=0x6d50794c97a8)
                at pthread_mutex_unlock.c:58
            58	pthread_mutex_unlock.c: No such file or directory.
            (rr) display
            1: x/i $pc
            => 0x3254208317dd <__GI___pthread_mutex_unlock+45>:	cmp    $0x1,%eax
            (rr) si
            [rr 4061206 1280885]2020-09-02 02:59:21 0x7f08f187d300[rr 4061206 1280888]  InnoDB: Assertion failure in file /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc line 2243
            

            I think that this bug can only be fixed by removing the reliance on the tablespace reference count in background operations, and always acquiring proper MDL (or an InnoDB table reference). MDEV-19514 and MDEV-16678 solved this problem in 10.5 in many other scenarios. But, even with MDEV-19514 fixed, as noted in MDEV-11634, ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change, and that merge is not necessarily covered by any MDL or InnoDB table reference.

            marko Marko Mäkelä added a comment - The reference count was incremented from 0 (and never reached 0 after that) in 10.5 Thread 8 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i   Old value = 0 New value = 1 fil_system_t::keyrotate_next (encrypt=false, recheck=false, space=<optimized out>, this=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498 1498 return space; (rr) when Current event: 1280868 (rr) bt #0 fil_system_t::keyrotate_next (encrypt=false, recheck=false, space=<optimized out>, this=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498 #1 fil_space_next (space=0x0, recheck=<optimized out>, encrypt=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514 #2 0x00005637b31a4e19 in fil_crypt_find_space_to_rotate ( key_state=key_state@entry=0x6d50794c9d40, state=state@entry=0x6d50794c9d60, recheck=recheck@entry=0x6d50794c9d30) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575 #3 0x00005637b31a5846 in fil_crypt_thread () at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163 … 1594 state->space = fil_space_next(state->space, *recheck, (rr) s fil_space_next (space=0x6160020dab98, recheck=false, encrypt=false) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1510 1510 { (rr) n 1511 mutex_enter(&fil_system.mutex); (rr) [Switching to Thread 4061206.4071781]   Thread 9 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i   Old value = 1 New value = 2 fil_system_t::keyrotate_next (encrypt=false, recheck=false, space=<optimized out>, this=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498 1498 return space; (rr) bt #0 fil_system_t::keyrotate_next (encrypt=false, recheck=false, space=<optimized out>, this=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498 #1 fil_space_next (space=0x0, recheck=<optimized out>, encrypt=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514 #2 0x00005637b31a4e19 in fil_crypt_find_space_to_rotate ( key_state=key_state@entry=0x640001027d40, state=state@entry=0x640001027d60, recheck=recheck@entry=0x640001027d30) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575 0x00003254208317db <+43>: xchg %eax,(%rdi) => 0x00003254208317dd <+45>: cmp $0x1,%eax 0x00003254208317e0 <+48>: jg 0x325420831820 <__GI___pthread_mutex_unlock+112>   #3 0x00005637b31a5846 in fil_crypt_thread () at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163 … (rr) when Current event: 1280872 At the same time when Thread 8 is about to decrement the reference count for this tablespace, Thread 9 will acquire the same tablespace for processing. A bit later, Thread 9 will release the reference count and Thread 10 will increment and decrement it, all while Thread 8 is still waiting for fil_system.mutex : Thread 8 (Thread 4061206.4071774): #0 __pthread_mutex_unlock_usercnt (decr=1, mutex=0x6d50794c97a8) at pthread_mutex_unlock.c:58 #1 __GI___pthread_mutex_unlock (mutex=0x6d50794c97a8) at pthread_mutex_unlock.c:357 #2 0x00005637b2837118 in OSMutex::exit (this=0x6d50794c97a0) at /home/mleich/Server/10.5A/storage/innobase/include/sync0types.h:434 #3 MutexDebug<TTASEventMutex<GenericPolicy> >::set (thread_id=18446744073709551615, line=0, filename=0x0, mutex=0x0, this=0x6d50794c9770) at /home/mleich/Server/10.5A/storage/innobase/include/sync0policy.h:59 #4 MutexDebug<TTASEventMutex<GenericPolicy> >::enter (line=1511, filename=0x5637b3fa6b00 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc", mutex=..., this=0x5637b4c20ab0 <fil_system+48>) at /home/mleich/Server/10.5A/storage/innobase/include/sync0policy.h:126 #5 PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5637b4c20aa0 <fil_system+32>, n_spins=<optimized out>, n_delay=<optimized out>, name=name@entry=0x5637b3fa6b00 "/home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc", line=line@entry=1511) at /home/mleich/Server/10.5A/storage/innobase/include/ib0mutex.h:588 #6 0x00005637b31a052a in fil_space_next (space=0x6160020dab98, recheck=<optimized out>, encrypt=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1511 This situation continues until the assertion fails, at event 1280908. It is unclear why the function is stuck here: 0x00003254208317db <+43>: xchg %eax,(%rdi) => 0x00003254208317dd <+45>: cmp $0x1,%eax 0x00003254208317e0 <+48>: jg 0x325420831820 <__GI___pthread_mutex_unlock+112> It looks like the failure occurs simply be because rr decided not to let this thread run: Thread 9 hit Hardware watchpoint 1: -location $2->n_pending_ops._M_i   Old value = 1 New value = 2 fil_system_t::keyrotate_next (encrypt=false, recheck=false, space=<optimized out>, this=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498 1498 return space; 1: x/i $pc => 0x5637b31a0b31 <fil_space_next(fil_space_t*, bool, bool)+1864>: lea -0x1e0(%rbx),%rax (rr) disa 1 (rr) thr 8 [Switching to thread 8 (Thread 4061206.4071774)] #0 __pthread_mutex_unlock_usercnt (decr=1, mutex=0x6d50794c97a8) at pthread_mutex_unlock.c:58 58 pthread_mutex_unlock.c: No such file or directory. (rr) display 1: x/i $pc => 0x3254208317dd <__GI___pthread_mutex_unlock+45>: cmp $0x1,%eax (rr) si [rr 4061206 1280885]2020-09-02 02:59:21 0x7f08f187d300[rr 4061206 1280888] InnoDB: Assertion failure in file /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc line 2243 I think that this bug can only be fixed by removing the reliance on the tablespace reference count in background operations, and always acquiring proper MDL (or an InnoDB table reference). MDEV-19514 and MDEV-16678 solved this problem in 10.5 in many other scenarios. But, even with MDEV-19514 fixed, as noted in MDEV-11634 , ibuf_insert_low() could invoke ibuf_read_merge_pages() to merge other buffered changes before buffering one more change, and that merge is not necessarily covered by any MDL or InnoDB table reference.

            Even though the execution trace involves an internal table of a FULLTEXT INDEX, I think that this race condition between background operations and deleting a data file is possible in any DDL operation that involves rebuilding or dropping a normal user table.

            marko Marko Mäkelä added a comment - Even though the execution trace involves an internal table of a FULLTEXT INDEX , I think that this race condition between background operations and deleting a data file is possible in any DDL operation that involves rebuilding or dropping a normal user table.

            I initially forgot to check fil_space_t::stop_new_ops. It turns out that fil_system_t::keyrotate_next() is kind-of disregarding it:

            Thread 50 hit Hardware watchpoint 3: -location sp->stop_new_ops
             
            Old value = false
            New value = true
            fil_check_pending_operations (id=607, truncate=truncate@entry=false, 
                path=path@entry=0x7f08f1871960)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2037
            2037			if (sp->crypt_data) {
            (rr) when
            Current event: 1280847
            (rr) c
            Continuing.
             
            Thread 50 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 
                {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>}
            New value = 
                {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>}
            fil_check_pending_operations (id=607, truncate=truncate@entry=false, 
                path=path@entry=0x7f08f1871960)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2039
            2039				mutex_exit(&fil_system.mutex);
            (rr) c
            Continuing.
             
            Thread 50 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 
                {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>}
            New value = 
                {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>}
            fil_space_t::release (this=0x6160020dab98)
                at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:509
            509		bool release() { auto n= n_pending_ops--; ut_ad(n); return n == 1; }
            (rr) bt
            #0  fil_space_t::release (this=0x6160020dab98)
                at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:509
            #1  fil_check_pending_operations (id=607, truncate=truncate@entry=false, 
                path=path@entry=0x7f08f1871960)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2042
            #2  0x00005637b316e33c in fil_delete_tablespace (id=<optimized out>, 
                if_exists=if_exists@entry=false, 
                detached_handles=detached_handles@entry=0x7f08f18728c0)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2172
            #3  0x00005637b2c487aa in row_drop_table_for_mysql (name=<optimized out>, 
                trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_DROP_DB, 
                create_failed=create_failed@entry=false, nonatomic=<optimized out>, 
                nonatomic@entry=false)
                at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3722
            #4  0x00005637b320ffef in fts_drop_table (trx=trx@entry=0x22773c0f2b90, 
                table_name=table_name@entry=0x7f08f1873bc0 "test/FTS_", '0' <repeats 13 times>, "26a_DELETED")
                at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1443
            #5  0x00005637b321090d in fts_drop_common_tables (
                trx=trx@entry=0x22773c0f2b90, fts_table=fts_table@entry=0x7f08f1873f60)
                at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1577
            #6  0x00005637b3210b4c in fts_drop_tables (trx=trx@entry=0x22773c0f2b90, 
                table=table@entry=0x619000274820)
                at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1688
            #7  0x00005637b2c47a8b in row_drop_ancillary_fts_tables (trx=0x22773c0f2b90, 
                table=0x619000274820)
                at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3214
            #8  row_drop_table_for_mysql (
                name=name@entry=0x7f08f1875730 "test/#sql-alter-3df816-3e", 
                trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE, 
                create_failed=create_failed@entry=false, nonatomic=<optimized out>, 
                nonatomic@entry=true)
                at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3682
            #9  0x00005637b2840018 in ha_innobase::delete_table (
                this=this@entry=0x62b000128fb0, 
                name=name@entry=0x61b0005a1140 "./test/#sql-alter-3df816-3e", 
                sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE)
                at /home/mleich/Server/10.5A/storage/innobase/handler/ha_innodb.cc:13211
            #10 0x00005637b281c22e in ha_innobase::delete_table (this=0x62b000128fb0, 
            --Type <RET> for more, q to quit, c to continue without paging--q
            Quit
            (rr) c
            Continuing.
            [Switching to Thread 4061206.4071774]
             
            Thread 7 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 
                {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>}
            New value = 
                {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>}
            fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
                space=<optimized out>, this=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
            1498	  return space;
            (rr) bt
            #0  fil_system_t::keyrotate_next (encrypt=false, recheck=false, 
                space=<optimized out>, this=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498
            #1  fil_space_next (space=0x0, recheck=<optimized out>, 
                encrypt=<optimized out>)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514
            #2  0x00005637b31a4e19 in fil_crypt_find_space_to_rotate (
                key_state=key_state@entry=0x6d50794c9d40, 
                state=state@entry=0x6d50794c9d60, recheck=recheck@entry=0x6d50794c9d30)
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575
            #3  0x00005637b31a5846 in fil_crypt_thread ()
                at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163
            #4  0x000032542082d609 in start_thread (arg=<optimized out>)
                at pthread_create.c:477
            #5  0x00007f08efcd6103 in clone ()
                at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            According to this extract from the same rr replay trace, the reference count correctly reached 0 when the ALTER TABLE…ALGORITHM=COPY operation dropped the original copy of the table. But, fil_system_t::keyrotate_next() had already checked fil_space_t::is_stopping() before any of the above occurred:

            Thread 8 hit Hardware access (read/write) watchpoint 4: *$5
             
            Value = false
            0x00005637b319b12e in fil_crypt_space_needs_rotation (
                state=state@entry=0x640001027d60, 
                key_state=key_state@entry=0x640001027d40, 
                recheck=recheck@entry=0x640001027d30)
                at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:419
            419		bool is_stopping() const { return stop_new_ops;	}
            (rr) when
            Current event: 1280814
            

            I think that this race condition was introduced by MDEV-15983 in 10.3.7 when fil_space_release() (which acquired fil_system->mutex) was replaced with fil_space_t::release().

            A possible fix might be to fuse stop_new_ops into the most significant bit of n_pending_ops, and make fil_space_t::acquire() use a compare-and-swap operation spinloop that may fail, similar to rw_lock_lock_word_decr(). This should guarantee that n_pending_ops cannot be incremented after stop_new_ops has been set.

            marko Marko Mäkelä added a comment - I initially forgot to check fil_space_t::stop_new_ops . It turns out that fil_system_t::keyrotate_next() is kind-of disregarding it: Thread 50 hit Hardware watchpoint 3: -location sp->stop_new_ops   Old value = false New value = true fil_check_pending_operations (id=607, truncate=truncate@entry=false, path=path@entry=0x7f08f1871960) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2037 2037 if (sp->crypt_data) { (rr) when Current event: 1280847 (rr) c Continuing.   Thread 50 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>} New value = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>} fil_check_pending_operations (id=607, truncate=truncate@entry=false, path=path@entry=0x7f08f1871960) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2039 2039 mutex_exit(&fil_system.mutex); (rr) c Continuing.   Thread 50 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>} New value = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>} fil_space_t::release (this=0x6160020dab98) at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:509 509 bool release() { auto n= n_pending_ops--; ut_ad(n); return n == 1; } (rr) bt #0 fil_space_t::release (this=0x6160020dab98) at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:509 #1 fil_check_pending_operations (id=607, truncate=truncate@entry=false, path=path@entry=0x7f08f1871960) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2042 #2 0x00005637b316e33c in fil_delete_tablespace (id=<optimized out>, if_exists=if_exists@entry=false, detached_handles=detached_handles@entry=0x7f08f18728c0) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0fil.cc:2172 #3 0x00005637b2c487aa in row_drop_table_for_mysql (name=<optimized out>, trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_DROP_DB, create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=false) at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3722 #4 0x00005637b320ffef in fts_drop_table (trx=trx@entry=0x22773c0f2b90, table_name=table_name@entry=0x7f08f1873bc0 "test/FTS_", '0' <repeats 13 times>, "26a_DELETED") at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1443 #5 0x00005637b321090d in fts_drop_common_tables ( trx=trx@entry=0x22773c0f2b90, fts_table=fts_table@entry=0x7f08f1873f60) at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1577 #6 0x00005637b3210b4c in fts_drop_tables (trx=trx@entry=0x22773c0f2b90, table=table@entry=0x619000274820) at /home/mleich/Server/10.5A/storage/innobase/fts/fts0fts.cc:1688 #7 0x00005637b2c47a8b in row_drop_ancillary_fts_tables (trx=0x22773c0f2b90, table=0x619000274820) at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3214 #8 row_drop_table_for_mysql ( name=name@entry=0x7f08f1875730 "test/#sql-alter-3df816-3e", trx=trx@entry=0x22773c0f2b90, sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE, create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=true) at /home/mleich/Server/10.5A/storage/innobase/row/row0mysql.cc:3682 #9 0x00005637b2840018 in ha_innobase::delete_table ( this=this@entry=0x62b000128fb0, name=name@entry=0x61b0005a1140 "./test/#sql-alter-3df816-3e", sqlcom=sqlcom@entry=SQLCOM_ALTER_TABLE) at /home/mleich/Server/10.5A/storage/innobase/handler/ha_innodb.cc:13211 #10 0x00005637b281c22e in ha_innobase::delete_table (this=0x62b000128fb0, --Type <RET> for more, q to quit, c to continue without paging--q Quit (rr) c Continuing. [Switching to Thread 4061206.4071774]   Thread 7 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 0}, <No data fields>} New value = {<std::__atomic_base<unsigned long>> = {static _S_alignment = 8, _M_i = 1}, <No data fields>} fil_system_t::keyrotate_next (encrypt=false, recheck=false, space=<optimized out>, this=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498 1498 return space; (rr) bt #0 fil_system_t::keyrotate_next (encrypt=false, recheck=false, space=<optimized out>, this=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1498 #1 fil_space_next (space=0x0, recheck=<optimized out>, encrypt=<optimized out>) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1514 #2 0x00005637b31a4e19 in fil_crypt_find_space_to_rotate ( key_state=key_state@entry=0x6d50794c9d40, state=state@entry=0x6d50794c9d60, recheck=recheck@entry=0x6d50794c9d30) at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1575 #3 0x00005637b31a5846 in fil_crypt_thread () at /home/mleich/Server/10.5A/storage/innobase/fil/fil0crypt.cc:1163 #4 0x000032542082d609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #5 0x00007f08efcd6103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 According to this extract from the same rr replay trace, the reference count correctly reached 0 when the ALTER TABLE…ALGORITHM=COPY operation dropped the original copy of the table. But, fil_system_t::keyrotate_next() had already checked fil_space_t::is_stopping() before any of the above occurred: Thread 8 hit Hardware access (read/write) watchpoint 4: *$5   Value = false 0x00005637b319b12e in fil_crypt_space_needs_rotation ( state=state@entry=0x640001027d60, key_state=key_state@entry=0x640001027d40, recheck=recheck@entry=0x640001027d30) at /home/mleich/Server/10.5A/storage/innobase/include/fil0fil.h:419 419 bool is_stopping() const { return stop_new_ops; } (rr) when Current event: 1280814 I think that this race condition was introduced by MDEV-15983 in 10.3.7 when fil_space_release() (which acquired fil_system->mutex ) was replaced with fil_space_t::release() . A possible fix might be to fuse stop_new_ops into the most significant bit of n_pending_ops , and make fil_space_t::acquire() use a compare-and-swap operation spinloop that may fail, similar to rw_lock_lock_word_decr() . This should guarantee that n_pending_ops cannot be incremented after stop_new_ops has been set.

            bb-10.4-marko commit 0212c417408983dfd0e973774200d5418603fa88 containing the patch for the current MDEV
            behaved well during RQG testing. There were no issues related to the patch.

            mleich Matthias Leich added a comment - bb-10.4-marko commit 0212c417408983dfd0e973774200d5418603fa88 containing the patch for the current MDEV behaved well during RQG testing. There were no issues related to the patch.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.