Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21217

innodb_force_recovery=2 may wrongly abort the rollback of recovered transactions

Details

    Description

      The assert hit during RQG testing was observed first time on
      origin/10.5 8ed646f0712b8f459439dfcdc3cf4ae45d79ec95 2019-12-02T13:35:54+03:00
      The corresponding remains (archive with core file etc.) of that failing test had to be deleted later.
       
      The server error log contained
      #0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62
      #1  0x00005632ef370321 in my_write_core (sig=6) at mysys/stacktrace.c:518
      #2  0x00005632eea7659f in handle_fatal_signal (sig=6) at sql/signal_handler.cc:343
      #3  <signal handler called>
      #4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
      #5  0x00007f8ef32fc37a in __GI_abort () at abort.c:89
      #6  0x00007f8ef32f2b47 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x5632ef83c7d4 "trx->lock.table_cached == 0", file=file@entry=0x5632ef83c3a0 "storage/innobase/trx/trx0trx.cc", 
          line=line@entry=381, function=function@entry=0x5632ef83e8d0 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:92
      #7  0x00007f8ef32f2bf2 in __GI___assert_fail (assertion=0x5632ef83c7d4 "trx->lock.table_cached == 0", file=0x5632ef83c3a0 "storage/innobase/trx/trx0trx.cc", line=381, 
          function=0x5632ef83e8d0 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:101
      #8  0x00005632ef0f14bc in trx_create () at storage/innobase/trx/trx0trx.cc:381
      #9  0x00005632eee729f2 in innobase_trx_allocate (thd=0x7f8eb8001bd8) at storage/innobase/handler/ha_innodb.cc:2526
      #10 0x00005632eee72ad1 in check_trx_exists (thd=0x7f8eb8001bd8) at storage/innobase/handler/ha_innodb.cc:2551
      #11 0x00005632eee9e303 in ha_innobase::update_thd (this=0x7f8eb800f3c0, thd=0x7f8eb8001bd8) at storage/innobase/handler/ha_innodb.cc:2767
      #12 0x00005632eee8d0dc in ha_innobase::info_low (this=0x7f8eb800f3c0, flag=282, is_analyze=false) at storage/innobase/handler/ha_innodb.cc:14493
      #13 0x00005632eee8dd1d in ha_innobase::info (this=0x7f8eb800f3c0, flag=282) at storage/innobase/handler/ha_innodb.cc:14806
      #14 0x00005632eee7a5ae in ha_innobase::open (this=0x7f8eb800f3c0, name=0x7f8eb800bda8 "./mysql/gtid_slave_pos") at storage/innobase/handler/ha_innodb.cc:6117
      #15 0x00005632eea7f3f8 in handler::ha_open (this=0x7f8eb800f3c0, table_arg=0x7f8eb800e1c8, name=0x7f8eb800bda8 "./mysql/gtid_slave_pos", mode=2, test_if_locked=18, mem_root=0x0, partitions_to_open=0x0)
          at sql/handler.cc:2736
      #16 0x00005632ee8722a9 in open_table_from_share (thd=0x7f8eb8001bd8, share=0x7f8eb800b810, alias=0x7f8eec21e3a8, db_stat=33, prgflag=8, ha_open_flags=18, outparam=0x7f8eb800e1c8, is_create_table=false, 
          partitions_to_open=0x0) at sql/table.cc:4094
      #17 0x00005632ee68dcc4 in open_table (thd=0x7f8eb8001bd8, table_list=0x7f8eec21e360, ot_ctx=0x7f8eec21e190) at sql/sql_base.cc:2074
      #18 0x00005632ee69175d in open_and_process_table (thd=0x7f8eb8001bd8, tables=0x7f8eec21e360, counter=0x7f8eec21e224, flags=0, prelocking_strategy=0x7f8eec21e2a8, has_prelocking_list=false, ot_ctx=0x7f8eec21e190)
          at sql/sql_base.cc:3837
      #19 0x00005632ee6929f7 in open_tables (thd=0x7f8eb8001bd8, options=..., start=0x7f8eec21e208, counter=0x7f8eec21e224, flags=0, prelocking_strategy=0x7f8eec21e2a8) at sql/sql_base.cc:4311
      #20 0x00005632ee694b74 in open_and_lock_tables (thd=0x7f8eb8001bd8, options=..., tables=0x7f8eec21e360, derived=false, flags=0, prelocking_strategy=0x7f8eec21e2a8) at sql/sql_base.cc:5204
      #21 0x00005632ee64ea15 in open_and_lock_tables (thd=0x7f8eb8001bd8, tables=0x7f8eec21e360, derived=false, flags=0) at sql/sql_base.h:506
      #22 0x00005632ee8bc86d in scan_one_gtid_slave_pos_table (thd=0x7f8eb8001bd8, hash=0x7f8eec21ee10, array=0x7f8eec21edb0, tablename=0x7f8eb800a6e0, out_hton=0x7f8eec21ea70) at sql/rpl_rli.cc:1531
      #23 0x00005632ee8bd633 in load_gtid_state_cb (thd=0x7f8eb8001bd8, table_name=0x7f8eb800a6e0, arg=0x7f8eec21edd0) at sql/rpl_rli.cc:1806
      #24 0x00005632ee8bd12a in scan_all_gtid_slave_pos_table (thd=0x7f8eb8001bd8, cb=0x5632ee8bd5f1 <load_gtid_state_cb(THD*, LEX_CSTRING*, void*)>, cb_data=0x7f8eec21edd0) at sql/rpl_rli.cc:1675
      #25 0x00005632ee8bd7f1 in rpl_load_gtid_slave_state (thd=0x7f8eb8001bd8) at sql/rpl_rli.cc:1843
      #26 0x00005632eebba326 in binlog_background_thread (arg=0x0) at sql/log.cc:9971
      #27 0x00007f8ef413f6da in start_thread (arg=0x7f8eec21f700) at pthread_create.c:456
      #28 0x00007f8ef33cdd7f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
       
      What follows are the results of another replay on
      origin/10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d 2020-06-12T16:22:03+03:00  build with ASAN
      which claims to be a 10.5.4.
      ...
      2020-06-12 16:35:45 0 [Note] Crash recovery finished.
      ...
      2020-06-12 16:35:50 0 [Note] /home/mleich/Server_bin/10.5_asan//bin/mysqld: ready for connections.
      2020-06-12T16:36:58 [67872] | Version: '10.5.4-MariaDB-debug-log'  socket: ...
      2020-06-12T16:36:58 [67872] | mysqld: /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:376: trx_t* trx_create(): Assertion `trx->lock.table_cached == 0' failed.
      ....    "rr" together with ASAN etc. caused that I did not got the usual backtrace within the server error log 
       
      Backtrace based on rr trace:
      -------------------------------------------
      Thread 2 (Thread 59357.95740):
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
      #1  0x00007fda70738801 in __GI_abort () at abort.c:79
      #2  0x00007fda7072839a in __assert_fail_base (fmt=0x7fda708af7d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55db86822240 "trx->lock.table_cached == 0", 
          file=file@entry=0x55db86821a00 "/home/mleich/10.5/storage/innobase/trx/trx0trx.cc", line=line@entry=376, function=function@entry=0x55db86827220 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:92
      #3  0x00007fda70728412 in __GI___assert_fail (assertion=0x55db86822240 "trx->lock.table_cached == 0", file=0x55db86821a00 "/home/mleich/10.5/storage/innobase/trx/trx0trx.cc", line=376, 
          function=0x55db86827220 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:101
      #4  0x000055db8523576a in trx_create () at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:376
      #5  0x000055db84cf2b02 in innobase_trx_allocate (thd=0x62b00007e218) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:2465
      #6  0x000055db84cf2cea in check_trx_exists (thd=0x62b00007e218) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:2490
      #7  0x000055db84d5aa1f in ha_innobase::update_thd (this=0x61d0007e90b8, thd=0x62b00007e218) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:2706
      #8  0x000055db84d32ede in ha_innobase::info_low (this=0x61d0007e90b8, flag=282, is_analyze=false) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:14395
      #9  0x000055db84d34c73 in ha_innobase::info (this=0x61d0007e90b8, flag=282) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:14707
      #10 0x000055db84d041fd in ha_innobase::open (this=0x61d0007e90b8, name=0x61b000080250 "./test/E") at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:6032
      #11 0x000055db84220eee in handler::ha_open (this=0x61d0007e90b8, table_arg=0x620000040098, name=0x61b000080250 "./test/E", mode=2, test_if_locked=18, mem_root=0x0, partitions_to_open=0x0) at /home/mleich/10.5/sql/handler.cc:2906
      #12 0x000055db83d862c7 in open_table_from_share (thd=0x62b00007e218, share=0x61b00007fcb8, alias=0x629000401280, db_stat=33, prgflag=8, ha_open_flags=18, outparam=0x620000040098, is_create_table=false, partitions_to_open=0x0)
          at /home/mleich/10.5/sql/table.cc:4185
      #13 0x000055db838ad314 in open_table (thd=0x62b00007e218, table_list=0x629000401238, ot_ctx=0x373f36466140) at /home/mleich/10.5/sql/sql_base.cc:1981
      #14 0x000055db838b6f22 in open_and_process_table (thd=0x62b00007e218, tables=0x629000401238, counter=0x373f36466260, flags=1090, prelocking_strategy=0x373f364662a0, has_prelocking_list=false, ot_ctx=0x373f36466140)
          at /home/mleich/10.5/sql/sql_base.cc:3780
      #15 0x000055db838b9aca in open_tables (thd=0x62b00007e218, options=..., start=0x373f36466230, counter=0x373f36466260, flags=1090, prelocking_strategy=0x373f364662a0) at /home/mleich/10.5/sql/sql_base.cc:4252
      #16 0x000055db838a2eae in open_tables (thd=0x62b00007e218, tables=0x373f36466230, counter=0x373f36466260, flags=1090, prelocking_strategy=0x373f364662a0) at /home/mleich/10.5/sql/sql_base.h:264
      #17 0x000055db838bf6de in open_normal_and_derived_tables (thd=0x62b00007e218, tables=0x629000401238, flags=1090, dt_phases=35) at /home/mleich/10.5/sql/sql_base.cc:5221
      #18 0x000055db838bfc5a in open_tables_only_view_structure (thd=0x62b00007e218, table_list=0x629000401238, can_deadlock=false) at /home/mleich/10.5/sql/sql_base.cc:5272
      #19 0x000055db83c0c0c2 in fill_schema_table_by_open (thd=0x62b00007e218, mem_root=0x62b000083ce0, is_show_fields_or_keys=true, table=0x624000488138, schema_table=0x55db87399260 <schema_tables+1920>, orig_db_name=0x62b000085800, 
          orig_table_name=0x62b000085810, open_tables_state_backup=0x373f36468510, can_deadlock=false) at /home/mleich/10.5/sql/sql_show.cc:4600
      #20 0x000055db83c0ebd4 in get_all_tables (thd=0x62b00007e218, tables=0x62b000087238, cond=0x0) at /home/mleich/10.5/sql/sql_show.cc:5125
      #21 0x000055db83c41bc3 in get_schema_tables_result (join=0x62b0000883a0, executed_place=PROCESSED_BY_JOIN_EXEC) at /home/mleich/10.5/sql/sql_show.cc:8673
      #22 0x000055db83b143b9 in JOIN::exec_inner (this=0x62b0000883a0) at /home/mleich/10.5/sql/sql_select.cc:4401
      #23 0x000055db83b12268 in JOIN::exec (this=0x62b0000883a0) at /home/mleich/10.5/sql/sql_select.cc:4225
      #24 0x000055db83b16322 in mysql_select (thd=0x62b00007e218, tables=0x62b000087238, fields=..., conds=0x0, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2684619520, result=0x62b000088370, unit=0x62b000082230, 
          select_lex=0x62b000082a30) at /home/mleich/10.5/sql/sql_select.cc:4649
      #25 0x000055db83ae84d3 in handle_select (thd=0x62b00007e218, lex=0x62b000082168, result=0x62b000088370, setup_tables_done_option=0) at /home/mleich/10.5/sql/sql_select.cc:417
      #26 0x000055db83a58ca5 in execute_sqlcom_select (thd=0x62b00007e218, all_tables=0x62b000087238) at /home/mleich/10.5/sql/sql_parse.cc:6208
      #27 0x000055db83a47fd4 in mysql_execute_command (thd=0x62b00007e218) at /home/mleich/10.5/sql/sql_parse.cc:3939
      #28 0x000055db83a63df9 in mysql_parse (thd=0x62b00007e218, rawbuf=0x62b000085238 "SHOW KEYS FROM `test`.`E`", length=25, parser_state=0x373f3646a8f0, is_com_multi=false, is_next_command=false) at /home/mleich/10.5/sql/sql_parse.cc:7992
      #29 0x000055db83a3a94d in dispatch_command (command=COM_QUERY, thd=0x62b00007e218, packet=0x6290003ed219 "SHOW KEYS FROM `test`.`E`", packet_length=25, is_com_multi=false, is_next_command=false) at /home/mleich/10.5/sql/sql_parse.cc:1874
      #30 0x000055db83a371cb in do_command (thd=0x62b00007e218) at /home/mleich/10.5/sql/sql_parse.cc:1355
      #31 0x000055db83e64708 in do_handle_one_connection (connect=0x6080045091b8, put_in_cache=true) at /home/mleich/10.5/sql/sql_connect.cc:1411
      #32 0x000055db83e64066 in handle_one_connection (arg=0x6080045091b8) at /home/mleich/10.5/sql/sql_connect.cc:1313
      #33 0x000055db84b1cac1 in pfs_spawn_thread (arg=0x615000008f18) at /home/mleich/10.5/storage/perfschema/pfs.cc:2201
      #34 0x0000104016d8d6db in start_thread (arg=0x373f3646c300) at pthread_create.c:463
      #35 0x00007fda7081988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
       
      Work flow of the RQG Test
      0. Bootstrap
      1. Start the server with innodb_force_recovery=2
      2. Create and fill all tables which get used later.
      3. Several concurrent sessions run excessive DML on these tables
           ~ 280s after start of 3. the server process gets killed (SIGKILL)
      4. The attempt to restart that server with innodb_force_recovery=2 had success.
      5. Than come checks for he integrity of the data content and there a 
           SHOW KEYS FROM `test`.`E` caused the assert.
       
      RQG
      git clone https://github.com/mleich1/rqg --branch experimental RQG
      origin/experimental 611a31e38a3bc157392c0748c95991b6a248bb3b 2020-06-09T15:31:46+02:00
       
      perl rqg.pl \                        
      --grammar=conf/mariadb/innodb_compression_encryption.yy \
      --gendata=conf/mariadb/innodb_compression_encryption.zz \
      --mysqld=--plugin-load-add=file_key_management.so \
      --mysqld=--loose-file-key-management-filename=conf/mariadb/encryption_keys.txt \
      --mysqld=--innodb_use_native_aio=1 \
      --mysqld=--innodb_stats_persistent=off \
      --mysqld=--innodb_lock_schedule_algorithm=fcfs \
      --mysqld=--loose-idle_write_transaction_timeout=0 \
      --mysqld=--loose-idle_transaction_timeout=0 \
      --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      --mysqld=--connect_timeout=60 \
      --mysqld=--interactive_timeout=28800 \
      --mysqld=--slave_net_timeout=60 \
      --mysqld=--net_read_timeout=30 \
      --mysqld=--net_write_timeout=60 \
      --mysqld=--loose-table_lock_wait_timeout=50 \
      --mysqld=--wait_timeout=28800 \
      --mysqld=--lock-wait-timeout=86400 \
      --mysqld=--innodb-lock-wait-timeout=50 \
      --no-mask \
      --queries=10000000 \
      --duration=300 \
      --seed=random \
      --reporters=Backtrace \
      --reporters=ErrorLog \
      --reporters=Deadlock1 \
      --reporters=CrashRecovery1 \
      --validators=None \
      --mysqld=--log_output=none \
      --mysqld=--log-bin \
      --mysqld=--log_bin_trust_function_creators=1 \
      --mysqld=--loose-max-statement-time=30 \
      --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      --engine=InnoDB \
      --restart_timeout=120 \
      --threads=2 \
      --mysqld=--innodb_force_recovery=2 \
      --mysqld=--innodb_page_size=8K \
      --mysqld=--innodb-buffer-pool-size=256M \
      --duration=300 \
      --no_mask \
      --workdir=<local settings> \
      --vardir=<local settings> \
      --mtr-build-thread=<local settings> \
      --basedir1=<local settings> \
      --basedir2=<local settings> \
      --script_debug=_nix_ \
      --rr=Server \
      --rr_options=--chaos
       
      Certain options are not required for replay.
       
      
      

      Attachments

        Issue Links

          Activity

            We resurrected table locks for the recovered transaction:

            10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d

            #0  0x000055db84eaa5ae in lock_table_create (table=0x619000028120, 
                type_mode=1, trx=0x2edc0254fb90, c_lock=0x0)
                at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3485
            #1  0x000055db84eadc4d in lock_table_ix_resurrect (table=0x619000028120, 
                trx=0x2edc0254fb90)
                at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3942
            #2  0x000055db85238b88 in trx_resurrect_table_locks (trx=0x2edc0254fb90, 
                undo=0x61200000b458)
                at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:644
            #3  0x000055db852399c2 in trx_resurrect (undo=0x61200000b458, 
                rseg=0x61300000a4d8, start_time=1591979736, start_time_micro=289221474758, 
                rows_to_undo=0x7ffc53b2fa30, is_old_insert=false)
                at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:725
            #4  0x000055db8523a0c0 in trx_lists_init_at_db_start ()
                at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:786
            …
            (rr) when
            Current event: 108171
            

            Later, the transaction was rolled back, but we did not clear {{ trx->lock.table_cached}} before freeing the transaction:

            10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d

            Thread 62 hit Breakpoint 5, trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90)
                at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394
            (rr) bt
            #0  trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90)
                at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394
            #1  0x000055db852379c2 in trx_free_at_shutdown (trx=0x2edc0254fb90)
                at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:552
            #2  0x000055db8521df14 in trx_rollback_recovered (all=true)
                at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:809
            #3  0x000055db8521e3ae in trx_rollback_all_recovered ()
                at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:838
            #4  0x0000104016d8d6db in start_thread (arg=0x46c635b41700)
                at pthread_create.c:463
            #5  0x00007fda7081988f in clone ()
                at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            (rr) when
            Current event: 115082
            

            A little later, the transaction object will be reused in trx_create() and the problem will be flagged.

            It appears that normally, the field would be cleared by the call trx_init(this) in trx_t::commit_in_memory(). But, here we are incorrectly marking the rollback as interrupted:

            	if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE
            			  && !srv_undo_sources
            			  && !srv_is_being_started)
            	    && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) {
            		/* Shutdown has been initiated. */
            		trx->error_state = DB_INTERRUPTED;
            		return NULL;
            	}
            

            Shutdown was not initiated, but we had failed to set srv_undo_sources if the value of innodb_force_recovery is 2, 3, or 4.
            mleich, can you please try the following patch?

            diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc
            index 3fe78bc1909..4f68af25bcf 100644
            --- a/storage/innobase/srv/srv0start.cc
            +++ b/storage/innobase/srv/srv0start.cc
            @@ -1875,6 +1875,7 @@ dberr_t srv_start(bool create_new_db)
             
             		if (srv_force_recovery < SRV_FORCE_NO_TRX_UNDO
             		    && !srv_read_only_mode) {
            +			srv_undo_sources = true;
             			/* Drop partially created indexes. */
             			row_merge_drop_temp_indexes();
             			/* Drop garbage tables. */
            @@ -1915,8 +1916,9 @@ dberr_t srv_start(bool create_new_db)
             		ut_ad(srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN
             		      || !purge_sys.enabled());
             
            +		srv_undo_sources = !high_level_read_only;
            +
             		if (srv_force_recovery < SRV_FORCE_NO_BACKGROUND) {
            -			srv_undo_sources = true;
             			/* Create the dict stats gathering task */
             			dict_stats_start();
             			/* Create the thread that will optimize the
            

            marko Marko Mäkelä added a comment - We resurrected table locks for the recovered transaction: 10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d #0 0x000055db84eaa5ae in lock_table_create (table=0x619000028120, type_mode=1, trx=0x2edc0254fb90, c_lock=0x0) at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3485 #1 0x000055db84eadc4d in lock_table_ix_resurrect (table=0x619000028120, trx=0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3942 #2 0x000055db85238b88 in trx_resurrect_table_locks (trx=0x2edc0254fb90, undo=0x61200000b458) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:644 #3 0x000055db852399c2 in trx_resurrect (undo=0x61200000b458, rseg=0x61300000a4d8, start_time=1591979736, start_time_micro=289221474758, rows_to_undo=0x7ffc53b2fa30, is_old_insert=false) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:725 #4 0x000055db8523a0c0 in trx_lists_init_at_db_start () at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:786 … (rr) when Current event: 108171 Later, the transaction was rolled back, but we did not clear {{ trx->lock.table_cached}} before freeing the transaction: 10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d Thread 62 hit Breakpoint 5, trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394 (rr) bt #0 trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394 #1 0x000055db852379c2 in trx_free_at_shutdown (trx=0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:552 #2 0x000055db8521df14 in trx_rollback_recovered (all=true) at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:809 #3 0x000055db8521e3ae in trx_rollback_all_recovered () at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:838 #4 0x0000104016d8d6db in start_thread (arg=0x46c635b41700) at pthread_create.c:463 #5 0x00007fda7081988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (rr) when Current event: 115082 A little later, the transaction object will be reused in trx_create() and the problem will be flagged. It appears that normally, the field would be cleared by the call trx_init(this) in trx_t::commit_in_memory() . But, here we are incorrectly marking the rollback as interrupted: if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE && !srv_undo_sources && !srv_is_being_started) && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) { /* Shutdown has been initiated. */ trx->error_state = DB_INTERRUPTED; return NULL; } Shutdown was not initiated, but we had failed to set srv_undo_sources if the value of innodb_force_recovery is 2, 3, or 4. mleich , can you please try the following patch? diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc index 3fe78bc1909..4f68af25bcf 100644 --- a/storage/innobase/srv/srv0start.cc +++ b/storage/innobase/srv/srv0start.cc @@ -1875,6 +1875,7 @@ dberr_t srv_start(bool create_new_db) if (srv_force_recovery < SRV_FORCE_NO_TRX_UNDO && !srv_read_only_mode) { + srv_undo_sources = true; /* Drop partially created indexes. */ row_merge_drop_temp_indexes(); /* Drop garbage tables. */ @@ -1915,8 +1916,9 @@ dberr_t srv_start(bool create_new_db) ut_ad(srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN || !purge_sys.enabled()); + srv_undo_sources = !high_level_read_only; + if (srv_force_recovery < SRV_FORCE_NO_BACKGROUND) { - srv_undo_sources = true; /* Create the dict stats gathering task */ dict_stats_start(); /* Create the thread that will optimize the

            Sorry, my patch cannot work as is. The srv_undo_sources is being as a predicate that tells whether the additional threads or tasks exist.
            Here is a less intrusive patch that I think should work in this scenario:

            diff --git a/storage/innobase/row/row0undo.cc b/storage/innobase/row/row0undo.cc
            index 398e31a5ae5..c4b0af997f5 100644
            --- a/storage/innobase/row/row0undo.cc
            +++ b/storage/innobase/row/row0undo.cc
            @@ -477,6 +477,7 @@ row_undo_step(
             	ut_ad(que_node_get_type(node) == QUE_NODE_UNDO);
             
             	if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE
            +			  && trx != trx_roll_crash_recv_trx
             			  && !srv_undo_sources
             			  && !srv_is_being_started)
             	    && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) {
            

            I will try to come up with a test case. Basically, we should start up the server with innodb_force_recovery=2 and expect the rollback to complete. I hope that this will be observable with a READ UNCOMMITTED transaction.

            marko Marko Mäkelä added a comment - Sorry, my patch cannot work as is. The srv_undo_sources is being as a predicate that tells whether the additional threads or tasks exist. Here is a less intrusive patch that I think should work in this scenario: diff --git a/storage/innobase/row/row0undo.cc b/storage/innobase/row/row0undo.cc index 398e31a5ae5..c4b0af997f5 100644 --- a/storage/innobase/row/row0undo.cc +++ b/storage/innobase/row/row0undo.cc @@ -477,6 +477,7 @@ row_undo_step( ut_ad(que_node_get_type(node) == QUE_NODE_UNDO); if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE + && trx != trx_roll_crash_recv_trx && !srv_undo_sources && !srv_is_being_started) && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) { I will try to come up with a test case. Basically, we should start up the server with innodb_force_recovery=2 and expect the rollback to complete. I hope that this will be observable with a READ UNCOMMITTED transaction.

            I think that we must update the predicate to check for srv_shutdown_state==SRV_SHUTDOWN_INITIATED that was introduced in MDEV-22827. That should be the least intrusive fix.

            I am able to reproduce the problem with the following test case, which is not entirely deterministic, because it depends on the relative timing of trx_rollback_all_recovered and the srv_is_being_started=false assignment in srv_start():

            --source include/have_innodb.inc
            --source include/not_embedded.inc
            --source include/have_sequence.inc
             
            FLUSH TABLES;
             
            connect (con0,localhost,root);
            CREATE TABLE t0 (a INT PRIMARY KEY) ENGINE=InnoDB;
            BEGIN;
            INSERT INTO t0 SELECT * FROM seq_1_to_10000;
             
            connection default;
            # Persist the above incomplete transaction.
            SET GLOBAL innodb_flush_log_at_trx_commit=1;
            CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;
             
            --let $restart_parameters= --innodb-force-recovery=2
            --let $shutdown_timeout=0
            --source include/restart_mysqld.inc
            --let $restart_parameters=
            --let $shutdown_timeout=
             
            disconnect con0;
            connection default;
            SELECT * FROM t0 LOCK IN SHARE MODE;
            DROP TABLE t0,t1;
            

            Without the fix, the SELECT would hang, because the rollback was aborted and the locks were not released.

            marko Marko Mäkelä added a comment - I think that we must update the predicate to check for srv_shutdown_state==SRV_SHUTDOWN_INITIATED that was introduced in MDEV-22827 . That should be the least intrusive fix. I am able to reproduce the problem with the following test case, which is not entirely deterministic, because it depends on the relative timing of trx_rollback_all_recovered and the srv_is_being_started=false assignment in srv_start() : --source include/have_innodb.inc --source include/not_embedded.inc --source include/have_sequence.inc   FLUSH TABLES;   connect (con0,localhost,root); CREATE TABLE t0 (a INT PRIMARY KEY ) ENGINE=InnoDB; BEGIN ; INSERT INTO t0 SELECT * FROM seq_1_to_10000;   connection default ; # Persist the above incomplete transaction . SET GLOBAL innodb_flush_log_at_trx_commit=1; CREATE TABLE t1 (a INT PRIMARY KEY ) ENGINE=InnoDB;   --let $restart_parameters= --innodb-force-recovery=2 --let $shutdown_timeout=0 --source include/restart_mysqld.inc --let $restart_parameters= --let $shutdown_timeout=   disconnect con0; connection default ; SELECT * FROM t0 LOCK IN SHARE MODE; DROP TABLE t0,t1; Without the fix, the SELECT would hang, because the rollback was aborted and the locks were not released.

            I believe that this logic was broken in MDEV-12352. The fix will depend on MDEV-22769.

            marko Marko Mäkelä added a comment - I believe that this logic was broken in MDEV-12352 . The fix will depend on MDEV-22769 .

            Pushed to 10.2 and merged to 10.3 (with conflict resolution).

            marko Marko Mäkelä added a comment - Pushed to 10.2 and merged to 10.3 (with conflict resolution).

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.