[MDEV-21217] innodb_force_recovery=2 may wrongly abort the rollback of recovered transactions - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.34, 10.1.30, 10.2.12, 10.3.3, 10.4.0, 10.5.0
Fix Version/s: 10.5.4, 10.2.33, 10.3.24, 10.4.14
Component/s: Storage Engine - InnoDB
Labels:
- rr-profile

Description

The assert hit during RQG testing was observed first time on

origin/10.5 8ed646f0712b8f459439dfcdc3cf4ae45d79ec95 2019-12-02T13:35:54+03:00

The corresponding remains (archive with core file etc.) of that failing test had to be deleted later.

The server error log contained

#0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62

#1  0x00005632ef370321 in my_write_core (sig=6) at mysys/stacktrace.c:518

#2  0x00005632eea7659f in handle_fatal_signal (sig=6) at sql/signal_handler.cc:343

#3  <signal handler called>

#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58

#5  0x00007f8ef32fc37a in __GI_abort () at abort.c:89

#6  0x00007f8ef32f2b47 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x5632ef83c7d4 "trx->lock.table_cached == 0", file=file@entry=0x5632ef83c3a0 "storage/innobase/trx/trx0trx.cc",

    line=line@entry=381, function=function@entry=0x5632ef83e8d0 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:92

#7  0x00007f8ef32f2bf2 in __GI___assert_fail (assertion=0x5632ef83c7d4 "trx->lock.table_cached == 0", file=0x5632ef83c3a0 "storage/innobase/trx/trx0trx.cc", line=381,

    function=0x5632ef83e8d0 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:101

#8  0x00005632ef0f14bc in trx_create () at storage/innobase/trx/trx0trx.cc:381

#9  0x00005632eee729f2 in innobase_trx_allocate (thd=0x7f8eb8001bd8) at storage/innobase/handler/ha_innodb.cc:2526

#10 0x00005632eee72ad1 in check_trx_exists (thd=0x7f8eb8001bd8) at storage/innobase/handler/ha_innodb.cc:2551

#11 0x00005632eee9e303 in ha_innobase::update_thd (this=0x7f8eb800f3c0, thd=0x7f8eb8001bd8) at storage/innobase/handler/ha_innodb.cc:2767

#12 0x00005632eee8d0dc in ha_innobase::info_low (this=0x7f8eb800f3c0, flag=282, is_analyze=false) at storage/innobase/handler/ha_innodb.cc:14493

#13 0x00005632eee8dd1d in ha_innobase::info (this=0x7f8eb800f3c0, flag=282) at storage/innobase/handler/ha_innodb.cc:14806

#14 0x00005632eee7a5ae in ha_innobase::open (this=0x7f8eb800f3c0, name=0x7f8eb800bda8 "./mysql/gtid_slave_pos") at storage/innobase/handler/ha_innodb.cc:6117

#15 0x00005632eea7f3f8 in handler::ha_open (this=0x7f8eb800f3c0, table_arg=0x7f8eb800e1c8, name=0x7f8eb800bda8 "./mysql/gtid_slave_pos", mode=2, test_if_locked=18, mem_root=0x0, partitions_to_open=0x0)

    at sql/handler.cc:2736

#16 0x00005632ee8722a9 in open_table_from_share (thd=0x7f8eb8001bd8, share=0x7f8eb800b810, alias=0x7f8eec21e3a8, db_stat=33, prgflag=8, ha_open_flags=18, outparam=0x7f8eb800e1c8, is_create_table=false,

    partitions_to_open=0x0) at sql/table.cc:4094

#17 0x00005632ee68dcc4 in open_table (thd=0x7f8eb8001bd8, table_list=0x7f8eec21e360, ot_ctx=0x7f8eec21e190) at sql/sql_base.cc:2074

#18 0x00005632ee69175d in open_and_process_table (thd=0x7f8eb8001bd8, tables=0x7f8eec21e360, counter=0x7f8eec21e224, flags=0, prelocking_strategy=0x7f8eec21e2a8, has_prelocking_list=false, ot_ctx=0x7f8eec21e190)

    at sql/sql_base.cc:3837

#19 0x00005632ee6929f7 in open_tables (thd=0x7f8eb8001bd8, options=..., start=0x7f8eec21e208, counter=0x7f8eec21e224, flags=0, prelocking_strategy=0x7f8eec21e2a8) at sql/sql_base.cc:4311

#20 0x00005632ee694b74 in open_and_lock_tables (thd=0x7f8eb8001bd8, options=..., tables=0x7f8eec21e360, derived=false, flags=0, prelocking_strategy=0x7f8eec21e2a8) at sql/sql_base.cc:5204

#21 0x00005632ee64ea15 in open_and_lock_tables (thd=0x7f8eb8001bd8, tables=0x7f8eec21e360, derived=false, flags=0) at sql/sql_base.h:506

#22 0x00005632ee8bc86d in scan_one_gtid_slave_pos_table (thd=0x7f8eb8001bd8, hash=0x7f8eec21ee10, array=0x7f8eec21edb0, tablename=0x7f8eb800a6e0, out_hton=0x7f8eec21ea70) at sql/rpl_rli.cc:1531

#23 0x00005632ee8bd633 in load_gtid_state_cb (thd=0x7f8eb8001bd8, table_name=0x7f8eb800a6e0, arg=0x7f8eec21edd0) at sql/rpl_rli.cc:1806

#24 0x00005632ee8bd12a in scan_all_gtid_slave_pos_table (thd=0x7f8eb8001bd8, cb=0x5632ee8bd5f1 <load_gtid_state_cb(THD*, LEX_CSTRING*, void*)>, cb_data=0x7f8eec21edd0) at sql/rpl_rli.cc:1675

#25 0x00005632ee8bd7f1 in rpl_load_gtid_slave_state (thd=0x7f8eb8001bd8) at sql/rpl_rli.cc:1843

#26 0x00005632eebba326 in binlog_background_thread (arg=0x0) at sql/log.cc:9971

#27 0x00007f8ef413f6da in start_thread (arg=0x7f8eec21f700) at pthread_create.c:456

#28 0x00007f8ef33cdd7f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

What follows are the results of another replay on

origin/10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d 2020-06-12T16:22:03+03:00  build with ASAN

which claims to be a 10.5.4.

...

2020-06-12 16:35:45 0 [Note] Crash recovery finished.

...

2020-06-12 16:35:50 0 [Note] /home/mleich/Server_bin/10.5_asan//bin/mysqld: ready for connections.

2020-06-12T16:36:58 [67872] | Version: '10.5.4-MariaDB-debug-log'  socket: ...

2020-06-12T16:36:58 [67872] | mysqld: /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:376: trx_t* trx_create(): Assertion `trx->lock.table_cached == 0' failed.

....    "rr" together with ASAN etc. caused that I did not got the usual backtrace within the server error log

Backtrace based on rr trace:

-------------------------------------------

Thread 2 (Thread 59357.95740):

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

#1  0x00007fda70738801 in __GI_abort () at abort.c:79

#2  0x00007fda7072839a in __assert_fail_base (fmt=0x7fda708af7d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55db86822240 "trx->lock.table_cached == 0",

    file=file@entry=0x55db86821a00 "/home/mleich/10.5/storage/innobase/trx/trx0trx.cc", line=line@entry=376, function=function@entry=0x55db86827220 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:92

#3  0x00007fda70728412 in __GI___assert_fail (assertion=0x55db86822240 "trx->lock.table_cached == 0", file=0x55db86821a00 "/home/mleich/10.5/storage/innobase/trx/trx0trx.cc", line=376,

    function=0x55db86827220 <trx_create()::__PRETTY_FUNCTION__> "trx_t* trx_create()") at assert.c:101

#4  0x000055db8523576a in trx_create () at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:376

#5  0x000055db84cf2b02 in innobase_trx_allocate (thd=0x62b00007e218) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:2465

#6  0x000055db84cf2cea in check_trx_exists (thd=0x62b00007e218) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:2490

#7  0x000055db84d5aa1f in ha_innobase::update_thd (this=0x61d0007e90b8, thd=0x62b00007e218) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:2706

#8  0x000055db84d32ede in ha_innobase::info_low (this=0x61d0007e90b8, flag=282, is_analyze=false) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:14395

#9  0x000055db84d34c73 in ha_innobase::info (this=0x61d0007e90b8, flag=282) at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:14707

#10 0x000055db84d041fd in ha_innobase::open (this=0x61d0007e90b8, name=0x61b000080250 "./test/E") at /home/mleich/10.5/storage/innobase/handler/ha_innodb.cc:6032

#11 0x000055db84220eee in handler::ha_open (this=0x61d0007e90b8, table_arg=0x620000040098, name=0x61b000080250 "./test/E", mode=2, test_if_locked=18, mem_root=0x0, partitions_to_open=0x0) at /home/mleich/10.5/sql/handler.cc:2906

#12 0x000055db83d862c7 in open_table_from_share (thd=0x62b00007e218, share=0x61b00007fcb8, alias=0x629000401280, db_stat=33, prgflag=8, ha_open_flags=18, outparam=0x620000040098, is_create_table=false, partitions_to_open=0x0)

    at /home/mleich/10.5/sql/table.cc:4185

#13 0x000055db838ad314 in open_table (thd=0x62b00007e218, table_list=0x629000401238, ot_ctx=0x373f36466140) at /home/mleich/10.5/sql/sql_base.cc:1981

#14 0x000055db838b6f22 in open_and_process_table (thd=0x62b00007e218, tables=0x629000401238, counter=0x373f36466260, flags=1090, prelocking_strategy=0x373f364662a0, has_prelocking_list=false, ot_ctx=0x373f36466140)

    at /home/mleich/10.5/sql/sql_base.cc:3780

#15 0x000055db838b9aca in open_tables (thd=0x62b00007e218, options=..., start=0x373f36466230, counter=0x373f36466260, flags=1090, prelocking_strategy=0x373f364662a0) at /home/mleich/10.5/sql/sql_base.cc:4252

#16 0x000055db838a2eae in open_tables (thd=0x62b00007e218, tables=0x373f36466230, counter=0x373f36466260, flags=1090, prelocking_strategy=0x373f364662a0) at /home/mleich/10.5/sql/sql_base.h:264

#17 0x000055db838bf6de in open_normal_and_derived_tables (thd=0x62b00007e218, tables=0x629000401238, flags=1090, dt_phases=35) at /home/mleich/10.5/sql/sql_base.cc:5221

#18 0x000055db838bfc5a in open_tables_only_view_structure (thd=0x62b00007e218, table_list=0x629000401238, can_deadlock=false) at /home/mleich/10.5/sql/sql_base.cc:5272

#19 0x000055db83c0c0c2 in fill_schema_table_by_open (thd=0x62b00007e218, mem_root=0x62b000083ce0, is_show_fields_or_keys=true, table=0x624000488138, schema_table=0x55db87399260 <schema_tables+1920>, orig_db_name=0x62b000085800,

    orig_table_name=0x62b000085810, open_tables_state_backup=0x373f36468510, can_deadlock=false) at /home/mleich/10.5/sql/sql_show.cc:4600

#20 0x000055db83c0ebd4 in get_all_tables (thd=0x62b00007e218, tables=0x62b000087238, cond=0x0) at /home/mleich/10.5/sql/sql_show.cc:5125

#21 0x000055db83c41bc3 in get_schema_tables_result (join=0x62b0000883a0, executed_place=PROCESSED_BY_JOIN_EXEC) at /home/mleich/10.5/sql/sql_show.cc:8673

#22 0x000055db83b143b9 in JOIN::exec_inner (this=0x62b0000883a0) at /home/mleich/10.5/sql/sql_select.cc:4401

#23 0x000055db83b12268 in JOIN::exec (this=0x62b0000883a0) at /home/mleich/10.5/sql/sql_select.cc:4225

#24 0x000055db83b16322 in mysql_select (thd=0x62b00007e218, tables=0x62b000087238, fields=..., conds=0x0, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2684619520, result=0x62b000088370, unit=0x62b000082230,

    select_lex=0x62b000082a30) at /home/mleich/10.5/sql/sql_select.cc:4649

#25 0x000055db83ae84d3 in handle_select (thd=0x62b00007e218, lex=0x62b000082168, result=0x62b000088370, setup_tables_done_option=0) at /home/mleich/10.5/sql/sql_select.cc:417

#26 0x000055db83a58ca5 in execute_sqlcom_select (thd=0x62b00007e218, all_tables=0x62b000087238) at /home/mleich/10.5/sql/sql_parse.cc:6208

#27 0x000055db83a47fd4 in mysql_execute_command (thd=0x62b00007e218) at /home/mleich/10.5/sql/sql_parse.cc:3939

#28 0x000055db83a63df9 in mysql_parse (thd=0x62b00007e218, rawbuf=0x62b000085238 "SHOW KEYS FROM `test`.`E`", length=25, parser_state=0x373f3646a8f0, is_com_multi=false, is_next_command=false) at /home/mleich/10.5/sql/sql_parse.cc:7992

#29 0x000055db83a3a94d in dispatch_command (command=COM_QUERY, thd=0x62b00007e218, packet=0x6290003ed219 "SHOW KEYS FROM `test`.`E`", packet_length=25, is_com_multi=false, is_next_command=false) at /home/mleich/10.5/sql/sql_parse.cc:1874

#30 0x000055db83a371cb in do_command (thd=0x62b00007e218) at /home/mleich/10.5/sql/sql_parse.cc:1355

#31 0x000055db83e64708 in do_handle_one_connection (connect=0x6080045091b8, put_in_cache=true) at /home/mleich/10.5/sql/sql_connect.cc:1411

#32 0x000055db83e64066 in handle_one_connection (arg=0x6080045091b8) at /home/mleich/10.5/sql/sql_connect.cc:1313

#33 0x000055db84b1cac1 in pfs_spawn_thread (arg=0x615000008f18) at /home/mleich/10.5/storage/perfschema/pfs.cc:2201

#34 0x0000104016d8d6db in start_thread (arg=0x373f3646c300) at pthread_create.c:463

#35 0x00007fda7081988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Work flow of the RQG Test

0. Bootstrap

1. Start the server with innodb_force_recovery=2

2. Create and fill all tables which get used later.

3. Several concurrent sessions run excessive DML on these tables

     ~ 280s after start of 3. the server process gets killed (SIGKILL)

4. The attempt to restart that server with innodb_force_recovery=2 had success.

5. Than come checks for he integrity of the data content and there a

     SHOW KEYS FROM `test`.`E` caused the assert.

RQG

git clone https://github.com/mleich1/rqg --branch experimental RQG

origin/experimental 611a31e38a3bc157392c0748c95991b6a248bb3b 2020-06-09T15:31:46+02:00

perl rqg.pl \

--grammar=conf/mariadb/innodb_compression_encryption.yy \

--gendata=conf/mariadb/innodb_compression_encryption.zz \

--mysqld=--plugin-load-add=file_key_management.so \

--mysqld=--loose-file-key-management-filename=conf/mariadb/encryption_keys.txt \

--mysqld=--innodb_use_native_aio=1 \

--mysqld=--innodb_stats_persistent=off \

--mysqld=--innodb_lock_schedule_algorithm=fcfs \

--mysqld=--loose-idle_write_transaction_timeout=0 \

--mysqld=--loose-idle_transaction_timeout=0 \

--mysqld=--loose-idle_readonly_transaction_timeout=0 \

--mysqld=--connect_timeout=60 \

--mysqld=--interactive_timeout=28800 \

--mysqld=--slave_net_timeout=60 \

--mysqld=--net_read_timeout=30 \

--mysqld=--net_write_timeout=60 \

--mysqld=--loose-table_lock_wait_timeout=50 \

--mysqld=--wait_timeout=28800 \

--mysqld=--lock-wait-timeout=86400 \

--mysqld=--innodb-lock-wait-timeout=50 \

--no-mask \

--queries=10000000 \

--duration=300 \

--seed=random \

--reporters=Backtrace \

--reporters=ErrorLog \

--reporters=Deadlock1 \

--reporters=CrashRecovery1 \

--validators=None \

--mysqld=--log_output=none \

--mysqld=--log-bin \

--mysqld=--log_bin_trust_function_creators=1 \

--mysqld=--loose-max-statement-time=30 \

--mysqld=--loose-debug_assert_on_not_freed_memory=0 \

--engine=InnoDB \

--restart_timeout=120 \

--threads=2 \

--mysqld=--innodb_force_recovery=2 \

--mysqld=--innodb_page_size=8K \

--mysqld=--innodb-buffer-pool-size=256M \

--duration=300 \

--no_mask \

--workdir=<local settings> \

--vardir=<local settings> \

--mtr-build-thread=<local settings> \

--basedir1=<local settings> \

--basedir2=<local settings> \

--script_debug=_nix_ \

--rr=Server \

--rr_options=--chaos

Certain options are not required for replay.

Attachments

Issue Links

blocks

MDEV-23339 innodb_force_recovery=2 may still abort the rollback of recovered transactions

Closed

is caused by

MDEV-12352 InnoDB shutdown should not be blocked by a large transaction rollback

Closed

relates to

MDEV-22889 InnoDB occasionally breaks the isolation of a recovered transaction

Open

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2020-06-13 09:58

We resurrected table locks for the recovered transaction:

10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d
#0 0x000055db84eaa5ae in lock_table_create (table=0x619000028120,
type_mode=1, trx=0x2edc0254fb90, c_lock=0x0)
at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3485
#1 0x000055db84eadc4d in lock_table_ix_resurrect (table=0x619000028120,
trx=0x2edc0254fb90)
at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3942
#2 0x000055db85238b88 in trx_resurrect_table_locks (trx=0x2edc0254fb90,
undo=0x61200000b458)
at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:644
#3 0x000055db852399c2 in trx_resurrect (undo=0x61200000b458,
rseg=0x61300000a4d8, start_time=1591979736, start_time_micro=289221474758,
rows_to_undo=0x7ffc53b2fa30, is_old_insert=false)
at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:725
#4 0x000055db8523a0c0 in trx_lists_init_at_db_start ()
at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:786
…
(rr) when
Current event: 108171

Later, the transaction was rolled back, but we did not clear {{ trx->lock.table_cached}} before freeing the transaction:

10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d
Thread 62 hit Breakpoint 5, trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90)
at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394
(rr) bt
#0 trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90)
at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394
#1 0x000055db852379c2 in trx_free_at_shutdown (trx=0x2edc0254fb90)
at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:552
#2 0x000055db8521df14 in trx_rollback_recovered (all=true)
at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:809
#3 0x000055db8521e3ae in trx_rollback_all_recovered ()
at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:838
#4 0x0000104016d8d6db in start_thread (arg=0x46c635b41700)
at pthread_create.c:463
#5 0x00007fda7081988f in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr) when
Current event: 115082

A little later, the transaction object will be reused in trx_create() and the problem will be flagged.

It appears that normally, the field would be cleared by the call trx_init(this) in trx_t::commit_in_memory(). But, here we are incorrectly marking the rollback as interrupted:

	if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE

			  && !srv_undo_sources

			  && !srv_is_being_started)

	    && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) {

		/* Shutdown has been initiated. */

		trx->error_state = DB_INTERRUPTED;

		return NULL;

Shutdown was not initiated, but we had failed to set srv_undo_sources if the value of innodb_force_recovery is 2, 3, or 4.
mleich, can you please try the following patch?

diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc

index 3fe78bc1909..4f68af25bcf 100644

--- a/storage/innobase/srv/srv0start.cc

+++ b/storage/innobase/srv/srv0start.cc

@@ -1875,6 +1875,7 @@ dberr_t srv_start(bool create_new_db)

 		if (srv_force_recovery < SRV_FORCE_NO_TRX_UNDO

 		    && !srv_read_only_mode) {

+			srv_undo_sources = true;

 			/* Drop partially created indexes. */

 			row_merge_drop_temp_indexes();

 			/* Drop garbage tables. */

@@ -1915,8 +1916,9 @@ dberr_t srv_start(bool create_new_db)

 		ut_ad(srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN

 		      || !purge_sys.enabled());

+		srv_undo_sources = !high_level_read_only;

 		if (srv_force_recovery < SRV_FORCE_NO_BACKGROUND) {

-			srv_undo_sources = true;

 			/* Create the dict stats gathering task */

 			dict_stats_start();

 			/* Create the thread that will optimize the

Marko Mäkelä added a comment - 2020-06-13 09:58 We resurrected table locks for the recovered transaction: 10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d #0 0x000055db84eaa5ae in lock_table_create (table=0x619000028120, type_mode=1, trx=0x2edc0254fb90, c_lock=0x0) at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3485 #1 0x000055db84eadc4d in lock_table_ix_resurrect (table=0x619000028120, trx=0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/lock/lock0lock.cc:3942 #2 0x000055db85238b88 in trx_resurrect_table_locks (trx=0x2edc0254fb90, undo=0x61200000b458) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:644 #3 0x000055db852399c2 in trx_resurrect (undo=0x61200000b458, rseg=0x61300000a4d8, start_time=1591979736, start_time_micro=289221474758, rows_to_undo=0x7ffc53b2fa30, is_old_insert=false) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:725 #4 0x000055db8523a0c0 in trx_lists_init_at_db_start () at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:786 … (rr) when Current event: 108171 Later, the transaction was rolled back, but we did not clear {{ trx->lock.table_cached}} before freeing the transaction: 10.5 d2c593c2a66549abe67c4e73dbefeb876fa5ed9d Thread 62 hit Breakpoint 5, trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394 (rr) bt #0 trx_free (trx=@0x46c635b407c8: 0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:394 #1 0x000055db852379c2 in trx_free_at_shutdown (trx=0x2edc0254fb90) at /home/mleich/10.5/storage/innobase/trx/trx0trx.cc:552 #2 0x000055db8521df14 in trx_rollback_recovered (all=true) at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:809 #3 0x000055db8521e3ae in trx_rollback_all_recovered () at /home/mleich/10.5/storage/innobase/trx/trx0roll.cc:838 #4 0x0000104016d8d6db in start_thread (arg=0x46c635b41700) at pthread_create.c:463 #5 0x00007fda7081988f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (rr) when Current event: 115082 A little later, the transaction object will be reused in trx_create() and the problem will be flagged. It appears that normally, the field would be cleared by the call trx_init(this) in trx_t::commit_in_memory() . But, here we are incorrectly marking the rollback as interrupted: if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE && !srv_undo_sources && !srv_is_being_started) && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) { /* Shutdown has been initiated. */ trx->error_state = DB_INTERRUPTED; return NULL; } Shutdown was not initiated, but we had failed to set srv_undo_sources if the value of innodb_force_recovery is 2, 3, or 4. mleich , can you please try the following patch? diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc index 3fe78bc1909..4f68af25bcf 100644 --- a/storage/innobase/srv/srv0start.cc +++ b/storage/innobase/srv/srv0start.cc @@ -1875,6 +1875,7 @@ dberr_t srv_start(bool create_new_db) if (srv_force_recovery < SRV_FORCE_NO_TRX_UNDO && !srv_read_only_mode) { + srv_undo_sources = true; /* Drop partially created indexes. */ row_merge_drop_temp_indexes(); /* Drop garbage tables. */ @@ -1915,8 +1916,9 @@ dberr_t srv_start(bool create_new_db) ut_ad(srv_force_recovery >= SRV_FORCE_NO_UNDO_LOG_SCAN || !purge_sys.enabled()); + srv_undo_sources = !high_level_read_only; + if (srv_force_recovery < SRV_FORCE_NO_BACKGROUND) { - srv_undo_sources = true; /* Create the dict stats gathering task */ dict_stats_start(); /* Create the thread that will optimize the

Marko Mäkelä added a comment - 2020-06-13 10:11

Sorry, my patch cannot work as is. The srv_undo_sources is being as a predicate that tells whether the additional threads or tasks exist.
Here is a less intrusive patch that I think should work in this scenario:

diff --git a/storage/innobase/row/row0undo.cc b/storage/innobase/row/row0undo.cc

index 398e31a5ae5..c4b0af997f5 100644

--- a/storage/innobase/row/row0undo.cc

+++ b/storage/innobase/row/row0undo.cc

@@ -477,6 +477,7 @@ row_undo_step(

 	ut_ad(que_node_get_type(node) == QUE_NODE_UNDO);

 	if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE

+			  && trx != trx_roll_crash_recv_trx

 			  && !srv_undo_sources

 			  && !srv_is_being_started)

 	    && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) {

I will try to come up with a test case. Basically, we should start up the server with innodb_force_recovery=2 and expect the rollback to complete. I hope that this will be observable with a READ UNCOMMITTED transaction.

Marko Mäkelä added a comment - 2020-06-13 10:11 Sorry, my patch cannot work as is. The srv_undo_sources is being as a predicate that tells whether the additional threads or tasks exist. Here is a less intrusive patch that I think should work in this scenario: diff --git a/storage/innobase/row/row0undo.cc b/storage/innobase/row/row0undo.cc index 398e31a5ae5..c4b0af997f5 100644 --- a/storage/innobase/row/row0undo.cc +++ b/storage/innobase/row/row0undo.cc @@ -477,6 +477,7 @@ row_undo_step( ut_ad(que_node_get_type(node) == QUE_NODE_UNDO); if (UNIV_UNLIKELY(trx_get_dict_operation(trx) == TRX_DICT_OP_NONE + && trx != trx_roll_crash_recv_trx && !srv_undo_sources && !srv_is_being_started) && (srv_fast_shutdown == 3 || trx == trx_roll_crash_recv_trx)) { I will try to come up with a test case. Basically, we should start up the server with innodb_force_recovery=2 and expect the rollback to complete. I hope that this will be observable with a READ UNCOMMITTED transaction.

Marko Mäkelä added a comment - 2020-06-13 11:26

I think that we must update the predicate to check for srv_shutdown_state==SRV_SHUTDOWN_INITIATED that was introduced in ~~MDEV-22827~~. That should be the least intrusive fix.

I am able to reproduce the problem with the following test case, which is not entirely deterministic, because it depends on the relative timing of trx_rollback_all_recovered and the srv_is_being_started=false assignment in srv_start():

--source include/have_innodb.inc

--source include/not_embedded.inc

--source include/have_sequence.inc

FLUSH TABLES;

connect (con0,localhost,root);

CREATE TABLE t0 (a INT PRIMARY KEY) ENGINE=InnoDB;

BEGIN;

INSERT INTO t0 SELECT * FROM seq_1_to_10000;

connection default;

# Persist the above incomplete transaction.

SET GLOBAL innodb_flush_log_at_trx_commit=1;

CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;

--let $restart_parameters= --innodb-force-recovery=2

--let $shutdown_timeout=0

--source include/restart_mysqld.inc

--let $restart_parameters=

--let $shutdown_timeout=

disconnect con0;

connection default;

SELECT * FROM t0 LOCK IN SHARE MODE;

DROP TABLE t0,t1;

Without the fix, the SELECT would hang, because the rollback was aborted and the locks were not released.

Marko Mäkelä added a comment - 2020-06-13 11:26 I think that we must update the predicate to check for srv_shutdown_state==SRV_SHUTDOWN_INITIATED that was introduced in MDEV-22827 . That should be the least intrusive fix. I am able to reproduce the problem with the following test case, which is not entirely deterministic, because it depends on the relative timing of trx_rollback_all_recovered and the srv_is_being_started=false assignment in srv_start() : --source include/have_innodb.inc --source include/not_embedded.inc --source include/have_sequence.inc FLUSH TABLES; connect (con0,localhost,root); CREATE TABLE t0 (a INT PRIMARY KEY ) ENGINE=InnoDB; BEGIN ; INSERT INTO t0 SELECT * FROM seq_1_to_10000; connection default ; # Persist the above incomplete transaction . SET GLOBAL innodb_flush_log_at_trx_commit=1; CREATE TABLE t1 (a INT PRIMARY KEY ) ENGINE=InnoDB; --let $restart_parameters= --innodb-force-recovery=2 --let $shutdown_timeout=0 --source include/restart_mysqld.inc --let $restart_parameters= --let $shutdown_timeout= disconnect con0; connection default ; SELECT * FROM t0 LOCK IN SHARE MODE; DROP TABLE t0,t1; Without the fix, the SELECT would hang, because the rollback was aborted and the locks were not released.

Marko Mäkelä added a comment - 2020-06-13 11:33

I believe that this logic was broken in ~~MDEV-12352~~. The fix will depend on ~~MDEV-22769~~.

Marko Mäkelä added a comment - 2020-06-13 11:33 I believe that this logic was broken in MDEV-12352 . The fix will depend on MDEV-22769 .

Marko Mäkelä added a comment - 2020-06-13 12:54

Pushed to 10.2 and merged to 10.3 (with conflict resolution).

Marko Mäkelä added a comment - 2020-06-13 12:54 Pushed to 10.2 and merged to 10.3 (with conflict resolution).

People

Assignee:: Marko Mäkelä

Reporter:: Matthias Leich

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2019-12-04 15:57

Updated:: 2020-07-30 06:24

Resolved:: 2020-06-13 12:54

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server