[MDEV-31393] partial server freeze after LOCK TABLES ...; FLUSH TABLES; Created: 2023-06-02  Updated: 2023-12-07

Status: Stalled
Project: MariaDB Server
Component/s: Locking, Storage Engine - InnoDB
Affects Version/s: 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 10.6.14
Fix Version/s: 10.4, 10.5, 10.6, 10.11, 11.0, 11.1

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Michael Widenius
Resolution: Unresolved Votes: 0
Labels: corruption, hang, rr-profile-analyzed

Attachments: File MDEV-31393.yy     File MDEV-31393.zz     File MDEV-31393_Extreme.cfg    
Issue Links:
Relates
relates to MDEV-31941 Assertion failures upon using OQGraph... Confirmed
relates to MDEV-5336 Implement BACKUP STAGE for safe exter... Closed

 Description   

Scenario:
1. Start the server and generate some initial data
2. Two "worker" session runs a DDL/DML mix
3. After some time the testing tool random query generator (RQG) detects that 
    one of the session running the DDL/DML mix is left over and has exceeded some
    threshold for the execution of its statement and assumes some server hang/freeze.
   - 21:19:51  (~ 352s after the "worker" session ID 17 started its activity) 
     ID -- COMMAND -- TIME -- INFO -- state
     13 -- Sleep -- 352 -- <undef> -- ok
     14 -- Killed -- 273 -- FLUSH TABLES /* E_R Thread1 QNO 1230 CON_ID 14 */ -- stalled?
     50 -- Query -- 0 -- SHOW FULL PROCESSLIST -- ok
  - 21:21:14   Why does ID 14  not switch to "Sleep"
     ID -- COMMAND -- TIME -- INFO -- state
     14 -- Killed -- 356 -- FLUSH TABLES /* E_R Thread1 QNO 1230 CON_ID 14 */
     60 -- Query -- 0 -- SHOW FULL PROCESSLIST
 
# 2023-06-01T21:21:14 [446281] Reporter 'Deadlock': Content of processlist ---------- begin
# 2023-06-01T21:21:14 [446281] Reporter 'Deadlock': ID -- COMMAND -- TIME -- INFO -- state
# 2023-06-01T21:21:14 [446281] Reporter 'Deadlock': 13 -- Sleep -- 435 -- <undef> -- ok
# 2023-06-01T21:21:14 [446281] Reporter 'Deadlock': 14 -- Killed -- 356 -- FLUSH TABLES /* E_R 
 
pluto:/data/results/1685644381/Freeze$ gdb -c ./1/data/core /data/Server_bin/bb-10.6-MDEV-30986B_asan_Og/bin/mysqld
 
Unfortunately I cannot exclude that RQG makes false alarm.
 
RQG
===
# git clone https://github.com/mleich1/rqg --branch <pick the right branch> RQG
#
# GIT_SHOW: HEAD -> master, origin/master, origin/HEAD 2cfe27f6ea2427e6d80314d26649e804d3bdf1c6 2023-05-26T16:45:32+02:00
# rqg.pl  : Version 4.2.1 (2022-12)
#
# $RQG_HOME/rqg.pl \
# --gendata=conf/engines/innodb/full_text_search.zz \
# --max_gd_duration=1200 \
# --short_column_names \
# --grammar=conf/engines/innodb/full_text_search.yy \
# --redefine=conf/mariadb/alter_table.yy \
# --redefine=conf/mariadb/instant_add.yy \
# --redefine=conf/mariadb/modules/alter_table_columns.yy \
# --redefine=conf/mariadb/bulk_insert.yy \
# --redefine=conf/mariadb/modules/foreign_keys.yy \
# --redefine=conf/mariadb/modules/locks.yy \
# --redefine=conf/mariadb/modules/sql_mode.yy \
# --redefine=conf/mariadb/redefine_temporary_tables.yy \
# --redefine=conf/mariadb/versioning.yy \
# --redefine=conf/mariadb/sequences.yy \
# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--wait_timeout=28800 \
# --mysqld=--lock-wait-timeout=86400 \
# --mysqld=--innodb-lock-wait-timeout=50 \
# --no_mask \
# --queries=10000000 \
# --seed=random \
# --reporters=None \
# --reporters=ErrorLog \
# --reporters=Deadlock \
# --validators=None \
# --validators=None \
# --mysqld=--log_output=none \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --engine=InnoDB \
# --restart_timeout=240 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--plugin-load-add=provider_lzo.so \
# --mysqld=--plugin-load-add=provider_bzip2.so \
# --mysqld=--plugin-load-add=provider_lzma.so \
# --mysqld=--plugin-load-add=provider_snappy.so \
# --mysqld=--plugin-load-add=provider_lz4.so \
# --mysqld=--loose-innodb_compression_level=1 \
# --duration=300 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--loose-innodb-sync-debug \
# --mysqld=--innodb_stats_persistent=off \
# --mysqld=--innodb_adaptive_hash_index=off \
# --mysqld=--innodb_random_read_ahead=ON \
# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
# --mysqld=--loose-max-statement-time=30 \
# --threads=1 \
# --mysqld=--innodb_use_native_aio=1 \
# --mysqld=--innodb_undo_log_truncate=ON \
# --mysqld=--loose_innodb_change_buffering=none \
# --mysqld=--innodb_undo_tablespaces=16 \
# --mysqld=--innodb_rollback_on_timeout=ON \
# --mysqld=--innodb_page_size=32K \
# --mysqld=--innodb-buffer-pool-size=256M \
# <local settings>



 Comments   
Comment by Marko Mäkelä [ 2023-06-02 ]

For the record, the only interesting threads among the stack traces of killed process were the following:
FLUSH TABLES executing something that is intended to be part of FLUSH TABLES…FOR EXPORT:

#6  0x000055cec54a950c in purge_sys_t::stop (this=0x55cec7a35c80 <purge_sys>) at /data/Server/bb-10.6-MDEV-30986B/storage/innobase/srv/srv0srv.cc:1358
#7  0x000055cec5484742 in row_quiesce_table_start (table=0x618000116908, trx=trx@entry=0x7f99e42fc340) at /data/Server/bb-10.6-MDEV-30986B/storage/innobase/row/row0quiesce.cc:533
#8  0x000055cec4fe9908 in ha_innobase::external_lock (this=0x61d0002026b8, thd=<optimized out>, lock_type=<optimized out>) at /data/Server/bb-10.6-MDEV-30986B/storage/innobase/handler/ha_innodb.cc:16079
#9  0x000055cec45f4d00 in handler::ha_external_lock (this=0x61d0002026b8, thd=thd@entry=0x62b00009a218, lock_type=lock_type@entry=0) at /data/Server/bb-10.6-MDEV-30986B/sql/handler.cc:7128

A purge worker task is waiting for a metadata lock:

#4  0x000055cec41a181c in inline_mysql_cond_timedwait (src_line=1207, src_file=0x55cec5f5de20 "/data/Server/bb-10.6-MDEV-30986B/sql/mdl.cc", abstime=0x7f998224fca0, mutex=0x62b00001c370, that=0x62b00001c420) at /data/Server/bb-10.6-MDEV-30986B/include/mysql/psi/mysql_thread.h:1088
#5  MDL_wait::timed_wait (this=this@entry=0x62b00001c370, owner=0x62b00001c300, abs_timeout=abs_timeout@entry=0x7f998224fca0, set_status_on_timeout=set_status_on_timeout@entry=false, wait_state_name=<optimized out>) at /data/Server/bb-10.6-MDEV-30986B/sql/mdl.cc:1207

The purge coordinator task was waiting for purge worker completion, and the buf_flush_page_cleaner() thread was idle.

A simplified grammar for reproducing this apparent hang would help a lot.

Comment by Matthias Leich [ 2023-06-06 ]

How to replay with the simplified test
==============================
# $RQG_HOME/rqg.pl \
# --duration=300 \
# --queries=10000000 \
# --no_mask \
# --seed=random \
# --max_gd_duration=1200 \
# --short_column_names \
# --rpl_mode=none \
# --gendata=MDEV-31393.zz \
# --engine=InnoDB \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--wait_timeout=28800 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-plugin-load-add=provider_lz4.so \
# --mysqld=--innodb_undo_log_truncate=ON \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--sync-binlog=1 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--innodb_undo_tablespaces=2 \
# --mysqld=--innodb-lock-wait-timeout=50 \
# --mysqld=--log-bin \
# --mysqld=--innodb_max_undo_log_size=10M \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--innodb_adaptive_hash_index=off \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--loose-innodb-sync-debug \
# --mysqld=--connect_timeout=60 \
# --mysqld=--lock-wait-timeout=86400 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose_innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--innodb-buffer-pool-size=256M \
# --mysqld=--log-output=none \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --mysqld=--innodb_page_size=8K \
# --mysqld=--loose-max-statement-time=30 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --reporters=Deadlock,None \
# --validators=None \
# --grammar=MDEV-31393.yy \
# --threads=1 \
# --basedir1=/data/Server_bin/10.6B1_RelWithDebInfo \
# --vardir_type=fast                                                                            \
local.cfg contains the setting
   $rqg_fast_dir           = '/dev/shm/rqg';
which causes that the vardir of the DB server resides below '/dev/shm/rqg'.
 
The "brute force" way  (parallel RQG runs in order to get a faster replay etc.)
./REPLAY_SIMP.sh MDEV-31393_Extreme.cfg  <path to where the binaries are installed like /data/Server_bin/10.6B1_RelWithDebInfo>
The run will exit if either
- some replay was achieved
or
- no replay but some fixed amount of non replays was exceeded.

Comment by Marko Mäkelä [ 2023-06-09 ]

The hang that I am analyzing involves two threads and a metadata lock on the table table200_innodb_int_autoinc. The SQL thread executes FLUSH TABLES /* E_R Thread1 QNO 60 CON_ID 13 */) and an internal thread will invoke purge_worker_callback.

ssh pluto
rr replay /data/results/1686228537/MDEV-31393/1/rr/latest-trace

run 112200
break MDL_context::acquire_lock
condition 1 $_streq(mdl_request.key.m_ptr+6,"table200_innodb_int_autoinc")
continue
backtrace
continue
backtrace
thread apply 23 backtrace
finish

10.6 f569e06e03a7efa6050258a8d167de0aaa4e124c

(rr) condition 1 $_streq(mdl_request.key.m_ptr+6,"table200_innodb_int_autoinc")
(rr) continue
Continuing.
2023-06-08 12:49:04 13 [Note] InnoDB: Writing table metadata to './test/table100_innodb_int_autoinc.cfg'
2023-06-08 12:49:04 13 [Note] InnoDB: Table `test`.`table100_innodb_int_autoinc` flushed to disk
[Switching to Thread 985824.987186]
 
Thread 23 hit Breakpoint 1, MDL_context::acquire_lock (this=0x5a2c2c000db0, mdl_request=0x4ddb490dbc80, lock_wait_timeout=86400) at /data/Server/10.6B1/sql/mdl.cc:2276
2276	{
(rr) backtrace
#0  MDL_context::acquire_lock (this=0x5a2c2c000db0, mdl_request=0x4ddb490dbc80, lock_wait_timeout=86400) at /data/Server/10.6B1/sql/mdl.cc:2276
#1  0x0000563c727e7bbc in MDL_context::upgrade_shared_lock (this=this@entry=0x5a2c2c000db0, mdl_ticket=0x32161c47cc20, new_type=new_type@entry=MDL_EXCLUSIVE, lock_wait_timeout=86400) at /data/Server/10.6B1/sql/mdl.cc:2594
#2  0x0000563c72670b4a in wait_while_table_is_used (thd=0x5a2c2c000c58, table=0x32161c0989b8, function=HA_EXTRA_PREPARE_FOR_FORCED_CLOSE) at /data/Server/10.6B1/sql/sql_base.cc:1344
#3  0x0000563c726730cb in close_cached_tables (thd=thd@entry=0x5a2c2c000c58, tables=tables@entry=0x0, wait_for_refresh=wait_for_refresh@entry=true, timeout=86400) at /data/Server/10.6B1/sql/sql_base.cc:391
#4  0x0000563c727fa671 in reload_acl_and_cache (thd=thd@entry=0x5a2c2c000c58, options=4, tables=tables@entry=0x0, write_to_binlog=write_to_binlog@entry=0x4ddb490dc0f0) at /data/Server/10.6B1/sql/sql_reload.cc:340
#5  0x0000563c726dec2d in mysql_execute_command (thd=0x5a2c2c000c58, is_called_from_prepared_stmt=<optimized out>) at /data/Server/10.6B1/sql/sql_parse.cc:5475
#6  0x0000563c726ce5c6 in mysql_parse (rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>, thd=0x5a2c2c000c58) at /data/Server/10.6B1/sql/sql_parse.cc:8036
(rr) continue
Continuing.
2023-06-08 12:49:04 13 [Note] InnoDB: Deleting the meta-data file './test/table100_innodb_int_autoinc.cfg'
2023-06-08 12:49:04 13 [Note] InnoDB: Resuming purge
[Switching to Thread 985824.985912]
 
Thread 1 hit Breakpoint 1, MDL_context::acquire_lock (this=this@entry=0x563c7593c810, mdl_request=mdl_request@entry=0x7fa3f97149e0, lock_wait_timeout=86400) at /data/Server/10.6B1/sql/mdl.cc:2276
2276	{
(rr) backtrace
#0  MDL_context::acquire_lock (this=this@entry=0x563c7593c810, mdl_request=mdl_request@entry=0x7fa3f97149e0, lock_wait_timeout=86400) at /data/Server/10.6B1/sql/mdl.cc:2276
#1  0x0000563c72caaf14 in dict_acquire_mdl_shared<false, true> (table=<optimized out>, thd=thd@entry=0x563c7593c6b8, mdl=mdl@entry=0x563c759d7fa8, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at /data/Server/10.6B1/sql/mdl.h:610
#2  0x0000563c72cab4c4 in dict_table_open_on_id<true> (table_id=132, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x563c7593c6b8, mdl=mdl@entry=0x563c759d7fa8) at /usr/include/c++/9/bits/atomic_base.h:539
#3  0x0000563c72c1ea17 in row_purge_parse_undo_rec (updated_extern=0x7fa3f9714ca1, thr=0x563c7599c120, undo_rec=0x563c759e71b8 "", node=0x563c759d7e28) at /data/Server/10.6B1/storage/innobase/row/row0purge.cc:1081
#4  row_purge (thr=<optimized out>, undo_rec=<optimized out>, node=0x563c759d7e28) at /data/Server/10.6B1/storage/innobase/row/row0purge.cc:1259
#5  row_purge_step (thr=thr@entry=0x563c7599c120) at /data/Server/10.6B1/storage/innobase/row/row0purge.cc:1325
#6  0x0000563c72bf12c0 in que_thr_step (thr=0x563c7599c120) at /data/Server/10.6B1/storage/innobase/que/que0que.cc:588
#7  que_run_threads_low (thr=<optimized out>) at /data/Server/10.6B1/storage/innobase/que/que0que.cc:644
#8  que_run_threads (thr=thr@entry=0x563c7599c120) at /data/Server/10.6B1/storage/innobase/que/que0que.cc:664
#9  0x0000563c72c391ad in srv_task_execute () at /data/Server/10.6B1/storage/innobase/srv/srv0srv.cc:1602
#10 purge_worker_callback () at /data/Server/10.6B1/storage/innobase/srv/srv0srv.cc:1850
#11 0x0000563c72d125b0 in tpool::task_group::execute (this=0x563c73f27340 <purge_task_group>, t=0x563c73ef46e0 <purge_worker_task>) at /data/Server/10.6B1/tpool/task_group.cc:55
#12 0x0000563c72d10477 in tpool::thread_pool_generic::worker_main (this=0x563c75951390, thread_var=0x563c75951c40) at /data/Server/10.6B1/tpool/tpool_generic.cc:580
(rr) thread apply 23 backtrace
 
Thread 23 (Thread 985824.987186):
#0  0x0000000070000002 in syscall_traced ()
#11 0x0000563c72d65375 in read (__nbytes=20451, __buf=0x5a2c2c045bb8, __fd=22) at /usr/include/x86_64-linux-gnu/bits/unistd.h:44
#12 my_read (Filedes=Filedes@entry=22, Buffer=Buffer@entry=0x5a2c2c045bb8 ' ' <repeats 200 times>..., Count=Count@entry=20451, MyFlags=MyFlags@entry=16) at /data/Server/10.6B1/mysys/my_read.c:51
#13 0x0000563c727acbde in inline_mysql_file_read (flags=16, count=20451, buffer=0x5a2c2c045bb8 ' ' <repeats 200 times>..., file=22) at /data/Server/10.6B1/include/mysql/psi/mysql_file.h:1146
#14 open_table_def (thd=thd@entry=0x5a2c2c000c58, share=share@entry=0x5a2c2c03c630, flags=flags@entry=11) at /data/Server/10.6B1/sql/table.cc:707
#15 0x0000563c7287b9a0 in tdc_acquire_share (thd=thd@entry=0x5a2c2c000c58, tl=tl@entry=0x5a2c2c05eda0, flags=flags@entry=3, out_table=out_table@entry=0x4ddb490db800) at /data/Server/10.6B1/sql/table_cache.cc:836
#16 0x0000563c72671f5f in open_table (thd=0x5a2c2c000c58, table_list=0x5a2c2c05eda0, ot_ctx=0x4ddb490dbe30) at /data/Server/10.6B1/sql/sql_base.cc:1886
#17 0x0000563c72672de4 in Locked_tables_list::reopen_tables (this=this@entry=0x5a2c2c004a48, thd=thd@entry=0x5a2c2c000c58, need_reopen=need_reopen@entry=false) at /data/Server/10.6B1/sql/sql_base.cc:2708
#18 0x0000563c72673102 in close_cached_tables (thd=thd@entry=0x5a2c2c000c58, tables=tables@entry=0x0, wait_for_refresh=wait_for_refresh@entry=true, timeout=86400) at /data/Server/10.6B1/sql/sql_base.cc:404
#19 0x0000563c727fa671 in reload_acl_and_cache (thd=thd@entry=0x5a2c2c000c58, options=4, tables=tables@entry=0x0, write_to_binlog=write_to_binlog@entry=0x4ddb490dc0f0) at /data/Server/10.6B1/sql/sql_reload.cc:340
#20 0x0000563c726dec2d in mysql_execute_command (thd=0x5a2c2c000c58, is_called_from_prepared_stmt=<optimized out>) at /data/Server/10.6B1/sql/sql_parse.cc:5475
#21 0x0000563c726ce5c6 in mysql_parse (rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>, thd=0x5a2c2c000c58) at /data/Server/10.6B1/sql/sql_parse.cc:8036
(rr) finish
Run till exit from #0  MDL_context::acquire_lock (this=this@entry=0x563c7593c810, mdl_request=mdl_request@entry=0x7fa3f97149e0, lock_wait_timeout=86400) at /data/Server/10.6B1/sql/mdl.cc:2276
2023-06-08 12:49:04 13 [Note] InnoDB: Sync to disk of `test`.`table100_innodb_int_autoinc` started.
2023-06-08 12:49:04 13 [Note] InnoDB: Stopping purge
2023-06-08 13:03:40 12 [Warning] Aborted connection 12 to db: 'mysql' user: 'root' host: 'localhost' (Got an error reading communication packets)
[New Thread 985824.987290]
[New Thread 985824.1153006]
 
Thread 24 received signal SIGABRT, Aborted.

I am not sure if it is the cause of the hang, but resuming purge before UNLOCK TABLES was executed, in the middle of executing a single FLUSH TABLES that affects multiple tables, seems to be a potential source of corruption because a user could try to copy all InnoDB data files after FLUSH TABLES completed, and in this case at least one data file would be unlocked prematurely during the execution of FLUSH TABLES:

2023-06-08 12:49:04 13 [Note] InnoDB: Resuming purge

This was invoked by the following:

10.6 f569e06e03a7efa6050258a8d167de0aaa4e124c

#0  0x0000563c725aa47c in row_quiesce_table_complete (table=0x32161c058458, trx=trx@entry=0x30ed3e522680) at /data/Server/10.6B1/storage/innobase/include/ut0ut.h:327
#1  0x0000563c72b7c8a2 in ha_innobase::external_lock (this=0x5a2c2c0647f0, thd=0x5a2c2c000c58, lock_type=2) at /data/Server/10.6B1/storage/innobase/handler/ha_innodb.cc:16095
#2  0x0000563c72902618 in handler::ha_external_lock (this=0x5a2c2c0647f0, thd=thd@entry=0x5a2c2c000c58, lock_type=lock_type@entry=2) at /data/Server/10.6B1/sql/handler.cc:7128
#3  0x0000563c72a02eff in handler::ha_external_unlock (thd=0x5a2c2c000c58, this=<optimized out>) at /data/Server/10.6B1/sql/handler.h:3549
#4  unlock_external (thd=thd@entry=0x5a2c2c000c58, table=0x5a2c2c0108c0, count=<optimized out>) at /data/Server/10.6B1/sql/lock.cc:730
#5  0x0000563c72a03080 in mysql_unlock_tables (thd=0x5a2c2c000c58, sql_lock=0x5a2c2c0108a0, free_lock=free_lock@entry=false) at /data/Server/10.6B1/sql/lock.cc:435
#6  0x0000563c72a03ee2 in mysql_unlock_some_tables (thd=<optimized out>, table=table@entry=0x4ddb490dbc78, count=count@entry=1, flag=flag@entry=0) at /data/Server/10.6B1/sql/lock.cc:465
#7  0x0000563c72a03f71 in mysql_lock_remove (thd=thd@entry=0x5a2c2c000c58, locked=0x32161c072478, table=<optimized out>, table@entry=0x32161c64b9c8) at /data/Server/10.6B1/sql/lock.cc:564
#8  0x0000563c7267149f in close_all_tables_for_name (thd=thd@entry=0x5a2c2c000c58, share=<optimized out>, extra=extra@entry=HA_EXTRA_NOT_USED, skip_table=skip_table@entry=0x0) at /data/Server/10.6B1/sql/sql_base.cc:769
#9  0x0000563c726730e5 in close_cached_tables (thd=thd@entry=0x5a2c2c000c58, tables=tables@entry=0x0, wait_for_refresh=wait_for_refresh@entry=true, timeout=86400) at /data/Server/10.6B1/sql/sql_base.cc:397
#10 0x0000563c727fa671 in reload_acl_and_cache (thd=thd@entry=0x5a2c2c000c58, options=4, tables=tables@entry=0x0, write_to_binlog=write_to_binlog@entry=0x4ddb490dc0f0) at /data/Server/10.6B1/sql/sql_reload.cc:340
#11 0x0000563c726dec2d in mysql_execute_command (thd=0x5a2c2c000c58, is_called_from_prepared_stmt=<optimized out>) at /data/Server/10.6B1/sql/sql_parse.cc:5475
#12 0x0000563c726ce5c6 in mysql_parse (rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>, thd=0x5a2c2c000c58) at /data/Server/10.6B1/sql/sql_parse.cc:8036

Some code around the call to close_all_tables_for_name() was refactored in a commit that is part of MDEV-5336.

The original MySQL WL#5522 design introduced the syntax FLUSH TABLES…FOR EXPORT, but it turns out that a plain FLUSH TABLES would do exactly the same, because InnoDB is interpreting the lock mode F_RDLCK as FOR EXPORT, and lock_external() would use lock_mode=F_RDLCK in the basic case.

Comment by Marko Mäkelä [ 2023-06-29 ]

I discussed this with monty earlier this week. The grammar MDEV-31393.yy generates LOCK TABLES statements as well as FLUSH TABLES; (with an empty table list). The preceding LOCK TABLES is somehow interfering with that statement, causing it to do something else than what the documentation says.

This should not be any recent regression. monty and svoj simplified the logic of FLUSH TABLES for MariaDB Server 10.6, but this part was missed. Apparently the call to close_cached_tables() should be omitted, something like this (rough idea, totally untested):

diff --git a/sql/sql_reload.cc b/sql/sql_reload.cc
index 3448e157e10..ecf0ca5965c 100644
--- a/sql/sql_reload.cc
+++ b/sql/sql_reload.cc
@@ -331,24 +331,6 @@ bool reload_acl_and_cache(THD *thd, unsigned long long options,
           }
         }
       }
-
-#ifdef WITH_WSREP
-      /* In case of applier thread, do not call flush tables */
-      if (!thd || !thd->wsrep_applier)
-#endif /* WITH_WSREP */
-      {
-        if (close_cached_tables(thd, tables,
-                                ((options & REFRESH_FAST) ?  FALSE : TRUE),
-                                (thd ? thd->variables.lock_wait_timeout :
-                                 LONG_TIMEOUT)))
-        {
-          /*
-            NOTE: my_error() has been already called by reopen_tables() within
-            close_cached_tables().
-          */
-          result= 1;
-        }
-      }
     }
     my_dbopt_cleanup();
   }

Comment by Michael Widenius [ 2023-11-05 ]

The main purpose of FLUSH TABLES is to close all open tables. This is why closed_cached_tables() are doing, so we cannot remove the call.

The documentation says for FLUSH TABLES:
Free memory and file descriptors not in use.

FLUSH TABLES will commit any open but still keep the table locked.
LOCKED TABLES are closed and reopened (protected by a MDL lock to ensure they are not dropped)

I do not see anything obviously wrong in either the documentation or the flush code.
I will talk with Marko and see if we can together try to sort this out.

Comment by Michael Widenius [ 2023-11-05 ]

"because a user could try to copy all InnoDB data files after FLUSH TABLES completed, and in this case at least one data file would be unlocked prematurely during the execution of FLUSH TABLES"

The above scenario is impossible.
After FLUSH TABLE is done, all tables will be locked and no other user is using them. There is an MDL lock that protects anyone else from opening them.

Generated at Thu Feb 08 10:23:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.