[MDEV-26445] innodb_undo_log_truncate is unnecessarily slow Created: 2021-08-20  Updated: 2023-12-01  Resolved: 2021-09-24

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.11, 10.6
Fix Version/s: 10.5.13, 10.6.5, 10.7.1

Type: Bug Priority: Critical
Reporter: martina342 Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance
Environment:

debian 11


Attachments: Text File crash.log     Text File crash2.log     Text File crash3.log     Text File hopeenoughdata.txt     Text File log.txt     Text File mariadb_bt_all_threads.txt     Text File mariadb_bt_all_threads_workload_slowed_down.txt     Text File semaphore without crash.txt     Text File show status.txt     Text File showenginecensored.txt     HTML File stackfreeze     HTML File stackfreeze2     HTML File stackfreeze3     Text File stillstuckmariadb.txt     HTML File stillstuckmariadb3    
Issue Links:
Blocks
is blocked by MDEV-26450 Corruption due to innodb_undo_log_tru... Closed
PartOf
includes MDEV-21952 ibdata1 file size growing in MariaDB Closed
Problem/Incident
causes MDEV-32757 innodb_undo_log_truncate=ON is not cr... Closed
Relates
relates to MDEV-24911 Missing warning before [ERROR] [FATAL... Open
relates to MDEV-26662 buf_read_page_background() is blockin... Closed
relates to MDEV-27530 InnoDB - Performance issues after upg... Closed

 Description   

Even after the version 10.5.9 which should have fixed it, it still crashes for me

--Thread 139751148128000 has waited at btr0pcur.cc line 510 for 435.00 seconds the semaphore:
S-lock on RW-latch at 0x7f3e1c0ca778 created in file buf0buf.cc line 1221
a writer (thread id 139751159285504) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file btr0pcur.cc line 257
2021-08-20  7:33:31 0 [Note] InnoDB: A semaphore wait:
--Thread 139751139219200 has waited at btr0pcur.cc line 510 for 254.00 seconds the semaphore:
S-lock on RW-latch at 0x7f3e1c0ca778 created in file buf0buf.cc line 1221
a writer (thread id 139751159285504) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file btr0pcur.cc line 257
2021-08-20  7:33:31 0 [Note] InnoDB: A semaphore wait:
--Thread 139751134918400 has waited at btr0pcur.cc line 510 for 511.00 seconds the semaphore:
S-lock on RW-latch at 0x7f3e1c0ca778 created in file buf0buf.cc line 1221
a writer (thread id 139751159285504) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file btr0pcur.cc line 257
InnoDB: Pending reads 0, writes 0
2021-08-20  7:33:31 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
210820  7:33:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.5.11-MariaDB-1
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=346
max_threads=402
thread_count=330
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1016043 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
Can't start addr2line
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x555b4482793e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x555b44333145]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7f40b1317140]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7f40b0e60ce1]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7f40b0e4a537]
/usr/sbin/mariadbd(+0x62fefd)[0x555b44017efd]
/usr/sbin/mariadbd(+0x626b02)[0x555b4400eb02]
/usr/sbin/mariadbd(_ZN5tpool19thread_pool_generic13timer_generic7executeEPv+0x3b)[0x555b447c91fb]
/usr/sbin/mariadbd(_ZN5tpool4task7executeEv+0x32)[0x555b447ca592]
/usr/sbin/mariadbd(_ZN5tpool19thread_pool_generic11worker_mainEPNS_11worker_dataE+0x4f)[0x555b447c8fbf]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xceed0)[0x7f40b11feed0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7)[0x7f40b130bea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f40b0f22def]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /home/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1030947              1030947              processes 
Max open files            60000                60000                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1030947              1030947              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core

Then after this crash, it has corrupted undo file and mariadb can't start :

2021-08-20 10:07:23 0 [ERROR] [FATAL] InnoDB: Trying to read 16384 bytes at 884310016 outside the bounds of the file: .//undo002
210820 10:07:23 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.5.11-MariaDB-1
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=402
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1016043 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
??:0(my_print_stacktrace)[0x55a09b42993e]
??:0(handle_fatal_signal)[0x55a09af35145]
sigaction.c:0(__restore_rt)[0x7f9b7a0e0140]
linux/raise.c:51(__GI_raise)[0x7f9b79c29ce1]
stdlib/abort.c:81(__GI_abort)[0x7f9b79c13537]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a09ac19efd]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a09ac2d0f2]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a09ac2d137]
??:0(std::unique_lock<std::mutex>::unlock())[0x55a09b34b247]
??:0(std::unique_lock<std::mutex>::unlock())[0x55a09b34c494]
??:0(std::unique_lock<std::mutex>::unlock())[0x55a09b33adb0]
??:0(std::unique_lock<std::mutex>::unlock())[0x55a09b302398]
??:0(std::unique_lock<std::mutex>::unlock())[0x55a09b2f873a]
??:0(std::unique_lock<std::mutex>::unlock())[0x55a09b2fee6e]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a09ac13662]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55a09b213139]
??:0(ha_initialize_handlerton(st_plugin_int*))[0x55a09af380de]
??:0(sys_var_pluginvar::sys_var_pluginvar(sys_var_chain*, char const*, st_plugin_int*, st_mysql_sys_var*))[0x55a09ad4db5a]
??:0(plugin_init(int*, char**, int))[0x55a09ad4ec47]
??:0(unireg_abort)[0x55a09ac7a457]
??:0(mysqld_main(int, char**))[0x55a09ac800e0]
csu/libc-start.c:308(__libc_start_main)[0x7f9b79c14d0a]
??:0(_start)[0x55a09ac7446a]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /home/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             1030947              1030947              processes 
Max open files            60000                60000                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       1030947              1030947              signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core



 Comments   
Comment by Marko Mäkelä [ 2021-08-20 ]

Can you please produce a stack trace of all threads at the time of the hang (from the core dump)? A link to the instructions was included in the output that you posted: https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

Comment by martina342 [ 2021-08-20 ]

the mariadb binary is stripped and now the database is corrupted, i don't even know if i can fix it, i setup innodb_force_recovery = 6 to start but I don't know if I can clear the content of the undo logs, but I don't want to corrupt again the database.

EDIT: force recovery is read only, can you please tell me how to cancel undo logs ?
I tried to set innodb_undo_tablespaces = 0 but then it fails with
InnoDB: File ./ibdata1: 'aio write' returned OS error 209. Cannot continue operation

Comment by martina342 [ 2021-08-22 ]

i installed mariadb 10.5 from the mariadb repo, and installed mariadb-server-core-10.5-dbgsym, so I followed the steps for the stack trace, and it crashed again (i noticed i need to add a variable "core_file" afterward)

--Thread 139462800680704 has waited at btr0pcur.cc line 510 for 556.00 seconds the semaphore:
S-lock on RW-latch at 0x7ee83c0cdc78 created in file buf0buf.cc line 1227
a writer (thread id 139461606029056) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file row0ins.cc line 2633
2021-08-21 22:49:09 0 [Note] InnoDB: A semaphore wait:
--Thread 139461610944256 has waited at trx0trx.cc line 881 for 606.00 seconds the semaphore:
Mutex at 0x55d0b144f300, Mutex REDO_RSEG created ./storage/innobase/trx/trx0rseg.cc:404, lock var 2
 
2021-08-21 22:49:09 0 [Note] InnoDB: A semaphore wait:
--Thread 139461603878656 has waited at btr0pcur.cc line 510 for 266.00 seconds the semaphore:
S-lock on RW-latch at 0x7ee83c0cdc78 created in file buf0buf.cc line 1227
a writer (thread id 139461606029056) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file row0ins.cc line 2633
2021-08-21 22:49:09 0 [Note] InnoDB: A semaphore wait:
--Thread 139462802523904 has waited at btr0pcur.cc line 510 for 370.00 seconds the semaphore:
S-lock on RW-latch at 0x7ee83c0cdc78 created in file buf0buf.cc line 1227
a writer (thread id 139461606029056) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file row0ins.cc line 2633
2021-08-21 22:49:09 0 [Note] InnoDB: A semaphore wait:
--Thread 139462801295104 has waited at btr0pcur.cc line 510 for 534.00 seconds the semaphore:
S-lock on RW-latch at 0x7ee83c0cdc78 created in file buf0buf.cc line 1227
a writer (thread id 139461606029056) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file row0ins.cc line 2633

2021-08-21 22:49:09 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
210821 22:49:09 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.12-MariaDB-1:10.5.12+maria~bullseye
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=144
max_threads=402
thread_count=74
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1016036 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x55d00358949e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x55d002fe7305]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7efd7f5a2140]
2021-08-21 22:49:18 0 [Note] InnoDB: Truncated .//undo001
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7efd7f0ebce1]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7efd7f0d5537]
/usr/sbin/mariadbd(+0x63724f)[0x55d002ca624f]
/usr/sbin/mariadbd(+0x62c07d)[0x55d002c9b07d]
/usr/sbin/mariadbd(_ZN5tpool19thread_pool_generic13timer_generic7executeEPv+0x3b)[0x55d00351f60b]
/usr/sbin/mariadbd(_ZN5tpool4task7executeEv+0x32)[0x55d0035204c2]
/usr/sbin/mariadbd(_ZN5tpool19thread_pool_generic11worker_mainEPNS_11worker_dataE+0x4f)[0x55d00351f3cf]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xceed0)[0x7efd7f489ed0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7)[0x7efd7f596ea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7efd7f1addef]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /home/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1031116              1031116              processes
Max open files            32768                32768                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       1031116              1031116              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: core

Comment by martina342 [ 2021-08-24 ]

here is a full stack

GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.
 
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/mysqld...
Reading symbols from /usr/lib/debug/.build-id/67/9570f0b96cbccab866cbf085e309ed3a7e8ddc.debug...
 
warning: Can't open file /[aio] (deleted) during file-backed mapping note processing
[New LWP 669786]
[New LWP 258999]
[New LWP 259018]
[New LWP 258654]
[New LWP 293279]
[New LWP 259058]
[New LWP 258658]
[New LWP 293242]
[New LWP 258667]
[New LWP 293230]
[New LWP 293236]
[New LWP 258951]
[New LWP 293306]
[New LWP 293209]
[New LWP 258653]
[New LWP 293313]
[New LWP 259016]
[New LWP 259001]
[New LWP 293225]
[New LWP 293285]
[New LWP 293576]
[New LWP 258924]
[New LWP 259008]
[New LWP 293316]
[New LWP 259059]
[New LWP 293589]
[New LWP 293312]
[New LWP 293318]
[New LWP 293202]
[New LWP 293595]
[New LWP 258990]
[New LWP 259078]
[New LWP 293577]
[New LWP 293254]
[New LWP 259068]
[New LWP 293237]
[New LWP 294894]
[New LWP 258652]
[New LWP 293314]
[New LWP 259084]
[New LWP 293571]
[New LWP 293317]
[New LWP 293266]
[New LWP 293275]
[New LWP 293327]
[New LWP 259024]
[New LWP 259082]
[New LWP 294705]
[New LWP 293584]
[New LWP 293310]
[New LWP 293597]
[New LWP 259014]
[New LWP 294934]
[New LWP 293601]
[New LWP 294688]
[New LWP 294928]
[New LWP 293220]
[New LWP 258689]
[New LWP 293610]
[New LWP 293246]
[New LWP 669785]
[New LWP 258655]
[New LWP 293235]
[New LWP 259031]
[New LWP 294694]
[New LWP 293288]
[New LWP 293241]
[New LWP 428265]
[New LWP 293694]
[New LWP 293693]
[New LWP 293311]
[New LWP 294929]
[New LWP 294908]
[New LWP 293585]
[New LWP 294893]
[New LWP 293249]
[New LWP 294920]
[New LWP 294935]
[New LWP 669125]
[New LWP 293196]
[New LWP 293244]
[New LWP 293692]
[New LWP 294689]
[New LWP 309945]
[New LWP 294907]
[New LWP 293217]
[New LWP 293570]
[New LWP 670639]
[New LWP 258668]
[New LWP 293583]
[New LWP 293336]
[New LWP 601081]
[New LWP 294901]
[New LWP 294693]
[New LWP 294778]
[New LWP 294775]
[New LWP 309770]
[New LWP 294690]
[New LWP 331566]
[New LWP 293320]
[New LWP 294927]
[New LWP 304551]
[New LWP 293272]
[New LWP 298162]
[New LWP 293263]
[New LWP 294687]
[New LWP 294909]
[New LWP 294916]
[New LWP 294917]
[New LWP 258958]
[New LWP 293578]
[New LWP 259034]
[New LWP 304560]
[New LWP 258956]
[New LWP 294910]
[New LWP 294777]
[New LWP 294773]
[New LWP 304541]
[New LWP 670720]
[New LWP 293574]
[New LWP 298152]
[New LWP 295013]
[New LWP 294937]
[New LWP 298153]
[New LWP 335788]
[New LWP 293300]
[New LWP 293324]
[New LWP 294895]
[New LWP 293228]
[New LWP 294932]
[New LWP 294921]
[New LWP 293581]
[New LWP 670858]
[New LWP 293287]
[New LWP 294933]
[New LWP 258962]
[New LWP 295014]
[New LWP 300352]
[New LWP 294919]
[New LWP 294940]
[New LWP 670831]
[New LWP 294939]
[New LWP 383425]
[New LWP 295046]
[New LWP 601054]
[New LWP 304540]
[New LWP 393894]
[New LWP 293582]
[New LWP 294691]
[New LWP 294692]
[New LWP 294903]
[New LWP 309767]
[New LWP 294923]
[New LWP 428228]
[New LWP 304561]
[New LWP 259010]
[New LWP 294897]
[New LWP 309946]
[New LWP 294931]
[New LWP 298160]
[New LWP 259041]
[New LWP 294942]
[New LWP 428142]
[New LWP 383429]
[New LWP 304552]
[New LWP 298154]
[New LWP 670807]
[New LWP 293296]
[New LWP 293259]
[New LWP 293596]
[New LWP 294914]
[New LWP 309773]
[New LWP 294936]
[New LWP 670658]
[New LWP 670833]
[New LWP 670638]
[New LWP 294902]
[New LWP 259022]
[New LWP 670772]
[New LWP 294938]
[New LWP 670870]
[New LWP 293226]
[New LWP 293319]
[New LWP 304557]
[New LWP 670709]
[New LWP 670124]
[New LWP 383436]
[New LWP 670640]
[New LWP 293245]
[New LWP 670842]
[New LWP 310227]
[New LWP 670834]
[New LWP 309768]
[New LWP 670732]
[New LWP 294912]
[New LWP 670871]
[New LWP 259029]
[New LWP 428184]
[New LWP 670781]
[New LWP 670662]
[New LWP 298361]
[New LWP 293308]
[New LWP 304542]
[New LWP 293944]
[New LWP 670645]
[New LWP 295012]
[New LWP 309766]
[New LWP 670131]
[New LWP 258926]
[New LWP 304547]
[New LWP 294915]
[New LWP 670736]
[New LWP 293307]
[New LWP 259036]
[New LWP 298163]
[New LWP 259020]
[New LWP 601080]
[New LWP 428153]
[New LWP 670796]
[New LWP 670791]
[New LWP 428224]
[New LWP 293572]
[New LWP 601079]
[New LWP 670712]
[New LWP 294896]
[New LWP 298151]
[New LWP 670714]
[New LWP 259053]
[New LWP 670766]
[New LWP 298156]
[New LWP 259064]
[New LWP 258943]
[New LWP 670671]
[New LWP 601084]
[New LWP 294941]
[New LWP 304553]
[New LWP 670866]
[New LWP 670837]
[New LWP 670702]
[New LWP 428145]
[New LWP 304545]
[New LWP 294922]
[New LWP 304556]
[New LWP 298158]
[New LWP 670865]
[New LWP 670637]
[New LWP 309774]
[New LWP 670667]
[New LWP 428245]
[New LWP 670856]
[New LWP 670765]
[New LWP 670703]
[New LWP 294926]
[New LWP 259044]
[New LWP 309771]
[New LWP 259073]
[New LWP 293298]
[New LWP 298155]
[New LWP 670643]
[New LWP 298159]
[New LWP 670701]
[New LWP 298164]
[New LWP 670771]
[New LWP 670769]
[New LWP 304555]
[New LWP 258940]
[New LWP 259087]
[New LWP 309775]
[New LWP 293221]
[New LWP 334045]
[New LWP 670770]
[New LWP 670706]
[New LWP 670809]
[New LWP 383427]
[New LWP 259060]
[New LWP 298150]
[New LWP 304549]
[New LWP 259091]
[New LWP 670803]
[New LWP 670838]
[New LWP 670729]
[New LWP 669783]
[New LWP 293255]
[New LWP 670839]
[New LWP 670852]
[New LWP 294913]
[New LWP 293580]
[New LWP 293261]
[New LWP 259086]
[New LWP 304544]
[New LWP 309765]
[New LWP 298161]
[New LWP 304550]
[New LWP 293599]
[New LWP 293262]
[New LWP 298973]
[New LWP 293579]
[New LWP 293277]
[New LWP 293199]
[New LWP 304546]
[New LWP 304559]
[New LWP 293304]
[New LWP 293315]
[New LWP 294955]
[New LWP 294918]
[New LWP 293321]
[New LWP 309769]
[New LWP 293322]
[New LWP 383433]
[New LWP 397906]
[New LWP 383435]
[New LWP 293329]
[New LWP 670704]
[New LWP 670705]
[New LWP 293573]
[New LWP 670741]
[New LWP 670764]
[New LWP 293575]
[New LWP 670832]
[New LWP 670811]
[New LWP 293586]
[New LWP 670681]
[New LWP 294774]
[New LWP 294776]
[New LWP 294930]
[New LWP 294966]
[New LWP 294968]
[New LWP 304537]
[New LWP 309948]
[New LWP 428180]
[New LWP 428253]
[New LWP 393900]
[New LWP 601082]
[New LWP 670648]
[New LWP 670787]
[New LWP 670849]
[New LWP 670872]
[New LWP 259023]
[New LWP 309772]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/sbin/mariadbd'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
56      ../sysdeps/unix/sysv/linux/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f7455ffb700 (LWP 669786))]

Comment by martina342 [ 2021-08-24 ]

the database is again now corrupted ....

2021-08-24 11:04:50 0 [ERROR] [FATAL] InnoDB: Trying to read 16384 bytes at 4633837568 outside the bounds of the file: .//undo002
210824 11:04:50 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.12-MariaDB-1:10.5.12+maria~bullseye
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=402
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1016036 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x5625bef0e49e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5625be96c305]
sigaction.c:0(__restore_rt)[0x7fc93b1b9140]
linux/raise.c:51(__GI_raise)[0x7fc93ad02ce1]
stdlib/abort.c:81(__GI_abort)[0x7fc93acec537]
/usr/sbin/mariadbd(+0x63724f)[0x5625be62b24f]
/usr/sbin/mariadbd(+0x64bd1b)[0x5625be63fd1b]
ut/ut0ut.cc:576(ib::fatal::~fatal() [clone .cold])[0x5625be63fd61]
fil/fil0fil.cc:3224(fil_report_invalid_page_access(char const*, unsigned long, unsigned long, bool))[0x5625bedf4aac]
include/fil0fil.h:535(fil_space_t::acquire())[0x5625bede2054]
buf/buf0rea.cc:344(buf_read_page(page_id_t, unsigned long))[0x5625bed9e9ca]
buf/buf0buf.cc:2960(buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool))[0x5625bed8c7a6]
trx/trx0undo.cc:854(trx_undo_mem_create_at_db_start(trx_rseg_t*, unsigned long, unsigned int, unsigned long&))[0x5625bed97914]
trx/trx0rseg.cc:431(trx_rseg_array_init())[0x5625be622daf]
trx/trx0trx.cc:740(trx_lists_init_at_db_start())[0x5625bec79319]
srv/srv0start.cc:1488(srv_start(bool))[0x5625be96ee9e]
handler/ha_innodb.cc:4025(innodb_init(void*))[0x5625be76ccca]
sql/handler.cc:645(ha_initialize_handlerton(st_plugin_int*))[0x5625be76df4d]
sql/sql_plugin.cc:1459(plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool))[0x5625be68bf9f]
sql/sql_plugin.cc:1751(plugin_init(int*, char**, int))[0x5625be691919]
csu/libc-start.c:308(__libc_start_main)[0x7fc93acedd0a]
/usr/sbin/mariadbd(_start+0x2a)[0x5625be68662a]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /home/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1031116              1031116              processes
Max open files            32768                32768                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       1031116              1031116              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: core
 

Comment by martina342 [ 2021-08-25 ]

here is a crash with really full stack trace (attached)

2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041117320960 has waited at dict0dict.cc line 1083 for 39.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041117013760 has waited at dict0dict.cc line 1083 for 39.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041116706560 has waited at dict0dict.cc line 1083 for 38.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041116399360 has waited at dict0dict.cc line 1083 for 36.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041116092160 has waited at dict0dict.cc line 1083 for 36.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041115784960 has waited at dict0dict.cc line 1083 for 32.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041115477760 has waited at dict0dict.cc line 1083 for 30.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041115170560 has waited at dict0dict.cc line 1083 for 10.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041114863360 has waited at dict0dict.cc line 1083 for 10.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041114556160 has waited at dict0dict.cc line 1083 for 10.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041114248960 has waited at dict0dict.cc line 1083 for 10.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041113941760 has waited at dict0dict.cc line 1083 for 10.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041113634560 has waited at dict0dict.cc line 1083 for 10.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
2021-08-25  8:04:11 0 [Note] InnoDB: A semaphore wait:
--Thread 140041113327360 has waited at dict0dict.cc line 1083 for 10.00 seconds the semaphore:
Mutex at 0x55c5e83dd040, Mutex DICT_SYS created ./storage/innobase/dict/dict0dict.cc:1027, lock var 2
 
InnoDB: Pending reads 0, writes 0
2021-08-25  8:04:11 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
210825  8:04:11 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.12-MariaDB-1:10.5.12+maria~bullseye
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=347
max_threads=402
thread_count=347
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1016036 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x55c5e7c0549e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x55c5e7663305]
sigaction.c:0(__restore_rt)[0x7f84317a0140]
linux/raise.c:51(__GI_raise)[0x7f84312e9ce1]
stdlib/abort.c:81(__GI_abort)[0x7f84312d3537]
/usr/sbin/mariadbd(+0x63724f)[0x55c5e732224f]
/usr/sbin/mariadbd(+0x62c07d)[0x55c5e731707d]
ut/ut0ut.cc:576(ib::fatal::~fatal() [clone .cold])[0x55c5e7b9b60b]
srv/srv0srv.cc:1337(srv_monitor_task.cold)[0x55c5e7b9c4c2]
bits/atomic_base.h:404(std::__atomic_base<bool>::store(bool, std::memory_order))[0x55c5e7b9b3cf]
2021-08-25  8:04:14 0 [Note] InnoDB: Truncated .//undo002
??:0(std::error_code::default_error_condition() const)[0x7f8431687ed0]
nptl/pthread_create.c:478(start_thread)[0x7f8431794ea7]
x86_64/clone.S:97(__GI___clone)[0x7f84313abdef]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.

crash.log

Comment by martina342 [ 2021-08-25 ]

the full log at the time of crash log.txt

Comment by martina342 [ 2021-08-25 ]

@Marko Mäkelä, do you have all the informations needed or do you need more stack ?

Comment by Marko Mäkelä [ 2021-08-31 ]

martina342, the InnoDB "semaphore wait" output is insufficient for analyzing hangs.

https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ uses the terms "backtrace" or "bt" to refer to stack traces.

Comment by martina342 [ 2021-08-31 ]

can you please check crash.log ? It is the "backtrace" (bt)

Comment by Marko Mäkelä [ 2021-08-31 ]

Sorry, indeed, crash.log it is. I see page_id=... in many stack traces. The following command should prevent such hiding:

set print frame-arguments all

I see lock=0x7f83dc1c2578 being passed to pfs_rw_lock_s_lock_func() in many threads. Based on the stack traces, that is an index page. Thread 44 is waiting for the exclusive page latch, which will cause any subsequent shared latch requests to wait until the exclusive latch has been granted and released. That purge thread is also by design holding dict_sys.latch in shared mode (this was just fixed in MDEV-24258). That dict_sys.latch which in turn will block some other thread that is waiting for dict_sys.mutex. But, I cannot quickly determine which thread is blocking Thread 44. It might even be impossible based on the available information.

These stack traces are definitely a step to the right direction to analyzing this, but it would be very helpful to see some more data from the core dump. If you still have the core dump from which crash.log was generated, then please provide the following:

print dict_sys.latch
print dict_sys.mutex

If you have to generate a new core dump first, then I would also need the stack traces for that core dump.

Theoretically, I could go through all 361 stack traces and try to match them with the source code, to see which threads could possibly be holding the exclusive latch, but I would like to save that effort.

Comment by martina342 [ 2021-08-31 ]

I have a new crash log from a new core dump

crash2.log

(gdb) print dict_sys.latch
$1 = {<ilist_node<void>> = {next = 0x55f6b7f887a0, prev = 0x55f6b7f8a850}, lock_word = {m = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 536870912}, <No data fields>}}, waiters = {
    m = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, <No data fields>}}, sx_recursive = 0, writer_thread = 0, event = 0x55f6b7f88e08, wait_ex_event = 0x55f6b7f88e98,
  cfile_name = 0x55f61011a268 "./storage/innobase/dict/dict0dict.cc", last_x_file_name = 0x55f61010fdb8 "./storage/innobase/srv/srv0srv.cc", cline = 1036, is_block_lock = 0, last_x_line = 1538, count_os_wait = 4,
  pfs_psi = 0x0}
(gdb) print dict_sys.mutex
$2 = {m_impl = {m_lock_word = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, <No data fields>}, m_event = 0x55f6b7f88d58, m_policy = {
      m_filename = 0x55f61011a268 "./storage/innobase/dict/dict0dict.cc", m_line = 1027, m_count = {m_spins = 0, m_waits = 0, m_calls = 0, m_enabled = false}, m_id = LATCH_ID_DICT_SYS}}, m_ptr = 0x0}

Comment by Marko Mäkelä [ 2021-09-01 ]

Is crash2.log from MariaDB Server 10.5.12? The line number in srv_monitor_task() suggests so. I only see 3 InnoDB threads being executed: the purge coordinator in Thread 63 (busy, but not waiting for anything), the page cleaner (idle) in Thread 42, and the crashing srv_monitor_task() in Thread 1. That looks like a false alarm; maybe a larger value of innodb_fatal_semaphore_wait_threshold could help avoid that.

In any case, the stack traces are completely different from crash.log , because there is no os_event::wait to be seen.

Comment by martina342 [ 2021-09-01 ]

yes it's 10.5.12

Comment by Marko Mäkelä [ 2021-09-01 ]

martina342 Good, I would tend to believe that some fixes in 10.5.12 (such as MDEV-25801) could have removed the original slowness that you reported.

I do not have any idea what could trigger a false alarm in the InnoDB watchdog. The watchdog has remained largely unchanged for several years, and the last time I remember fixing a false alarm was some 10 years ago. In 10.6, I removed the "sync array" and rewrote the watchdog to cover only the dict_sys.mutex or dict_sys.latch.

You might want to ensure that the InnoDB buffer pool and redo log are configured to be large enough for your workload.

Comment by martina342 [ 2021-09-01 ]

i'm not sure if the crash2 was the first crash after the semaphore lock. Maybe it was the crash once the data was corrupted by the first crash and I tried to restart mariadb.

I will send you another crash with the correct settings if I have another crash that I'm sure is the one after lock

Comment by martina342 [ 2021-09-04 ]

Hello

it crashed again.

2021-09-04 20:39:02 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 1000 seconds. We intentionally crash the server because it appears to be hung.
210904 20:39:02 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.12-MariaDB-1:10.5.12+maria~bullseye
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=197
max_threads=402
thread_count=197
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1016036 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x559dbb56749e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x559dbafc5305]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14140)[0x7f4c57031140]
2021-09-04 20:39:11 0 [Note] InnoDB: Truncated .//undo001
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x141)[0x7f4c56b7ace1]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x123)[0x7f4c56b64537]
/usr/sbin/mariadbd(+0x63724f)[0x559dbac8424f]
/usr/sbin/mariadbd(+0x62c07d)[0x559dbac7907d]
/usr/sbin/mariadbd(_ZN5tpool19thread_pool_generic13timer_generic7executeEPv+0x3b)[0x559dbb4fd60b]
/usr/sbin/mariadbd(_ZN5tpool4task7executeEv+0x32)[0x559dbb4fe4c2]
2021-09-04 20:39:18 0 [Note] InnoDB: Truncating .//undo002
/usr/sbin/mariadbd(_ZN5tpool19thread_pool_generic11worker_mainEPNS_11worker_dataE+0x4f)[0x559dbb4fd3cf]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xceed0)[0x7f4c56f18ed0]
2021-09-04 20:39:23 0 [Note] InnoDB: Truncated .//undo002
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7)[0x7f4c57025ea7]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f4c56c3cdef]

(gdb) print dict_sys.latch
$1 = {<ilist_node<void>> = {next = 0x559e6324d840, prev = 0x559e6324f8c0}, lock_word = {m = {<std::__atomic_base<int>> = {static _S_alignment = 4, _M_i = 536870912}, <No data fields>}}, waiters = {
    m = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, <No data fields>}}, sx_recursive = 0, writer_thread = 0, event = 0x559e6324de78, wait_ex_event = 0x559e6324df08,
  cfile_name = 0x559dbb7b6268 "./storage/innobase/dict/dict0dict.cc", last_x_file_name = 0x559dbb7abdb8 "./storage/innobase/srv/srv0srv.cc", cline = 1036, is_block_lock = 0, last_x_line = 1538, count_os_wait = 3872,
  pfs_psi = 0x0}
 
(gdb) print dict_sys.mutex
$2 = {m_impl = {m_lock_word = {<std::__atomic_base<unsigned int>> = {static _S_alignment = 4, _M_i = 0}, <No data fields>}, m_event = 0x559e6324ddc8, m_policy = {
      m_filename = 0x559dbb7b6268 "./storage/innobase/dict/dict0dict.cc", m_line = 1027, m_count = {m_spins = 0, m_waits = 0, m_calls = 0, m_enabled = false}, m_id = LATCH_ID_DICT_SYS}}, m_ptr = 0x0}
 

the backtrace is attached on crash3.log

Comment by martina342 [ 2021-09-06 ]

hello,

today I got logs of semaphore wait, but which didn't resultet in a crash.
See attached semaphore without crash.txt

Comment by martina342 [ 2021-09-06 ]

update : it's waiting for semaphore right now again, with all requests frozen on the server.
They are stuck in "statistics" or "sending data", they are very easy requests.

I have a wordpress website and each visitor generate this request stuck on "statistics" :
SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'

I have tens of this request on my process list.
This table have 3 853 rows for 20mb

edit: it unstucked itself after 15 minutes, all requests were processed

Comment by Marko Mäkelä [ 2021-09-06 ]

semaphore without crash.txt does not include enough information for analysis. For crash3.log it does not look like dict_sys is really locked. I only see 3 threads whose stack frames contain the string storage/innobase. None of them are waiting for anything resource inside InnoDB. The monitor task is killing the server, and the 2 other threads are actively executing something.

I think that your system is misconfigured in some way, because such output is not normal. It could be useful to check with sudo perf record or http://poormansprofiler.org or similar tools what is going on. But, I cannot provide free support via this bug reporting system.

Comment by martina342 [ 2021-09-09 ]

i upgraded to mariadb 10.6, but it's still happening. It's still freezing, but don't crash anymore
the command "top" doesn't show a lot of cpu usage (it show some cpu usage for mariadb)
iotop show zero io for mariadb, and almost zero read / write or io % for other processes

the datadir of mariadb doesn't show any modification since than the time where it started freezing.
I followed this http://poormansprofiler.org/ and executed " gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mariadbd)"
here is the file generated stackfreeze

i don't see anything special on the syslog
mdstat doesn't show any problem with the raid
smartctl doesn't show any error for any of the 2 nvme drives

all requests are stuck. Like
"SHOW STATUS" is stuck on "Filling schema table"
"SHOW ENGINE INNODB STATUS" is stuck on "starting"
others are stuck on "update" or "statistics"

testing the disk doesn't show error :
dd if=/dev/zero of=/tmp/test1.img bs=1G count=1 oflag=dsync
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.30623 s, 466 MB/s

I have a undo002 file that is 35gb large and at 13h01 it started truncating it, and at 13h25 mariadb stopped doing anything.
My ib_logfile0 is 37gb large.

Here is my mariadb configuration :

  1. Import all .cnf files from configuration directory
    !includedir /etc/mysql/conf.d/
    !includedir /etc/mysql/mariadb.conf.d/

[mysql]

[mysqld]
datadir = /home/mysql
tmpdir = /home/mysqltmp/
default-time-zone = Europe/Paris
wait_timeout = 43200
max_allowed_packet = 64M
query_cache_size = 0
query_cache_type = 0
innodb_buffer_pool_size = 150G
innodb_log_file_size = 37G
innodb_buffer_pool_instances = 64
innodb_undo_log_truncate = ON
skip-name-resolve = 1
tmp_table_size = 256M
max_heap_table_size = 256M
max_connections = 400
net_read_timeout = 250600
net_write_timeout = 250600
join_buffer_size = 1MB
sql_mode = ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
innodb-defragment = 1
innodb_online_alter_log_max_size= 50000000000
table_definition_cache = 1000
innodb_undo_tablespaces = 2
table_open_cache = 30000
log_error = /var/log/mysql/error.log
innodb_fatal_semaphore_wait_threshold = 2000

[mariadb]
core_file

I notice a file ddl_recovery.log was created at 13h21

i tried to kill some of the oldest request from processlist, but they stay on processlist with the "Killed" command

i modified the gdb command to add frame arguments :
gdb -ex "set pagination 0" -ex "set print frame-arguments all" -ex "thread apply all bt" --batch -p $(pidof mariadbd)
here is the output
stackfreeze2

When i executed the command, it showed this "29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory." . I don't know if it's relevant.

in my opinion, there is something wrong in mariadb, not my server configuration. And if I have some mariadb configuration not appropriate for my workload, it shouldn't completely freeze it

here is a third stack generated, in case you would like to compare with the previous one and see if mariadb did anything in between
stackfreeze3

I have set innodb_fatal_semaphore_wait_threshold = 2000
but now, 45 minutes after it started freezeing (it's still freezeing), it didn't crashed or generated logs about semaphore wait

Comment by martina342 [ 2021-09-09 ]

update : it partially unfroze for a moment and let through some commands at 14h51.
"SHOW ENGINE INNODB STATUS;" worked after waiting for more than an hour

here is the output (censored)
showenginecensored.txt

on the mariadb data folder, the file ib_logfile0 have been updated at 14h51, no other file have modification time changed since 13h23

the processlist still shows a lot of request stuck since the beginning of the freeze. I don't know why some request worked

edit: the mariadb log show some sort aborted, that was the requests on "Creating sort index" that i killed :
2021-09-09 14:51:41 63364 [Warning] Sort aborted, host: localhost, user: sitewordpress, thread: 63364, query: SELECT ***
2021-09-09 14:51:41 58831 [Warning] Sort aborted, host: localhost, user: sitewordpress, thread: 58831, query: SELECT ***
2021-09-09 14:51:41 55718 [Warning] Sort aborted, host: localhost, user: sitewordpress, thread: 55718, query: SELECT ***
2021-09-09 14:51:41 58784 [Warning] Sort aborted, host: localhost, user: sitewordpress, thread: 58784, query: SELECT ***
2021-09-09 14:51:41 45088 [Warning] Sort aborted, host: localhost, user: sitewordpress, thread: 45088, query: SELECT ***
2021-09-09 14:51:41 58798 [Warning] Sort aborted, host: localhost, user: sitewordpress, thread: 58798, query: SELECT ***

not all requests that i killed earlier are gone, i still see some of them with the "killed" command

Comment by martina342 [ 2021-09-13 ]

i cleared completely the mariadb server (removed mysql data folder, recreate from zero) and I am importing from mysqldump's .sql files, and it still freezes :

Id      User    Host    db      Command Time    State   Info    Progress
12407   root    localhost       db  Query   535     Update  INSERT INTO `import` VALUES (4339807,1992787),(4339807,71207374),(4339808,27759    0.000
22303   root    localhost       sitewordpress   Query   371     creating table  CREATE TABLE `wbk_appointments` (\n  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,\n  `name` varchar(  0.000
22332   root    localhost       NULL    Killed  347     Filling schema table    SHOW STATUS     0.000
22493   root    localhost       NULL    Killed  187     Filling schema table    show global status like '%dirty%'       0.000
22568   root    localhost       bdd_principale  Sleep   114             NULL    0.000
22684   root    localhost       NULL    Query   0       starting        show processlist        0.000

gdb output : stillstuckmariadb.txt

it unstuck itself after 15 minutes. The server was doing nothing else, wasn't overloaded

Comment by martina342 [ 2021-09-13 ]

freeze again, one simple insert stuck again. But this time, "SHOW ENGINE INNODB STATUS;" or "show status" works

Id User Host db Command Time State Info Progress
12407 root localhost bdd_principale Query 4466 Update INSERT INTO `**` VALUES (2510****RE 0.000
23056 root localhost bdd_principale Query 5802 Waiting for table metadata lock SELECT eff*****, 0.000
25326 root localhost NULL Sleep 50 NULL 0.000
29037 root localhost NULL Query 0 starting show processlist 0.000

the second query waiting for table metadata is because it's on the same table. The insert is stuck for more than 1 hour.

2021-09-13 18:16:57 0x7fe7d01f6700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 48 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 17624 srv_idle
srv_master_thread log flush and writes: 17624
----------
SEMAPHORES
----------
------------
TRANSACTIONS
------------
Trx id counter 90242
Purge done for trx's n:o < 70633 undo n:o < 0 state: running
History list length 9692
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 90241, ACTIVE 4416 sec inserting
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 342
MariaDB thread id 12407, OS thread handle 140633604970240, query id 83758 localhost root Update
INSERT INTO `****` VALUES (2510******
--------
FILE I/O
--------
Pending flushes (fsync) log: 0; buffer pool: 0
206 OS file reads, 1454819 OS file writes, 89726 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 0, seg size 2, 0 merges
merged operations:
 insert 0, delete mark 0, delete 0
discarded operations:
 insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 60340282849
Log flushed up to   60340282849
Pages flushed up to 28160947137
Last checkpoint at  28160947137
0 pending log flushes, 0 pending chkp writes
50192 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 161094828032
Dictionary memory allocated 956706280
Buffer pool size   9716400
Free buffers       6346865
Database pages     3369535
Old database pages 1243812
Modified db pages  2030057
Percent of dirty pages(LRU & free pages): 20.893
Max dirty pages percent: 5.000
Pending reads 0
Pending writes: LRU 0, flush list 1025
Pages made young 33628, not young 77
0.00 youngs/s, 0.00 non-youngs/s
Pages read 187, created 3498534, written 1403598
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 3369535, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 read views open inside InnoDB
Process ID=0, Main thread ID=0, state: sleeping
Number of rows inserted 127004126, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Number of system rows inserted 0, updated 0, deleted 0, read 0
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT

output of show status :
show status.txt

gdb output :
stillstuckmariadb3

Marko, do you need more infos ?

Comment by Marko Mäkelä [ 2021-09-13 ]

martina342, to which version did you upgrade? Unfortunately, stillstuckmariadb3 is showing ?? for all InnoDB functions. I do not think that a metadata lock wait counts as a server hang. You can try setting lock_wait_timeout to a smaller value if you expect frequent conflicts with DDL operations or statements like FLUSH TABLES WITH READ LOCK.

Comment by martina342 [ 2021-09-13 ]

it is version 10.6.4-MariaDB-1:10.6.4+maria~bullseye mariadb.org binary distribution

did you saw that I got a freeze during 15 minutes, and any request like "SHOW status" was freezing with "Filling schema table"

"stillstuckmariadb3 is showing ?? for all InnoDB functions."
what do you recommand to make it show the function name instead of "??" ?

modifying the variable "lock_wait_timeout" would be only a temporary workaround, right ?

I don't know if that was about "Waiting for table metadata lock", which is not important here as it is probably the dump locking the table.

My problem is not that, but the INSERT which was stuck (i killed it after more than 2 hours), while other imports for the same table took much less time, and mariadb was not doing anyting

Comment by martina342 [ 2021-09-14 ]

i started over again by removing the mariadb data folder and reimporting again the mysqldump backups.
This time it froze earlier, with another table

Id      User    Host    db      Command Time    State   Info    Progress
3793    root    localhost       bdd_principale  Query   15033   Update  INSERT INTO `***` VALUES ('*****    0.000
16167   root    localhost       *****  Query   37960   Sending data    SELECT /*!40001 SQL_NO_CACHE */ `id`,*****    0.000
44841   root    localhost       NULL    Sleep   9064            NULL    0.000
53986   root    localhost       NULL    Query   0       starting        show processlist        0.000

the query "SELECT /*!40001 SQL_NO_CACHE" that kicked in is the automatic backup, but it's stuck too.

The table at which is stuck at, the frm file is about 100MB, but it got stuck at 68MB

I ran mariadb using perf record. It generated a 8GB file perf.data.
Here is the result (the beginning) for "perf report".

Samples: 217M of event 'cycles', Event count (approx.): 156165128856310
Overhead  Command   Shared Object        Symbol
  65.66%  mariadbd  mariadbd             [.] 0x0000000000de75e3                                                                                                                                                                                             â—†
  30.53%  mariadbd  mariadbd             [.] 0x0000000000de75e7                                                                                                                                                                                             â–’
   0.91%  mariadbd  mariadbd             [.] 0x0000000000de75ee                                                                                                                                                                                             â–’
   0.28%  mariadbd  mariadbd             [.] MYSQLparse                                                                                                                                                                                                     â–’
   0.10%  mariadbd  mariadbd             [.] 0x0000000000de75de                                                                                                                                                                                             â–’
   0.08%  mariadbd  mariadbd             [.] Lex_input_stream::get_text                                                                                                                                                                                     â–’
   0.07%  mariadbd  mariadbd             [.] 0x0000000000d1c195                                                                                                                                                                                             â–’
   0.06%  mariadbd  mariadbd             [.] Lex_input_stream::lex_one_token                                                                                                                                                                                â–’
   0.04%  mariadbd  mariadbd             [.] my_numchars_mb                                                                                                                                                                                                 â–’
   0.03%  mariadbd  [kernel.kallsyms]    [k] asm_sysvec_apic_timer_interrupt                                                                                                                                                                                â–’
   0.03%  mariadbd  libc-2.31.so         [.] __memmove_avx_unaligned_erms                                                                                                                                                                                   â–’
   0.03%  mariadbd  mariadbd             [.] 0x0000000000ddb73e                                                                                                                                                                                             â–’
   0.03%  mariadbd  [kernel.kallsyms]    [k] ktime_get_update_offsets_now                                                                                                                                                                                   â–’
   0.02%  mariadbd  mariadbd             [.] fill_record                                                                                                                                                                                                    â–’
   0.02%  mariadbd  mariadbd             [.] 0x0000000000f24360                                                                                                                                                                                             â–’
   0.02%  mariadbd  mariadbd             [.] 0x0000000000ddb707                                                                                                                                                                                             â–’
   0.02%  mariadbd  mariadbd             [.] setup_fields                                                                                                                                                                                                   â–’
   0.02%  mariadbd  [kernel.kallsyms]    [k] psi_group_change                                                                                                                                                                                               â–’
   0.02%  mariadbd  mariadbd             [.] 0x0000000000de75eb                                                                                                                                                                                             â–’
   0.02%  mariadbd  mariadbd             [.] mysql_insert                                                                                                                                                                                                   â–’
   0.02%  mariadbd  [kernel.kallsyms]    [k] native_write_msr                                                                                                                                                                                               â–’
   0.02%  mariadbd  libc-2.31.so         [.] _int_malloc                                                                                                                                                                                                    â–’
   0.02%  mariadbd  [kernel.kallsyms]    [k] irq_enter_rcu                                                                                                                                                                                                  â–’
   0.02%  mariadbd  mariadbd             [.] alloc_root                                                                                                                                                                                                     â–’
   0.02%  mariadbd  libc-2.31.so         [.] __memcmp_avx2_movbe                                                                                                                                                                                            â–’
   0.02%  mariadbd  [kernel.kallsyms]    [k] acct_account_cputime                                                                                                                                                                                           â–’
   0.02%  mariadbd  mariadbd             [.] Query_arena::free_items                                                                                                                                                                                        â–’
   0.02%  mariadbd  mariadbd             [.] 0x0000000000cf8c70                                                                                                                                                                                             â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000d1c57b                                                                                                                                                                                             â–’
   0.01%  mariadbd  mariadbd             [.] get_charset                                                                                                                                                                                                    â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000ddb784                                                                                                                                                                                             â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] hrtimer_interrupt                                                                                                                                                                                              â–’
   0.01%  mariadbd  mariadbd             [.] Lex_input_stream::unescape                                                                                                                                                                                     â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000f29083                                                                                                                                                                                             â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] cpufreq_this_cpu_can_update                                                                                                                                                                                    â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] pvclock_gtod_notify                                                                                                                                                                                            â–’
   0.01%  mariadbd  libc-2.31.so         [.] _int_free                                                                                                                                                                                                      â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave                                                                                                                                                                                         â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000f29067                                                                                                                                                                                             â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000f2904b                                                                                                                                                                                             â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] __update_load_avg_se                                                                                                                                                                                           â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000ddb7d4                                                                                                                                                                                             â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] perf_event_task_tick                                                                                                                                                                                           â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000f24363                                                                                                                                                                                             â–’
   0.01%  mariadbd  mariadbd             [.] THD::make_string_literal                                                                                                                                                                                       â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000d1c573                                                                                                                                                                                             â–’
   0.01%  mariadbd  mariadbd             [.] 0x0000000000d1b13a                                                                                                                                                                                             â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] native_read_msr                                                                                                                                                                                                â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] account_user_time                                                                                                                                                                                              â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] update_curr                                                                                                                                                                                                    â–’
   0.01%  mariadbd  mariadbd             [.] my_convert                                                                                                                                                                                                     â–’
   0.01%  mariadbd  [kernel.kallsyms]    [k] __sysvec_apic_timer_interrupt                            

i suppose I should compile a debug version of mariadb to show real names instead of "??" or "0x..." ?

Comment by martina342 [ 2021-09-14 ]

i compiled a debug version to test, but it's importing like 20 times slower. It will take a week to import

Comment by Marko Mäkelä [ 2021-09-15 ]

martina342, I just realized that stillstuckmariadb3 is missing the function parameters as well. This would suggest that separately available "dbgsym" packages were not installed.

I do not know why the non-stripped executable would be missing InnoDB symbols. That is not a new problem, and it seems to be affecting every system except Microsoft Windows. I suspect that it has something to do with the way how the intermediate target libinnobase.a is being built and linked with the final executables, but I have not investigated it. I think that the numeric addresses from perf could be due to the same problem.

I think that you will have to install additional packages, with names something like the following:

mariadb-server-10.6-dbgsym_10.6.4+maria~bullseye_amd64.deb
mariadb-server-core-10.6-dbgsym_10.6.4+maria~bullseye_amd64.deb

After installing those, we should see fully resolved stack traces, with function parameter values as well.

Note: The debugging information for GDB and similar tools independent of additional runtime checks that would be enabled in special debug builds. I do not think that it would be useful to track this problem using a debug build, because the timing characteristics could become completely different.

Comment by Daniel Black [ 2021-09-15 ]

martina342 thanks for helping out marko. I've been getting ready to release the below functionality as a service, its still really early days so any problems let me know.

This uses a build of the 10.6, exactly what appears in a release with the debug symbols installed in the container. This is from about 4 days ago currently (ref: https://quay.io/repository/mariadb-foundation/mariadb-debug?tab=tags).

I usually use podman, but docker works as well. Same arguments on command line:

Steps:

can you copy your cnf file, without datadir, tmpdir or log_error to a file in a directory my.cnf.d

Run the latest 10.6 version with the same configuration on your current data (or a copy if you want).

run image with debug info

docker run  --cap-add CAP_SYS_PTRACE --security-opt seccomp=unconfined  -v ./my.cnf.d:/etc/mysql/conf.d:z -v /home/mysql:/var/lib/mysql -p 127.0.0.1:3306:3306 -d --name mariadb    quay.io/mariadb-foundation/mariadb-debug:10.6-amd64

Note: seccomp is docker only.

install gdb into container

docker exec -ti mariadb sh -c "apt-get update && apt-get install -y gdb"

When you are at a point and want to take a snapshot:

take snapshot

docker exec -ti mariadb gosu mysql  gdb --batch --eval-command="set print frame-arguments all" --eval-command="thread apply all bt full" --eval-command="print dict_sys.latch" --eval-command="print dict_sys.mutex" -p 1  | tee  mariadb_bt_all_threads.txt

Test this now before applying workload to check all permissions have been relaxed enough to examine workload. Look at logs of the container with:

container logs

docker logs mariadb

Apply workload

To include some metadata like which exact build it was generated from in the same file:

capture metadata

docker exec -ti mariadb cat /manifest.txt | tee  -a mariadb_bt_all_threads.txt

In my test with Marko mariadb_bt_all_threads.txt was enough of a backtrace.

Comment by martina342 [ 2021-09-15 ]

thanks both of you.

Before your message, Daniel, I already had started over again, but with the package mariadb-server-10.6-dbgsym installed , so I let it run until it froze, instead of stopping it.

So it looks like it's blocked again

Id      User    Host    db      Command Time    State   Info    Progress
2717    root    localhost       bdd_principale  Query   3625    Update  INSERT INTO `i***********998    0.000
13685   root    localhost       bdd_principale  Sleep   30              NULL    0.000
13717   root    localhost       NULL    Query   0       starting        show processlist        0.000

I ran this :
gdb -ex "set pagination 0" -ex "set print frame-arguments all" -ex "thread apply all bt full" --batch -p $(pidof mariadbd)

Here is the output :
hopeenoughdata.txt

And this command :
gdb -ex "set pagination 0" -ex "set print frame-arguments all" -ex "print dict_sys.latch" --batch -p $(pidof mariadbd)
[New LWP 2201079]
[New LWP 2201080]
[New LWP 2201092]
[New LWP 2201098]
[New LWP 2201100]
[New LWP 2201104]
[New LWP 2201136]
[New LWP 2202646]
[New LWP 2209615]
[New LWP 2212111]
[New LWP 2228121]
[New LWP 2229924]
[New LWP 2253944]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f2c857913ff in _GI__poll (fds=0x560ff3d83bf8, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
No symbol "dict_sys" in current context.
[Inferior 1 (process 2201077) detached]

And here is the perf record

  40.05%  mariadbd  mariadbd             [.] 0x0000000000de75e3
  15.19%  mariadbd  mariadbd             [.] 0x0000000000de75e7
   8.74%  mariadbd  mariadbd             [.] MYSQLparse
   1.80%  mariadbd  mariadbd             [.] Lex_input_stream::lex_one_token
   1.71%  mariadbd  mariadbd             [.] 0x0000000000d1c195
   1.46%  mariadbd  mariadbd             [.] Lex_input_stream::get_text
   0.55%  mariadbd  mariadbd             [.] setup_fields
   0.54%  mariadbd  mariadbd             [.] 0x0000000000ddb7d4
   0.53%  mariadbd  mariadbd             [.] my_numchars_mb
   0.48%  mariadbd  mariadbd             [.] fill_record
   0.48%  mariadbd  mariadbd             [.] Query_arena::free_items
   0.45%  mariadbd  libc-2.31.so         [.] __memmove_avx_unaligned_erms
   0.44%  mariadbd  libc-2.31.so         [.] __memcmp_avx2_movbe
   0.42%  mariadbd  mariadbd             [.] alloc_root
   0.38%  mariadbd  mariadbd             [.] 0x0000000000de75ee
   0.37%  mariadbd  mariadbd             [.] 0x0000000000cf8c70
   0.37%  mariadbd  libc-2.31.so         [.] _int_malloc
   0.35%  mariadbd  mariadbd             [.] THD::make_string_literal
   0.34%  mariadbd  mariadbd             [.] 0x0000000000ddb707
   0.33%  mariadbd  mariadbd             [.] 0x0000000000ddb73e
   0.33%  mariadbd  mariadbd             [.] Item::Item
   0.31%  mariadbd  libc-2.31.so         [.] _int_free
   0.30%  mariadbd  mariadbd             [.] 0x0000000000ddb844
   0.28%  mariadbd  mariadbd             [.] get_charset
   0.27%  mariadbd  mariadbd             [.] Lex_input_stream::add_digest_token
   0.27%  mariadbd  mariadbd             [.] 0x0000000000ddbb81
   0.26%  mariadbd  mariadbd             [.] 0x0000000000d1c57b
   0.22%  mariadbd  mariadbd             [.] 0x0000000000ddb784
   0.22%  mariadbd  mariadbd             [.] Lex_input_stream::lex_token
   0.21%  mariadbd  mariadbd             [.] 0x0000000000f24360
   0.21%  mariadbd  mariadbd             [.] 0x0000000000d1c573
   0.21%  mariadbd  mariadbd             [.] mysql_insert
   0.20%  mariadbd  mariadbd             [.] 0x0000000000d1b1ce
   0.18%  mariadbd  mariadbd             [.] 0x0000000000cd9192
   0.18%  mariadbd  mariadbd             [.] 0x0000000000d1b13a
   0.18%  mariadbd  mariadbd             [.] 0x0000000000f2904b
   0.18%  mariadbd  mariadbd             [.] 0x0000000000f29083
   0.18%  mariadbd  mariadbd             [.] 0x0000000000d1b1ef
   0.17%  mariadbd  mariadbd             [.] 0x0000000000f29067
   0.17%  mariadbd  mariadbd             [.] Item::set_name
   0.16%  mariadbd  mariadbd             [.] Item::cleanup
   0.16%  mariadbd  libc-2.31.so         [.] malloc
   0.16%  mariadbd  mariadbd             [.] Field_blob::store
   0.15%  mariadbd  mariadbd             [.] 0x0000000000f24363
   0.14%  mariadbd  mariadbd             [.] 0x0000000000ddb8ff
   0.14%  mariadbd  mariadbd             [.] 0x0000000000d1c259
   0.14%  mariadbd  libc-2.31.so         [.] __memset_avx2_unaligned_erms
   0.14%  mariadbd  mariadbd             [.] Item_string::save_in_field
   0.13%  mariadbd  mariadbd             [.] 0x0000000000ddb8f9
   0.13%  mariadbd  mariadbd             [.] my_convert
   0.13%  mariadbd  mariadbd             [.] 0x0000000000dc55cb
   0.12%  mariadbd  libc-2.31.so         [.] __memmove_avx_unaligned

I will follow the steps for the docker test

edit: I have an error with the first command :

 
docker run  --cap-add CAP_SYS_PTRACE --security-opt seccomp=unconfined  -v ./my.cnf.d:/etc/mysql/conf.d:z -v /home/mysql:/var/lib/mysql -p 127.0.0.1:3306:3306 -d --name mariadb    quay.io/mariadb-foundation/mariadb-debug:10.6-amd64
Unable to find image 'quay.io/mariadb-foundation/mariadb-debug:10.6-amd64' locally
10.6-amd64: Pulling from mariadb-foundation/mariadb-debug
5297cdb712ed: Pull complete
834628376a35: Pull complete
1a4e5185f03f: Pull complete
Digest: sha256:d86994bf9f22c7bd097445a0d232835ac15e5847acab190f7602501437582fca
Status: Downloaded newer image for quay.io/mariadb-foundation/mariadb-debug:10.6-amd64
docker: Error response from daemon: create ./my.cnf.d: "./my.cnf.d" includes invalid characters for a local volume name, only "[a-zA-Z0-9][a-zA-Z0-9_.-]" are allowed. If you intended to pass a host directory, use absolute path.
See 'docker run --help'.

the folder exists :

ls ./my.cnf.d
my.cnf

Comment by Marko Mäkelä [ 2021-09-15 ]

martina342, it looks like the interesting command thread apply all backtrace was not being executed at all. It is not promising either that the symbol dict_sys.latch was not recognized by the debugger. Hopefully it was just because the current stack frame was in a shared library and not the main mariadbd executable. So, that GDB invocation did not get us any further yet.

I do not know if perf report is supposed to be helped by the dbgsym packages.

If this involves Docker, I understood from earlier discussions with danblack that some extra effort may be needed to make GDB work. He is our Docker expert.

I am sorry that this is so tricky, but for hangs or some forms of performance trouble there really is no substitute for having stack traces of all threads. For some hangs, we would want to see several samples to distinguish genuine deadlocks from mere slowness.

Comment by Marko Mäkelä [ 2021-09-15 ]

Sorry, I overlooked the attachment hopeenoughdata.txt again. It seems that we are still missing function call parameter names and values and local variables for most stack frames. I only see them for some library functions, such as the pthread functions.

There is no occurrence of innobase in the output, but there are ?? function names, some of which must belong to InnoDB threads. One is Thread 9 apparently executing an INSERT statement. One of Thread 4 or Thread 3 should be the buf_flush_page_cleaner(), but I cannot know which one, and I have no idea what the other one could be. Maybe it’d be an Aria checkpointer thread, because there are no Aria symbols either. Thread 13 could be running some InnoDB code. We would really need to know what those threads are doing.

One thing that you might try is to set innodb_stats_persistent=OFF or innodb_stats_auto_recalc=OFF. If this hang is related to the automatic recalculation of persistent statistics (MDEV-15020), it should be fixed in the upcoming 10.6.5 release. But, I’d rather expect Thread 13 to run the InnoDB purge coordinator task than the statistics updates.

Comment by martina342 [ 2021-09-15 ]

i replaced ./my.cnf.d with the fullpath to fix the error I mentioned.
I also have problems to access mariadb from the server to docker as only 127.0.0.1 is allowed, so I had to adjust it.

Here is the log file using docker, without any workload mariadb_bt_all_threads.txt

without workload :

docker logs mariadb
2021-09-15 19:56:14+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.6.5+maria~focal started.
2021-09-15 19:56:14+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2021-09-15 19:56:14+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.6.5+maria~focal started.
Warning: skipping '!includedir /etc/mysql/conf.d/' directive as maximum includerecursion level was reached in file /etc/mysql/conf.d/my.cnf at line 2
Warning: skipping '!includedir /etc/mysql/mariadb.conf.d/' directive as maximum includerecursion level was reached in file /etc/mysql/conf.d/my.cnf at line 3
2021-09-15 19:56:15 0 [Note] mariadbd (server 10.6.5-MariaDB-1:10.6.5+maria~focal as 10.6.5-MariaDB-b873651e3cae854f887c4c4af39206def5992be4) starting as process 1 ...
2021-09-15 19:56:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-09-15 19:56:15 0 [Note] InnoDB: Number of pools: 1
2021-09-15 19:56:15 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2021-09-15 19:56:15 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2021-09-15 19:56:15 0 [Note] InnoDB: Using Linux native AIO
2021-09-15 19:56:15 0 [Note] InnoDB: Initializing buffer pool, total size = 161061273600, chunk size = 134217728
2021-09-15 19:56:15 0 [Note] InnoDB: Completed initialization of buffer pool
2021-09-15 19:56:15 0 [Note] InnoDB: Opened 2 undo tablespaces
2021-09-15 19:56:15 0 [Note] InnoDB: Opened 2 undo tablespaces
2021-09-15 19:56:15 0 [Note] InnoDB: 128 rollback segments in 2 undo tablespaces are active.
2021-09-15 19:56:15 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2021-09-15 19:56:15 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2021-09-15 19:56:15 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2021-09-15 19:56:15 0 [Note] InnoDB: 10.6.5 started; log sequence number 42906; transaction id 14
2021-09-15 19:56:15 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2021-09-15 19:56:15 0 [Note] Plugin 'FEEDBACK' is disabled.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 19:56:15 0 [Warning] You need to use --log-bin to make --expire-logs-days or --binlog-expire-logs-seconds work.
2021-09-15 19:56:15 0 [Note] Server socket created on IP: '0.0.0.0'.
2021-09-15 19:56:15 0 [Note] Server socket created on IP: '::'.
2021-09-15 19:56:15 0 [Note] InnoDB: Buffer pool(s) load completed at 210915 19:56:15
2021-09-15 19:56:15 0 [Warning] 'user' entry '@ns3' ignored in --skip-name-resolve mode.
2021-09-15 19:56:15 0 [Warning] 'proxies_priv' entry '@% root@ns3' ignored in --skip-name-resolve mode.
2021-09-15 19:56:15 0 [Note] mariadbd: ready for connections.
Version: '10.6.5-MariaDB-1:10.6.5+maria~focal' as '10.6.5-MariaDB-b873651e3cae854f887c4c4af39206def5992be4'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution

i'm currently importing data on the container to generate logs with workload, but if the bug is indeed the one fixed on 10.6.5, it should not reproduce anymore

Comment by martina342 [ 2021-09-16 ]

Hello,
the current import from mysqldump with the docker version of mariadb (10.6.5-MariaDB-b873651e3cae854f887c4c4af39206def5992be4 mariadb.org binary distribution) seems to work without complete freeze.
Almost all insert take less than a second. I have a QPS of 5.8, and i'm doing a show processlist every second.
However, I noticed at some point an insert that took more than 15 minutes. I took a stacktrace during it (i censored some informations). The table that was inserting, the ibd file is 332mb "only", not all inserts on this table took this long, even on this table most of the time it took less than a second :
mariadb_bt_all_threads_workload_slowed_down.txt

Here are the mariadb logs :

2021-09-15 20:09:05+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.6.5+maria~focal started.
2021-09-15 20:09:05+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2021-09-15 20:09:05+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:10.6.5+maria~focal started.
Warning: skipping '!includedir /etc/mysql/conf.d/' directive as maximum includerecursion level was reached in file /etc/mysql/conf.d/my.cnf at line 2
Warning: skipping '!includedir /etc/mysql/mariadb.conf.d/' directive as maximum includerecursion level was reached in file /etc/mysql/conf.d/my.cnf at line 3
2021-09-15 20:09:06 0 [Note] mariadbd (server 10.6.5-MariaDB-1:10.6.5+maria~focal as 10.6.5-MariaDB-b873651e3cae854f887c4c4af39206def5992be4) starting as process 1 ...
2021-09-15 20:09:06 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-09-15 20:09:06 0 [Note] InnoDB: Number of pools: 1
2021-09-15 20:09:06 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2021-09-15 20:09:06 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2021-09-15 20:09:06 0 [Note] InnoDB: Using Linux native AIO
2021-09-15 20:09:06 0 [Note] InnoDB: Initializing buffer pool, total size = 161061273600, chunk size = 134217728
2021-09-15 20:09:06 0 [Note] InnoDB: Completed initialization of buffer pool
2021-09-15 20:09:06 0 [Note] InnoDB: Opened 2 undo tablespaces
2021-09-15 20:09:06 0 [Note] InnoDB: Opened 2 undo tablespaces
2021-09-15 20:09:06 0 [Note] InnoDB: 128 rollback segments in 2 undo tablespaces are active.
2021-09-15 20:09:06 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2021-09-15 20:09:06 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2021-09-15 20:09:06 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2021-09-15 20:09:06 0 [Note] InnoDB: 10.6.5 started; log sequence number 42918; transaction id 14
2021-09-15 20:09:06 0 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
2021-09-15 20:09:06 0 [Note] Plugin 'FEEDBACK' is disabled.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2021-09-15 20:09:06 0 [Warning] You need to use --log-bin to make --expire-logs-days or --binlog-expire-logs-seconds work.
2021-09-15 20:09:06 0 [Note] InnoDB: Buffer pool(s) load completed at 210915 20:09:06
2021-09-15 20:09:06 0 [Note] Server socket created on IP: '0.0.0.0'.
2021-09-15 20:09:06 0 [Note] Server socket created on IP: '::'.
2021-09-15 20:09:06 0 [Warning] 'user' entry '@ns3' ignored in --skip-name-resolve mode.
2021-09-15 20:09:06 0 [Warning] 'proxies_priv' entry '@% root@ns3' ignored in --skip-name-resolve mode.
2021-09-15 20:09:06 0 [Note] mariadbd: ready for connections.
Version: '10.6.5-MariaDB-1:10.6.5+maria~focal' as '10.6.5-MariaDB-b873651e3cae854f887c4c4af39206def5992be4'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
2021-09-15 20:09:39 3 [Warning] Aborted connection 3 to db: 'unconnected' user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally without authentication)
2021-09-15 20:19:07 21 [Warning] Aborted connection 21 to db: 'unconnected' user: 'unauthenticated' host: '172.17.0.1' (This connection closed normally without authentication)
2021-09-15 20:22:39 24 [Warning] 'user' entry '@ns3' ignored in --skip-name-resolve mode.
2021-09-15 20:22:39 24 [Warning] 'proxies_priv' entry '@% root@ns3' ignored in --skip-name-resolve mode.
2021-09-15 20:23:33 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:23:53 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:23:54 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:24:02 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:24:38 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:24:38 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:24:39 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:24:39 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:25:16 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:25:16 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:25:17 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:25:51 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:26:30 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:26:30 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:26:31 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:26:31 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:27:01 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:31:02 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:31:02 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:31:02 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:31:27 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:34:46 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:34:48 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:35:30 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:35:31 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:35:31 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:36:00 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:36:01 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:36:01 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:36:09 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:42:19 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:42:19 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:42:20 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:42:20 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:42:51 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:42:51 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:42:51 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:42:51 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:43:25 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:45:03 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:45:04 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:48:17 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:48:43 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:48:43 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:48:43 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:48:57 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:49:25 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:49:25 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:49:25 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:49:25 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:49:53 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:49:54 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:49:54 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:49:55 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:50:23 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:52:35 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:52:36 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:53:00 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:55:56 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:55:56 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:56:32 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:56:33 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:56:33 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:56:33 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:57:07 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:57:08 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:57:08 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:57:08 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:58:48 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 20:58:48 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 20:58:49 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 20:58:49 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 20:59:18 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:00:16 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:00:17 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:00:17 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:00:46 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:00:46 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:00:47 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:00:47 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:01:17 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:01:17 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:01:18 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:01:18 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:01:46 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:01:46 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:01:47 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:01:47 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:02:15 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:08:44 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:08:45 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:08:59 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:08:59 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:12:09 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:12:36 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:28:04 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:28:05 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:30:25 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:30:25 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:34:08 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:34:34 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:34:34 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:34:35 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:34:35 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:35:07 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:35:07 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:35:08 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:35:08 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:35:38 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:35:38 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:35:39 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:35:39 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:36:08 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:36:08 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:36:09 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:36:09 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:36:38 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:36:40 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:36:40 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:36:40 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:37:10 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:37:11 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:37:11 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:37:12 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:37:42 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:37:43 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:37:43 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:37:44 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:38:16 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:38:16 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:38:16 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:38:17 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:38:46 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:38:46 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:38:46 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:38:47 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:39:21 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:39:22 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:39:22 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:39:22 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:39:59 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:39:59 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:40:00 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:40:00 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:40:37 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:40:37 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:40:38 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:40:38 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:41:14 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:41:14 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:41:15 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:41:15 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:41:51 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:41:52 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:41:52 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:41:52 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:42:26 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:42:27 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:42:27 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:42:27 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:43:03 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:43:03 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:43:04 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:43:04 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:43:40 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:43:40 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:43:41 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:43:59 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:44:34 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:44:34 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 21:44:35 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:44:35 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:45:13 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:45:13 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:45:13 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 21:45:14 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 23:00:57 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 23:01:01 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 23:01:02 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 23:01:02 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 23:01:41 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 23:35:25 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 23:35:26 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 23:37:30 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 23:38:05 0 [Note] InnoDB: Truncating .//undo001
2021-09-15 23:45:39 0 [Note] InnoDB: Truncated .//undo001
2021-09-15 23:45:40 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 23:45:40 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 23:46:19 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 23:51:57 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 23:51:58 0 [Note] InnoDB: Truncating .//undo001
2021-09-16  0:02:12 0 [Note] InnoDB: Truncated .//undo001

Notice how it often truncate undo logs, but sometime it takes longer and most of the time it's instaneous :
2021-09-15 21:08:59 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:12:09 0 [Note] InnoDB: Truncated .//undo002
2021-09-15 21:12:36 0 [Note] InnoDB: Truncating .//undo002
2021-09-15 21:28:04 0 [Note] InnoDB: Truncated .//undo002

This matches with the moment i took the stacktrace.

So if everything continue to do the same behaviour, it's only randomly stuck for some minutes, instead of blocking completely.

So I think there is still something to check. I guess I need to compile mariadb until the next version is released ? Do you have an ETA for it ?

Comment by Marko Mäkelä [ 2021-09-16 ]

In mariadb_bt_all_threads_workload_slowed_down.txt I see that undo tablespace truncation is being executed. (Please be aware of MDEV-26450.) There previously was a 10.5 performance regression about that (MDEV-25801), but this is the fixed code.

I also see that the INSERT is waiting for buffer pool LRU eviction. Based on this single snapshot during the workload, I would guess that the data load is being slowed down by ‘eviction flushing’ (writing out least recently used pages so that they can be evicted, to make room for new pages). If it is not possible to make the buffer pool or the redo log bigger, there is nothing that we can do about this. But, let’s wait for the stack trace for the actual hang, to see if there is some other problem. In this stack trace dump, everything still looks normal to me.

How big are the buffer pool and the redo log, and how big would the final table be? Also, what type of storage are you using?

Comment by Marko Mäkelä [ 2021-09-16 ]

martina342, I checked the new attachment and composed my message before you posted the update.

It looks like the problem is related to undo tablespace truncation. I think that we must test its performance as part of fixing MDEV-26450. Until that bug has been fully analyzed and fixed, I would not recommend enabling undo tablespace truncation. It is possible that this correctness bug affects all MariaDB releases since 10.2.

I do not have an estimate when we could fix that, and in my opinion it is not feasible to tweak the performance before the code has been confirmed to be correct. It is covered by a simple kill+restart test in our regression test suite, but that test never fails.

I try to treat any corruption bugs with high priority. This one might miss the next round of point releases. We aim to have releases about 4 times per year.

Comment by martina342 [ 2021-09-16 ]

this are the settings :

wait_timeout 					= 43200
max_allowed_packet				= 64M
query_cache_size 				= 0
query_cache_type 				= 0
innodb_buffer_pool_size 		= 150G
innodb_log_file_size 			= 37G
innodb_buffer_pool_instances	= 64
innodb_undo_log_truncate 		= ON
skip-name-resolve 				= 1
tmp_table_size 					= 256M
max_heap_table_size 			= 256M
max_connections			 		= 400
net_read_timeout 				= 250600
net_write_timeout 				= 250600
join_buffer_size 				= 1MB
sql_mode						= ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
innodb-defragment				= 1
innodb_online_alter_log_max_size= 50000000000
table_definition_cache			= 1000
innodb_undo_tablespaces			= 2
table_open_cache				= 30000
innodb_fatal_semaphore_wait_threshold = 2000
 
[mariadb]
core_file

I can't make the innodb_buffer_pool_size larger (it's already 150G), i have innodb_log_file_size at 37G, which was suggested by mysqltuner.

The table that was inserting during the dump is 317Mb large only.

I'm using only innodb tables.

https://jira.mariadb.org/browse/MDEV-26450 : this is what happened to me on version 10.5 when the semaphore wait was killing the server and forced me to restore backups. But it also happened on 10.6 during normal shutdown. So i will disable undo truncate until it's fixed.

Comment by Daniel Black [ 2021-09-16 ]

The slowness in the container may be as a result of what docker its storage, overlayfs, as a guess. Its good enough for a few package updates but for the datadir it should be in its own volume. "docker volume create mariadbdata", and use that volume "-v mariadbdata:/var/lib/mysql" when the container is started. This should be a direct binding of a directory without the complexities of userspace filesystems.

release schedule: https://jira.mariadb.org/projects/MDEV?selectedItem=com.atlassian.jira.jira-projects-plugin:release-page

Comment by Marko Mäkelä [ 2021-09-16 ]

martina342, I see. If both the buffer pool and the data were smaller (so that there wasn’t too much LRU page eviction going on), then I might suggest using a larger redo log file size, so that log checkpoints would be executed less often.

I see that you are not configuring innodb_io_capacity. The default value could be way to small for your storage. (Which type is it? SATA HDD? SSD? NVMe? What is the read and write speed?)

Possibly there still is some room of improvement for our page flushing (MDEV-26055).

In any case, please let me know if the performance is more acceptable when undo tablespace truncation is disabled. That will be helpful when we test the fix of MDEV-26450.

Comment by martina342 [ 2021-09-16 ]

i'm using a raid 1 nvme (soft) which is apparently rated at this :
Random Read (4kb) : 430 Kiops
Random Write (4kb) : 40 Kiops
Seq. Read : 3000 Mb/s
Seq. Write : 1900 Mb/s

i can't test right now for real world benchmark as i'm still importing backup.

thanks for the hint of innodb_io_capacity, apparently i was bottlenecking my server due to that variable.

Globally the performance is acceptable, except when it seem to freeze like it happened, which seems to happen rarely, so hard to monitor

Comment by Marko Mäkelä [ 2021-09-16 ]

On 10.6, you could try to adapt your data load for MDEV-515, to avoid unnecessary row-level undo logging when inserting into empty tables. (And insert into each table only from one transaction.)

SET unique_checks=0, foreign_key_checks=0;

This is admittedly a hack, to be compatible with mysqldump files. And you’d better be sure that there are no violations of those checks.

Comment by martina342 [ 2021-09-17 ]

here is real world benchmark using fio :

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.25
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][100.0%][r=276MiB/s,w=91.7MiB/s][r=70.6k,w=23.5k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3064140: Fri Sep 17 10:39:57 2021
  read: IOPS=74.9k, BW=293MiB/s (307MB/s)(3070MiB/10487msec)
   bw (  KiB/s): min=278040, max=326000, per=100.00%, avg=305779.20, stdev=15118.69, samples=20
   iops        : min=69510, max=81500, avg=76444.80, stdev=3779.67, samples=20
  write: IOPS=25.0k, BW=97.8MiB/s (103MB/s)(1026MiB/10487msec); 0 zone resets
   bw (  KiB/s): min=91712, max=108600, per=100.00%, avg=102215.60, stdev=5362.53, samples=20
   iops        : min=22928, max=27148, avg=25553.90, stdev=1340.58, samples=20
  cpu          : usr=11.85%, sys=55.60%, ctx=309103, majf=0, minf=134
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64
 
Run status group 0 (all jobs):
   READ: bw=293MiB/s (307MB/s), 293MiB/s-293MiB/s (307MB/s-307MB/s), io=3070MiB (3219MB), run=10487-10487msec
  WRITE: bw=97.8MiB/s (103MB/s), 97.8MiB/s-97.8MiB/s (103MB/s-103MB/s), io=1026MiB (1076MB), run=10487-10487msec
 
Disk stats (read/write):
    md2: ios=779713/260682, merge=0/0, ticks=366080/197908, in_queue=563988, util=99.13%, aggrios=392960/262799, aggrmerge=0/11, aggrticks=185182/181367, aggrin_queue=366550, aggrutil=98.77%
  nvme1n1: ios=388833/262799, merge=0/11, ticks=187345/181869, in_queue=369215, util=98.69%
  nvme0n1: ios=397087/262799, merge=0/11, ticks=183019/180865, in_queue=363885, util=98.77%

so i guess i could set innodb_io_capacity to at least 15000 ?

Before disabling unto truncation, I've saw another request that took 1 hour instead of being instantaneous.

I've disabled unto truncate, but could't stay to monitor if some other requests were taking too long, but it finished, so I suppose it made it works better.

Comment by martina342 [ 2021-09-17 ]

DAMN

i've disabled unto truncation on runtime ... and in the my.cnf, but it corrupted the data anyway :

Version: '10.6.5-MariaDB-1:10.6.5+maria~focal' as '10.6.5-MariaDB-b873651e3cae854f887c4c4af39206def5992be4'  socket: '/run/mysqld/mysqld.sock'  port: 3306  mariadb.org binary distribution
2021-09-17 09:09:27 0x7f058b00b700  InnoDB: Assertion failure in file /buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/trx/trx0purge.cc line 123
InnoDB: Failing assertion: purge_sys.tail.trx_no <= purge_sys.rseg->last_trx_no()
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
210917  9:09:27 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.6.5-MariaDB-1:10.6.5+maria~focal
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=402
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1016262 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x55df3500f828
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f058b00aca8 thread_stack 0x49000
mariadbd(my_print_stacktrace+0x32)[0x55df33442c32]

please tell me how to fix that without reimporting everything, my client can't wait any more. It took 1.5 days to reimport the data

It was working, and once i restarted mariadb, it crashed

apprently it crashed but doesn't want to stop :

docker exec mariadb ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mysql          1  4.6 16.3 165041080 43171172 ?  Ssl  09:09   2:19 mariadbd
mysql         55  0.8  1.5 160747856 4036876 ?   S    09:09   0:23 mariadbd
root         122  0.0  0.0  10896  4248 ?        Ss   09:38   0:00 mysqladmin shutdown
root         269  0.0  0.0   5904  2784 ?        Rs   09:59   0:00 ps aux

and the logs ends with this and nothing else after :

2021-09-17  9:09:53 0 [Note] InnoDB: Buffer pool(s) load completed at 210917  9:09:53
2021-09-17  9:38:44 0 [Note] mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
2021-09-17  9:38:44 0 [Note] InnoDB: FTS optimize thread exiting.

Comment by Marko Mäkelä [ 2021-09-17 ]

martina342, if the undo logs were already corrupted due to MDEV-26450 in some way (you did not restart from the scratch after disabling that), then I am sorry but some (most?) forms of corruption in the InnoDB system and undo tablespaces are not self-healing or easy to fix.

I do not think that we currently have any option to discard and re-create the undo logs. There is only the option innodb_force_recovery=5 that would ignore all undo logs and treat all reads as READ UNCOMMITTED. Technically, it should not be hard to implement such an option, if the user accepts the resulting inconsistency and the potential corruption of secondary indexes for any partially completed row operations. So far, no support customer has requested that.

Comment by martina342 [ 2021-09-19 ]

Hello
I started over again, this time with undo truncation disabled from the beginning, and I didn't noticed any freeze or corrumption using the own compiled 10.6.5

Comment by Marko Mäkelä [ 2021-09-21 ]

martina342, thank you. Meanwhile, I think I nailed down MDEV-26450.

I think that in 10.5 and 10.6, it should be possible to avoid the buf_flush_list_space() loop in trx_purge_truncate_history() and simply let undo tablespace truncation discard any pages that had not been written before the file was shrunk. This thanks to MDEV-15528. We will handle that as part of this ticket, not as part of the correctness issue, which is MDEV-26450.

Comment by Marko Mäkelä [ 2021-09-22 ]

I implemented a fix that avoids writing any pages of the being-truncated undo tablespace. It does not depend on MDEV-15528, but rather on MDEV-25113 and MDEV-15058. A simple 5-minute Sysbench oltp_update_index 8×100,000 rows (on RAM disk) shows the following difference of average throughput, 95th percentile and maximum latency between innodb_undo_tablespaces=0 and innodb_undo_tablespaces=2 innodb_undo_log_truncate=ON:

connections baseline/tps baseline 95%/ms baseline max/ms truncate/tps truncate 95%/ms truncate max/ms
10 95285.41 0.19 77.87 97687.20 0.18 38.54
20 157405.77 0.18 195.21 159989.50 0.19 407.60
40 164189.49 0.42 336.01 166069.04 0.70 333.03
80 163296.38 1.32 773.72 161948.80 2.07 716.47
160 146380.27 5.00 161.18 143285.16 5.88 136.79

Note: Each test run was on the same 10.5-based server code, just changing the 2 configuration parameters. The latency is increasing starting with 40 concurrent connections. However, based on the server error log, the last time the undo logs were truncated was 14 minutes before the end of the test, which must have been while running at 40 concurrent connections. At the start (10 connections), the undo log was being truncated several times per second. So, the latency increase ought to be due to using 2 dedicated undo tablespaces, not due to undo tablespace truncation. The maximum latency is random by nature, but the average throughput and the 95th percentile latency should be reasonably stable.

Comment by Marko Mäkelä [ 2021-09-23 ]

It occurred to me that it is better to not hold any mutex while truncating the file. It suffices to only hold the tablespace latch. That should reduce the latency, by only blocking page allocation operations in the being-truncated tablespace. During the truncation, the undo tablespace should not be accessed at all.

Comment by Marko Mäkelä [ 2021-09-23 ]

I conducted another test on the 10.6-based branch, this time using NVMe instead of RAM disk, and 50GiB redo log and buffer pool. Yesterday, on the 10.5-based branch, I think that the buffer pool was 50GiB.

connections throughput/tps 95% latency/ms max latency/ms
10 105850.83 0.12 19.65
20 161065.65 0.17 23.45
40 172238.07 0.44 38.84
80 169323.66 0.86 108.88

I did not test at 160 connections. We can see that the latency and throughput are much better now, despite the fact that the test runs on real storage instead of RAM disk. Some of it should be explained by the trick to truncate the file while only holding the fil_space_t::latch.

Admittedly, the storage is fast (peak write speed 2300 MB/s, only about 100 MB/s in this benchmark, and the maximum write rate I have seen in my tests is around 800 MB/s). I suppose that it was writing out almost no data pages. During the benchmark, the LSN only grew to 42,006,626,705 (not even to fill the 50 GiB log). So, there should have been no need to create a log checkpoint.

The last undo truncation took place almost 12 minutes before the end of the benchmark. That was while we were serving 20 concurrent connections. I think that to have more chances of undo tablespace truncation, on a write-heavy system, innodb_undo_tablespaces needs to be larger than 2.

Comment by Matthias Leich [ 2021-09-23 ]

origin/bb-10.6-MDEV-26450 456479690b9ee38b0b8a912bf4b6217952c5903b 2021-09-23T11:21:51+03:00
which should contain fixes for MDEV-26450 and MDEV-26445 behaved well in RQG testing.

Generated at Thu Feb 08 09:45:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.