Thank you for an excellent report, with a useful mariadbd_full_bt_all_threads.txt
to start with. Many threads are waiting for dict_sys.latch, or the caller of a syscall() was invoke with this=dict_sys+72. If I ignore those, I should hopefully find the thread that is blocked while holding an exclusive dict_sys.latch. The first interesting thread is this one:
10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9
|
Thread 236 (Thread 0x7fea195a2700 (LWP 216099)):
|
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
…
|
#5 srw_lock_impl<true>::wr_lock (line=1435, file=0x561122d9c240 "/home/buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/trx/trx0undo.cc", this=0x561123344bc8 <trx_sys+16648>) at ./storage/innobase/include/srw_lock.h:528
|
No locals.
|
#6 trx_undo_assign_low<false> (trx=trx@entry=0x7feb52822580, rseg=rseg@entry=0x561123344bc0 <trx_sys+16640>, undo=undo@entry=0x7feb52822ed0, mtr=mtr@entry=0x7fea1959e790, err=err@entry=0x7fea1959e754) at ./storage/innobase/trx/trx0undo.cc:1435
|
block = <optimized out>
|
This is executing an INSERT, so it can’t be holding an exclusive dict_sys.latch. It is waiting for an exclusive rseg->latch so that it can assign the first undo log page for the transaction. We have also Thread 234 and Thread 230 waiting for the same rseg->latch == trx_sys+16648 at the same spot. This should still be normal; the interesting threads would be ones that are waiting for something else while holding dict_sys.latch or the rseg->latch.
In the end, I filtered out these less interesting waits from the output of
grep -B1 -A3 -w syscall mariadbd_full_bt_all_threads.txt
|
using replace-regexp in GNU Emacs and got the following:
10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9
|
Thread 196 (Thread 0x7fea185a0700 (LWP 215877)):
|
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
No locals.
|
#1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7feb2c090398) at ./storage/innobase/sync/srw_lock.cc:238
|
No locals.
|
--
|
Thread 159 (Thread 0x7fea402d9700 (LWP 1624)):
|
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
No locals.
|
#1 0x00005611229ac028 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x7fea00030470) at ./storage/innobase/sync/srw_lock.cc:238
|
No locals.
|
--
|
Thread 149 (Thread 0x7feb52817700 (LWP 57)):
|
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
No locals.
|
#1 0x0000561122aa3b6d in my_getevents (min_nr=1, nr=256, ev=0x7feb52814d60, ctx=<optimized out>) at ./tpool/aio_linux.cc:105
|
saved_errno = <optimized out>
|
--
|
Thread 44 (Thread 0x7fea19188700 (LWP 216113)):
|
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
No locals.
|
#1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fea9c096658) at ./storage/innobase/sync/srw_lock.cc:238
|
No locals.
|
--
|
We can ignore Thread 149, which is invoking the io_getevents system call; this image does not use the newer io_uring (MDEV-24883). The other three threads are waiting inside btr_cur_t::search_leaf(). This could be a case that was missed in MDEV-29835. The ticket MDEV-31815 had been filed, but because the test case involved ROW_FORMAT=COMPRESSED tables, it hit other bugs too easily.
I will check the mtr_t::m_memo in all threads deeper, to check if this is related to MDEV-31815.
Thank you for an excellent report, with a useful mariadbd_full_bt_all_threads.txt
to start with. Many threads are waiting for dict_sys.latch, or the caller of a syscall() was invoke with this=dict_sys+72. If I ignore those, I should hopefully find the thread that is blocked while holding an exclusive dict_sys.latch. The first interesting thread is this one:
10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9
Thread 236 (Thread 0x7fea195a2700 (LWP 216099)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
…
#5 srw_lock_impl<true>::wr_lock (line=1435, file=0x561122d9c240 "/home/buildbot/amd64-ubuntu-2004-deb-autobake/build/storage/innobase/trx/trx0undo.cc", this=0x561123344bc8 <trx_sys+16648>) at ./storage/innobase/include/srw_lock.h:528
No locals.
#6 trx_undo_assign_low<false> (trx=trx@entry=0x7feb52822580, rseg=rseg@entry=0x561123344bc0 <trx_sys+16640>, undo=undo@entry=0x7feb52822ed0, mtr=mtr@entry=0x7fea1959e790, err=err@entry=0x7fea1959e754) at ./storage/innobase/trx/trx0undo.cc:1435
block = <optimized out>
This is executing an INSERT, so it can’t be holding an exclusive dict_sys.latch. It is waiting for an exclusive rseg->latch so that it can assign the first undo log page for the transaction. We have also Thread 234 and Thread 230 waiting for the same rseg->latch == trx_sys+16648 at the same spot. This should still be normal; the interesting threads would be ones that are waiting for something else while holding dict_sys.latch or the rseg->latch.
In the end, I filtered out these less interesting waits from the output of
grep -B1 -A3 -w syscall mariadbd_full_bt_all_threads.txt
using replace-regexp in GNU Emacs and got the following:
10.6 07494006dd0887ebfb31564a8fd4c59cf1b299e9
Thread 196 (Thread 0x7fea185a0700 (LWP 215877)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7feb2c090398) at ./storage/innobase/sync/srw_lock.cc:238
No locals.
--
Thread 159 (Thread 0x7fea402d9700 (LWP 1624)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00005611229ac028 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x7fea00030470) at ./storage/innobase/sync/srw_lock.cc:238
No locals.
--
Thread 149 (Thread 0x7feb52817700 (LWP 57)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x0000561122aa3b6d in my_getevents (min_nr=1, nr=256, ev=0x7feb52814d60, ctx=<optimized out>) at ./tpool/aio_linux.cc:105
saved_errno = <optimized out>
--
Thread 44 (Thread 0x7fea19188700 (LWP 216113)):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1 0x00005611229ac0d8 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7fea9c096658) at ./storage/innobase/sync/srw_lock.cc:238
No locals.
--
We can ignore Thread 149, which is invoking the io_getevents system call; this image does not use the newer io_uring (
MDEV-24883). The other three threads are waiting inside btr_cur_t::search_leaf(). This could be a case that was missed inMDEV-29835. The ticket MDEV-31815 had been filed, but because the test case involved ROW_FORMAT=COMPRESSED tables, it hit other bugs too easily.I will check the mtr_t::m_memo in all threads deeper, to check if this is related to MDEV-31815.