[MDEV-30799] Replication stuck with "Waiting for room in worker thread event queue" Created: 2023-03-07  Updated: 2023-10-11  Resolved: 2023-10-11

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.3.36
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Susmeet Khaire Assignee: Andrei Elkin
Resolution: Duplicate Votes: 2
Labels: CS0522727


 Description   

Replication keeps getting stuck randomly.

All the information is in the zip file.

To resume replication, a STOP/START slave is needed.



 Comments   
Comment by Daniel Black [ 2023-03-07 ]

how long has it been in this state?

what is in show process list on the replica?

what does show slave status show?

what event(s) are being processed?

Can you obtain a backtrace - https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#getting-backtraces-from-a-running-mysqld-process-with-gdb-on-linux with Debuginfo packages installed?

Comment by Susmeet Khaire [ 2023-03-07 ]

I have attached all the information in the zip file. [^Logs_x01giwdb7a.zip]

Comment by Andrei Elkin [ 2023-03-08 ]

susmeet.khaire, to the latest logs
there are a number of threads waiting for an MDL lock. There must be actually no replication specifics. What is the MDL lock and its holder
need identifying.
The waiters include a replication thread, which is responsible for waiting status of other replication threads

Thread 10 (Thread 0x7f9490178700 (LWP 28410)):
#0  0x00007f94975abde2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000055bf114665e2 in inline_mysql_cond_timedwait (src_file=0x55bf11b68620 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.37/sql/mdl.cc", src_line=1094, abstime=0x7f94901770b0, mutex=0x7f8058000a10, that=0x7f8058000a40) at /usr/src/debug/MariaDB-10.3.37/src_0/include/mysql/psi/mysql_thread.h:1222
        result = <optimized out>
#2  MDL_wait::timed_wait (this=this@entry=0x7f8058000a10, owner=0x7f80580009a0, abs_timeout=abs_timeout@entry=0x7f94901770b0, set_status_on_timeout=set_status_on_timeout@entry=false, wait_state_name=<optimized out>) at /usr/src/debug/MariaDB-10.3.37/src_0/sql/mdl.cc:1094
        old_stage = {m_key = 0, m_name = 0x55bf11b4af26 "Opening tables", m_flags = 1476397584}
        result = <optimized out>
        wait_result = 0
#3  0x000055bf11467ba1 in MDL_context::acquire_lock (this=this@entry=0x7f8058000a10, mdl_request=mdl_request@entry=0x7f80589dc788, lock_wait_timeout=<optimized out>) at /usr/src/debug/MariaDB-10.3.37/src_0/sql/mdl.cc:2148
...
#13 0x000055bf114a562b in handle_rpl_parallel_thread (arg=<optimized out>) at /usr/src/debug/MariaDB-10.3.37/src_0/sql/rpl_parallel.cc:1335

as well as normal user threads:

Thread 9 (Thread 0x7f94800c6700 (LWP 29014)):
#0  0x00007f94975abde2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000055bf114665e2 in inline_mysql_cond_timedwait (src_file=0x55bf11b68620 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.37/sql/mdl.cc", src_line=1094, abstime=0x7f94800c1640, mutex=0x7f7f40000af0, that=0x7f7f40000b20) at /usr/src/debug/MariaDB-10.3.37/src_0/include/mysql/psi/mysql_thread.h:1222
        result = <optimized out>
#2  MDL_wait::timed_wait (this=this@entry=0x7f7f40000af0, owner=0x7f7f40000a80, abs_timeout=abs_timeout@entry=0x7f948
 
 
Thread 8 (Thread 0x7f80fc774700 (LWP 17426)):
#0  0x00007f94975abde2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x000055bf114665e2 in inline_mysql_cond_timedwait (src_file=0x55bf11b68620 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.37/sql/mdl.cc", src_line=1094, abstime=0x7f80fc76f640, mutex=0x7f7eec000af0, that=0x7f7eec000b20) at /usr/src/debug/MariaDB-10.3.37/src_0/include/mysql/psi/mysql_thread.h:1222
        result = <optimized out>
#2  MDL_wait::timed_wait (this=this@entry=0x7f7eec000af0, owner=0x7f7eec000a80, abs_timeout=abs_timeout@entry=0x7f80fc76f640, set_status_on_timeout=set_status_on_timeout@entry=false, wait_state_name=<optimized out>) at /usr/src/debug/MariaDB-10.3.37/src_0/sql/mdl.cc:1094
 
Thread 5 (Thread 0x7f810408c700 (LWP 368)):
#0  0x00007f94975abde2 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
Thread 3 (Thread 0x7f94801ce700 (LWP 30162)):
Thread 2 (Thread 0x7f80fc62a700 (LWP 10272)):

Comment by Andrei Elkin [ 2023-03-08 ]

susmeet.khaire, the presence of user threads hanging

405701	tmtb_tableau_iwork	10.68.22.59:58046	iwork	Query	11800	Waiting for table metadata lock	SELECT TABLE_NAME

the same way as one of replication workers strongly suggests to look around for MDL lock holders. I can not be a replication thread.

Comment by Andrei Elkin [ 2023-03-13 ]

Most probably this ticket duplicates MDEV-29621 where I followed up with the metadata lock info that suggests a deadlock.

Generated at Thu Feb 08 10:18:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.