[MDEV-22370] safe_mutex: Trying to lock uninitialized mutex at /data/src/10.4-bug/sql/rpl_parallel.cc, line 470 upon shutdown during FTWRL Created: 2020-04-26  Updated: 2021-05-14  Resolved: 2020-06-16

Status: Closed
Project: MariaDB Server
Component/s: Locking
Affects Version/s: 10.4, 10.5
Fix Version/s: 10.5.4, 10.4.14

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Sachin Setiya (Inactive)
Resolution: Fixed Votes: 0
Labels: sprint-week-18

Attachments: HTML File error22_all_threads    
Issue Links:
Relates
relates to MDEV-14656 safe_mutex: Trying to lock uninitiali... Open

 Description   

10.5 87a7968c

2020-04-14 10:14:49 0 [Note] /ten1/builds/10.5-87a7968c-deb/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2020-04-14 10:14:49 0 [Note] Event Scheduler: Purging the queue. 14 events
safe_mutex: Trying to lock uninitialized mutex at /home/elenst/src/10.5/sql/rpl_parallel.cc, line 470
200414 10:14:49 [ERROR] mysqld got signal 6 ;
 
#3  <signal handler called>
#4  0x00007f73a1ad5207 in raise () from /lib64/libc.so.6
#5  0x00007f73a1ad68f8 in abort () from /lib64/libc.so.6
#6  0x0000560ad0f8b4cf in safe_mutex_lock (mp=0x560ad1f58d90 <global_rpl_thread_pool+16>, my_flags=0, file=0x560ad11aaf40 "/home/elenst/src/10.5/sql/rpl_parallel.cc", line=470) at /home/elenst/src/10.5/mysys/thr_mutex.c:244
#7  0x0000560ad05107d1 in inline_mysql_mutex_lock (that=0x560ad1f58d90 <global_rpl_thread_pool+16>, src_file=0x560ad11aaf40 "/home/elenst/src/10.5/sql/rpl_parallel.cc", src_line=470) at /home/elenst/src/10.5/include/mysql/psi/mysql_thread.h:763
#8  0x0000560ad051229d in pool_mark_busy (pool=0x560ad1f58d80 <global_rpl_thread_pool>, thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/rpl_parallel.cc:470
#9  0x0000560ad05126f2 in rpl_pause_for_ftwrl (thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/rpl_parallel.cc:563
#10 0x0000560ad02eb534 in mysql_execute_command (thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:5374
#11 0x0000560ad02f3912 in mysql_parse (thd=0x7f7328000b18, rawbuf=0x7f7328012460 "FLUSH /* QNO 1411 CON_ID 9 */ TABLES WITH READ LOCK", length=51, parser_state=0x7f7384acf520, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:7953
#12 0x0000560ad02df2c0 in dispatch_command (command=COM_QUERY, thd=0x7f7328000b18, packet=0x7f7328008809 "FLUSH /* QNO 1411 CON_ID 9 */ TABLES WITH READ LOCK", packet_length=51, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:1840
#13 0x0000560ad02dd93c in do_command (thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:1359
#14 0x0000560ad0483c6d in do_handle_one_connection (connect=0x560ad51db8f8, put_in_cache=true) at /home/elenst/src/10.5/sql/sql_connect.cc:1422
#15 0x0000560ad0483986 in handle_one_connection (arg=0x560ad51db8f8) at /home/elenst/src/10.5/sql/sql_connect.cc:1319
#16 0x0000560ad09c173e in pfs_spawn_thread (arg=0x560ad52f7448) at /home/elenst/src/10.5/storage/perfschema/pfs.cc:2201
#17 0x00007f73a310bdd5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f73a1b9cead in clone () from /lib64/libc.so.6

The stack trace above is from an original failure in concurrent tests.

The test case below will produce a failure with line numbers in rpl_parallel.cc shifted by 1, due to the suggested debug injection.
Theoretically it can fail without the injection, but chances are slim, while with it it fails for me every time.

injection

index 9344312fe62..e42e6b2ebb5 100644
--- a/sql/rpl_parallel.cc
+++ b/sql/rpl_parallel.cc
@@ -467,6 +467,7 @@ pool_mark_busy(rpl_parallel_thread_pool *pool, THD *thd)
     So we protect the infrequent operations of FLUSH TABLES WITH READ LOCK and
     pool size changes with this condition wait.
   */
+  DBUG_EXECUTE_IF("sleep_mark_busy",my_sleep(1000000););
   mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
   if (thd)
   {

Test case

--connect (con1,localhost,root,,)
SET DEBUG_DBUG='+d,sleep_mark_busy';
--send
  FLUSH TABLES WITH READ LOCK;
 
--connection default
--source include/restart_mysqld.inc

Reproducible on 10.4, 10.5.
Could not reproduce and didn't observe on 10.3.
I have no information if the problem has any effect on non-debug builds.



 Comments   
Comment by Andrei Elkin [ 2020-04-27 ]

sachin.setiya.007 I think we can crack this one quickly. Could you check it once having breaks from ALTER?

Comment by Elena Stepanova [ 2020-04-27 ]

In concurrent tests, I am also getting what appears to be a closely related problem. It's harder for me to simulate, maybe the race condition is somewhat thinner, or I just don't see the right spot, but I hope you'll take care of it along with the reported one.
In all cases that I looked at, the one in the description happens in rpl_pause_for_ftwrl/pool_mark_busy, while the one below – in rpl_unpause_after_ftwrl/pool_mark_not_busy. Both happen on FTWRL during server shutdown.

10.5 2655984e

2020-04-23  8:54:58 0 [Note] /ten1/builds/10.5-2655984e-deb/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2020-04-23  8:54:58 0 [Note] Event Scheduler: Purging the queue. 0 events
Got error 22 when trying to lock mutex LOCK_rpl_thread_pool at /home/elenst/src/10.5/sql/rpl_parallel.cc, line 500
200423  8:54:58 [ERROR] mysqld got signal 6 ;
 
#3  <signal handler called>
#4  0x00007f43283f2207 in raise () from /lib64/libc.so.6
#5  0x00007f43283f38f8 in abort () from /lib64/libc.so.6
#6  0x000056135e2a6474 in safe_mutex_lock (mp=0x56135f273d90 <global_rpl_thread_pool+16>, my_flags=0, file=0x56135e4c6060 "/home/elenst/src/10.5/sql/rpl_parallel.cc", line=500) at /home/elenst/src/10.5/mysys/thr_mutex.c:297
#7  0x000056135d82b3e3 in inline_mysql_mutex_lock (that=0x56135f273d90 <global_rpl_thread_pool+16>, src_file=0x56135e4c6060 "/home/elenst/src/10.5/sql/rpl_parallel.cc", src_line=500) at /home/elenst/src/10.5/include/mysql/psi/mysql_thread.h:763
#8  0x000056135d82d016 in pool_mark_not_busy (pool=0x56135f273d80 <global_rpl_thread_pool>) at /home/elenst/src/10.5/sql/rpl_parallel.cc:500
#9  0x000056135d82d280 in rpl_unpause_after_ftwrl (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/rpl_parallel.cc:537
#10 0x000056135d605bb0 in mysql_execute_command (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:5433
#11 0x000056135d60dcda in mysql_parse (thd=0x7f42b0000b18, rawbuf=0x7f42b0012580 "FLUSH /* QNO 3372 CON_ID 10 */ TABLES WITH READ LOCK", length=52, parser_state=0x7f43107bd520, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:7957
#12 0x000056135d5f962c in dispatch_command (command=COM_QUERY, thd=0x7f42b0000b18, packet=0x7f42b0008929 "FLUSH /* QNO 3372 CON_ID 10 */ TABLES WITH READ LOCK", packet_length=52, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:1840
#13 0x000056135d5f7ca8 in do_command (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:1359
#14 0x000056135d79e479 in do_handle_one_connection (connect=0x56136289ed68, put_in_cache=true) at /home/elenst/src/10.5/sql/sql_connect.cc:1422
#15 0x000056135d79e192 in handle_one_connection (arg=0x56136289ed68) at /home/elenst/src/10.5/sql/sql_connect.cc:1319
#16 0x000056135dcdc84c in pfs_spawn_thread (arg=0x561362781be8) at /home/elenst/src/10.5/storage/perfschema/pfs.cc:2201
#17 0x00007f4329a28dd5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f43284b9ead in clone () from /lib64/libc.so.6

All threads stack trace is attached: error22_all_threads

Comment by Sachin Setiya (Inactive) [ 2020-04-28 ]

In versions before 10.4 pool_mark_busy is called only once , But in 10.4 forward shutdown slave also calls pool_mark_busy

(gdb) bt
#0  pool_mark_busy (pool=0x5555576f07e0 <global_rpl_thread_pool>, thd=0x0) at /home/sachin/10.4/server/sql/rpl_parallel.cc:455
#1  0x0000555555f67532 in rpl_parallel_change_thread_count (pool=0x5555576f07e0 <global_rpl_thread_pool>, new_count=0, force=true) at /home/sachin/10.4/server/sql/rpl_parallel.cc:1530
#2  0x0000555555f68e88 in rpl_parallel_thread_pool::destroy (this=0x5555576f07e0 <global_rpl_thread_pool>) at /home/sachin/10.4/server/sql/rpl_parallel.cc:2017
#3  0x0000555555c51eb3 in slave_prepare_for_shutdown () at /home/sachin/10.4/server/sql/slave.cc:1448
#4  0x0000555555c1eb89 in close_connections () at /home/sachin/10.4/server/sql/mysqld.cc:1765
#5  0x0000555555c2785b in mysqld_main (argc=131, argv=0x555557f868e0) at /home/sachin/10.4/server/sql/mysqld.cc:5936
#6  0x0000555555c1b77a in main (argc=7, argv=0x7fffffffb2c8) at /home/sachin/10.4/server/sql/main.cc:25
(gdb)

This commit seems responsible for this 5cd21ac20225 (https://jira.mariadb.org/browse/MDEV-20821)

Comment by Sachin Setiya (Inactive) [ 2020-04-28 ]

So the rpl_parallel_thread_pool::destroy destroys the mutex LOCK_rpl_thread_pool, while FTWRL is waiting

Comment by Sachin Setiya (Inactive) [ 2020-04-28 ]

This also fails

--source include/master-slave.inc
 
--connection master
 
--connection slave
stop slave;
SET GLOBAL slave_parallel_threads=10;
set global slave_parallel_mode=optimistic;
change master to master_use_gtid=slave_pos;
start slave;
--connect (con1,127.0.0.1,root,,test,$SLAVE_MYPORT)
SET DEBUG_DBUG='+d,sleep_mark_busy';
--send
  FLUSH TABLES WITH READ LOCK;
 
--connection slave
--source include/rpl_restart_server.inc
 
--connection slave
stop slave;
SET GLOBAL slave_parallel_threads=0;
set global slave_parallel_mode=conservative;
start slave;
--disconnect con1
--source include/rpl_end.inc

Comment by Sachin Setiya (Inactive) [ 2020-04-28 ]

So this patch will not work because it fails with earlier case

diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc
index 9344312fe62..1dea1caae7e 100644
--- a/sql/rpl_parallel.cc
+++ b/sql/rpl_parallel.cc
@@ -467,6 +467,7 @@ pool_mark_busy(rpl_parallel_thread_pool *pool, THD *thd)
     So we protect the infrequent operations of FLUSH TABLES WITH READ LOCK and
     pool size changes with this condition wait.
   */
+  DBUG_EXECUTE_IF("sleep_mark_busy",my_sleep(1000000););
   mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
   if (thd)
   {
@@ -511,6 +512,8 @@ rpl_unpause_after_ftwrl(THD *thd)
   uint32 i;
   rpl_parallel_thread_pool *pool= &global_rpl_thread_pool;
   DBUG_ENTER("rpl_unpause_after_ftwrl");
+  if (!pool->count)
+    DBUG_VOID_RETURN;
 
   DBUG_ASSERT(pool->busy);
 
@@ -552,6 +555,8 @@ rpl_pause_for_ftwrl(THD *thd)
   int err;
   DBUG_ENTER("rpl_pause_for_ftwrl");
 
+  if (!pool->count)
+    DBUG_RETURN(0);
   /*
     While the count_pending_pause_for_ftwrl counter is non-zero, the pool
     cannot be shutdown/resized, so threads are guaranteed to not disappear.

Comment by Sachin Setiya (Inactive) [ 2020-04-28 ]

Patch branch bb-10.4-sachin

Comment by Andrei Elkin [ 2020-06-15 ]

Couple of cosmetic comments are left on GH, otherwise the patch looks nice.
Thanks!

Generated at Thu Feb 08 09:14:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.