[MDEV-22370] safe_mutex: Trying to lock uninitialized mutex at /data/src/10.4-bug/sql/rpl_parallel.cc, line 470 upon shutdown during FTWRL - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.4(EOL), 10.5
Fix Version/s: 10.5.4, 10.4.14
Component/s: Locking
Labels:
- sprint-week-18

Description

10.5 87a7968c
2020-04-14 10:14:49 0 [Note] /ten1/builds/10.5-87a7968c-deb/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2020-04-14 10:14:49 0 [Note] Event Scheduler: Purging the queue. 14 events
safe_mutex: Trying to lock uninitialized mutex at /home/elenst/src/10.5/sql/rpl_parallel.cc, line 470
200414 10:14:49 [ERROR] mysqld got signal 6 ;

#3 <signal handler called>
#4 0x00007f73a1ad5207 in raise () from /lib64/libc.so.6
#5 0x00007f73a1ad68f8 in abort () from /lib64/libc.so.6
#6 0x0000560ad0f8b4cf in safe_mutex_lock (mp=0x560ad1f58d90 <global_rpl_thread_pool+16>, my_flags=0, file=0x560ad11aaf40 "/home/elenst/src/10.5/sql/rpl_parallel.cc", line=470) at /home/elenst/src/10.5/mysys/thr_mutex.c:244
#7 0x0000560ad05107d1 in inline_mysql_mutex_lock (that=0x560ad1f58d90 <global_rpl_thread_pool+16>, src_file=0x560ad11aaf40 "/home/elenst/src/10.5/sql/rpl_parallel.cc", src_line=470) at /home/elenst/src/10.5/include/mysql/psi/mysql_thread.h:763
#8 0x0000560ad051229d in pool_mark_busy (pool=0x560ad1f58d80 <global_rpl_thread_pool>, thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/rpl_parallel.cc:470
#9 0x0000560ad05126f2 in rpl_pause_for_ftwrl (thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/rpl_parallel.cc:563
#10 0x0000560ad02eb534 in mysql_execute_command (thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:5374
#11 0x0000560ad02f3912 in mysql_parse (thd=0x7f7328000b18, rawbuf=0x7f7328012460 "FLUSH /* QNO 1411 CON_ID 9 */ TABLES WITH READ LOCK", length=51, parser_state=0x7f7384acf520, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:7953
#12 0x0000560ad02df2c0 in dispatch_command (command=COM_QUERY, thd=0x7f7328000b18, packet=0x7f7328008809 "FLUSH /* QNO 1411 CON_ID 9 */ TABLES WITH READ LOCK", packet_length=51, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:1840
#13 0x0000560ad02dd93c in do_command (thd=0x7f7328000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:1359
#14 0x0000560ad0483c6d in do_handle_one_connection (connect=0x560ad51db8f8, put_in_cache=true) at /home/elenst/src/10.5/sql/sql_connect.cc:1422
#15 0x0000560ad0483986 in handle_one_connection (arg=0x560ad51db8f8) at /home/elenst/src/10.5/sql/sql_connect.cc:1319
#16 0x0000560ad09c173e in pfs_spawn_thread (arg=0x560ad52f7448) at /home/elenst/src/10.5/storage/perfschema/pfs.cc:2201
#17 0x00007f73a310bdd5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f73a1b9cead in clone () from /lib64/libc.so.6

The stack trace above is from an original failure in concurrent tests.

The test case below will produce a failure with line numbers in rpl_parallel.cc shifted by 1, due to the suggested debug injection.
Theoretically it can fail without the injection, but chances are slim, while with it it fails for me every time.

injection
index 9344312fe62..e42e6b2ebb5 100644
--- a/sql/rpl_parallel.cc
+++ b/sql/rpl_parallel.cc
@@ -467,6 +467,7 @@ pool_mark_busy(rpl_parallel_thread_pool pool, THD thd)
So we protect the infrequent operations of FLUSH TABLES WITH READ LOCK and
pool size changes with this condition wait.
*/
+ DBUG_EXECUTE_IF("sleep_mark_busy",my_sleep(1000000););
mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
if (thd)
{

Test case
--connect (con1,localhost,root,,)
SET DEBUG_DBUG='+d,sleep_mark_busy';
--send
FLUSH TABLES WITH READ LOCK;

--connection default
--source include/restart_mysqld.inc

Reproducible on 10.4, 10.5.
Could not reproduce and didn't observe on 10.3.
I have no information if the problem has any effect on non-debug builds.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

error22_all_threads
2020-04-27 11:10
58 kB
Elena Stepanova

Issue Links

relates to

MDEV-14656 safe_mutex: Trying to lock uninitialized mutex or Got error 22 when trying to lock mutex LOCK_event_queue at sql/event_queue.cc, line 716 upon DROP DATABASE / EVENT upon server shutdown

Open

Activity

Ascending order - Click to sort in descending order

Andrei Elkin added a comment - 2020-04-27 08:39

sachin.setiya.007 I think we can crack this one quickly. Could you check it once having breaks from ALTER?

Andrei Elkin added a comment - 2020-04-27 08:39 sachin.setiya.007 I think we can crack this one quickly. Could you check it once having breaks from ALTER?

Elena Stepanova added a comment - 2020-04-27 11:11 - edited

In concurrent tests, I am also getting what appears to be a closely related problem. It's harder for me to simulate, maybe the race condition is somewhat thinner, or I just don't see the right spot, but I hope you'll take care of it along with the reported one.
In all cases that I looked at, the one in the description happens in rpl_pause_for_ftwrl/pool_mark_busy, while the one below – in rpl_unpause_after_ftwrl/pool_mark_not_busy. Both happen on FTWRL during server shutdown.

10.5 2655984e
2020-04-23 8:54:58 0 [Note] /ten1/builds/10.5-2655984e-deb/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2020-04-23 8:54:58 0 [Note] Event Scheduler: Purging the queue. 0 events
Got error 22 when trying to lock mutex LOCK_rpl_thread_pool at /home/elenst/src/10.5/sql/rpl_parallel.cc, line 500
200423 8:54:58 [ERROR] mysqld got signal 6 ;

#3 <signal handler called>
#4 0x00007f43283f2207 in raise () from /lib64/libc.so.6
#5 0x00007f43283f38f8 in abort () from /lib64/libc.so.6
#6 0x000056135e2a6474 in safe_mutex_lock (mp=0x56135f273d90 <global_rpl_thread_pool+16>, my_flags=0, file=0x56135e4c6060 "/home/elenst/src/10.5/sql/rpl_parallel.cc", line=500) at /home/elenst/src/10.5/mysys/thr_mutex.c:297
#7 0x000056135d82b3e3 in inline_mysql_mutex_lock (that=0x56135f273d90 <global_rpl_thread_pool+16>, src_file=0x56135e4c6060 "/home/elenst/src/10.5/sql/rpl_parallel.cc", src_line=500) at /home/elenst/src/10.5/include/mysql/psi/mysql_thread.h:763
#8 0x000056135d82d016 in pool_mark_not_busy (pool=0x56135f273d80 <global_rpl_thread_pool>) at /home/elenst/src/10.5/sql/rpl_parallel.cc:500
#9 0x000056135d82d280 in rpl_unpause_after_ftwrl (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/rpl_parallel.cc:537
#10 0x000056135d605bb0 in mysql_execute_command (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:5433
#11 0x000056135d60dcda in mysql_parse (thd=0x7f42b0000b18, rawbuf=0x7f42b0012580 "FLUSH /* QNO 3372 CON_ID 10 */ TABLES WITH READ LOCK", length=52, parser_state=0x7f43107bd520, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:7957
#12 0x000056135d5f962c in dispatch_command (command=COM_QUERY, thd=0x7f42b0000b18, packet=0x7f42b0008929 "FLUSH /* QNO 3372 CON_ID 10 */ TABLES WITH READ LOCK", packet_length=52, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:1840
#13 0x000056135d5f7ca8 in do_command (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:1359
#14 0x000056135d79e479 in do_handle_one_connection (connect=0x56136289ed68, put_in_cache=true) at /home/elenst/src/10.5/sql/sql_connect.cc:1422
#15 0x000056135d79e192 in handle_one_connection (arg=0x56136289ed68) at /home/elenst/src/10.5/sql/sql_connect.cc:1319
#16 0x000056135dcdc84c in pfs_spawn_thread (arg=0x561362781be8) at /home/elenst/src/10.5/storage/perfschema/pfs.cc:2201
#17 0x00007f4329a28dd5 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f43284b9ead in clone () from /lib64/libc.so.6

All threads stack trace is attached: error22_all_threads

Elena Stepanova added a comment - 2020-04-27 11:11 - edited In concurrent tests, I am also getting what appears to be a closely related problem. It's harder for me to simulate, maybe the race condition is somewhat thinner, or I just don't see the right spot, but I hope you'll take care of it along with the reported one. In all cases that I looked at, the one in the description happens in rpl_pause_for_ftwrl / pool_mark_busy , while the one below – in rpl_unpause_after_ftwrl / pool_mark_not_busy . Both happen on FTWRL during server shutdown. 10.5 2655984e 2020-04-23 8:54:58 0 [Note] /ten1/builds/10.5-2655984e-deb/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown 2020-04-23 8:54:58 0 [Note] Event Scheduler: Purging the queue. 0 events Got error 22 when trying to lock mutex LOCK_rpl_thread_pool at /home/elenst/src/10.5/sql/rpl_parallel.cc, line 500 200423 8:54:58 [ERROR] mysqld got signal 6 ; #3 <signal handler called> #4 0x00007f43283f2207 in raise () from /lib64/libc.so.6 #5 0x00007f43283f38f8 in abort () from /lib64/libc.so.6 #6 0x000056135e2a6474 in safe_mutex_lock (mp=0x56135f273d90 <global_rpl_thread_pool+16>, my_flags=0, file=0x56135e4c6060 "/home/elenst/src/10.5/sql/rpl_parallel.cc", line=500) at /home/elenst/src/10.5/mysys/thr_mutex.c:297 #7 0x000056135d82b3e3 in inline_mysql_mutex_lock (that=0x56135f273d90 <global_rpl_thread_pool+16>, src_file=0x56135e4c6060 "/home/elenst/src/10.5/sql/rpl_parallel.cc", src_line=500) at /home/elenst/src/10.5/include/mysql/psi/mysql_thread.h:763 #8 0x000056135d82d016 in pool_mark_not_busy (pool=0x56135f273d80 <global_rpl_thread_pool>) at /home/elenst/src/10.5/sql/rpl_parallel.cc:500 #9 0x000056135d82d280 in rpl_unpause_after_ftwrl (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/rpl_parallel.cc:537 #10 0x000056135d605bb0 in mysql_execute_command (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:5433 #11 0x000056135d60dcda in mysql_parse (thd=0x7f42b0000b18, rawbuf=0x7f42b0012580 "FLUSH /* QNO 3372 CON_ID 10 */ TABLES WITH READ LOCK", length=52, parser_state=0x7f43107bd520, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:7957 #12 0x000056135d5f962c in dispatch_command (command=COM_QUERY, thd=0x7f42b0000b18, packet=0x7f42b0008929 "FLUSH /* QNO 3372 CON_ID 10 */ TABLES WITH READ LOCK", packet_length=52, is_com_multi=false, is_next_command=false) at /home/elenst/src/10.5/sql/sql_parse.cc:1840 #13 0x000056135d5f7ca8 in do_command (thd=0x7f42b0000b18) at /home/elenst/src/10.5/sql/sql_parse.cc:1359 #14 0x000056135d79e479 in do_handle_one_connection (connect=0x56136289ed68, put_in_cache=true) at /home/elenst/src/10.5/sql/sql_connect.cc:1422 #15 0x000056135d79e192 in handle_one_connection (arg=0x56136289ed68) at /home/elenst/src/10.5/sql/sql_connect.cc:1319 #16 0x000056135dcdc84c in pfs_spawn_thread (arg=0x561362781be8) at /home/elenst/src/10.5/storage/perfschema/pfs.cc:2201 #17 0x00007f4329a28dd5 in start_thread () from /lib64/libpthread.so.0 #18 0x00007f43284b9ead in clone () from /lib64/libc.so.6 All threads stack trace is attached: error22_all_threads

Sachin Setiya (Inactive) added a comment - 2020-04-28 10:45

In versions before 10.4 pool_mark_busy is called only once , But in 10.4 forward shutdown slave also calls pool_mark_busy

(gdb) bt

#0  pool_mark_busy (pool=0x5555576f07e0 <global_rpl_thread_pool>, thd=0x0) at /home/sachin/10.4/server/sql/rpl_parallel.cc:455

#1  0x0000555555f67532 in rpl_parallel_change_thread_count (pool=0x5555576f07e0 <global_rpl_thread_pool>, new_count=0, force=true) at /home/sachin/10.4/server/sql/rpl_parallel.cc:1530

#2  0x0000555555f68e88 in rpl_parallel_thread_pool::destroy (this=0x5555576f07e0 <global_rpl_thread_pool>) at /home/sachin/10.4/server/sql/rpl_parallel.cc:2017

#3  0x0000555555c51eb3 in slave_prepare_for_shutdown () at /home/sachin/10.4/server/sql/slave.cc:1448

#4  0x0000555555c1eb89 in close_connections () at /home/sachin/10.4/server/sql/mysqld.cc:1765

#5  0x0000555555c2785b in mysqld_main (argc=131, argv=0x555557f868e0) at /home/sachin/10.4/server/sql/mysqld.cc:5936

#6  0x0000555555c1b77a in main (argc=7, argv=0x7fffffffb2c8) at /home/sachin/10.4/server/sql/main.cc:25

(gdb)

This commit seems responsible for this 5cd21ac20225 (https://jira.mariadb.org/browse/MDEV-20821)

Sachin Setiya (Inactive) added a comment - 2020-04-28 10:45 In versions before 10.4 pool_mark_busy is called only once , But in 10.4 forward shutdown slave also calls pool_mark_busy (gdb) bt #0 pool_mark_busy (pool=0x5555576f07e0 <global_rpl_thread_pool>, thd=0x0) at /home/sachin/10.4/server/sql/rpl_parallel.cc:455 #1 0x0000555555f67532 in rpl_parallel_change_thread_count (pool=0x5555576f07e0 <global_rpl_thread_pool>, new_count=0, force=true) at /home/sachin/10.4/server/sql/rpl_parallel.cc:1530 #2 0x0000555555f68e88 in rpl_parallel_thread_pool::destroy (this=0x5555576f07e0 <global_rpl_thread_pool>) at /home/sachin/10.4/server/sql/rpl_parallel.cc:2017 #3 0x0000555555c51eb3 in slave_prepare_for_shutdown () at /home/sachin/10.4/server/sql/slave.cc:1448 #4 0x0000555555c1eb89 in close_connections () at /home/sachin/10.4/server/sql/mysqld.cc:1765 #5 0x0000555555c2785b in mysqld_main (argc=131, argv=0x555557f868e0) at /home/sachin/10.4/server/sql/mysqld.cc:5936 #6 0x0000555555c1b77a in main (argc=7, argv=0x7fffffffb2c8) at /home/sachin/10.4/server/sql/main.cc:25 (gdb) This commit seems responsible for this 5cd21ac20225 ( https://jira.mariadb.org/browse/MDEV-20821 )

Sachin Setiya (Inactive) added a comment - 2020-04-28 10:47

So the rpl_parallel_thread_pool::destroy destroys the mutex LOCK_rpl_thread_pool, while FTWRL is waiting

Sachin Setiya (Inactive) added a comment - 2020-04-28 10:47 So the rpl_parallel_thread_pool::destroy destroys the mutex LOCK_rpl_thread_pool, while FTWRL is waiting

Sachin Setiya (Inactive) added a comment - 2020-04-28 12:16

This also fails

--source include/master-slave.inc

--connection master

--connection slave

stop slave;

SET GLOBAL slave_parallel_threads=10;

set global slave_parallel_mode=optimistic;

change master to master_use_gtid=slave_pos;

start slave;

--connect (con1,127.0.0.1,root,,test,$SLAVE_MYPORT)

SET DEBUG_DBUG='+d,sleep_mark_busy';

--send

  FLUSH TABLES WITH READ LOCK;

--connection slave

--source include/rpl_restart_server.inc

--connection slave

stop slave;

SET GLOBAL slave_parallel_threads=0;

set global slave_parallel_mode=conservative;

start slave;

--disconnect con1

--source include/rpl_end.inc

Sachin Setiya (Inactive) added a comment - 2020-04-28 12:16 This also fails --source include/master-slave.inc --connection master --connection slave stop slave; SET GLOBAL slave_parallel_threads=10; set global slave_parallel_mode=optimistic; change master to master_use_gtid=slave_pos; start slave; --connect (con1,127.0.0.1,root,,test,$SLAVE_MYPORT) SET DEBUG_DBUG='+d,sleep_mark_busy'; --send FLUSH TABLES WITH READ LOCK; --connection slave --source include/rpl_restart_server.inc --connection slave stop slave; SET GLOBAL slave_parallel_threads=0; set global slave_parallel_mode=conservative; start slave; --disconnect con1 --source include/rpl_end.inc

Sachin Setiya (Inactive) added a comment - 2020-04-28 12:17

So this patch will not work because it fails with earlier case

diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc

index 9344312fe62..1dea1caae7e 100644

--- a/sql/rpl_parallel.cc

+++ b/sql/rpl_parallel.cc

@@ -467,6 +467,7 @@ pool_mark_busy(rpl_parallel_thread_pool *pool, THD *thd)

     So we protect the infrequent operations of FLUSH TABLES WITH READ LOCK and

     pool size changes with this condition wait.

*/

+  DBUG_EXECUTE_IF("sleep_mark_busy",my_sleep(1000000););

   mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);

   if (thd)

@@ -511,6 +512,8 @@ rpl_unpause_after_ftwrl(THD *thd)

   uint32 i;

   rpl_parallel_thread_pool *pool= &global_rpl_thread_pool;

   DBUG_ENTER("rpl_unpause_after_ftwrl");

+  if (!pool->count)

+    DBUG_VOID_RETURN;

   DBUG_ASSERT(pool->busy);

@@ -552,6 +555,8 @@ rpl_pause_for_ftwrl(THD *thd)

   int err;

   DBUG_ENTER("rpl_pause_for_ftwrl");

+  if (!pool->count)

+    DBUG_RETURN(0);

/*

     While the count_pending_pause_for_ftwrl counter is non-zero, the pool

     cannot be shutdown/resized, so threads are guaranteed to not disappear.

Sachin Setiya (Inactive) added a comment - 2020-04-28 12:17 So this patch will not work because it fails with earlier case diff --git a/sql/rpl_parallel.cc b/sql/rpl_parallel.cc index 9344312fe62..1dea1caae7e 100644 --- a/sql/rpl_parallel.cc +++ b/sql/rpl_parallel.cc @@ -467,6 +467,7 @@ pool_mark_busy(rpl_parallel_thread_pool *pool, THD *thd) So we protect the infrequent operations of FLUSH TABLES WITH READ LOCK and pool size changes with this condition wait. */ + DBUG_EXECUTE_IF("sleep_mark_busy",my_sleep(1000000);); mysql_mutex_lock(&pool->LOCK_rpl_thread_pool); if (thd) { @@ -511,6 +512,8 @@ rpl_unpause_after_ftwrl(THD *thd) uint32 i; rpl_parallel_thread_pool *pool= &global_rpl_thread_pool; DBUG_ENTER("rpl_unpause_after_ftwrl"); + if (!pool->count) + DBUG_VOID_RETURN; DBUG_ASSERT(pool->busy); @@ -552,6 +555,8 @@ rpl_pause_for_ftwrl(THD *thd) int err; DBUG_ENTER("rpl_pause_for_ftwrl"); + if (!pool->count) + DBUG_RETURN(0); /* While the count_pending_pause_for_ftwrl counter is non-zero, the pool cannot be shutdown/resized, so threads are guaranteed to not disappear.

Sachin Setiya (Inactive) added a comment - 2020-04-28 20:13

Patch branch bb-10.4-sachin

Sachin Setiya (Inactive) added a comment - 2020-04-28 20:13 Patch branch bb-10.4-sachin

Andrei Elkin added a comment - 2020-06-15 16:44

Couple of cosmetic comments are left on GH, otherwise the patch looks nice.
Thanks!

Andrei Elkin added a comment - 2020-06-15 16:44 Couple of cosmetic comments are left on GH, otherwise the patch looks nice. Thanks!

People

Assignee:: Sachin Setiya (Inactive)

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2020-04-26 12:41

Updated:: 2021-05-14 10:50

Resolved:: 2020-06-16 21:18

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration