[MDEV-27039] Trying to lock mutex ... when the mutex was already locked | SIGABRT in safe_mutex_lock | Hangs Created: 2021-11-13  Updated: 2022-03-10  Resolved: 2022-01-03

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - Spider
Affects Version/s: 10.6, 10.7, 10.8
Fix Version/s: 10.6.6, 10.7.2

Type: Bug Priority: Critical
Reporter: Roel Van de Paar Assignee: Andrei Elkin
Resolution: Fixed Votes: 0
Labels: cross-mysqld-interaction, hang, regression

Issue Links:
Duplicate
duplicates MDEV-25890 Trying to lock mutex ... when the mut... Closed
Problem/Incident
is caused by MDEV-19371 Implement binlog_expire_logs_seconds ... Closed
Relates

 Description   

Very similar to MDEV-25890 but different versions and different area. Almost definitely related. This testcase too causes mysqld hangs and likely (though not confirmed) cross-mysqld interaction (i.e. one mysqld affecting another, without anything shared). Testcase is sporadic, though usually only 1-2 executions are needed.

# mysqld options required for replay: --log-bin --performance-schema
INSTALL PLUGIN spider SONAME 'ha_spider.so';
SET GLOBAL expire_logs_days=11;
SET GLOBAL binlog_checksum=NONE;

Leads to:

10.8.0 bc57ff7cf79504a198d4752b4748340a4afd886c (Debug)

2021-11-13 14:21:40 0 [Note] /test/MD121121-mariadb-10.8.0-linux-x86_64-dbg/bin/mysqld: ready for connections.
Version: '10.8.0-MariaDB-debug-log'  socket: '/test/MD121121-mariadb-10.8.0-linux-x86_64-dbg/socket.sock'  port: 37855  MariaDB Server
safe_mutex: Trying to lock mutex at /test/10.8_dbg/sql/sql_plugin.cc, line 3093, when the mutex was already locked at /test/10.8_dbg/sql/sys_vars_shared.h, line 56 in thread T@47

10.8.0 bc57ff7cf79504a198d4752b4748340a4afd886c (Debug)

Core was generated by `/test/MD121121-mariadb-10.8.0-linux-x86_64-dbg/bin/mysqld --no-defaults --core-'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
[Current thread is 1 (Thread 0x154df8bed700 (LWP 1629659))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000154e130d3859 in __GI_abort () at abort.c:79
#2  0x00005603c8c83268 in safe_mutex_lock (mp=mp@entry=0x5603c9795b00 <LOCK_global_system_variables>, my_flags=my_flags@entry=0, file=file@entry=0x5603c8e31318 "/test/10.8_dbg/sql/sql_plugin.cc", line=line@entry=3093) at /test/10.8_dbg/mysys/thr_mutex.c:262
#3  0x00005603c8c7d770 in psi_mutex_lock (that=0x5603c9795b00 <LOCK_global_system_variables>, file=file@entry=0x5603c8e31318 "/test/10.8_dbg/sql/sql_plugin.cc", line=line@entry=3093) at /test/10.8_dbg/mysys/my_thr_init.c:487
#4  0x00005603c817f81c in inline_mysql_mutex_lock (src_line=3093, src_file=0x5603c8e31318 "/test/10.8_dbg/sql/sql_plugin.cc", that=<optimized out>) at /test/10.8_dbg/include/mysql/psi/mysql_thread.h:746
#5  sync_dynamic_session_variables (thd=thd@entry=0x154dc0000db8, global_lock=global_lock@entry=true) at /test/10.8_dbg/sql/sql_plugin.cc:3093
#6  0x00005603c817fa9a in intern_sys_var_ptr (thd=0x154dc0000db8, offset=160, global_lock=global_lock@entry=true) at /test/10.8_dbg/sql/sql_plugin.cc:3160
#7  0x00005603c817fb8a in mysql_sys_var_int (thd=<optimized out>, offset=<optimized out>) at /test/10.8_dbg/sql/sql_plugin.cc:3182
#8  0x0000154de91d59bf in spider_param_init_sql_alloc_size (thd=0x154dc0000db8, init_sql_alloc_size=0) at /test/10.8_dbg/storage/spider/spd_param.cc:600
#9  0x0000154de92a0b01 in spider_mbase_handler::init (this=0x154dc005d950) at /test/10.8_dbg/storage/spider/spd_db_mysql.cc:8993
#10 0x0000154de91edf2e in spider_get_trx (thd=thd@entry=0x154dc0000db8, regist_allocated_thds=regist_allocated_thds@entry=true, error_num=error_num@entry=0x154df8bebb1c) at /test/10.8_dbg/storage/spider/spd_trx.cc:1518
#11 0x0000154de9248946 in spider_flush_logs (hton=<optimized out>) at /test/10.8_dbg/storage/spider/spd_table.cc:7040
#12 0x00005603c845bf88 in flush_handlerton (thd=thd@entry=0x0, plugin=<optimized out>, arg=arg@entry=0x0) at /test/10.8_dbg/sql/handler.cc:3011
#13 0x00005603c8180ab0 in plugin_foreach_with_mask (thd=thd@entry=0x0, func=func@entry=0x5603c845bf66 <flush_handlerton(THD*, plugin_ref, void*)>, type=type@entry=1, state_mask=state_mask@entry=8, arg=arg@entry=0x0) at /test/10.8_dbg/sql/sql_plugin.cc:2509
#14 0x00005603c8463419 in ha_flush_logs () at /test/10.8_dbg/sql/handler.cc:3017
#15 0x00005603c85ca5e8 in MYSQL_BIN_LOG::purge (this=this@entry=0x5603c97ca160 <mysql_bin_log>) at /test/10.8_dbg/sql/log.cc:7031
#16 0x00005603c85cf2f9 in MYSQL_BIN_LOG::checkpoint_and_purge (this=this@entry=0x5603c97ca160 <mysql_bin_log>, binlog_id=binlog_id@entry=1) at /test/10.8_dbg/sql/log.cc:7043
#17 0x00005603c85da247 in binlog_checksum_update (thd=<optimized out>, var=<optimized out>, var_ptr=<optimized out>, save=<optimized out>) at /test/10.8_dbg/sql/log.cc:11570
#18 0x00005603c8181bf1 in sys_var_pluginvar::global_update (this=0x5603cbead8b0, thd=0x154dc0000db8, var=0x154dc0014308) at /test/10.8_dbg/sql/sql_plugin.cc:3621
#19 0x00005603c805b6ae in sys_var::update (this=0x5603cbead8b0, thd=0x154dc0000db8, var=0x154dc0014308) at /test/10.8_dbg/sql/set_var.cc:207
#20 0x00005603c805bbdb in set_var::update (this=<optimized out>, thd=<optimized out>) at /test/10.8_dbg/sql/set_var.cc:863
#21 0x00005603c805cf21 in sql_set_variables (thd=thd@entry=0x154dc0000db8, var_list=var_list@entry=0x154dc00060d8, free=free@entry=true) at /test/10.8_dbg/sql/set_var.cc:745
#22 0x00005603c81638f4 in mysql_execute_command (thd=thd@entry=0x154dc0000db8, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /test/10.8_dbg/sql/sql_parse.cc:5034
#23 0x00005603c814ccad in mysql_parse (thd=thd@entry=0x154dc0000db8, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x154df8bec400) at /test/10.8_dbg/sql/sql_parse.cc:8028
#24 0x00005603c815b949 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x154dc0000db8, packet=packet@entry=0x154dc001a239 "SET GLOBAL binlog_checksum=NONE", packet_length=packet_length@entry=31, blocking=blocking@entry=true) at /test/10.8_dbg/sql/sql_class.h:1360
#25 0x00005603c815ed83 in do_command (thd=0x154dc0000db8, blocking=blocking@entry=true) at /test/10.8_dbg/sql/sql_parse.cc:1402
#26 0x00005603c82d8e2a in do_handle_one_connection (connect=<optimized out>, connect@entry=0x5603cc5ccba8, put_in_cache=put_in_cache@entry=true) at /test/10.8_dbg/sql/sql_connect.cc:1418
#27 0x00005603c82d942f in handle_one_connection (arg=arg@entry=0x5603cc5ccba8) at /test/10.8_dbg/sql/sql_connect.cc:1312
#28 0x00005603c87594ce in pfs_spawn_thread (arg=0x5603cc4c3b88) at /test/10.8_dbg/storage/perfschema/pfs.cc:2201
#29 0x0000154e135e2609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#30 0x0000154e131d0293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Bug confirmed present in:
MariaDB: 10.6.6 (dbg), 10.7.2 (dbg), 10.8.0 (dbg)

Bug (or feature/syntax) confirmed not present in (unless sporadicity caused it not to crash/fail):
MariaDB: 10.2.42 (dbg), 10.2.42 (opt), 10.3.33 (dbg), 10.3.33 (opt), 10.4.23 (dbg), 10.4.23 (opt), 10.5.14 (dbg), 10.5.14 (opt), 10.6.6 (opt), 10.7.2 (opt), 10.8.0 (opt)
MySQL: 5.5.62 (dbg), 5.5.62 (opt), 5.6.51 (dbg), 5.6.51 (opt), 5.7.36 (dbg), 5.7.36 (opt), 8.0.27 (dbg), 8.0.27 (opt)



 Comments   
Comment by Nayuta Yanagisawa (Inactive) [ 2021-12-21 ]

The bug is reproducible on 10.6.3 but not on 10.6.0.

Comment by Nayuta Yanagisawa (Inactive) [ 2021-12-21 ]

The result of git-bisect:

49ff2cbff443b7d7e688f97887a05297901601da is the first bad commit
commit 49ff2cbff443b7d7e688f97887a05297901601da
Author: Sujatha <sujatha.sivakumar@mariadb.com>
Date:   Fri May 7 12:57:20 2021 +0530
 
    MDEV-19371: Implement binlog_expire_logs_seconds for purging of binary logs
    
    Part1: Functional changes
    
    Backporting upstream changes.
    commit a7e1ef858ee82493dd8ad9a76bc9c22fe3b8c05b
    Author: Neha Kumari <neha.n.kumari@oracle.com>
    Note:
    From the upstream patch only the new option binlog_expire_logs_seconds
    specific changes are taken.
    
    * Unlike in the upstream patch 'binlog_expire_logs_seconds' does not
      replace the "old" 'expire_logs_days', to preserve backward-compatibility.
    
    * Datatype of 'expire_logs_days' variable is changed to double.
    
    * Default value of 'binlog_expire_logs_seconds=0' similar to
      'expire_logs_days'.
    
    * The purge_time can be specified in days with the micro-day precision.
      Eg:
      expire_logs_days=1 is the same as expire_logs_days=1.000000 to make
        binlog_expire_logs_seconds=86400.
      binlog_expire_logs_seconds=1 is the same as expire_logs_days=0.000012.
    
    * If binary log is disabled and option 'expire_logs_days' or
      'binlog_expire_logs_seconds' used with purge_time > 0 a warning will be
      issued.
 
 sql/log.cc      |  6 ++++--
 sql/mysqld.cc   | 34 +++++++++++++++++++++++++++++-----
 sql/mysqld.h    |  5 ++++-
 sql/sys_vars.cc | 38 +++++++++++++++++++++++++++++++++++---
 4 files changed, 72 insertions(+), 11 deletions(-)

Comment by Nayuta Yanagisawa (Inactive) [ 2021-12-21 ]

Elkin The bug seems to be due to the change in the replication component. Please check it.

Comment by Sergei Golubchik [ 2021-12-21 ]

In particular, this change which is not explained in the commit comment and is not present in the upstream patch.

It comes from a different upstream commit and I suspect it's just wrong there.

Comment by Nayuta Yanagisawa (Inactive) [ 2021-12-21 ]

What happens in the above back trace:

LOCK_global_system_variables is acquired in sys_var::update(). Then, spider_param_init_sql_alloc_size() tries to access the Spider's system variable spider_init_sql_alloc_size and tries to take LOCK_global_system_variables lock (dead lock!).

Comment by Nayuta Yanagisawa (Inactive) [ 2021-12-21 ]

As serg pointed out, removing ha_flush_logs() fixes the problem.

The following makeshift patch also fix the problem (note: this cannot be applied as is):

diff --git a/storage/spider/spd_param.cc b/storage/spider/spd_param.cc
index b23877ca92a..e8d9dd4fca3 100644
--- a/storage/spider/spd_param.cc
+++ b/storage/spider/spd_param.cc
@@ -597,8 +597,7 @@ int spider_param_init_sql_alloc_size(
   int init_sql_alloc_size
 ) {
   DBUG_ENTER("spider_param_init_sql_alloc_size");
-  DBUG_RETURN(THDVAR(thd, init_sql_alloc_size) < 0 ?
-    init_sql_alloc_size : THDVAR(thd, init_sql_alloc_size));
+  DBUG_RETURN(1024);
 }
 
 /*
@@ -1147,7 +1146,7 @@ bool spider_param_use_flash_logs(
   THD *thd
 ) {
   DBUG_ENTER("spider_param_use_flash_logs");
-  DBUG_RETURN(THDVAR(thd, use_flash_logs));
+  DBUG_RETURN(FALSE);
 }

Comment by Nayuta Yanagisawa (Inactive) [ 2021-12-21 ]

Elkin Please check comments made by serg and me.

If the log flushing is necessary for setting the binlog_checksum variable, we may have to change the Spider implementation. However, if it is wrongly there, please remove it.

Comment by Nayuta Yanagisawa (Inactive) [ 2021-12-21 ]

Elkin FYI, MDEV-25890 also seems to be introduced by 49ff2cb.

Comment by Sergei Golubchik [ 2021-12-21 ]

I suspect that ha_flush_logs() there is not needed.
But if it is, it certainly cannot be done under LOCK_global_system_variables mutex.

Comment by Roel Van de Paar [ 2021-12-26 ]

Elkin FYI, some related info (testcase, notes) in MDEV-25890.

Comment by Andrei Elkin [ 2021-12-27 ]

Indeed, ha_flush_logs() is unnecessary. Unlike the upstream, where it arrived from, Mariadb exploits different
mechanisms for not letting PURGE or RESET-MASTER to trouble transaction recovery. That is in case should a trx
be prepared but its binlog file gone, the trx then is committed on disk too.
I am reverting the culprit to submit a commit for review.

Comment by Andrei Elkin [ 2021-12-27 ]

Howdy, Sergei.

I'm inviting you to check 'cos you've been looking into anyway.

Thank you.

/ndrei.

Comment by Sergei Golubchik [ 2021-12-27 ]

the commit de548dbb0f2089 is, of course, ok, as discussed above.
WIll you add a test case?

Comment by Andrei Elkin [ 2021-12-27 ]

serg, no need to add more tests than those that the commit message points. They test PURGE and more.

Comment by Sergei Golubchik [ 2021-12-27 ]

what about the test for MDEV-27039? The commit specifically mentions this MDEV in the subject, but there is no test that the bug — the crash — went away.

Comment by Andrei Elkin [ 2021-12-28 ]

serg, You're right. My focus here was too much replication-centric. I adding the description based one to spider 's own suite.

Comment by Andrei Elkin [ 2021-12-29 ]

A regression test is added as a separate commit (to be fixed up at future pushing).
It actually makes clear about the necessary condition to summon the bug.

Cheers.

Comment by Sergei Golubchik [ 2022-01-02 ]

2611abb3562 is to to push too, thanks!

Generated at Thu Feb 08 09:49:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.