We have a Windows server that has been running MariaDB 10.1 successfully for over a year. The server remains mostly idle for long times with some read access, but occasionally there are transactions that add data (about 500k rows per commit). There can be up to 10 such transactions (one per database) at the same time and during those times the server is under quite some load (the code processing the data resides on the same server as the database).
When trying to use MariaDB 10.3, during those load times the database crashes and logs "[FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung."
I figure it could be related to MariaDB switching from XtraDB to InnoDB and on the mailing list it was suggested to file a bug.
I'm attaching the error log. I cannot attach a minidump as-is because upon inspection it contained sensitive data that I'm not allowed to disclose. If absolutely neccessary, I can try and reproduce the problem with random test data, but this might take a few days.
Additional info: The error log is starting from a clean state (i.e. a MariaDB 10.3 with no data previously in it). It contains the whole log from starting the database up to the crash.
Tom Deory
added a comment - Additional info: The error log is starting from a clean state (i.e. a MariaDB 10.3 with no data previously in it). It contains the whole log from starting the database up to the crash.
It is strange that you got sensitive data from those minidumps, they contain hardly anything apart from callstacks. but anyway, we need all threads callstacks, not a single thread callstack that error log provides.
There are different ways to get them, using mysqld.dmp that you have created.
(you'd might to fix the -y, -i paths to be correct and point to the directory where mysqld.pdb , mysqld.exe . the path given after -z should be full path to mysqld.dmp)
cdb.exe is part of Debugging tools for Windows, which is part of Windows SDK. After I installed it, cdb.exe is located under
C:\Program Files (x86)\Windows Kits\10\Debuggers\x64
Another day to dump all threads callstacks is opening minidump in visual studio, click "debug with native only" button and then execute
Debug.ListCallStack /AllThreads in the "Command Window"
So, could you please try to get all threads callstack, and attach that, if you are not comfortable sharing the minidump
Vladislav Vaintroub
added a comment - - edited It is strange that you got sensitive data from those minidumps, they contain hardly anything apart from callstacks. but anyway, we need all threads callstacks, not a single thread callstack that error log provides.
There are different ways to get them, using mysqld.dmp that you have created.
One way is to execute
cdb.exe -y "C:\Program Files\MariaDB 10.3\bin" -i "C:\Program Files\MariaDB 10.3\bin" -lines -z C:\path\to\mysqld.dmp -c "~*kc;q!"
(you'd might to fix the -y, -i paths to be correct and point to the directory where mysqld.pdb , mysqld.exe . the path given after -z should be full path to mysqld.dmp)
cdb.exe is part of Debugging tools for Windows, which is part of Windows SDK. After I installed it, cdb.exe is located under
C:\Program Files (x86)\Windows Kits\10\Debuggers\x64
more info on debugging tools for Windows :
https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/
Another day to dump all threads callstacks is opening minidump in visual studio, click "debug with native only" button and then execute
Debug.ListCallStack /AllThreads in the "Command Window"
So, could you please try to get all threads callstack, and attach that, if you are not comfortable sharing the minidump
We updated our database server (1 of 25+ database servers) from MariaDB10.1 to MariaDB10.3
First days everything went as planned
This weekend MariaDB intentionally crashed itself twice (Night b error_mysqld.log etween friday - saturday & night between sunday - monday)
The crashes were at exactly the same moment (00:20)
After analyzing our server (with a lot of databases) we saw that 105 (small) events are scheduled at 00:00 each night and in the logs it's obvious that it's around that time that something goes wrong -> First semaphore wait indication is at 00:04:03 with a 241 second wait
It's our opinion that the events have something to do with the crash but this wasn't an issue in MariaDB10.1
Our environment on the other hand is completely different:
Virtualised
Linux server -> CentOS 7.5.1804 (Core)
32GB RAM
8 CPU's
SSD
18GB innodb_buffer_pool_size
I attached our logs in the hope that you could find some useful information.
If you need any more information, please just ask and I'll see if I can find it
Regards
Jens
Jens Van Deynse
added a comment - - edited Hello
We experience the exact same issue:
We updated our database server (1 of 25+ database servers) from MariaDB10.1 to MariaDB10.3
First days everything went as planned
This weekend MariaDB intentionally crashed itself twice (Night b error_mysqld.log etween friday - saturday & night between sunday - monday)
The crashes were at exactly the same moment (00:20)
After analyzing our server (with a lot of databases) we saw that 105 (small) events are scheduled at 00:00 each night and in the logs it's obvious that it's around that time that something goes wrong -> First semaphore wait indication is at 00:04:03 with a 241 second wait
It's our opinion that the events have something to do with the crash but this wasn't an issue in MariaDB10.1
Our environment on the other hand is completely different:
Virtualised
Linux server -> CentOS 7.5.1804 (Core)
32GB RAM
8 CPU's
SSD
18GB innodb_buffer_pool_size
I attached our logs in the hope that you could find some useful information.
If you need any more information, please just ask and I'll see if I can find it
Regards
Jens
We have the identical problem on MDB 10.2.14.
The deadlocks occur only during slave replication due to competing identical insert on duplicate update statements for the same database. This never occurred in 10.1.18 or 10.1.22.
LMK if I need to file a separate bug since this report is based on 10.3 and ours is on 10.2.14.
binlog_wait_commit on master was 2 and have since set it to 0 and the problem still occurs on the slave. On the slave side lowered slave_parallel_workers ... threads from 16 to 8 to now 4.
Servers are 40cpu, 384 gb ram, bufferpools 260GB, disks 33tb. These servers handle extreme high volume mass inserts on duplicate update with NO problems on the master. We saw occassional deadlocks on the slave even in 10.1.22 but MDB always correctly handled the deadlocks. Since the problems were not occuring on the master, but only on slave during parallism this is considered an MDB issue as opposed to a code isue. Once upgraded to 10.2.14 this problem continues to occur about every 4 days now.
Scott Klasing
added a comment - We have the identical problem on MDB 10.2.14.
The deadlocks occur only during slave replication due to competing identical insert on duplicate update statements for the same database. This never occurred in 10.1.18 or 10.1.22.
LMK if I need to file a separate bug since this report is based on 10.3 and ours is on 10.2.14.
binlog_wait_commit on master was 2 and have since set it to 0 and the problem still occurs on the slave. On the slave side lowered slave_parallel_workers ... threads from 16 to 8 to now 4.
Servers are 40cpu, 384 gb ram, bufferpools 260GB, disks 33tb. These servers handle extreme high volume mass inserts on duplicate update with NO problems on the master. We saw occassional deadlocks on the slave even in 10.1.22 but MDB always correctly handled the deadlocks. Since the problems were not occuring on the master, but only on slave during parallism this is considered an MDB issue as opposed to a code isue. Once upgraded to 10.2.14 this problem continues to occur about every 4 days now.
On one of our other servers we are also seeing slave insert on duplcate update same database deadlocks. Also from the parallel processing but it has yet to cause a semaphore > 900 seconds. Instead the server load and threads go > 1000 load, threads upwards 2k and the only show engine innodb status of significance is the following message indicating a very high level mutex is jammed.
--Thread 140079041885952 has waited at srv0srv.cc line 952 for 0.00 seconds the semaphore:
Mutex at 0x7f9c38aadd10, Mutex SRV_SYS created srv0srv.cc:1032, lock var 0
Scott Klasing
added a comment - should have also reported we are on centos 6.9.x.
On one of our other servers we are also seeing slave insert on duplcate update same database deadlocks. Also from the parallel processing but it has yet to cause a semaphore > 900 seconds. Instead the server load and threads go > 1000 load, threads upwards 2k and the only show engine innodb status of significance is the following message indicating a very high level mutex is jammed.
--Thread 140079041885952 has waited at srv0srv.cc line 952 for 0.00 seconds the semaphore:
Mutex at 0x7f9c38aadd10, Mutex SRV_SYS created srv0srv.cc:1032, lock var 0
I'm afraid #me-too does not help much in those cases .
If something deadlocks, taking stack traces from all threads, and attaching it to the bug helps. If you're on Windows, you can also add core-file under [mysqld], and attach the mysqld.dmp to the bug report.
Vladislav Vaintroub
added a comment - - edited I'm afraid #me-too does not help much in those cases .
If something deadlocks, taking stack traces from all threads, and attaching it to the bug helps. If you're on Windows, you can also add core-file under [mysqld] , and attach the mysqld.dmp to the bug report.
thanks , will setup for the stack traces from all threads and attach when the problem reoccurs. Only have the mysql-error.log at this point in time.
Scott Klasing
added a comment - thanks , will setup for the stack traces from all threads and attach when the problem reoccurs. Only have the mysql-error.log at this point in time.
fyi, everything is set up for the thread stack traces; now waiting for the crash which is usually about 3-4 days out. We bumped up the slave parallel threads to see if we can trigger the problem sooner.
Scott Klasing
added a comment - fyi, everything is set up for the thread stack traces; now waiting for the crash which is usually about 3-4 days out. We bumped up the slave parallel threads to see if we can trigger the problem sooner.
Noting since we set the db up for the core dump of all threads the problem has not reoccurred, one thought is that the coredoump configuration slows down the db just enough to properly handle deadlocks. As repeated earlier, we have replication configured for optimistic processing; it appears to still be too agressive since the problem only occurs with paralell slave processing. Setting it to conservative cause replication to immediately fall behind.
Scott Klasing
added a comment - Noting since we set the db up for the core dump of all threads the problem has not reoccurred, one thought is that the coredoump configuration slows down the db just enough to properly handle deadlocks. As repeated earlier, we have replication configured for optimistic processing; it appears to still be too agressive since the problem only occurs with paralell slave processing. Setting it to conservative cause replication to immediately fall behind.
We finally have a core dump, all 360GB of it compressed, along with the mysql-error.log, show variables out etc. The plan is to send it to ftp.askmonty.org/private. Does MariaDB have an established NDA process since our company requires it prior to my sending the core dump?
Scott Klasing
added a comment - We finally have a core dump, all 360GB of it compressed, along with the mysql-error.log, show variables out etc. The plan is to send it to ftp.askmonty.org/private. Does MariaDB have an established NDA process since our company requires it prior to my sending the core dump?
Vladislav Vaintroub
added a comment - sklasing , usually, there is no need to send the whole dump. all stack backtrace should normally be enough.
To produce it , start gdb with core dump, execute "thread apply all bt", and attach the output to the bug report
See also https://mariadb.com/kb/en/library/how-to-produce-a-full-stack-trace-for-mysqld/#using-the-core-file
Scott Klasing
added a comment - Thanks Vlad,
I am attached two back trace outputs of two core dumps that occurred on Sunday/Monday htis week.
gdb_core.64996.txt gdb_core.180561.txt
also noting after the last crash when the system restarted it immediately reported 2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.
We noted the cache hit rate was only 27% so have since then quadrupled it to 64k and our current hit rate is now 79%. There are atleast 16k databases on this archive server with approx 6 tables each.
Since then no crashes but a bit too early to correlate the mod.
Scott Klasing
added a comment - also noting after the last crash when the system restarted it immediately reported 2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.
We noted the cache hit rate was only 27% so have since then quadrupled it to 64k and our current hit rate is now 79%. There are atleast 16k databases on this archive server with approx 6 tables each.
Since then no crashes but a bit too early to correlate the mod.
Hmm, are those complete stack traces? How did you collect them?
The reason I'm asking is that I'm missing a couple of expected threads - thread that does "poll" to handle new connection, innodb background threads, and some other background threads, and most interestingly, I miss the thread that holds the mutex. It is not possible that all threads are waiting for the same mutex, there must be one that holds it.
Vladislav Vaintroub
added a comment - Hmm, are those complete stack traces? How did you collect them?
The reason I'm asking is that I'm missing a couple of expected threads - thread that does "poll" to handle new connection, innodb background threads, and some other background threads, and most interestingly, I miss the thread that holds the mutex. It is not possible that all threads are waiting for the same mutex, there must be one that holds it.
gdb --batch --eval-command="thread apply all bt full" <path to the binary> <path to the coredump> > <path-to-the-resulting-file>
It will create a text file at <path-to-the-resulting-file> location. Even with full stack and 8600 threads, it shouldn't be very big, but if it turns out to be more than 10 Mb, you'll need to compress it before attaching to the issue.
Elena Stepanova
added a comment - - edited For all threads' full stack trace:
gdb --batch --eval-command="thread apply all bt full" <path to the binary> <path to the coredump> > <path-to-the-resulting-file>
It will create a text file at <path-to-the-resulting-file> location. Even with full stack and 8600 threads, it shouldn't be very big, but if it turns out to be more than 10 Mb, you'll need to compress it before attaching to the issue.
sorry for the delay folks. Even the above supplied --batch commandstream would not write the log file. Ended up repeating orignal with a set pagination off which appeared to do the trick. LMK if you need anything else and will try to be more responsive. gdb.180561.txt.gz
Scott Klasing
added a comment - sorry for the delay folks. Even the above supplied --batch commandstream would not write the log file. Ended up repeating orignal with a set pagination off which appeared to do the trick. LMK if you need anything else and will try to be more responsive. gdb.180561.txt.gz
Thanks.
I've also attached gdb_short.txt – it's extracted from the archive above, for the initial quick look without unpacking.
Elena Stepanova
added a comment - Thanks.
I've also attached gdb_short.txt – it's extracted from the archive above, for the initial quick look without unpacking.
tdcf, we did not hear from you for a while now. it could be a completely different case of deadlock than the one sklasing is experiencing
Vladislav Vaintroub
added a comment - tdcf , we did not hear from you for a while now. it could be a completely different case of deadlock than the one sklasing is experiencing
There is a high contention on LOCK_thread_count mutex (_L_lock_995), 8412 new connections would like to acquire it outside of thread cache, 144 new connection from thread cache, 4 from "show processlist", and 1 replication thread that does DeadlockChecker::print
Vladislav Vaintroub
added a comment - Added https://jira.mariadb.org/secure/attachment/45892/uniqstack.txt with stack aggregation, so it is easier to look at
There is a high contention on LOCK_thread_count mutex (_L_lock_995), 8412 new connections would like to acquire it outside of thread cache, 144 new connection from thread cache, 4 from "show processlist", and 1 replication thread that does DeadlockChecker::print
Folks, we have not had a crash for over a week now.
After the last crash the error log displayed this message AFTER it restarted
2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.
As a result we checked the open table cache hit rate which was 27%. We then raised open_table_cache from 16k to 64k since we have easily 150k tables / 15k databases on this instance.
My theory is that when the competing inserts on duplicate key updates deadlock during slave replication that MDB is having to clear cache for the new tables and perhaps that caused the 900 sec semaphore delay.
Hope this helps.
Scott Klasing
added a comment - Folks, we have not had a crash for over a week now.
After the last crash the error log displayed this message AFTER it restarted
2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.
As a result we checked the open table cache hit rate which was 27%. We then raised open_table_cache from 16k to 64k since we have easily 150k tables / 15k databases on this instance.
My theory is that when the competing inserts on duplicate key updates deadlock during slave replication that MDB is having to clear cache for the new tables and perhaps that caused the 900 sec semaphore delay.
Hope this helps.
There is no contention on table cache mutex in the log.
The message [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.
is not a warning, it is a note. Some heuristic is used on how table cache instances are automatically sized, and this comes from that heuristic.
Vladislav Vaintroub
added a comment - - edited There is no contention on table cache mutex in the log.
The message
[Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.
is not a warning, it is a note. Some heuristic is used on how table cache instances are automatically sized, and this comes from that heuristic.
In uniqstack.txt I see that ibuf_delete_rec() is unnecessarily calling lock_update_delete(). But that should not be the reason of the hang.
Marko Mäkelä
added a comment - In uniqstack.txt I see that ibuf_delete_rec() is unnecessarily calling lock_update_delete() . But that should not be the reason of the hang.
JensVanDeynse, while I did not see any sign of MDEV-14637 in uniqstack.txt, it is possible that you are hitting that issue. It has been observed as a hang between two purge threads. It might also be possible during a ROLLBACK, but as I have not created a test case, I cannot say for sure.
Marko Mäkelä
added a comment - JensVanDeynse , while I did not see any sign of MDEV-14637 in uniqstack.txt , it is possible that you are hitting that issue. It has been observed as a hang between two purge threads. It might also be possible during a ROLLBACK , but as I have not created a test case, I cannot say for sure.
@Marko Mäkelä, this could be the case
We tried observing what was wrong by using gdb and although we couldn't see much we saw that it was hung between multiple drop commands
This was because of a script that drops and recreates the databases, since then we changed the timing of these scripts which fixed the issue for some time but this morning it crashed again after two weeks but we can't keep gdb running since this eventually also causes the server to crash
Jens Van Deynse
added a comment - @Marko Mäkelä, this could be the case
We tried observing what was wrong by using gdb and although we couldn't see much we saw that it was hung between multiple drop commands
This was because of a script that drops and recreates the databases, since then we changed the timing of these scripts which fixed the issue for some time but this morning it crashed again after two weeks but we can't keep gdb running since this eventually also causes the server to crash
#5 0x00007f5a17df1773 in thd_get_error_context_description (thd=0x7f13f00009a8, buffer=0x7f5a100caf80 "\240\260\f\020Z\177", length=1024, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_class.cc:540
#6 0x00007f5a18141cf4 in innobase_mysql_print_thd (f=0x7f5ba7906950, thd=Unhandled dwarf expression opcode 0xf3
#7 0x00007f5a181846be in DeadlockChecker::print (trx=0x7f59823329c0, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7238
#8 0x00007f5a18185b7f in DeadlockChecker::notify (this=0x7f5a100cb460, lock=0x7f13e4025b28) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7384
#9 0x00007f5a18186815 in DeadlockChecker::search (this=0x7f5a100cb460) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7500
#10 0x00007f5a18188a35 in DeadlockChecker::check_and_resolve (lock=0x7f13e4025c80, trx=0x7f59823356d0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7635
#21 0x00007f5a1820e0de in row_insert_for_mysql (mysql_rec=0x7f1091cdb150 "0\220\375\221\020\177", prebuilt=0x7f04587c3cf0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/row/row0mysql.cc:1404
#22 0x00007f5a1814fb4f in ha_innobase::write_row (this=0x7f045ac412f0, record=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/handler/ha_innodb.cc:8295
#23 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac412f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971
#24 0x00007f5a184a650a in ha_partition::write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/ha_partition.cc:4189
#25 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971
#26 0x00007f5a17e07aa0 in write_record (thd=0x7f14080009a8, table=0x7f045ac3d958, info=0x7f5a100cdcc0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_insert.cc:1654
#27 0x00007f5a17e0db7c in mysql_insert (thd=0x7f14080009a8, table_list=Unhandled dwarf expression opcode 0xf3
#28 0x00007f5a17e21e39 in mysql_execute_command (thd=0x7f14080009a8) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_parse.cc:4430
#29 0x00007f5a17e277ca in mysql_parse (thd=0x7f14080009a8, rawbuf=Unhandled dwarf expression opcode 0xf3
#31 0x00007f5a17db89a3 in Log_event::apply_event (this=0x7f13dc9ee968, rgi=0x7f14202019a0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/log_event.h:1452
#33 0x00007f5a17f2834e in rpt_handle_event (qev=0x7f1421dd4da8, rpt=Unhandled dwarf expression opcode 0xfa
#34 0x00007f5a17f2bb5c in handle_rpl_parallel_thread (arg=0x7f1420007eb0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/rpl_parallel.cc:1300
#35 0x00007f5a174baaa1 in start_thread () from /lib64/libpthread.so.0
#36 0x00007f5a15bd2bcd in clone () from /lib64/libc.so.6
This should've been fixed since 10.3.3 by
commit c2118a08b144c95cd4d88a080eaa70abd49f3740
Author: Monty <monty@mariadb.org>
Date: Thu Dec 7 21:28:00 2017 +0200
Move all kill mutex protection to LOCK_thd_kill
LOCK_thd_data was used to protect both THD data and
ensure that the THD is not deleted while it was in use
This patch moves the THD delete protection to LOCK_thd_kill,
which already protects the THD for kill.
The benefits are:
- More well defined what LOCK_thd_data protects
- LOCK_thd_data usage is now much simpler and easier to verify
- Less chance of deadlocks in SHOW PROCESS LIST as there is less
chance of interactions between mutexes
- Remove not needed LOCK_thread_count from
thd_get_error_context_description()
- Fewer mutex taken for thd->awake()
Other things:
- Don't take mysys->var mutex in show processlist to check if thread
is kill marked
- thd->awake() now automatically takes the LOCK_thd_kill mutex
(Simplifies code)
- Apc uses LOCK_thd_kill instead of LOCK_thd_data
Specifically by "- Remove not needed LOCK_thread_count from thd_get_error_context_description()"
10.2 is missing this fix, but it can be easily backported.
Unfortunately there's not enough data to analyze 10.3 deadlock. Dump of all threads callstacks is missing.
Sergey Vojtovich
added a comment - Looking at the gdb_short.txt (which is by far the most useful attachment), culprit is:
Thread 61 (
Thread 0x7f5a100d0700 (LWP 184312)):
#0 0x00007f5a174c1334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f5a174bc60e in _L_lock_995 () from /lib64/libpthread.so.0
#2 0x00007f5a174bc576 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007f5a17def3fd in inline_mysql_mutex_lock (that=0x7f5a18dd9680, src_line=Unhandled dwarf expression opcode 0xf3
#4 inline_mysql_mutex_lock (that=0x7f5a18dd9680, src_line=Unhandled dwarf expression opcode 0xf3
#5 0x00007f5a17df1773 in thd_get_error_context_description (thd=0x7f13f00009a8, buffer=0x7f5a100caf80 "\240\260\f\020Z\177", length=1024, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_class.cc:540
#6 0x00007f5a18141cf4 in innobase_mysql_print_thd (f=0x7f5ba7906950, thd=Unhandled dwarf expression opcode 0xf3
#7 0x00007f5a181846be in DeadlockChecker::print (trx=0x7f59823329c0, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7238
#8 0x00007f5a18185b7f in DeadlockChecker::notify (this=0x7f5a100cb460, lock=0x7f13e4025b28) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7384
#9 0x00007f5a18186815 in DeadlockChecker::search (this=0x7f5a100cb460) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7500
#10 0x00007f5a18188a35 in DeadlockChecker::check_and_resolve (lock=0x7f13e4025c80, trx=0x7f59823356d0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7635
#11 0x00007f5a1818a535 in lock_rec_enqueue_waiting (c_lock=0x7f5a1a3de938, type_mode=2563, block=Unhandled dwarf expression opcode 0xf3
#12 0x00007f5a1818aca7 in lock_rec_insert_check_and_lock (flags=Unhandled dwarf expression opcode 0xf3
#13 0x00007f5a182b5de0 in btr_cur_ins_lock_and_undo (flags=0, cursor=0x7f5a100cbb60, entry=0x7f11501973e0, thr=0x7f04587d12f0, mtr=Unhandled dwarf expression opcode 0xf3
#14 0x00007f5a182b8844 in btr_cur_optimistic_insert (flags=0, cursor=0x7f5a100cbb60, offsets=0x7f5a100cbb28, heap=0x7f5a100cbac0, entry=0x7f11501973e0, rec=0x7f5a100cbb20, big_rec=0x7f5a100cbab0, n_ext=0, thr=0x7f04587d12f0, mtr=0x7f5a100cc660) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/btr/btr0cur.cc:3031
#15 0x00007f5a181fb3e3 in row_ins_clust_index_entry_low (flags=0, mode=2, index=0x7f1090a8e740, n_uniq=Unhandled dwarf expression opcode 0xf3
#16 0x00007f5a181fcfde in row_ins_clust_index_entry (index=0x7f1090a8e740, entry=0x7f11501973e0, thr=0x7f04587d12f0, n_ext=0, dup_chk_only=Unhandled dwarf expression opcode 0xf3
#17 0x00007f5a181ff38a in row_ins_index_entry (thr=Unhandled dwarf expression opcode 0xf3
#18 row_ins_index_entry_step (thr=Unhandled dwarf expression opcode 0xf3
#19 row_ins (thr=Unhandled dwarf expression opcode 0xf3
#20 row_ins_step (thr=Unhandled dwarf expression opcode 0xf3
#21 0x00007f5a1820e0de in row_insert_for_mysql (mysql_rec=0x7f1091cdb150 "0\220\375\221\020\177", prebuilt=0x7f04587c3cf0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/row/row0mysql.cc:1404
#22 0x00007f5a1814fb4f in ha_innobase::write_row (this=0x7f045ac412f0, record=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/handler/ha_innodb.cc:8295
#23 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac412f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971
#24 0x00007f5a184a650a in ha_partition::write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/ha_partition.cc:4189
#25 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971
#26 0x00007f5a17e07aa0 in write_record (thd=0x7f14080009a8, table=0x7f045ac3d958, info=0x7f5a100cdcc0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_insert.cc:1654
#27 0x00007f5a17e0db7c in mysql_insert (thd=0x7f14080009a8, table_list=Unhandled dwarf expression opcode 0xf3
#28 0x00007f5a17e21e39 in mysql_execute_command (thd=0x7f14080009a8) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_parse.cc:4430
#29 0x00007f5a17e277ca in mysql_parse (thd=0x7f14080009a8, rawbuf=Unhandled dwarf expression opcode 0xf3
#30 0x00007f5a18076734 in Query_log_event::do_apply_event (this=0x7f13dc9ee968, rgi=0x7f14202019a0, query_arg=0x7f14221178c0 "INSERT INTO app_12289_archive.rollup_hourly (month, hrl_hour, hrl_hash, hrl_tracker, hrl_network, hrl_site, hrl_creative, hrl_geo_country, hrl_campaign, hrl_install_count, hrl_install_matched_to_imp_"..., q_len_arg=Unhandled dwarf expression opcode 0xf3
#31 0x00007f5a17db89a3 in Log_event::apply_event (this=0x7f13dc9ee968, rgi=0x7f14202019a0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/log_event.h:1452
#32 0x00007f5a17dabc83 in apply_event_and_update_pos_apply (ev=0x7f13dc9ee968, thd=0x7f14080009a8, rgi=0x7f14202019a0, reason=Unhandled dwarf expression opcode 0xf3
#33 0x00007f5a17f2834e in rpt_handle_event (qev=0x7f1421dd4da8, rpt=Unhandled dwarf expression opcode 0xfa
#34 0x00007f5a17f2bb5c in handle_rpl_parallel_thread (arg=0x7f1420007eb0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/rpl_parallel.cc:1300
#35 0x00007f5a174baaa1 in start_thread () from /lib64/libpthread.so.0
#36 0x00007f5a15bd2bcd in clone () from /lib64/libc.so.6
This should've been fixed since 10.3.3 by
commit c2118a08b144c95cd4d88a080eaa70abd49f3740
Author: Monty <monty@mariadb.org>
Date: Thu Dec 7 21:28:00 2017 +0200
Move all kill mutex protection to LOCK_thd_kill
LOCK_thd_data was used to protect both THD data and
ensure that the THD is not deleted while it was in use
This patch moves the THD delete protection to LOCK_thd_kill,
which already protects the THD for kill.
The benefits are:
- More well defined what LOCK_thd_data protects
- LOCK_thd_data usage is now much simpler and easier to verify
- Less chance of deadlocks in SHOW PROCESS LIST as there is less
chance of interactions between mutexes
- Remove not needed LOCK_thread_count from
thd_get_error_context_description()
- Fewer mutex taken for thd->awake()
Other things:
- Don't take mysys->var mutex in show processlist to check if thread
is kill marked
- thd->awake() now automatically takes the LOCK_thd_kill mutex
(Simplifies code)
- Apc uses LOCK_thd_kill instead of LOCK_thd_data
Specifically by "- Remove not needed LOCK_thread_count from thd_get_error_context_description()"
10.2 is missing this fix, but it can be easily backported.
Unfortunately there's not enough data to analyze 10.3 deadlock. Dump of all threads callstacks is missing.
The original report was something specific to 10.3.7.
The backport of the cleanup from 10.3.3 to 10.1.41, 10.2.25 effectively reverted the attempted fix of MDEV-11896, which caused the hang in 10.0.32, 10.1.26, 10.2.8, 10.3.1. A deeper analysis of this is in MDEV-13983.
Marko Mäkelä
added a comment - The original report was something specific to 10.3.7.
The backport of the cleanup from 10.3.3 to 10.1.41, 10.2.25 effectively reverted the attempted fix of MDEV-11896 , which caused the hang in 10.0.32, 10.1.26, 10.2.8, 10.3.1. A deeper analysis of this is in MDEV-13983 .
Additional info: The error log is starting from a clean state (i.e. a MariaDB 10.3 with no data previously in it). It contains the whole log from starting the database up to the crash.