[MDEV-16467] MariaDB crashes because of "long semaphore wait"after migrating from 10.1 to 10.3 Created: 2018-06-11 Updated: 2021-09-30 Resolved: 2019-06-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3.7 |
| Fix Version/s: | 10.2.25, 10.1.41 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Tom Deory | Assignee: | Sergey Vojtovich |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | crash, innodb, semaphore | ||
| Environment: |
Windows 10, 32 GB RAM, SSD |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
We have a Windows server that has been running MariaDB 10.1 successfully for over a year. The server remains mostly idle for long times with some read access, but occasionally there are transactions that add data (about 500k rows per commit). There can be up to 10 such transactions (one per database) at the same time and during those times the server is under quite some load (the code processing the data resides on the same server as the database). When trying to use MariaDB 10.3, during those load times the database crashes and logs "[FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung." I figure it could be related to MariaDB switching from XtraDB to InnoDB and on the mailing list it was suggested to file a bug. I'm attaching the error log. I cannot attach a minidump as-is because upon inspection it contained sensitive data that I'm not allowed to disclose. If absolutely neccessary, I can try and reproduce the problem with random test data, but this might take a few days. The mailing list entry mentioned above can be found here: https://lists.launchpad.net/maria-discuss/msg05139.html Regards, |
| Comments |
| Comment by Tom Deory [ 2018-06-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Additional info: The error log is starting from a clean state (i.e. a MariaDB 10.3 with no data previously in it). It contains the whole log from starting the database up to the crash. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-06-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It is strange that you got sensitive data from those minidumps, they contain hardly anything apart from callstacks. but anyway, we need all threads callstacks, not a single thread callstack that error log provides. There are different ways to get them, using mysqld.dmp that you have created. One way is to execute cdb.exe -y "C:\Program Files\MariaDB 10.3\bin" -i "C:\Program Files\MariaDB 10.3\bin" -lines -z C:\path\to\mysqld.dmp -c "~*kc;q!" (you'd might to fix the -y, -i paths to be correct and point to the directory where mysqld.pdb , mysqld.exe . the path given after -z should be full path to mysqld.dmp) cdb.exe is part of Debugging tools for Windows, which is part of Windows SDK. After I installed it, cdb.exe is located under more info on debugging tools for Windows : Another day to dump all threads callstacks is opening minidump in visual studio, click "debug with native only" button and then execute Debug.ListCallStack /AllThreads in the "Command Window" So, could you please try to get all threads callstack, and attach that, if you are not comfortable sharing the minidump | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Van Deynse [ 2018-06-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello We experience the exact same issue:
Our environment on the other hand is completely different:
I attached our logs in the hope that you could find some useful information. Regards | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have the identical problem on MDB 10.2.14. LMK if I need to file a separate bug since this report is based on 10.3 and ours is on 10.2.14. binlog_wait_commit on master was 2 and have since set it to 0 and the problem still occurs on the slave. On the slave side lowered slave_parallel_workers ... threads from 16 to 8 to now 4. Servers are 40cpu, 384 gb ram, bufferpools 260GB, disks 33tb. These servers handle extreme high volume mass inserts on duplicate update with NO problems on the master. We saw occassional deadlocks on the slave even in 10.1.22 but MDB always correctly handled the deadlocks. Since the problems were not occuring on the master, but only on slave during parallism this is considered an MDB issue as opposed to a code isue. Once upgraded to 10.2.14 this problem continues to occur about every 4 days now. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
should have also reported we are on centos 6.9.x. On one of our other servers we are also seeing slave insert on duplcate update same database deadlocks. Also from the parallel processing but it has yet to cause a semaphore > 900 seconds. Instead the server load and threads go > 1000 load, threads upwards 2k and the only show engine innodb status of significance is the following message indicating a very high level mutex is jammed. --Thread 140079041885952 has waited at srv0srv.cc line 952 for 0.00 seconds the semaphore: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm afraid #me-too does not help much in those cases | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
thanks , will setup for the stack traces from all threads and attach when the problem reoccurs. Only have the mysql-error.log at this point in time. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
fyi, everything is set up for the thread stack traces; now waiting for the crash which is usually about 3-4 days out. We bumped up the slave parallel threads to see if we can trigger the problem sooner. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-07-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Noting since we set the db up for the core dump of all threads the problem has not reoccurred, one thought is that the coredoump configuration slows down the db just enough to properly handle deadlocks. As repeated earlier, we have replication configured for optimistic processing; it appears to still be too agressive since the problem only occurs with paralell slave processing. Setting it to conservative cause replication to immediately fall behind. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-07-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We finally have a core dump, all 360GB of it compressed, along with the mysql-error.log, show variables out etc. The plan is to send it to ftp.askmonty.org/private. Does MariaDB have an established NDA process since our company requires it prior to my sending the core dump? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-07-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
sklasing, usually, there is no need to send the whole dump. all stack backtrace should normally be enough. To produce it , start gdb with core dump, execute "thread apply all bt", and attach the output to the bug report | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-07-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Vlad, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-07-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
also noting after the last crash when the system restarted it immediately reported 2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2. We noted the cache hit rate was only 27% so have since then quadrupled it to 64k and our current hit rate is now 79%. There are atleast 16k databases on this archive server with approx 6 tables each. Since then no crashes but a bit too early to correlate the mod. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-07-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hmm, are those complete stack traces? How did you collect them? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2018-07-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For all threads' full stack trace:
It will create a text file at <path-to-the-resulting-file> location. Even with full stack and 8600 threads, it shouldn't be very big, but if it turns out to be more than 10 Mb, you'll need to compress it before attaching to the issue. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-07-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
sorry for the delay folks. Even the above supplied --batch commandstream would not write the log file. Ended up repeating orignal with a set pagination off which appeared to do the trick. LMK if you need anything else and will try to be more responsive. gdb.180561.txt.gz | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2018-07-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-07-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
tdcf, we did not hear from you for a while now. it could be a completely different case of deadlock than the one sklasing is experiencing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-07-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Added https://jira.mariadb.org/secure/attachment/45892/uniqstack.txt with stack aggregation, so it is easier to look at There is a high contention on LOCK_thread_count mutex (_L_lock_995), 8412 new connections would like to acquire it outside of thread cache, 144 new connection from thread cache, 4 from "show processlist", and 1 replication thread that does DeadlockChecker::print | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Klasing [ 2018-07-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Folks, we have not had a crash for over a week now. After the last crash the error log displayed this message AFTER it restarted As a result we checked the open table cache hit rate which was 27%. We then raised open_table_cache from 16k to 64k since we have easily 150k tables / 15k databases on this instance. My theory is that when the competing inserts on duplicate key updates deadlock during slave replication that MDB is having to clear cache for the new tables and perhaps that caused the 900 sec semaphore delay. Hope this helps. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2018-07-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There is no contention on table cache mutex in the log. is not a warning, it is a note. Some heuristic is used on how table cache instances are automatically sized, and this comes from that heuristic. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-07-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In uniqstack.txt | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-07-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
JensVanDeynse, while I did not see any sign of | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Van Deynse [ 2018-07-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Marko Mäkelä, this could be the case | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-06-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looking at the gdb_short.txt (which is by far the most useful attachment), culprit is:
This should've been fixed since 10.3.3 by
Specifically by "- Remove not needed LOCK_thread_count from thd_get_error_context_description()" 10.2 is missing this fix, but it can be easily backported. Unfortunately there's not enough data to analyze 10.3 deadlock. Dump of all threads callstacks is missing. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-06-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Given that original 10.3.7 mariadb_clean.err contains ADD FOREIGN KEY:
It feels like similar problem was fixed by | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-06-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
10.3 issue should've been fixed already | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-01-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The original report was something specific to 10.3.7. |