[MDEV-29809] MariaDB: node crash and recovery - Semaphore wait has lasted > 600 seconds. We intentionally crash.... Created: 2022-10-17  Updated: 2022-11-23  Resolved: 2022-11-23

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.3.27
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Wenwen Jing Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: crash, lock, recovery
Environment:

CentOS7.9 ,16C32G2000G,
MariaDB10.3.27


Attachments: PNG File 11.PNG     PNG File 12.PNG     PNG File 13.PNG     PNG File 14.PNG    
Issue Links:
Relates
relates to MDEV-25048 semaphore has too many locks Closed

 Description   

Several days ago,we met a problem a node of MM cluster crashed and then recovery itself.There was a heavy load-in migration task at that time.The related log is in the attachment,11.PNG tells that the server is hanged and crashed raised signal 6. 12.PNG tells the server raised singal 11 and aborting ,that is the reason backtracing file is not generated.13.PNG and 14.PNG tell the informtion when analyzing singal waiting.
If you need any other information to helping solve the problem ,please contact us,thank you!



 Comments   
Comment by Daniel Black [ 2022-10-17 ]

Was a core file generated ? Is installing debug info packages and obtaining a backtrace from the core (as text) possible?

A 91M count in the resevation array (13.png) seems like a lot. What configuration are you running? Do you have some forms of query for the load-in migration and their tables?

I removed MDEV-24294 as this was 10.5+ and Galera so not your problem. It could be similar to MDEV-25048, though like that one, much more information is needed to even approach this problem. For private uploads push data to https://mariadb.com/kb/en/meta/mariadb-ftp-server/ and this won't be public and will only be used to resolve this issue.

Comment by Marko Mäkelä [ 2022-10-17 ]

By default, due to MDEV-10814, core dumps will not include a copy of the buffer pool or the buffer page descriptor. For debugging this hang, we might need access to them. A dict_index_t::lock covers all non-leaf pages in the corresponding index B-tree. It could be good to attach GDB to the server before it hangs and let the execution continue until the fatal signal is delivered.

It would be better to attach output as text instead of bitmaps. In the server error log, were there any reports of corrupted pages? I believe that before the fix of MDEV-13542, InnoDB could hang in some cases when it is trying to read a corrupted page.

Is this hang reproducible with MariaDB Server 10.6.10 or a later version?

Generated at Thu Feb 08 10:11:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.