[MDEV-29809] MariaDB: node crash and recovery - Semaphore wait has lasted > 600 seconds. We intentionally crash.... - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Incomplete
Affects Version/s: 10.3.27
Fix Version/s: N/A
Component/s: Server, Storage Engine - InnoDB
Labels:
- crash
- lock
- recovery
Environment:
CentOS7.9 ,16C32G2000G,
MariaDB10.3.27

Description

Several days ago,we met a problem a node of MM cluster crashed and then recovery itself.There was a heavy load-in migration task at that time.The related log is in the attachment，11.PNG tells that the server is hanged and crashed raised signal 6. 12.PNG tells the server raised singal 11 and aborting ,that is the reason backtracing file is not generated.13.PNG and 14.PNG tell the informtion when analyzing singal waiting.
If you need any other information to helping solve the problem ,please contact us,thank you!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

11.PNG
9 kB
2022-10-17 05:23
12.PNG
49 kB
2022-10-17 05:23
13.PNG
11 kB
2022-10-17 05:23
14.PNG
6 kB
2022-10-17 05:23

Issue Links

relates to

MDEV-25048 semaphore has too many locks

Closed

Activity

Ascending order - Click to sort in descending order

Daniel Black added a comment - 2022-10-17 05:57

Was a core file generated ? Is installing debug info packages and obtaining a backtrace from the core (as text) possible?

A 91M count in the resevation array (13.png) seems like a lot. What configuration are you running? Do you have some forms of query for the load-in migration and their tables?

I removed ~~MDEV-24294~~ as this was 10.5+ and Galera so not your problem. It could be similar to ~~MDEV-25048~~, though like that one, much more information is needed to even approach this problem. For private uploads push data to https://mariadb.com/kb/en/meta/mariadb-ftp-server/ and this won't be public and will only be used to resolve this issue.

Daniel Black added a comment - 2022-10-17 05:57 Was a core file generated ? Is installing debug info packages and obtaining a backtrace from the core (as text) possible? A 91M count in the resevation array (13.png) seems like a lot. What configuration are you running? Do you have some forms of query for the load-in migration and their tables? I removed MDEV-24294 as this was 10.5+ and Galera so not your problem. It could be similar to MDEV-25048 , though like that one, much more information is needed to even approach this problem. For private uploads push data to https://mariadb.com/kb/en/meta/mariadb-ftp-server/ and this won't be public and will only be used to resolve this issue.

Marko Mäkelä added a comment - 2022-10-17 06:08

By default, due to ~~MDEV-10814~~, core dumps will not include a copy of the buffer pool or the buffer page descriptor. For debugging this hang, we might need access to them. A dict_index_t::lock covers all non-leaf pages in the corresponding index B-tree. It could be good to attach GDB to the server before it hangs and let the execution continue until the fatal signal is delivered.

It would be better to attach output as text instead of bitmaps. In the server error log, were there any reports of corrupted pages? I believe that before the fix of ~~MDEV-13542~~, InnoDB could hang in some cases when it is trying to read a corrupted page.

Is this hang reproducible with MariaDB Server 10.6.10 or a later version?

Marko Mäkelä added a comment - 2022-10-17 06:08 By default, due to MDEV-10814 , core dumps will not include a copy of the buffer pool or the buffer page descriptor. For debugging this hang, we might need access to them. A dict_index_t::lock covers all non-leaf pages in the corresponding index B-tree. It could be good to attach GDB to the server before it hangs and let the execution continue until the fatal signal is delivered. It would be better to attach output as text instead of bitmaps. In the server error log, were there any reports of corrupted pages? I believe that before the fix of MDEV-13542 , InnoDB could hang in some cases when it is trying to read a corrupted page. Is this hang reproducible with MariaDB Server 10.6.10 or a later version?

People

Assignee:: Unassigned

Reporter:: Wenwen Jing

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2022-10-17 05:23

Updated:: 2022-11-23 15:57

Resolved:: 2022-11-23 15:57

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.