[MDEV-22340] Server hangs on ibuf_merge_or_delete_for_page during shutdown with innodb_fast_shutdown=0 Created: 2020-04-22 Updated: 2023-05-30 Resolved: 2023-05-30 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server, Storage Engine - InnoDB |
| Affects Version/s: | 10.3.20 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Bernardo Perez | Assignee: | Marko Mäkelä |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Linux |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
Initially I thought I was hitting I seem to have a case that matches this behavior on one of the versions that in theory fixes this I've attached the stack. Here there is some extra info: Thread 4 and 5 seem to be both on ibuf_merge_or_delete_for_page. Thread 4 seems to be a buffer dumping the buffer pool. They seem to be working on a cluster index from change buffer.
I've printed dictionary operation locks and sys mutexes:
Locks of the two threads: thread 4
thread 5
|
| Comments |
| Comment by Bernardo Perez [ 2020-04-22 ] | ||
|
Additional information requested added to extra_info.txt | ||
| Comment by Marko Mäkelä [ 2020-04-29 ] | ||
|
Bernardo Perez, thank you. Can you double-check that buf_page_io_complete() was really invoked for the same page number in both threads? Both threads are thankfully invoking buf_read_page_low() with sync=true, so that the completion function will be called in the same thread. This suggests that there could be a bug in buf_read_page_background() or buf_page_get_gen(), causing the hang if both threads are invoked on the same page number. The bug might not be limited to loading buffer pool dumps. The function buf_read_page_background() is also invoked by read-ahead operations. Can you repeat this bug easily? | ||
| Comment by Marko Mäkelä [ 2020-04-29 ] | ||
|
Bernardo Perez, while you detect the hang on shutdown, I suspect that these two threads must have been hung all the way since startup. If you attempted to access the page 8865 from SQL, the thread for handling that connection should hang as well. | ||
| Comment by Bernardo Perez [ 2020-05-01 ] | ||
|
Hello Marko, Interestingly enough this has happened tonight (I am based in USA now) in 6 different systems. Two sets of 3 replicas from 2 different masters. All of them in 10.3.20 I was able to gather some information from 3 of them and I've been able to leave one up "stuck" (stack 3) in case you want me to extract something. But we can't keep it up on this state for long so if you could reply fast it would be great. The 3 stacks are different between them and different to the first. What I can observe now is that all of them wait on the same mutex and when looking into the cursor on the function btr_cur_search_to_nth_level_func all of them are accesing change buffer in the system tablespace and the page 4. Interestingly enough, the one we could keep up does not even seem to have a contending thread. There is only 1 active thread that seems to be stuck. Let me know your thoughts. stack1.txt | ||
| Comment by Marko Mäkelä [ 2020-06-01 ] | ||
|
Bernardo Perez, sorry, I have been busy working on 10.5. Meanwhile, I learned a new gdb trick:
In that format, we should see proper values instead of the obscured page_id=.... Could you please post new stack traces with that setting? | ||
| Comment by Bernardo Perez [ 2020-06-01 ] | ||
|
Hello Marko Mäkelä no problem at all. We had to terminate those instances. If/when we encounter the same issue we will follow your request to extract the data as you suggested. I will come back once we get the information. Regards, | ||
| Comment by Marko Mäkelä [ 2021-03-05 ] | ||
|
It is really difficult to tell, but it might be the case that the system tablespace (or the change buffer structures) had been corrupted due to | ||
| Comment by Marko Mäkelä [ 2022-11-10 ] | ||
|
For the root cause of the hang | ||
| Comment by Marko Mäkelä [ 2023-02-17 ] | ||
|
Bernardo Perez, we were finally able to reproduce a hang scenario similar to Can you experience this with MariaDB Server 10.5.19 or later? | ||
| Comment by Sergei Golubchik [ 2023-05-30 ] | ||
|
I'm going to close it now. If feedback comes, we'll reopen |