[MDEV-22340] Server hangs on ibuf_merge_or_delete_for_page during shutdown with innodb_fast_shutdown=0 Created: 2020-04-22  Updated: 2023-05-30  Resolved: 2023-05-30

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.3.20
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Bernardo Perez Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 1
Labels: None
Environment:

Linux


Attachments: Text File extra_info.txt     Text File stack.txt     Text File stack1.txt     Text File stack2.txt     Text File stack3.txt    
Issue Links:
Relates
relates to MDEV-30009 InnoDB shutdown hangs when the change... Closed
relates to MDEV-26871 Shutdown hangs after upgrade to 10.5.12 Closed

 Description   

Initially I thought I was hitting MDEV-20934 I got requested to file a new bug.

I seem to have a case that matches this behavior on one of the versions that in theory fixes this

I've attached the stack.

Here there is some extra info:

Thread 4 and 5 seem to be both on ibuf_merge_or_delete_for_page.

Thread 4 seems to be a buffer dumping the buffer pool.
Thread 5 seems to be in the state you describe on this JIRA.

They seem to be working on a cluster index from change buffer.

(gdb) p cursor->index->table->space->name
$25 = 0x2b29f7036230 "innodb_system"
 
(gdb) p index->table->name
$26 =
{m_name = 0x2b2a367ff340 "innodb_change_buffer", static part_suffix = "#P#"}
 
(gdb) p index->name
$27 =
{m_name = 0x2b2a087db270 "CLUST_IND"}

I've printed dictionary operation locks and sys mutexes:

(gdb) print dict_operation_lock
$36 = {lock_word = 536870912, waiters = 0, sx_recursive = 0, writer_is_wait_ex = false, writer_thread = 0, event = 0x2b2c5deec8d0, wait_ex_event = 0x2b2c5deec940,
cfile_name = 0x55f9dbc4d608 "/local/p4clients/pkgbuild-9zvR5/workspace/src/RDSMariaDB/storage/innobase/dict/dict0dict.cc",
last_x_file_name = 0x55f9dbc41fe0 "/local/p4clients/pkgbuild-9zvR5/workspace/src/RDSMariaDB/storage/innobase/srv/srv0srv.cc", cline = 1097, is_block_lock = 0,
last_x_line = 2025, count_os_wait = 0, list =
{prev = 0x2b29f7027a78, next = 0x2b2a2c3f4120}
 
, pfs_psi = 0x2b2a0800d680}
 
(gdb) print dict_sys.mutex
$37 = {m_impl = {m_lock_word = 0, m_event = 0x2b2c5deec860, m_policy = {m_count =
{m_spins = 0, m_waits = 0, m_calls = 0, m_enabled = false}
 
, m_id = LATCH_ID_DICT_SYS}},
m_ptr = 0x2b29f70bca80}

Locks of the two threads:

thread 4

(gdb) p lock->waiters
$59 = 1
 
(gdb) p lock->event
$56 = (os_event_t) 0x2b2ac905a9b0
 
(gdb) p lock->wait_ex_event
$57 = (os_event_t) 0x2b2ac905aa20

thread 5

(gdb) p lock->waiters
$59 = 1
 
(gdb) p lock->event
$54 = (os_event_t) 0x2b2ac905a9b0
 
(gdb) p lock->wait_ex_event
$55 = (os_event_t) 0x2b2ac905aa20



 Comments   
Comment by Bernardo Perez [ 2020-04-22 ]

Additional information requested added to extra_info.txt Also the server presents literally no IO activity so it is not IO bound. We have information of the state of the IO activity when this happened and also does not present any resource contention on IO.

Comment by Marko Mäkelä [ 2020-04-29 ]

Bernardo Perez, thank you.
extra_info.txt suggests that there is some sort of a race condition on page 8865 in the system tablespace, for which buffered changes exist in the change buffer (also in the system tablespace). The only interesting waits in stack.txt are the Thread 4 and 5.

Can you double-check that buf_page_io_complete() was really invoked for the same page number in both threads? Both threads are thankfully invoking buf_read_page_low() with sync=true, so that the completion function will be called in the same thread.

This suggests that there could be a bug in buf_read_page_background() or buf_page_get_gen(), causing the hang if both threads are invoked on the same page number.

The bug might not be limited to loading buffer pool dumps. The function buf_read_page_background() is also invoked by read-ahead operations.

Can you repeat this bug easily?

Comment by Marko Mäkelä [ 2020-04-29 ]

Bernardo Perez, while you detect the hang on shutdown, I suspect that these two threads must have been hung all the way since startup. If you attempted to access the page 8865 from SQL, the thread for handling that connection should hang as well.

Comment by Bernardo Perez [ 2020-05-01 ]

Hello Marko,

Interestingly enough this has happened tonight (I am based in USA now) in 6 different systems. Two sets of 3 replicas from 2 different masters.

All of them in 10.3.20 I was able to gather some information from 3 of them and I've been able to leave one up "stuck" (stack 3) in case you want me to extract something. But we can't keep it up on this state for long so if you could reply fast it would be great.

The 3 stacks are different between them and different to the first.

What I can observe now is that all of them wait on the same mutex and when looking into the cursor on the function btr_cur_search_to_nth_level_func all of them are accesing change buffer in the system tablespace and the page 4.

Interestingly enough, the one we could keep up does not even seem to have a contending thread. There is only 1 active thread that seems to be stuck.

Let me know your thoughts. stack1.txt stack2.txt stack3.txt

Comment by Marko Mäkelä [ 2020-06-01 ]

Bernardo Perez, sorry, I have been busy working on 10.5. Meanwhile, I learned a new gdb trick:

set print frame-arguments all
thread apply all backtrace

In that format, we should see proper values instead of the obscured page_id=.... Could you please post new stack traces with that setting?

Comment by Bernardo Perez [ 2020-06-01 ]

Hello Marko Mäkelä no problem at all. We had to terminate those instances. If/when we encounter the same issue we will follow your request to extract the data as you suggested.

I will come back once we get the information.

Regards,

Comment by Marko Mäkelä [ 2021-03-05 ]

It is really difficult to tell, but it might be the case that the system tablespace (or the change buffer structures) had been corrupted due to MDEV-24449. Without more data, it is hard to draw any conclusions.

Comment by Marko Mäkelä [ 2022-11-10 ]

MDEV-27734 disabled the change buffer by default because of corruption like MDEV-27765.

For the root cause of the hang MDEV-20934, we now have a better hypothesis.

Comment by Marko Mäkelä [ 2023-02-17 ]

Bernardo Perez, we were finally able to reproduce a hang scenario similar to MDEV-20934 in our internal testing, in MDEV-30009. I can’t be sure if this is exactly what you experienced, but I think that it is plausible.

Can you experience this with MariaDB Server 10.5.19 or later?

Comment by Sergei Golubchik [ 2023-05-30 ]

I'm going to close it now. If feedback comes, we'll reopen

Generated at Thu Feb 08 09:13:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.