[MDEV-32861] InnoDB hangs when running out of I/O slots Created: 2023-11-22 Updated: 2023-12-05 Resolved: 2023-11-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5, 10.6, 10.11, 11.0, 11.1, 11.2, 11.3 |
| Fix Version/s: | 10.5.24, 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | affects-tests, hang, regression | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
To reduce the rate of the infamous 'fake hangs' in rr replay I experimented with the following patch:
With this patch, the following invocation would hang on bootstrap:
The first reason turns out to be missing signaling of a condition variable:
As far as I can tell, the above condition was broken in Even with the above patch applied, the bootstrap would hang. The remaining reason is A further problem would be that many encryption related tests would end up in an infinite loop in buf_pool_t::io_buf_t::reserve(), which is something that had been last changed in
When this function is being called from buf_page_decrypt_after_read(), we must pass wait_for_reads=false, because the current thread actually is executing in a read completion callback, and therefore os_aio_wait_until_no_pending_reads(); after the continue; would hang. In any case, we must schedule a write of a possibly partial doublewrite batch so that os_aio_wait_until_no_pending_writes(); has a chance of finishing. With these fixes, a run with OS_AIO_N_PENDING_IOS_PER_THREAD=1 and
completes fine in both 10.5 and 10.6, with the exception of the test sys_vars.innodb_read_io_threads_basic. The motivation of setting these parameters is to give fewer chances to rr record to schedule threads in an unfair fashion that often leads to 'fake hangs' where write_slots->m_cache.m_pos (or less often read_slots->m_cache.m_pos) would be nonzero and some InnoDB threads are waiting for an extremely long time for a buffer page latch. |
| Comments |
| Comment by Vladislav Vaintroub [ 2023-11-22 ] |
|
Good catch in tpool::cache bug. Seems good to push for me. |