Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
10.5.7
Description
As part of MDEV-23855, when we reduced fil_system.mutex contention, we introduced a race condition between two fil_space_t::flush_low(). I analyzed this in an rr record session of an AddressSanitizer build.
We have two threads executing fil_space_t::flush_low() on the same tablespace, like this:
10.5 5eb539555b36a60944eefeb84d5d6d436ba61e63 |
Thread 153 hit Hardware watchpoint 8: -location fil_system.unflushed_spaces.size_
|
|
Old value = 1
|
New value = 0
|
sized_ilist<fil_space_t, unflushed_spaces_tag_t>::erase (pos=...,
|
this=0x565394181298 <fil_system+216>)
|
at /Server/bb-10.5-MDEV-24369/include/ilist.h:252
|
252 return BASE::erase(pos);
|
4: /x $space.n_pending._M_i = 0x20000001
|
5: $space.is_in_unflushed_spaces = false
|
6: fil_system.unflushed_spaces = {<ilist<fil_space_t, unflushed_spaces_tag_t>> = {sentinel_ = {next = 0x615000032318, prev = 0x615000032318}}, size_ = 0}
|
(rr) when
|
Current event: 782346
|
(rr) bt
|
#0 sized_ilist<fil_space_t, unflushed_spaces_tag_t>::erase (pos=...,
|
this=0x565394181298 <fil_system+216>)
|
at /Server/bb-10.5-MDEV-24369/include/ilist.h:252
|
#1 sized_ilist<fil_space_t, unflushed_spaces_tag_t>::remove (value=...,
|
this=0x565394181298 <fil_system+216>)
|
at /Server/bb-10.5-MDEV-24369/include/ilist.h:261
|
#2 fil_space_t::flush_low (this=this@entry=0x615000032318)
|
at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:532
|
#3 0x00005653926e01cd in fil_flush_file_spaces ()
|
at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:3520
|
#4 0x0000565392563f23 in buf_dblwr_t::write_completed (
|
this=0x565394c02020 <buf_dblwr>)
|
at /Server/bb-10.5-MDEV-24369/storage/innobase/buf/buf0dblwr.cc:486
|
#5 0x000056539259066b in buf_page_write_complete (request=...)
|
at /Server/bb-10.5-MDEV-24369/storage/innobase/buf/buf0flu.cc:382
|
#6 0x00005653926f87cb in fil_aio_callback (request=...)
|
at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:3474
|
#7 0x00005653920279b7 in io_callback (cb=<optimized out>)
|
at /Server/bb-10.5-MDEV-24369/storage/innobase/os/os0file.cc:3895
|
#8 0x00005653928a43ab in tpool::simulated_aio::simulated_aio_callback (
|
param=0x64000179ef80)
|
at /Server/bb-10.5-MDEV-24369/tpool/aio_simulated.cc:162
|
…
|
(rr) thr a 5 bt
|
|
Thread 5 (Thread 1662862.1663094):
|
#0 __pthread_mutex_unlock_usercnt (decr=1, mutex=0x60d000005010) at pthread_mutex_unlock.c:58
|
#1 __GI___pthread_mutex_unlock (mutex=0x60d000005010) at pthread_mutex_unlock.c:357
|
#2 0x0000565392029012 in OSMutex::exit (this=0x60d000005008) at /Server/bb-10.5-MDEV-24369/storage/innobase/include/sync0types.h:384
|
#3 os_event::set (this=0x60d000004ff8) at /Server/bb-10.5-MDEV-24369/storage/innobase/os/os0event.cc:68
|
#4 os_event_set (event=0x60d000004ff8) at /Server/bb-10.5-MDEV-24369/storage/innobase/os/os0event.cc:451
|
#5 0x0000565391dc9d7e in TTASEventMutex<GenericPolicy>::exit (this=0x5653941811e0 <fil_system+32>) at /Server/bb-10.5-MDEV-24369/storage/innobase/include/sync0types.h:385
|
#6 PolicyMutex<TTASEventMutex<GenericPolicy> >::exit (this=0x5653941811e0 <fil_system+32>) at /Server/bb-10.5-MDEV-24369/storage/innobase/include/ib0mutex.h:564
|
#7 0x00005653926e01c5 in fil_flush_file_spaces () at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:3519
|
#8 0x0000565392583f13 in log_checkpoint () at /Server/bb-10.5-MDEV-24369/storage/innobase/buf/buf0flu.cc:1692
|
#9 0x00005653925a0337 in buf_flush_page_cleaner ()
|
When Thread 5 was checking, the flag was still set:
rescan:
|
mutex_enter(&fil_system.mutex);
|
|
for (fil_space_t &space : fil_system.unflushed_spaces) |
{
|
if (space.needs_flush_not_stopping()) |
{
|
mutex_exit(&fil_system.mutex);
|
space.flush_low();
|
goto rescan; |
}
|
}
|
But, while that thread is releasing the mutex, Thread 153 had cleared the flag, giving permission for fil_delete_tablespace() to proceed a little later. Then, once the fil_space_t object had been freed, ASAN would complain about heap-use-after-free when Thread 5 gets a chance to invoke space.flush_low().
A possible fix might be to ensure that as long as the page cleaner thread is running, it will be the only caller of fil_flush_file_spaces().
Attachments
Issue Links
- is caused by
-
MDEV-23855 InnoDB log checkpointing causes regression for write-heavy OLTP
- Closed
- relates to
-
MDEV-23416 innodb.innodb_mutexes failed in buildbot with extra warning
- Closed