[MDEV-24391] heap-use-after-free in fil_space_t::flush_low() Created: 2020-12-10  Updated: 2020-12-20  Resolved: 2020-12-11

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.7
Fix Version/s: 10.5.9

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: ASAN, race, regression, rr-profile-analyzed

Issue Links:
Problem/Incident
is caused by MDEV-23855 InnoDB log checkpointing causes regre... Closed
Relates
relates to MDEV-23416 innodb.innodb_mutexes failed in build... Closed

 Description   

As part of MDEV-23855, when we reduced fil_system.mutex contention, we introduced a race condition between two fil_space_t::flush_low(). I analyzed this in an rr record session of an AddressSanitizer build.

We have two threads executing fil_space_t::flush_low() on the same tablespace, like this:

10.5 5eb539555b36a60944eefeb84d5d6d436ba61e63

Thread 153 hit Hardware watchpoint 8: -location fil_system.unflushed_spaces.size_
 
Old value = 1
New value = 0
sized_ilist<fil_space_t, unflushed_spaces_tag_t>::erase (pos=..., 
    this=0x565394181298 <fil_system+216>)
    at /Server/bb-10.5-MDEV-24369/include/ilist.h:252
252	    return BASE::erase(pos);
4: /x $space.n_pending._M_i = 0x20000001
5: $space.is_in_unflushed_spaces = false
6: fil_system.unflushed_spaces = {<ilist<fil_space_t, unflushed_spaces_tag_t>> = {sentinel_ = {next = 0x615000032318, prev = 0x615000032318}}, size_ = 0}
(rr) when
Current event: 782346
(rr) bt
#0  sized_ilist<fil_space_t, unflushed_spaces_tag_t>::erase (pos=..., 
    this=0x565394181298 <fil_system+216>)
    at /Server/bb-10.5-MDEV-24369/include/ilist.h:252
#1  sized_ilist<fil_space_t, unflushed_spaces_tag_t>::remove (value=..., 
    this=0x565394181298 <fil_system+216>)
    at /Server/bb-10.5-MDEV-24369/include/ilist.h:261
#2  fil_space_t::flush_low (this=this@entry=0x615000032318)
    at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:532
#3  0x00005653926e01cd in fil_flush_file_spaces ()
    at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:3520
#4  0x0000565392563f23 in buf_dblwr_t::write_completed (
    this=0x565394c02020 <buf_dblwr>)
    at /Server/bb-10.5-MDEV-24369/storage/innobase/buf/buf0dblwr.cc:486
#5  0x000056539259066b in buf_page_write_complete (request=...)
    at /Server/bb-10.5-MDEV-24369/storage/innobase/buf/buf0flu.cc:382
#6  0x00005653926f87cb in fil_aio_callback (request=...)
    at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:3474
#7  0x00005653920279b7 in io_callback (cb=<optimized out>)
    at /Server/bb-10.5-MDEV-24369/storage/innobase/os/os0file.cc:3895
#8  0x00005653928a43ab in tpool::simulated_aio::simulated_aio_callback (
    param=0x64000179ef80)
    at /Server/bb-10.5-MDEV-24369/tpool/aio_simulated.cc:162
(rr) thr a 5 bt
 
Thread 5 (Thread 1662862.1663094):
#0  __pthread_mutex_unlock_usercnt (decr=1, mutex=0x60d000005010) at pthread_mutex_unlock.c:58
#1  __GI___pthread_mutex_unlock (mutex=0x60d000005010) at pthread_mutex_unlock.c:357
#2  0x0000565392029012 in OSMutex::exit (this=0x60d000005008) at /Server/bb-10.5-MDEV-24369/storage/innobase/include/sync0types.h:384
#3  os_event::set (this=0x60d000004ff8) at /Server/bb-10.5-MDEV-24369/storage/innobase/os/os0event.cc:68
#4  os_event_set (event=0x60d000004ff8) at /Server/bb-10.5-MDEV-24369/storage/innobase/os/os0event.cc:451
#5  0x0000565391dc9d7e in TTASEventMutex<GenericPolicy>::exit (this=0x5653941811e0 <fil_system+32>) at /Server/bb-10.5-MDEV-24369/storage/innobase/include/sync0types.h:385
#6  PolicyMutex<TTASEventMutex<GenericPolicy> >::exit (this=0x5653941811e0 <fil_system+32>) at /Server/bb-10.5-MDEV-24369/storage/innobase/include/ib0mutex.h:564
#7  0x00005653926e01c5 in fil_flush_file_spaces () at /Server/bb-10.5-MDEV-24369/storage/innobase/fil/fil0fil.cc:3519
#8  0x0000565392583f13 in log_checkpoint () at /Server/bb-10.5-MDEV-24369/storage/innobase/buf/buf0flu.cc:1692
#9  0x00005653925a0337 in buf_flush_page_cleaner ()

When Thread 5 was checking, the flag was still set:

rescan:
  mutex_enter(&fil_system.mutex);
 
  for (fil_space_t &space : fil_system.unflushed_spaces)
  {
    if (space.needs_flush_not_stopping())
    {
      mutex_exit(&fil_system.mutex);
      space.flush_low();
      goto rescan;
    }
  }

But, while that thread is releasing the mutex, Thread 153 had cleared the flag, giving permission for fil_delete_tablespace() to proceed a little later. Then, once the fil_space_t object had been freed, ASAN would complain about heap-use-after-free when Thread 5 gets a chance to invoke space.flush_low().

A possible fix might be to ensure that as long as the page cleaner thread is running, it will be the only caller of fil_flush_file_spaces().



 Comments   
Comment by Matthias Leich [ 2020-12-11 ]

There were two issues during a test campaign  on 10.6
 
#         ERROR: AddressSanitizer: heap-use-after-free
# 2020-12-10T20:34:45 [1437773] |     #1 0x55653a27a67e in fil_space_t::acquire_low() /Server/bb-10.6-MDEV-21452/storage/innobase/include/fil0fil.h:522
# 2020-12-10T20:34:45 [1437773] |     #2 0x55653a3caeef in fil_space_t::flush_low() /Server/bb-10.6-MDEV-21452/storage/innobase/fil/fil0fil.cc:506
# 2020-12-10T20:34:45 [1437773] |     #3 0x55653a3e20ae in fil_flush_file_spaces() /Server/bb-10.6-MDEV-21452/storage/innobase/fil/fil0fil.cc:3400
# 2020-12-10T20:34:45 [1437773] |     #4 0x55653a2fb8b0 in log_checkpoint /Server/bb-10.6-MDEV-21452/storage/innobase/buf/buf0flu.cc:1686
# 2020-12-10T20:34:45 [1437773] |     #5 0x55653a2ffc2d in buf_flush_page_cleaner /Server/bb-10.6-MDEV-21452/storage/innobase/buf/buf0flu.cc:2171
 
#         ERROR: AddressSanitizer: heap-use-after-free
# 2020-12-10T18:26:05 [2964288] |     #1 0x55fae1749e3c in fil_space_t::flush_low() /Server/bb-10.6-MDEV-21452/storage/innobase/fil/fil0fil.cc:498
# 2020-12-10T18:26:05 [2964288] |     #2 0x55fae17610ae in fil_flush_file_spaces() /Server/bb-10.6-MDEV-21452/storage/innobase/fil/fil0fil.cc:3400
# 2020-12-10T18:26:05 [2964288] |     #3 0x55fae167a8b0 in log_checkpoint /Server/bb-10.6-MDEV-21452/storage/innobase/buf/buf0flu.cc:1686
# 2020-12-10T18:26:05 [2964288] |     #4 0x55fae167ec2d in buf_flush_page_cleaner /Server/bb-10.6-MDEV-21452/storage/innobase/buf/buf0flu.cc:2171
# 2020-12-10T18:26:05 [2964288] |     #5 0x7fa0005da608 in start_thread /build/glibc-ZN95T4/glibc-2.31/nptl/pthread_create.c:477
# 2020-12-10T18:26:05 [2964288] |     #6 0x7fa0001ae292 in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x122292)
 
Both seem to belong to the current bug.

Comment by Matthias Leich [ 2020-12-11 ]

Results of RQG testing on bb-10.5-MDEV-24391
- tests specialized on replaying the current bug or the trace sequences from the comment above: No replay
- tests with broad functional coverage: No new/unknown issues

Generated at Thu Feb 08 09:29:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.