[MDEV-32511] Race condition between page write completion and log checkpoint Created: 2023-10-18  Updated: 2023-10-18  Resolved: 2023-10-18

Status: Closed
Project: MariaDB Server
Component/s: Backup, Storage Engine - InnoDB
Affects Version/s: 10.5, 10.6, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2
Fix Version/s: 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: MSAN, corruption, race

Issue Links:
Relates
relates to MDEV-26827 Make page flushing even faster Closed

 Description   

MDEV-26827 removed duplicated bookkeeping for io_slots::pending_io_count(). The merge of that to 10.8 introduced a debug assertion in mariadb-backup, which would fail rather often in a MemorySanitizer build.

For some reason, I did not manage to get a core dump for this problem, even after I removed calls to sigaction(2). Finally, I was able to figure out the cause by applying the following patch:

diff --git a/extra/mariabackup/xtrabackup.cc b/extra/mariabackup/xtrabackup.cc
index a9891444eaf..3329be7a6df 100644
--- a/extra/mariabackup/xtrabackup.cc
+++ b/extra/mariabackup/xtrabackup.cc
@@ -2396,7 +2396,7 @@ static bool innodb_init()
   buf_flush_sync();
   recv_sys.debug_free();
   ut_ad(!os_aio_pending_reads());
-  ut_ad(!os_aio_pending_writes());
+  if (os_aio_pending_writes()) kill(0, SIGSTOP);
   ut_d(mysql_mutex_lock(&buf_pool.flush_list_mutex));
   ut_ad(!buf_pool.get_oldest_modification(0));
   ut_d(mysql_mutex_unlock(&buf_pool.flush_list_mutex));

With this patch and the following invocation:

LD_LIBRARY_PATH=/home/marko/libmsan-15 MSAN_OPTIONS=abort_on_error=1 MSAN_SYMBOLIZER_PATH=/home/marko/bin/llvm-symbolizer-msan nice ./mtr --repeat=10 --parallel=90 mariabackup.alter_copy_race{,,,,,,,,,}{,,,,,,,,}

I got 4 hung mariabackup processes near the end of the run. I attached gdb to one of them. Sure enough, one of the threads was executing the kill(). We had write_slots.m_cache.m_pos=1, corresponding to the following thread:

10.11 2ecc0443ec62b64cdbf9867e560ac3b40341a570

Thread 5 (Thread 0x7fc442bec6c0 (LWP 1295375) "mariabackup"):
#0  0x000055d34d0ee935 in __sanitizer::internal_mmap(void*, unsigned long, int, int, int, unsigned long long) ()
#1  0x000055d34d0f09f9 in __sanitizer::MmapFixedImpl(unsigned long, unsigned long, bool, char const*) ()
#2  0x000055d34d110144 in __sanitizer::SizeClassAllocator64<__msan::AP64>::PopulateFreeArray(__sanitizer::AllocatorStats*, unsigned long, __sanitizer::SizeClassAllocator64<__msan::AP64>::RegionInfo*, unsigned long) ()
#3  0x000055d34d10ff81 in __sanitizer::SizeClassAllocator64<__msan::AP64>::GetFromAllocator(__sanitizer::AllocatorStats*, unsigned long, unsigned int*, unsigned long) ()
#4  0x000055d34d10fb1a in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__msan::AP64> >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__msan::AP64> >::PerClass*, __sanitizer::SizeClassAllocator64<__msan::AP64>*, unsigned long) ()
#5  0x000055d34d10f7f8 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__msan::AP64>, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__s--Type <RET> for more, q to quit, c to continue without paging--c
4<__msan::AP64> >*, unsigned long, unsigned long) ()
#6  0x000055d34d10ebf9 in __msan::MsanAllocate(__sanitizer::StackTrace*, unsigned long, unsigned long, bool) ()
#7  0x000055d34d10eb5a in __msan::msan_malloc(unsigned long, __sanitizer::StackTrace*) ()
#8  0x000055d34d119023 in malloc ()
#9  0x000055d3505522b7 in DbugMalloc (size=568) at /mariadb/10.11/dbug/dbug.c:2074
#10 PushState (cs=0x704000008000) at /mariadb/10.11/dbug/dbug.c:1599
#11 0x000055d350555989 in _db_push_ (control=0x55d34c788b7d "") at /mariadb/10.11/dbug/dbug.c:867
#12 0x000055d3504e02ef in safe_mutex_lock (mp=0x55d3533a8fd8 <fil_system+40>, my_flags=0, file=0x55d34c784541 "/mariadb/10.11/storage/innobase/fil/fil0fil.cc", line=2682) at /mariadb/10.11/mysys/thr_mutex.c:236
#13 0x000055d34f76acc6 in inline_mysql_mutex_lock (src_line=2682, that=<optimized out>, src_file=<optimized out>) at /mariadb/10.11/include/mysql/psi/mysql_thread.h:750
#14 fil_node_t::complete_write (this=0x707000000328) at /mariadb/10.11/storage/innobase/fil/fil0fil.cc:2682
#15 0x000055d34f75917f in IORequest::write_complete (this=<optimized out>, io_error=<optimized out>) at /mariadb/10.11/storage/innobase/fil/fil0fil.cc:2819
#16 0x000055d34fcfb791 in write_io_callback (c=0x7fc44d450000) at /mariadb/10.11/storage/innobase/os/os0file.cc:3465
#17 0x000055d350267d46 in tpool::task_group::execute (this=0x7130000002e0, t=0x7fc44d450048) at /mariadb/10.11/tpool/task_group.cc:70
#18 0x000055d3502580ef in tpool::thread_pool_generic::worker_main (this=0x718000000000, thread_var=0x730000010080) at /mariadb/10.11/tpool/tpool_generic.cc:583

This thread had already released the page latch, which allowed the buf_flush_page_cleaner() to finish. It had not yet invoked the following:

  node->complete_write();
  node->space->release();

This could mean that during a normal operation, a log checkpoint would fail to invoke fdatasync() or fsync() to make durable the last writes on a file. The two calls should probably be made part of buf_page_write_complete() right before the page latch release.

The assertion expression would fail until write_slots->release(cb) was called in write_io_callback(). Perhaps we should replace the debug assertions with some wait loops.



 Comments   
Comment by Marko Mäkelä [ 2023-10-18 ]

I think that the following fixes the race condition between log checkpoint and page write completion:

diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc
index 27f9a35fb4c..bcd18e73926 100644
--- a/storage/innobase/fil/fil0fil.cc
+++ b/storage/innobase/fil/fil0fil.cc
@@ -2804,6 +2804,7 @@ void IORequest::write_complete(int io_error) const
   ut_ad(fil_validate_skip());
   ut_ad(node);
   ut_ad(is_write());
+  node->complete_write();
 
   if (!bpage)
   {
@@ -2816,7 +2817,6 @@ void IORequest::write_complete(int io_error) const
   else
     buf_page_write_complete(*this, io_error);
 
-  node->complete_write();
   node->space->release();
 }
 

The debug assertions in mariabackup as well as create_log_file() and srv_start() are best replaced with calls to os_aio_wait_until_no_pending_writes(false), to ensure that the debug build would hang in case there is some problem. In non-debug builds, the buf_page_t::lock should be held until the write has been properly registered.

Comment by Marko Mäkelä [ 2023-10-18 ]

The race condition on write completion affects 10.5 as well. The debug assertion is only present in 10.6 onwards, related to MDEV-26827.

Generated at Thu Feb 08 10:31:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.