Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
10.5, 10.6, 10.11, 11.1(EOL), 11.2, 10.8(EOL), 10.9(EOL), 10.10(EOL), 11.0(EOL)
Description
MDEV-26827 removed duplicated bookkeeping for io_slots::pending_io_count(). The merge of that to 10.8 introduced a debug assertion in mariadb-backup, which would fail rather often in a MemorySanitizer build.
For some reason, I did not manage to get a core dump for this problem, even after I removed calls to sigaction(2). Finally, I was able to figure out the cause by applying the following patch:
diff --git a/extra/mariabackup/xtrabackup.cc b/extra/mariabackup/xtrabackup.cc
|
index a9891444eaf..3329be7a6df 100644
|
--- a/extra/mariabackup/xtrabackup.cc
|
+++ b/extra/mariabackup/xtrabackup.cc
|
@@ -2396,7 +2396,7 @@ static bool innodb_init()
|
buf_flush_sync();
|
recv_sys.debug_free();
|
ut_ad(!os_aio_pending_reads());
|
- ut_ad(!os_aio_pending_writes());
|
+ if (os_aio_pending_writes()) kill(0, SIGSTOP);
|
ut_d(mysql_mutex_lock(&buf_pool.flush_list_mutex));
|
ut_ad(!buf_pool.get_oldest_modification(0));
|
ut_d(mysql_mutex_unlock(&buf_pool.flush_list_mutex)); |
With this patch and the following invocation:
LD_LIBRARY_PATH=/home/marko/libmsan-15 MSAN_OPTIONS=abort_on_error=1 MSAN_SYMBOLIZER_PATH=/home/marko/bin/llvm-symbolizer-msan nice ./mtr --repeat=10 --parallel=90 mariabackup.alter_copy_race{,,,,,,,,,}{,,,,,,,,} |
I got 4 hung mariabackup processes near the end of the run. I attached gdb to one of them. Sure enough, one of the threads was executing the kill(). We had write_slots.m_cache.m_pos=1, corresponding to the following thread:
10.11 2ecc0443ec62b64cdbf9867e560ac3b40341a570 |
Thread 5 (Thread 0x7fc442bec6c0 (LWP 1295375) "mariabackup"):
|
#0 0x000055d34d0ee935 in __sanitizer::internal_mmap(void*, unsigned long, int, int, int, unsigned long long) ()
|
#1 0x000055d34d0f09f9 in __sanitizer::MmapFixedImpl(unsigned long, unsigned long, bool, char const*) ()
|
#2 0x000055d34d110144 in __sanitizer::SizeClassAllocator64<__msan::AP64>::PopulateFreeArray(__sanitizer::AllocatorStats*, unsigned long, __sanitizer::SizeClassAllocator64<__msan::AP64>::RegionInfo*, unsigned long) ()
|
#3 0x000055d34d10ff81 in __sanitizer::SizeClassAllocator64<__msan::AP64>::GetFromAllocator(__sanitizer::AllocatorStats*, unsigned long, unsigned int*, unsigned long) ()
|
#4 0x000055d34d10fb1a in __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__msan::AP64> >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__msan::AP64> >::PerClass*, __sanitizer::SizeClassAllocator64<__msan::AP64>*, unsigned long) ()
|
#5 0x000055d34d10f7f8 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__msan::AP64>, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__s--Type <RET> for more, q to quit, c to continue without paging--c
|
4<__msan::AP64> >*, unsigned long, unsigned long) ()
|
#6 0x000055d34d10ebf9 in __msan::MsanAllocate(__sanitizer::StackTrace*, unsigned long, unsigned long, bool) ()
|
#7 0x000055d34d10eb5a in __msan::msan_malloc(unsigned long, __sanitizer::StackTrace*) ()
|
#8 0x000055d34d119023 in malloc ()
|
#9 0x000055d3505522b7 in DbugMalloc (size=568) at /mariadb/10.11/dbug/dbug.c:2074
|
#10 PushState (cs=0x704000008000) at /mariadb/10.11/dbug/dbug.c:1599
|
#11 0x000055d350555989 in _db_push_ (control=0x55d34c788b7d "") at /mariadb/10.11/dbug/dbug.c:867
|
#12 0x000055d3504e02ef in safe_mutex_lock (mp=0x55d3533a8fd8 <fil_system+40>, my_flags=0, file=0x55d34c784541 "/mariadb/10.11/storage/innobase/fil/fil0fil.cc", line=2682) at /mariadb/10.11/mysys/thr_mutex.c:236
|
#13 0x000055d34f76acc6 in inline_mysql_mutex_lock (src_line=2682, that=<optimized out>, src_file=<optimized out>) at /mariadb/10.11/include/mysql/psi/mysql_thread.h:750
|
#14 fil_node_t::complete_write (this=0x707000000328) at /mariadb/10.11/storage/innobase/fil/fil0fil.cc:2682
|
#15 0x000055d34f75917f in IORequest::write_complete (this=<optimized out>, io_error=<optimized out>) at /mariadb/10.11/storage/innobase/fil/fil0fil.cc:2819
|
#16 0x000055d34fcfb791 in write_io_callback (c=0x7fc44d450000) at /mariadb/10.11/storage/innobase/os/os0file.cc:3465
|
#17 0x000055d350267d46 in tpool::task_group::execute (this=0x7130000002e0, t=0x7fc44d450048) at /mariadb/10.11/tpool/task_group.cc:70
|
#18 0x000055d3502580ef in tpool::thread_pool_generic::worker_main (this=0x718000000000, thread_var=0x730000010080) at /mariadb/10.11/tpool/tpool_generic.cc:583
|
This thread had already released the page latch, which allowed the buf_flush_page_cleaner() to finish. It had not yet invoked the following:
node->complete_write();
|
node->space->release();
|
This could mean that during a normal operation, a log checkpoint would fail to invoke fdatasync() or fsync() to make durable the last writes on a file. The two calls should probably be made part of buf_page_write_complete() right before the page latch release.
The assertion expression would fail until write_slots->release(cb) was called in write_io_callback(). Perhaps we should replace the debug assertions with some wait loops.
Attachments
Issue Links
- relates to
-
MDEV-26827 Make page flushing even faster
- Closed