[MDEV-27466] Deadlock when purge thread races the flush thread for page lock under Innodb redo log space pressure Created: 2022-01-11  Updated: 2022-03-24  Resolved: 2022-03-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.4.20
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Sasha Pachev Assignee: Marko Mäkelä
Resolution: Won't Fix Votes: 0
Labels: hang, not-10.5

Attachments: PDF File tpcc1.pdf     PDF File tpcc2.pdf    
Issue Links:
Relates
relates to MDEV-14550 Error log flood : "InnoDB: page_clea... Closed
relates to MDEV-23399 10.5 performance regression with IO-b... Closed

 Description   

This is not a full bug report yet, though I do have a reproducible scenario on a test server. I am working on a smaller test case.

To cause the problem, we do this:

set innodb_log_file_size to 4M to quickly simulate log space shortage
use a table with 72M rows
alter it with the INSTANT algorithm adding a few columns
set innodb_buffer_pool_size to 1G
run an update modifying one of the instant-added columns (everything goes well, the query takes 3 minutes)
then set innodb_buffer_pool_size to 512M and wait for the dust to settle down so that it actually shows as 512M
then re-run the above query

This time we go into the deadlock timeout, and from the gdb trace we can see this race:

hread 23 (Thread 0x7f39f4ff5700 (LWP 38128)):
#0  0x00007f3a20687a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000560a685f3b89 in wait (this=0x7f3a0c429bb0)
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/include/sync0types.h:463
#2  os_event::wait_low(long) () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/os/os0event.cc:325
#3  0x0000560a685f3e65 in os_event_wait_low (event=<optimized out>, reset_sig_count=<optimized out>)
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/os/os0event.cc:502
#4  0x0000560a68670c21 in sync_array_wait_event(sync_array_t*, sync_cell_t*&) ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/include/ib0mutex.h:496
#5  0x0000560a6867350a in rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/sync/sync0rw.cc:833
#6  0x0000560a686ea9c9 in pfs_rw_lock_sx_lock_func (pass=2, 
    file_name=0x560a68cd2a38 "/home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc", line=1186, 
    lock=0x7f39d88fb248) at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/include/sync0rw.ic:674
#7  buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool) ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1186
#8  0x0000560a686eb201 in buf_flush_try_neighbors (n_to_flush=<optimized out>, n_flushed=<optimized out>, 
    flush_type=BUF_FLUSH_LIST, page_id=...)
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1431
#9  buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*) [clone .isra.0] ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1505
#10 0x0000560a686ec109 in buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long) ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1763
#11 0x0000560a686edef9 in buf_flush_batch (n=0x7f39f4ff4c50, lsn_limit=1446564044560, min_n=18446744073709551614, 
    flush_type=BUF_FLUSH_LIST, buf_pool=0x560a6b20c620)
/home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1834
#12 buf_flush_do_batch (n=0x7f39f4ff4c50, lsn_limit=1446564044560, min_n=18446744073709551614, type=BUF_FLUSH_LIST, 
    buf_pool=0x560a6b20c620) at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1992
#13 buf_flush_do_batch (type=BUF_FLUSH_LIST, n=0x7f39f4ff4c50, lsn_limit=<optimized out>, min_n=<optimized out>, 
    buf_pool=0x560a6b20c620) at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1975
#14 pc_flush_slot () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:2791
#15 0x0000560a680b1d54 in buf_flush_page_cleaner_coordinator ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:3160
#16 0x00007f3a20683ea5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f3a1f11396d in clone () from /lib64/libc.so.6

The flush thread wants a page lock that is being held by another thread.
Following the writer_thread member of rw_lock we find that it is being held by the purge thread with this trace:

Thread 16 (Thread 0x7f39ec7f8700 (LWP 38147)):
#0  0x00007f3a2068ae9d in nanosleep () from /lib64/libpthread.so.0
#1  0x0000560a685f401c in os_thread_sleep (tm=tm@entry=100000)
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/os/os0thread.cc:231
#2  0x0000560a685d96ca in log_margin_checkpoint_age(unsigned long) ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/log/log0log.cc:283
#3  0x0000560a685eb4f3 in prepare_write (this=0x7f39ec7f6c50)
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/mtr/mtr0mtr.cc:699
#4  mtr_t::commit() () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/mtr/mtr0mtr.cc:439
#5  0x0000560a6869c445 in trx_undo_truncate_start(trx_rseg_t*, unsigned long, unsigned long, unsigned long) ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0undo.cc:1018
#6  0x0000560a6867cae4 in trx_purge_truncate_rseg_history (limit=..., rseg=...)
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0purge.cc:465
#7  trx_purge_truncate_history() () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0purge.cc:564
#8  0x0000560a6867f9e5 in trx_purge(unsigned long, bool) ()
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0purge.cc:1329
#9  0x0000560a6866cbad in srv_do_purge (n_total_purged=<synthetic pointer>)
    at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/srv/srv0srv.cc:2613
#10 srv_purge_coordinator_thread () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/srv/srv0srv.cc:2752
#11 0x00007f3a20683ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f3a1f11396d in clone () from /lib64/libc.so.6

So apparently the purge thread is waiting for the log space to become available, but it will never happen because it is doing it while holding the lock on the page that the flush thread wants to flush to make that space available.

We are working on isolating this into something manageable that we can post, but since we are aware of this bug, I figured I would open an issue and hopefully get some tips on the code in question. In particular, I would like to know how to get a call in trx_undo_truncate_start() to do something meaningful - I can get a call to happen, but it returns immediately because the limit argument is 0.



 Comments   
Comment by Marko Mäkelä [ 2022-01-11 ]

Thank you for the report!

I think that this problem has been fixed by MDEV-23399 in MariaDB Server 10.5.7. Before that change, the page cleaner threads would execute 2 types of flushing batches: eviction flushing (writing out least recently modified pages so that they can be evicted from the buffer pool) and checkpoint flushing.

In MariaDB Server 10.5, there is only a single buf_flush_page_cleaner() thread that executes only checkpoint flushing. Furthermore, only that thread is updating checkpoints, after each page flushing batch. In earlier versions, there are multiple page cleaner threads, and they are not updating checkpoints themselves. Also, MDEV-25113 in 10.5.12 introduced mtr_t::PAGE_FLUSH_ASYNC to reduce the possibility of an intrusive wait during mini-transaction commit.

A problem with old style of flushing is that it is blocking while waiting for a page latch. I believe that this also is the cause of the messages that MDEV-14550 was filed about.

You could also be affected by MDEV-23929 on 10.4, and you could reduce the probability of this hang by specifying the non-default setting innodb_flush_neighbors=OFF if the database is stored on SSD.

In 10.5 or later, such checkpoint hangs should not be possible. I recently ran some tests for MDEV-27414 and MDEV-27416 with io_uring.

It would be very interesting to know if you can repeat such a hang on 10.5 or 10.6.

Comment by Sasha Pachev [ 2022-01-11 ]

Marko - I was able to reproduce the problem on 10.4.20 with innodb_flush_neighbors set to OFF producing the same trace. I did a double-take as the call to buf_flush_neighbors() was in the trace. But upon checking the code, I saw that buf_flush_neighbors() still runs even if innodb_flush_neighbors is off, just does less work.

While I have a good amount of confidence than the fixes in 10.5 and 10.6 will likely avoid this problem, it is a priority for us to make 10.4 stable against this type of race with the smallest and safest code change possible. So I will continue to work on a simple reproducible test. Do you have any suggestions how to make

trx_undo_truncate_start()

get called with a non-zero value for the

limit

argument?

I am also able to try custom patches, as well as examine the variables in gdb during the deadlock if you need me to.

Comment by Marko Mäkelä [ 2022-01-12 ]

spachev, could you provide a definite answer that the hang is not reproducible for the latest 10.5 or 10.6?

To answer your question: I think that you will want to set the following, in order to make undo tablespace truncation occur more frequently.

SET GLOBAL innodb_purge_rseg_truncate_frequency = 1;

That is what I am using in tests that wait for everything to be purged. There is also a parameter innodb_purge_batch_size that could play a role here.

You might also want to check the tests innodb.undo_truncate and innodb.undo_truncate_recover.

The entry point for innodb_undo_log_truncate=ON (which is effective with innodb_undo_tablespaces>2) is trx_purge_truncate_history(). It invokes a dedicated function mtr_t::commit_shrink() for shrinking the undo tablespace. The algorithm is:

  1. Lock any pages that will no longer belong to the shrunk tablespace, so that buf_flush_page() cannot write them out.
  2. Durably write log for updating the FSP_SIZE field in the first page of the tablespace, so that recovery will be able to shrink the file.
  3. Remove the locked "orphan" pages from the buf_pool_t::flush_list.
  4. Release the locks held by the mini-transaction.

This allows us to rebuild undo tablespaces by using just the redo log. Last time I checked, MySQL starting with 5.7 uses a separate log file for truncating undo tablespaces, and therefore it will have to execute a log checkpoint (write out all dirty pages from the buffer pool) on the main redo log.

One thing that got better in 10.4 compared to 10.2 and 10.3 is that we have a MLOG_MEMSET record, which helps reduce the amount of the redo log that needs to be written for reinitializing the undo tablespace.

I did not expect setting innodb_flush_neighbors=OFF to make this hang go away. I only thought that it could reduce the probability. A possible fix could be something like this:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index 2cfca67ddd1..ac3a31bc2a7 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1143,12 +1143,7 @@ buf_flush_page(
 		flush = FALSE;
 	} else {
 		rw_lock = &reinterpret_cast<buf_block_t*>(bpage)->lock;
-		if (flush_type != BUF_FLUSH_LIST) {
-			flush = rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE);
-		} else {
-			/* Will SX lock later */
-			flush = TRUE;
-		}
+		flush = rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE);
 	}
 
 	if (flush) {
@@ -1170,22 +1165,6 @@ buf_flush_page(
 
 		buf_pool_mutex_exit(buf_pool);
 
-		if (flush_type == BUF_FLUSH_LIST
-		    && is_uncompressed
-		    && !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) {
-
-			if (!fsp_is_system_temporary(bpage->id.space())) {
-				/* avoiding deadlock possibility involves
-				doublewrite buffer, should flush it, because
-				it might hold the another block->lock. */
-				buf_dblwr_flush_buffered_writes();
-			} else {
-				buf_dblwr_sync_datafiles();
-			}
-
-			rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE);
-		}
-
 		/* If there is an observer that want to know if the asynchronous
 		flushing was sent then notify it.
 		Note: we set flush observer to a page with x-latch, so we can

Note: I did not check if this compiles or behaves correctly. Even if it passed correctness tests, it might still result in a performance regression. With the multiple buffer pool instances and page cleaner threads that we have before 10.5, things can behave in a chaotic fashion, and it is difficult to predict the performance impact.

Comment by Sasha Pachev [ 2022-01-19 ]

I can confirm that the issue does not happen on 10.6. I had to play with the size of the buffer pool to make it a fair comparison. With 512 MB buffer pool 10.4 was deadlocking, but 10.6 would give me "The total number of locks exceeds the lock table size". So had to experimentally find the size that was big enough for 10.6 to do something different, but small enough for 10.4 to deadlock, and achieved the desired result at the size of 805306368 at which 10.6 was able to complete the test successfully while 10.4 deadlocked.

It may be of significance that the column being updated was added by INSTANT ALTER TABLE algorithm. At least we have not been able to reproduce it on other columns.

Comment by Sasha Pachev [ 2022-01-21 ]

Quick update. With the patch the deadlock does not happen, the query completes with an error . However it takes a very long time. There is a period of what appears to be flush starvation. The update sits in buf_flush_wait_flushed() for a very long time, and there is no flushing activity for a while. However, the situation eventually resolves itself.

I still have not fully understood what is happening. I will once I have. However, I thought I would post what I have for the MariaDB team to have something to work with.

Comment by Sasha Pachev [ 2022-01-21 ]

Another quick update. During the flush stall phase mentioned earlier I am looking at buf_flush_lists() called from buf_flush_page_cleaner_coordinator() and seeing a pattern of trickle flushing. 2-4 pages followed 0 in an alternating pattern. 0 causes a 10 second wait before the next flush because the flush thread thinks there is nothing to do. But in the meantime we are running short on log space, so the update execution as well as the purge thread are waiting for the log space to become available.

Comment by Marko Mäkelä [ 2022-01-21 ]

spachev, thank you. Possibly somewhat related to this, in MDEV-27461 I found out that buf_flush_LRU_list_batch() will not write out anything if the buffer pool (or before 10.5, buffer pool instance) contains less than 256 pages. In MDEV-26784, to fix a 10.5+ recovery hang with a very small buffer pool, we tweaked the function so that while crash recovery is active, it will attempt to write out all dirty pages, so that they can be evicted.

Instant ADD COLUMN (MDEV-11369) would create variable-length records. Any update between the default value of the instantly added column and some other value will change the size of the record. This could cause more frequent calls to btr_page_reorganize_low() or page splits and merges. All those can cause explicit lock data structures to be copied. Those structures are allocated from the buffer pool. In 10.6.0, I removed LOCK_PAGE_BITMAP_MARGIN, and 10.6 also contains other lock_sys refactoring that might reduce the memory footprint.

Before I worked on MDEV-23399 and MDEV-23855, I was not too familiar with the page flushing subsystem. I probably have a biased view, but I feel that the old multi-threaded page cleaner could perform in chaotic ways, depending on how scheduling conflicts are resolved. A major performance bottleneck of the 10.5 single buf_flush_page_cleaner() thread was found to be the fil_system.mutex. I changed some fil_space_t and fil_node_t members to use std::atomic, so that we can avoid acquiring the fil_system.mutex at least 3 times for each flushed page (in 10.5 it could be 1 time for several initiated page writes). I would expect that the multiple page cleaner threads before 10.5 will conflict on the fil_system.mutex as well as log_sys.mutex, but perhaps not the buf_pool->mutex if each buffer pool instance is being processed by its own page cleaner thread. On the kernel side, fdatasync() might cause stalls of synchronous page writes. Since MDEV-23855, also the doublewrite buffer will use asynchronous writes.

In 10.6 and later, the buf_pool.mutex and log_sys.mutex remain the main bottlenecks in write-heavy workloads. In 10.5, also lock_sys.mutex was hot, but that one should not interfere with page flushing.

We can run correctness tests on any patch that you would submit, but I am afraid that for tweaking the performance of 10.4, your are mostly on your own. MDEV-23475 is an example where we had to revert a change from 10.2, 10.3, 10.4 because it caused a performance regression in some cases. It was related to page flushing in a convoluted way.

Comment by Sasha Pachev [ 2022-01-21 ]

In this particular case, when the stall happens, buf_flush_LRU_list_batch() is not being called at all for a while. Then after some period time, maybe half hour, all of a sudden it starts getting called, the LRU list size is 39K, so we are not dealing with BUF_LRU_MIN_LEN restriction at all, the flushes happen, and the rows start getting updated. Will continue debugging...

Comment by Marko Mäkelä [ 2022-01-24 ]

Before MDEV-23399 and MDEV-23855 in 10.5, the LRU flushing could be initiated by the page cleaner threads, but there also was a last-ditch effort "single page flushing" which could be initiated by user threads. Are you seeing that?

I no longer remember how the LRU eviction logic worked before my changes. I think that the idea of LRU eviction by the page cleaner threads was to "pre-evict" some pages so that user threads would almost never have to wait for page writes in case all pages are dirty and something needs to be evicted to make room for another page.

My changes adopted the Linux kernel’s early idea of "try to retain buffer caches as long as possible", and in 10.5 only the user threads that are trying to allocate a block may initiate LRU eviction flushing (with no distinction between single-page flushing and "LRU batch" flushing). One thread runs the LRU flush, others that might concurrently try to allocate a page will wait for that batch to complete. On write completion, the pages will actually be freed.

Comment by Sasha Pachev [ 2022-01-25 ]

Update. We are able to avoid the stall with the following patch:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index 2cfca67ddd1..db584a3e71f 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1143,12 +1143,7 @@ buf_flush_page(
 		flush = FALSE;
 	} else {
 		rw_lock = &reinterpret_cast<buf_block_t*>(bpage)->lock;
-		if (flush_type != BUF_FLUSH_LIST) {
-			flush = rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE);
-		} else {
-			/* Will SX lock later */
-			flush = TRUE;
-		}
+		flush = rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE);
 	}
 
 	if (flush) {
@@ -1170,22 +1165,6 @@ buf_flush_page(
 
 		buf_pool_mutex_exit(buf_pool);
 
-		if (flush_type == BUF_FLUSH_LIST
-		    && is_uncompressed
-		    && !rw_lock_sx_lock_nowait(rw_lock, BUF_IO_WRITE)) {
-
-			if (!fsp_is_system_temporary(bpage->id.space())) {
-				/* avoiding deadlock possibility involves
-				doublewrite buffer, should flush it, because
-				it might hold the another block->lock. */
-				buf_dblwr_flush_buffered_writes();
-			} else {
-				buf_dblwr_sync_datafiles();
-			}
-
-			rw_lock_sx_lock_gen(rw_lock, BUF_IO_WRITE);
-		}
-
 		/* If there is an observer that want to know if the asynchronous
 		flushing was sent then notify it.
 		Note: we set flush observer to a page with x-latch, so we can
@@ -3181,7 +3160,7 @@ DECLARE_THREAD(buf_flush_page_cleaner_coordinator)(void*)
 
 			n_flushed = n_flushed_lru + n_flushed_list;
 
-		} else if (srv_check_activity(last_activity)) {
+		} else if (srv_check_activity(last_activity) || log_is_in_distress()) {
 			ulint	n_to_flush;
 			lsn_t	lsn_limit = 0;
 
diff --git a/storage/innobase/include/log0log.h b/storage/innobase/include/log0log.h
index 368d912f07f..cd938d5a82e 100644
--- a/storage/innobase/include/log0log.h
+++ b/storage/innobase/include/log0log.h
@@ -67,6 +67,11 @@ log_reserve_and_write_fast(
 	const void*	str,
 	ulint		len,
 	lsn_t*		start_lsn);
+
+UNIV_INLINE
+bool
+log_is_in_distress();
+
 /***********************************************************************//**
 Checks if there is need for a log buffer flush or a new checkpoint, and does
 this if yes. Any database operation should call this when it has modified
diff --git a/storage/innobase/include/log0log.ic b/storage/innobase/include/log0log.ic
index 8dfd86d3078..c65c91b941a 100644
--- a/storage/innobase/include/log0log.ic
+++ b/storage/innobase/include/log0log.ic
@@ -51,6 +51,17 @@ log_block_get_flush_bit(
 	return(FALSE);
 }
 
+UNIV_INLINE
+bool log_is_in_distress()
+{
+	bool res;
+	log_mutex_enter();
+	res = 3*(log_sys.lsn - log_sys.last_checkpoint_lsn)/2 > log_sys.log_group_capacity;
+	log_mutex_exit();
+	return res;
+}
+
+
 /************************************************************//**
 Sets the log block flush bit. */
 UNIV_INLINE
diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
index 241905601cd..03eff3a8936 100644
--- a/storage/innobase/trx/trx0purge.cc
+++ b/storage/innobase/trx/trx0purge.cc
@@ -41,6 +41,7 @@ Created 3/26/1996 Heikki Tuuri
 #include "trx0roll.h"
 #include "trx0rseg.h"
 #include "trx0trx.h"
+#include "log0log.h"
 #include <mysql/service_wsrep.h>
 
 /** Maximum allowable purge history length.  <=0 means 'infinite'. */
@@ -437,6 +438,9 @@ trx_purge_truncate_rseg_history(
 	mtr_t		mtr;
 	trx_id_t	undo_trx_no;
 
+	if (log_is_in_distress())
+		return;
+
 	mtr.start();
 	ut_ad(rseg.is_persistent());
 	mutex_enter(&rseg.mutex);
@@ -446,7 +450,7 @@ trx_purge_truncate_rseg_history(
 	hdr_addr = trx_purge_get_log_from_hist(
 		flst_get_last(rseg_hdr + TRX_RSEG_HISTORY, &mtr));
 loop:
-	if (hdr_addr.page == FIL_NULL) {
+	if (hdr_addr.page == FIL_NULL || log_is_in_distress()) {
 func_exit:
 		mutex_exit(&rseg.mutex);
 		mtr.commit();
diff --git a/storage/innobase/trx/trx0undo.cc b/storage/innobase/trx/trx0undo.cc
index 6128044bbc9..53b1054001c 100644
--- a/storage/innobase/trx/trx0undo.cc
+++ b/storage/innobase/trx/trx0undo.cc
@@ -33,6 +33,7 @@ Created 3/26/1996 Heikki Tuuri
 #include "srv0start.h"
 #include "trx0purge.h"
 #include "trx0rec.h"
+#include "log0log.h"
 #include "trx0rseg.h"
 
 /* How should the old versions in the history list be managed?
@@ -967,9 +968,10 @@ trx_undo_truncate_start(
 
 	ut_ad(mutex_own(&(rseg->mutex)));
 
-	if (!limit) {
+	if (!limit || log_is_in_distress()) {
 		return;
 	}
+
 loop:
 	mtr_start(&mtr);
 
@@ -991,7 +993,7 @@ trx_undo_truncate_start(
 
 	last_rec = trx_undo_page_get_last_rec(undo_page, hdr_page_no,
 					      hdr_offset);
-	if (trx_undo_rec_get_undo_no(last_rec) >= limit) {
+	if (trx_undo_rec_get_undo_no(last_rec) >= limit || log_is_in_distress()) {
 
 		mtr_commit(&mtr);
 

I have not yet tested, but I think the simple variant that only has

if (trx_undo_rec_get_undo_no(last_rec) >= limit || log_is_in_distress()) {

will work as well. I observed that trx_undo_truncate_start() could start where there is plenty of space in the log, but while freeing the pages manages to push the checkpoint age to the limit. Just one call it to it. I have not yet fully understood the mechanism, but I am guessing it has to do with taking some page locks which results in blocking the flush activity (with Marko's patch from this issue) or deadlocking (without the patch).

I am not suggesting this patch is necessarily the correct solution or that it may not cause other problems. I am just posting this to demonstrate that reducing purge truncation activity when the log space is short works around the problem.

Comment by Marko Mäkelä [ 2022-02-11 ]

spachev, thank you for the update. This reminds me of https://github.com/MariaDB/server/pull/1889 (MDEV-26356) in 10.6.

Can you submit a pull request for this?

Comment by Sasha Pachev [ 2022-02-17 ]

marko
https://github.com/MariaDB/server/pull/2021

Comment by Marko Mäkelä [ 2022-02-17 ]

Thank you! Can you please rebase that on the latest 10.4?

Comment by Sasha Pachev [ 2022-02-17 ]

marko - done, also added the safety explanation comment as you requested.

Comment by VAROQUI Stephane [ 2022-02-18 ]

Can 10.3 be affected as well ? i hit similar long semaphore wait 600s on a read only replica with conservative and multi source GTID replication
The purge thread waiting for an other thread doing a long CREATE OR REPLACE

create or replace table SANDBOX.T1
(PRIMARY KEY (ID,DATE,ID_ORDER)
,KEY `FKIndex1` (ID_ORDER))
 
SELECT ...

Unfortunate the stack did not write to disk but from monitoring we get

=====================================
2022-02-16 19:55:19 0x7efdd17fc700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 20 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 35904 srv_active, 0 srv_shutdown, 18927 srv_idle
srv_master_thread log flush and writes: 54830
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 15411906
 
 
--Thread 139628552120064 has waited at row0purge.cc line 1067 for 536.00 seconds the semaphore:
S-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920
a writer (thread id 139628598245120) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file dict0stats.cc line 2488
 
--Thread 139628633913088 has waited at srv0srv.cc line 2023 for 939.00 seconds the semaphore:
X-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920
a writer (thread id 139628598245120) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file dict0stats.cc line 2488
 
 
--Thread 139628539541248 has waited at row0purge.cc line 1067 for 536.00 seconds the semaphore:
S-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920
a writer (thread id 139628598245120) has reserved it in mode  exclusive
 
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file dict0stats.cc line 2488
OS WAIT ARRAY INFO: signal count 19724996
RW-shared spins 18501644, rounds 75969116, OS waits 1318974
RW-excl spins 37487656, rounds 126621842, OS waits 1242171
RW-sx spins 334085, rounds 5011848, OS waits 47900
Spin rounds per wait: 4.11 RW-shared, 3.38 RW-excl, 15.00 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 21851754826
Purge done for trx's n:o < 21851754824 undo n:o < 0 state: running
History list length 2
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421161563984184, COMMITTED IN MEMORY flushing log
0 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 8
---TRANSACTION 21851754821, ACTIVE (PREPARED) 958 sec
2 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 3
MySQL thread id 31, OS thread handle 139628805043968, query id 6098577 Init
COMMIT
 
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 6 state: waiting for i/o request (write thread)
....
I/O thread 9 state: waiting for i/o request (write thread)
Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
 ibuf aio reads:, log i/o's:, sync i/o's:
Pending flushes (fsync) log: 0; buffer pool: 1
829392 OS file reads, 7730749 OS file writes, 426108 OS fsyncs
0 pending reads, 1 pending writes
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 5, seg size 7, 53437 merges
merged operations:
 insert 75980, delete mark 61216, delete 1244
discarded operations:
 insert 0, delete mark 0, delete 0
Hash table size 6375037, node heap has 13856 buffer(s)
Hash table size 6375037, node heap has 7638 buffer(s)
Hash table size 6375037, node heap has 5463 buffer(s)
Hash table size 6375037, node heap has 13341 buffer(s)
Hash table size 6375037, node heap has 36025 buffer(s)
Hash table size 6375037, node heap has 52283 buffer(s)
Hash table size 6375037, node heap has 55725 buffer(s)
Hash table size 6375037, node heap has 69834 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 12370514365421
Log flushed up to   12370514363873
Pages flushed up to 12370509062381
Last checkpoint at  12370509062381
0 pending log flushes, 0 pending chkp writes
2118121 log i/o's done, 0.00 log i/o's/second
 
----------------------------
END OF INNODB MONITOR OUTPUT
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
2022-02-16 19:55:32 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
220216 19:55:32 [ERROR] mysqld got signal 6 ;
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5627464767ce]
/usr/sbin/mysqld(handle_fatal_signal+0x54d)[0x562745f879bd]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f0b5a7ae730]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f0b5a6127bb]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f0b5a5fd535]
/usr/sbin/mysqld(+0x4eac5b)[0x562745cbac5b]
/usr/sbin/mysqld(+0x4e1022)[0x562745cb1022]

Comment by Sasha Pachev [ 2022-02-23 ]

stephane@skysql.com - I think your case is different. From the status we see that the difference between "Log sequence number " and "Last checkpoint at" is only a bit over 5 M. Even if the log is very small, the smallest log capacity (2 logs, 4 MB each) is 8 M, and even if it is that low, you still have 3 M of log space left. But it looks like something might have gone wrong due to a slow or stuck log flush, which would produce similar symptoms.

Comment by Axel Schwenke [ 2022-03-14 ]

Added performance test results for PR#2021. For sysbench-tpcc we have even an increased performance - when the data set fits in the buffer pool (tpcc1.pdf). Unfortunately the effect reverses, when the data set is bigger than the buffer pool (aka IO-bound, tpcc2.pdf).

Not shown: for standard sysbench workloads no change in performance was observed.

To avoid negative effects for customers who are still on 10.4, I'd prefer offering the patched MariaDB 10.4 as a custom build to interested parties.

Generated at Thu Feb 08 09:53:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.