[MDEV-27499] tps regression with 10.8 (w.r.t to 10.6.5) Created: 2022-01-14  Updated: 2022-01-25  Resolved: 2022-01-18

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: N/A
Fix Version/s: 10.5.14, 10.6.6, 10.7.2, 10.8.1

Type: Bug Priority: Blocker
Reporter: Krunal Bauskar Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance

Attachments: PNG File 10.6.5-tps and 10.8-tps - x86 (uniform, update-non-index, 1024 scalbility).png     PNG File 10.6.5-tps and 10.8.-tps - ARM (uniform, update-non-index, 1024 scalability).png     PNG File arm.png     PNG File on touching threshold steep drop in tps from X to 0. Even though page cleaner continue to flush pages and fuzzy checkpoint take place the said threads wait for the complete checkpoint.png     PNG File x86.png    
Issue Links:
Problem/Incident
is caused by MDEV-27416 InnoDB hang in buf_flush_wait_flushed... Closed

 Description   
  • while benchmarking 10.8 (#4b14874c2809a8b0f2d04a969132e062fb22b145) work-in-progress release I observed there is a regression with update-non-index, uniform workload (1024 scalability).
  • tps often drops to 0.


 Comments   
Comment by Krunal Bauskar [ 2022-01-14 ]
  • Refer to graph of tps attached above.
  • Test is run of 300 seconds -> 60 seconds of active sleep -> 300 seconds of repeat workload.
  • With TPS continue to touch 0 and then auto-recover in some case. (ARM case is slightly better than x86).
  • with x86 once we goes to 0 it continue to remain 0 till it hit 60 seconds of active sleep and then on next run tps is again back to original level.

configuration details: https://github.com/mysqlonarm/benchmark-suites/blob/master/mysql-sbench/conf/mdb.cnf/100tx3m_106_cpubound.cnf + skip-log-bin + thread-pool enabled.

Comment by Marko Mäkelä [ 2022-01-14 ]

Thank you! As of now, there is no difference whatsoever for storage/innobase between 10.7 and 10.8. Between the current 10.6 and 10.7 there are more differences, but I do not think of anything that could explain your observation.

The periods of 0 transactions per second would suggest that all threads are waiting for a log checkpoint to occur.

I recently fixed MDEV-27416, which was a checkpoint hang that I think would only affect scenarios where only one thread was writing, and subsequently waiting for a checkpoint. That could perhaps have caused a shutdown hang. But, the revision that you tested does include this fix.

We have not seen anything like this in our internal testing, so we would rely on you to debug this further. In your configuration, I do not see anything special.

  • Can you get the stack traces of all threads during the hangs?
  • Could you please also test the latest 10.6 development version? If something was broken since 10.6.5, the most likely culprits (which are also in the 10.6 branch) would be MDEV-27058 and MDEV-27416.
Comment by Krunal Bauskar [ 2022-01-14 ]

Marko,

You are right. I see the regression even with 10.6-trunk (#e6a0611)

Comment by Krunal Bauskar [ 2022-01-17 ]

I did some more investigation and found that the issue is likely cropping from the said commit

commit 4c3ad24413f234599eda27f4958dd3ff21df3203
MDEV-27416 InnoDB hang in buf_flush_wait_flushed(), on log checkpoint

Said commit segregated logic to wait for dirty pages up to the said lsn in form of a new function buf_flush_wait(). This function in addition to waiting for dirty pages also waits for the checkpoint to complete in other sense resetting of buf_flush_sync_lsn. Till then all threads are blocked in the said function causing the tps to drops down to 0.

Issue is hit when the checkpoint age touches the max-threshold (Innodb_checkpoint_max_age) when a sync flush starts.

Tried commenting the said snippet and tps continued to maintain above 0 reducing from the original level on hitting the threshold.

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index dec0292a088..6b3dc5f0767 100644
— a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1798,10 +1798,12 @@ static void buf_flush_wait(lsn_t lsn)
&buf_pool.flush_list_mutex.m_mutex);
}

+#if 0
/* Wait for the checkpoint. */
while (buf_flush_sync_lsn)
my_cond_wait(&buf_pool.done_flush_list,
&buf_pool.flush_list_mutex.m_mutex);
+#endif
}

Marko,
Not sure why the said condition is added (is it for some other changes that are expected in near future?)

Comment by Marko Mäkelä [ 2022-01-17 ]

krunalbauskar, thank you for narrowing it down. I think that the wait condition is too aggressive, and threads could end up waiting for longer than necessary if the page cleaner remains in ‘furious flushing’ mode. Can you try the following patch?

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index dec0292a088..ecbc65b15dd 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1799,7 +1799,7 @@ static void buf_flush_wait(lsn_t lsn)
   }
 
   /* Wait for the checkpoint. */
-  while (buf_flush_sync_lsn)
+  while (buf_flush_sync_lsn >= lsn)
     my_cond_wait(&buf_pool.done_flush_list,
                  &buf_pool.flush_list_mutex.m_mutex);
 }
 

Comment by Krunal Bauskar [ 2022-01-17 ]

Marko,

1. the said patch doesn't help.
2. Condition is constant till buf_flush_sync_lsn is reset to 0. lsn is not changing.
3. I am still confused why we need that condition there?

Comment by Marko Mäkelä [ 2022-01-17 ]

krunalbauskar, in case you wonder why there are two wait loops in the function buf_flush_wait(): The first one waits for the minimum oldest_modification in the buffer pool to be large enough. That is the prerequisite for a log checkpoint. The oldest_modification of each modified page would be cleared in buf_page_write_complete() in I/O callback threads.

The final loop is waiting for the checkpoint to actually be advanced by the buf_flush_page_cleaner(). That condition is currently waiting for the page cleaner to completely exit the ‘furious flushing’ mode before exiting.

I now realize that my initially suggested change cannot possibly help, because whenever buf_flush_sync_lsn is nonzero, it would be guaranteed to be at least lsn, by design. What we really might need is to wait for the log checkpoint to have advanced enough. Something like this:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index dec0292a088..7480e0cf673 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -1799,7 +1799,7 @@ static void buf_flush_wait(lsn_t lsn)
   }
 
   /* Wait for the checkpoint. */
-  while (buf_flush_sync_lsn)
+  while (buf_flush_sync_lsn && log_sys.last_checkpoint_lsn < lsn)
     my_cond_wait(&buf_pool.done_flush_list,
                  &buf_pool.flush_list_mutex.m_mutex);
 }

I will have to review whether that second loop is needed at all. It should be unnecessary for the correctness of log_checkpoint_margin(), but I have to consider other callers as well.

Comment by Marko Mäkelä [ 2022-01-17 ]

I believe that the second loop of buf_flush_wait() (waiting for the checkpoint) that was added in MDEV-27416 might only be needed for the buf_flush_sync() call in srv_prepare_to_delete_redo_log_file(). Other calls of buf_flush_sync() are in debug code, so it is fine for them to wait for the checkpoint. But, the calls in log_checkpoint_margin() definitely do not need to wait for a checkpoint; the checkpoint age will explicitly be checked after buf_flush_wait_flushed().

Edit: I think that the loop is completely unnecessary. For log file resizing, we explicitly disable checkpoints, so we should only care that the buffer pool is clean, not whether the page cleaner exited the ‘furious flushing’ mode. I have submitted the removal of the loop for testing.

Thank you for catching this, krunalbauskar. For some reason, the condition failed to reproduce in our performance tests. Perhaps axel did not test with a small innodb_log_file_size lately? I think that I did run some perfomrance tests like that during the MDEV-14425 development, but I did not notice this.

Comment by Matthias Leich [ 2022-01-18 ]

The tree
origin/bb-10.5-MDEV-27499 6c66c66ef6c6a6e4a99102fcc465877174e44757 2022-01-17T16:56:07+02:00
behaved well in RQG testing.
The bad effects observed are known == in other trees too.

Comment by Marko Mäkelä [ 2022-01-19 ]

The second wait loop turns out to be necessary for log file resizing after all. We got a few failures of the test innodb.log_file_size, like this:

innodb.log_file_size '8k,innodb'         w5 [ fail ]
…/srv/srv0start.cc:854: lsn_t srv_prepare_to_delete_redo_log_file(): Assertion `!buf_pool.any_io_pending()' failed.

I am unable to reproduce this failure locally.

It might be acceptable to replace this assertion with a similar one (!buf_pool.get_oldest_modification(0)), but then we might get into a race condition where the page cleaner would attempt to write a checkpoint to the newly created log file. I will add that wait loop to buf_flush_sync(), like I originally planned to. Other callers of buf_flush_sync() are in debug code.

Generated at Thu Feb 08 09:53:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.