[MDEV-31343] Another server hang with innodb_undo_log_truncate=ON Created: 2023-05-25  Updated: 2023-12-13  Resolved: 2023-05-26

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1
Fix Version/s: 11.1.1, 11.0.2, 10.6.14, 10.9.7, 10.10.5, 10.11.4

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Matthias Leich
Resolution: Fixed Votes: 0
Labels: hang

Attachments: PNG File timeseries_tpcc_64.png    
Issue Links:
Relates
relates to MDEV-33009 Server hangs for a long time with inn... Closed
relates to MDEV-27058 Buffer page descriptors are too large Closed
relates to MDEV-27414 Server may hang when innodb_undo_log_... Closed
relates to MDEV-30180 Server hang with innodb_undo_log_trun... Closed
relates to MDEV-31234 InnoDB does not free UNDO after the f... Closed

 Description   

axel reproduced one more hang related to innodb_undo_log_truncate=ON, similar to MDEV-30180. Here is a description of a hang that was reproduced with innodb_use_native_aio=0:

  1. trx_purge_truncate_history() writes the message InnoDB: Truncating and is about to truncate an undo log tablespace.
  2. trx_purge_truncate_history() is busy-looping in a scan of buf_pool.flush_list because one of the pages belonging to the undo tablespace is write-fixed.
  3. During the time trx_purge_truncate_history() releases and re-acquires buf_pool.flush_list_mutex, other threads that are waiting for it cannot grab it, in this version of GNU libc. This is similar to MDEV-30180, which could only be reproduced in the same particular environment.
  4. buf_dblwr_t::flush_buffered_writes_completed() was waiting for log_sys.mutex in log_write_up_to(), while trying to write the block that trx_purge_truncate_history() is trying to lock.
  5. log_sys.mutex was be held by buf_flush_page_cleaner(), which is waiting for buf_pool.flush_list_mutex.

A possible fix would be that trx_purge_truncate_history() buffer-fixes the block, releases buf_pool.flush_list_mutex, waits for an exclusive latch on the block and finally reacquire buf_pool.flush_list_mutex. In that way, the blocking of other threads is minimized. The buffer-fix will prevent the eviction or relocation of the block in the buffer pool while no mutex is held by trx_purge_truncate_history().



 Comments   
Comment by Axel Schwenke [ 2023-05-26 ]

Commit f410444a76b from the bb-10.6-MDEV-31343 branch survived a 1 hour run of sysbench-tpcc. So the fix is most probably complete (without it the server hung within few minutes).

It still has severe impact of performance. Enabling innodb_undo_log_truncate=ON has lead to 50% performance loss (~3000 tps vs. ~6000 tps) in my benchmark:

Comment by Marko Mäkelä [ 2023-05-26 ]

Also in MDEV-29401 and MDEV-30628 we concluded that something needs to be done about the history list length. One possibility would be that any thread that is acquiring an exclusive latch on an index page for other reasons will attempt to remove any purgeable history, so that the actual purge threads will have less work to do.

Comment by Marko Mäkelä [ 2023-12-13 ]

The history list length issue was greatly improved by MDEV-32050.

We experienced a similar problem again in this same environment, in MDEV-33009.

Generated at Thu Feb 08 10:23:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.