[MDEV-31953] madvise(frame, srv_page_size, MADV_FREE) is causing a performance regression Created: 2023-08-18  Updated: 2023-11-20  Resolved: 2023-11-18

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.11, 11.0, 11.1, 11.2, 11.3
Fix Version/s: 10.11.7, 11.0.5, 11.1.4, 11.2.3

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Daniel Black
Resolution: Fixed Votes: 1
Labels: performance, regression
Environment:

GNU/Linux, various multi socket Intel CPU systems (NUMA)


Attachments: PNG File baseline-timeseries_tpcc_32.png     PNG File madvise_timeseries_tpcc_32.png    
Issue Links:
Blocks
is blocked by MDEV-24670 avoid OOM by linux kernel co-operativ... Closed
Problem/Incident
is caused by MDEV-25341 innodb buffer pool soft decommit of m... Closed

 Description   

In MDEV-25341 (MariaDB Server 10.11.0), some calls to madvise() with MADV_FREE were added.

If the call in buf_LRU_block_free_non_file_page() is committed out, the throughput in an I/O bound HammerDB benchmark would improve by 6% to 18%, depending on the CPU microarchitecture.

It could be best to revert this code for now. A better solution might be doable as part of a larger refactoring (MDEV-25340, MDEV-29445). Instead of invoking madvise(MADV_FREE) on individual buffer pages, we might invoke it on larger chunks at a time. There could also be an explicit SQL statement for defragmenting or freeing the buffer pool.



 Comments   
Comment by Marko Mäkelä [ 2023-09-04 ]

axel has confirmed a regression that the following patch would address:

diff --git a/storage/innobase/include/buf0buf.h b/storage/innobase/include/buf0buf.h
index 2aaf748d7ae..a31653583d5 100644
--- a/storage/innobase/include/buf0buf.h
+++ b/storage/innobase/include/buf0buf.h
@@ -644,9 +644,6 @@ class buf_page_t
   void set_os_unused()
   {
     MEM_NOACCESS(frame, srv_page_size);
-#ifdef MADV_FREE
-    madvise(frame, srv_page_size, MADV_FREE);
-#endif
   }
 
   void set_os_used() const

Comment by Axel Schwenke [ 2023-09-12 ]

I tested the patch with TPC-C workload and 3 different sets of UNDO log specific settings. Baseline was MariaDB 10.11 @ commit 725bd568346

no undo log settings

  baseline MDEV-31953
tps 4587.9 4651.4

innodb_max_purge_lag = 500000
innodb_max_purge_lag_delay = 5000

  baseline MDEV-31953
tps 3992.4 4183.4

innodb_undo_log_truncate = ON

  baseline MDEV-31953
tps 3674.7 3749.1

So it was faster with the patch in any case. The UNDO log growth was not affected by this patch, nor was the general "jumpiness" of the troughput-over-time. For another workload, INSERT w/ AUTOCOMMIT the patch showed no effect.

Comment by Axel Schwenke [ 2023-09-12 ]

The time series for the second scenario (where the performance difference was greatest):

Comment by Vladislav Vaintroub [ 2023-09-15 ]

https://github.com/MariaDB/server/commit/c1fd082e9c7369f4511eb5a52e58cb15489caa74
removed "Windows madvise", because it turned out to be expensive. As the comment says, that this code runs under global mutex, which did not sound like a good idea.

Comment by Marko Mäkelä [ 2023-09-15 ]

I think that in any case, it could be useful to have some interface to ask InnoDB to collect any garbage from the buffer pool. Garbage would be left behind in particular by DDL operations. The thinking is that the LRU replacement mechanism will reuse those garbage pages later. This "laziness" allows operations like DROP, TRUNCATE or any table rebuild (OPTIMIZE, ALTER TABLE) to finish faster. But, it generally leads to "InnoDB buffer pool cannot shrink", which MDEV-25341 aimed to mitigate.

Possibly, we can reuse existing user interface for this:

SET GLOBAL innodb_buffer_pool_size=@@GLOBAL.innodb_buffer_pool_size;

That is, when you execute a "null" buffer pool resizing, InnoDB would scan the buf_pool.LRU list, evict any pages that belong to not-longer-existing tablespaces, and invoke madvise(MADV_FREE) or its Microsoft Windows equivalent on all page frame addresses that got freed. To keep it simple, this could be similar to the current approach (individual calls for each page frame, even when they are adjacent). During normal operation, there would be no madvise(MADV_FREE) calls.

A refinement of this could be to "defragment" the buffer pool and invoke madvise(MADV_FREE) on the larger areas that got freed. This might be easier to do after the MDEV-29445 refactoring.

Comment by Marko Mäkelä [ 2023-10-23 ]

Note that the logic of the added function buf_pool_t::garbage_collect() will need to be revised based on MDEV-32552.

Comment by Vladislav Vaintroub [ 2023-10-23 ]

Hi marko. Maybe in the released versions we just remove the madvise?
For a future version, maybe we can have a new boolean variable innodb_discard_unused_memory_now (or other descriptive name) to do the garbage collection, to decouple it from innodb_buffer_pool_size . Changing innodb_buffer_pool_size and freeing unused memory are unrelated, except both use the buffer pool.

Comment by Marko Mäkelä [ 2023-10-23 ]

wlad, I would prefer not to introduce new parameters whenever it can be avoided, because it is hard or impossible to remove parameters when they become obsolete. We may consider adding a new interface for triggering this garbage collection later, once we have implemented MDEV-29445. It might become feasible to trigger the garbage collection automatically as well.

Comment by Vladislav Vaintroub [ 2023-10-23 ]

Well, until we figured out the interface , maybe we should not do that at all, and introduce it in when we know what we want to do with it.
Currently, set innodb_buffer_pool_size=@@innodb_buffer_pool_size is a no-op, and someone might already rely that it is a no-op, but it becomes a heavy undocumented op in the new version. So we better not change the existing behavior.

Comment by Marko Mäkelä [ 2023-10-24 ]

wlad, are you proposing to essentially revert the new 10.11 feature MDEV-25341 entirely without any replacement?

I believe that SET GLOBAL already requires a privileged user. The new behaviour of setting the parameter to the old value can be documented. The old behaviour is not that lightweight either: it would always spawn buf_resize_task and wait for it to complete.

Comment by Vladislav Vaintroub [ 2023-10-24 ]

marko, revert without replacement, yep, that's the proposal. If we determine that a feature was determined to be not good enough, performance-wise or otherwise, why not? It does not have visible interface. The server returns to being as impolite as any other program. Maybe Linux kernel would learn to shrink process working set, instead of killing processes one day.

Comment by Marko Mäkelä [ 2023-10-24 ]

danblack suggested a better solution: Make MDEV-24670 an InnoDB bug fix, so that the buf_pool_t::garbage_collect() that I added would be automatically invoked when some memory pressure is reported by the operating system. This could avoid an OOM kill without actually changing any virtual address mappings for the buffer pool.

Let us do that, and leave the logic around SET GLOBAL innodb_buffer_pool_size alone. This will miss the scheduled quarterly releases, but will hopefully be finished in the next quarter.

Comment by Daniel Black [ 2023-11-13 ]

Needs advice on improving reliability of the test case, but otherwise complete.

(Hope I didn't miss anything).

Comment by Marko Mäkelä [ 2023-11-17 ]

OK to push after addressing my review comment, which includes a more stable version of the test innodb.mem_pressure.

Generated at Thu Feb 08 10:27:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.