[MDEV-33340] fix for MDEV-24670 causes a performance regression Created: 2024-01-31  Updated: 2024-02-01  Resolved: 2024-02-01

Status: Closed
Project: MariaDB Server
Component/s: N/A
Affects Version/s: N/A
Fix Version/s: N/A

Type: Bug Priority: Blocker
Reporter: Axel Schwenke Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 1
Labels: regression

Attachments: File 25c627885a2-good.svg     File a057a6e41f2-bad.svg     PNG File big_ranges_tp_off.png     PNG File big_ranges_tp_off2.png     PNG File big_ranges_tp_on.png     PNG File big_ranges_tp_on2.png     Text File errlog-bad.txt     Text File errlog-good.txt     File pmp_raw_1706708923.txt.gz     File report-no-children.txt.gz     File report.txt.gz    
Issue Links:
Problem/Incident
is caused by MDEV-24670 avoid OOM by linux kernel co-operativ... Closed

 Description   

The regression test suite found a regression for the t_threadpool* tests. It turned out to be a regression for sorting rows. Further analysis in TODO-4510 traced it to commit a057a6e41f2 for MDEV-24670.

The regression gets bigger when more rows have to be sorted and is in the order of 3% for 1000 rows. Additionally the execution times for queries fluctuate more than normal.



 Comments   
Comment by Axel Schwenke [ 2024-01-31 ]

The performance test results for the OLTP order ranges test:

Comment by Marko Mäkelä [ 2024-01-31 ]

MDEV-24670 was joint work of myself and danblack. I wrote the buf_pool_t::garbage_collect(). Every invocation of it should also invoke the following:

  sql_print_information("InnoDB: Memory pressure event freed %zu pages",
                        freed);

Because there are no such messages in the server error log, the problem should not be this code (which would likely cause more severe regression later on during any performance test, by forcing pages to be read back into the buffer pool), the problem should be in some code that danblack wrote, such as mem_pressure::setup() or mem_pressure::pressure_routine().

Comment by Daniel Black [ 2024-02-01 ]

[a057a6e41f2](https://github.com/MariaDB/server/commit/a057a6e41f2) is a simple condition in the buffer pool init so that it doesn't get activated for MariaDB-backup.

As both commits contain the same "Failed to initialize memory pressure: No such file or directory" the both took the same path though this. (Note this error message was removed later).

Because of this error, there isn't even a background thread running (confirmed by PMP). There would have been a little extra processing in init, however extra no CPU time during the run. Even it if had of been Ubuntu 22.04, or an OS with cgroups2, the background thread is waiting on a poll for an even that shouldn't happen without memory pressure.

Looking at the flame graphs there's 0.08 difference in percentage at start_thread. By the time gets up to JOIN::exec the difference in percentage is 0.25. Going up further to join_init_read_record the difference in percentage is 0.83. Obviously we haven't touched any code created by memory pressure.

Obviously with 7f11fad85a885d148254ca05f508125e3b94339c showing the same performance, there's still a regression there.

Did reverting a057a6e41f2 show the improvement come back?

Yes, there's a regression, but with nothing showing in the CPU profile related to MDEV-24670, on an OS platform that doesn't support the functionality that MDEV-24670 adds, confirmed by the logs, its not the problem.

Comment by Axel Schwenke [ 2024-02-01 ]

I retested commit 7f11fad85a8 (the original bad release candidate) with the supposed bad commit a057a6e41f2 reverted. Result:


for threadpool=off this looks like a057a6e41f2 could be the culprit, but then for threadpool=on it does not. I noticed also high fluctuations in throughput. Meaning the test used for bisecting could have returned bogus numbers.

I will close this ticket, and reopen TODO-4510. Then bisect again in a different branch and maybe with a better (more stable) test case.

Comment by Axel Schwenke [ 2024-02-01 ]

Looks like this was a false alarm.

Generated at Thu Feb 08 10:38:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.