[MDEV-33340] fix for MDEV-24670 causes a performance regression - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Cannot Reproduce
Affects Version/s: N/A
Fix Version/s: N/A
Component/s: N/A
Labels:
- regression

Description

The regression test suite found a regression for the t_threadpool* tests. It turned out to be a regression for sorting rows. Further analysis in TODO-4510 traced it to commit a057a6e41f2 for ~~MDEV-24670~~.

The regression gets bigger when more rows have to be sorted and is in the order of 3% for 1000 rows. Additionally the execution times for queries fluctuate more than normal.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

25c627885a2-good.svg
2024-01-31 14:17
129 kB
Axel Schwenke
a057a6e41f2-bad.svg
2024-01-31 14:17
127 kB
Axel Schwenke
big_ranges_tp_off.png
2024-01-31 09:28
7 kB
Axel Schwenke
big_ranges_tp_off2.png
2024-02-01 10:02
7 kB
Axel Schwenke
big_ranges_tp_on.png
2024-01-31 09:28
7 kB
Axel Schwenke
big_ranges_tp_on2.png
2024-02-01 10:02
6 kB
Axel Schwenke
errlog-bad.txt
2024-01-31 15:13
3 kB
Axel Schwenke
errlog-good.txt
2024-01-31 15:13
3 kB
Axel Schwenke
pmp_raw_1706708923.txt.gz
2024-01-31 14:56
12 kB
Axel Schwenke
report.txt.gz
2024-01-31 14:56
38 kB
Axel Schwenke
report-no-children.txt.gz
2024-01-31 14:56
15 kB
Axel Schwenke

Issue Links

is caused by

MDEV-24670 avoid OOM by linux kernel co-operative memory management

Closed

Activity

Ascending order - Click to sort in descending order

Axel Schwenke added a comment - 2024-01-31 09:28

The performance test results for the OLTP order ranges test:

Axel Schwenke added a comment - 2024-01-31 09:28 The performance test results for the OLTP order ranges test:

Marko Mäkelä added a comment - 2024-01-31 09:49

~~MDEV-24670~~ was joint work of myself and danblack. I wrote the buf_pool_t::garbage_collect(). Every invocation of it should also invoke the following:

  sql_print_information("InnoDB: Memory pressure event freed %zu pages",

                        freed);

Because there are no such messages in the server error log, the problem should not be this code (which would likely cause more severe regression later on during any performance test, by forcing pages to be read back into the buffer pool), the problem should be in some code that danblack wrote, such as mem_pressure::setup() or mem_pressure::pressure_routine().

Marko Mäkelä added a comment - 2024-01-31 09:49 MDEV-24670 was joint work of myself and danblack . I wrote the buf_pool_t::garbage_collect() . Every invocation of it should also invoke the following: sql_print_information( "InnoDB: Memory pressure event freed %zu pages" , freed); Because there are no such messages in the server error log, the problem should not be this code (which would likely cause more severe regression later on during any performance test, by forcing pages to be read back into the buffer pool), the problem should be in some code that danblack wrote, such as mem_pressure::setup() or mem_pressure::pressure_routine() .

Daniel Black added a comment - 2024-02-01 01:03

[a057a6e41f2](https://github.com/MariaDB/server/commit/a057a6e41f2) is a simple condition in the buffer pool init so that it doesn't get activated for MariaDB-backup.

As both commits contain the same "Failed to initialize memory pressure: No such file or directory" the both took the same path though this. (Note this error message was removed later).

Because of this error, there isn't even a background thread running (confirmed by PMP). There would have been a little extra processing in init, however extra no CPU time during the run. Even it if had of been Ubuntu 22.04, or an OS with cgroups2, the background thread is waiting on a poll for an even that shouldn't happen without memory pressure.

Looking at the flame graphs there's 0.08 difference in percentage at start_thread. By the time gets up to JOIN::exec the difference in percentage is 0.25. Going up further to join_init_read_record the difference in percentage is 0.83. Obviously we haven't touched any code created by memory pressure.

Obviously with 7f11fad85a885d148254ca05f508125e3b94339c showing the same performance, there's still a regression there.

Did reverting a057a6e41f2 show the improvement come back?

Yes, there's a regression, but with nothing showing in the CPU profile related to ~~MDEV-24670~~, on an OS platform that doesn't support the functionality that ~~MDEV-24670~~ adds, confirmed by the logs, its not the problem.

Daniel Black added a comment - 2024-02-01 01:03 [a057a6e41f2] ( https://github.com/MariaDB/server/commit/a057a6e41f2 ) is a simple condition in the buffer pool init so that it doesn't get activated for MariaDB-backup. As both commits contain the same "Failed to initialize memory pressure: No such file or directory" the both took the same path though this. (Note this error message was removed later). Because of this error, there isn't even a background thread running (confirmed by PMP). There would have been a little extra processing in init, however extra no CPU time during the run. Even it if had of been Ubuntu 22.04, or an OS with cgroups2, the background thread is waiting on a poll for an even that shouldn't happen without memory pressure. Looking at the flame graphs there's 0.08 difference in percentage at start_thread. By the time gets up to JOIN::exec the difference in percentage is 0.25. Going up further to join_init_read_record the difference in percentage is 0.83. Obviously we haven't touched any code created by memory pressure. Obviously with 7f11fad85a885d148254ca05f508125e3b94339c showing the same performance, there's still a regression there. Did reverting a057a6e41f2 show the improvement come back? Yes, there's a regression, but with nothing showing in the CPU profile related to MDEV-24670 , on an OS platform that doesn't support the functionality that MDEV-24670 adds, confirmed by the logs, its not the problem.

Axel Schwenke added a comment - 2024-02-01 10:12

I retested commit 7f11fad85a8 (the original bad release candidate) with the supposed bad commit a057a6e41f2 reverted. Result:

for threadpool=off this looks like a057a6e41f2 could be the culprit, but then for threadpool=on it does not. I noticed also high fluctuations in throughput. Meaning the test used for bisecting could have returned bogus numbers.

I will close this ticket, and reopen TODO-4510. Then bisect again in a different branch and maybe with a better (more stable) test case.

Axel Schwenke added a comment - 2024-02-01 10:12 I retested commit 7f11fad85a8 (the original bad release candidate) with the supposed bad commit a057a6e41f2 reverted. Result: for threadpool=off this looks like a057a6e41f2 could be the culprit, but then for threadpool=on it does not. I noticed also high fluctuations in throughput. Meaning the test used for bisecting could have returned bogus numbers. I will close this ticket, and reopen TODO-4510. Then bisect again in a different branch and maybe with a better (more stable) test case.

Axel Schwenke added a comment - 2024-02-01 10:14

Looks like this was a false alarm.

Axel Schwenke added a comment - 2024-02-01 10:14 Looks like this was a false alarm.

MariaDB Server

fix for MDEV-24670 causes a performance regression

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration