[MDEV-23017] range query performance regression in 10.5.4 Created: 2020-06-25 Updated: 2020-07-02 Resolved: 2020-07-02 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5.4 |
| Fix Version/s: | 10.5.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Axel Schwenke | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
The regression test suite reports rather severe performance regressions in MariaDB 10.5.4 vs. 10.5.3. It looks like it is genuinely for the sysbench OLTP range queries. Example:
|
| Comments |
| Comment by Axel Schwenke [ 2020-06-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
The regression does not exist with multiple (8) tables. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-06-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
It turns out, that the distinct range query from sysbench is the one that is least affected. The other range queries show much more regression, even the simplistic SELECT c FROM table WHERE id BETWEEN const AND const (with c=CHAR(120) and id=INT PRIMARY_KEY) | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-06-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I'll bisect the commits between 10.5.3 and 10.5.4. Test case: simple range queries @ 32 threads. It turns out that this test does not reproduce the problem. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
More tests give more negative feedback:
Next idea: could it be related to loading the tables fresh vs. using an exisiting datadir? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Bingo! The regression only exists when one loads the database tables and runs the benchmark immediately afterwards (not restarting the server inbetween). I demote this issue from Critical to Major since it is not very likely for a customer to hit this problem. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
It is likely purge activity, that interferes with the workload. once purge is over, it becomes truly "readonly". | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I don't believe in purge being the culprit. In any case I force a complete InnoDB checkpoint after filling the tables and wait for innodb_buffer_pool_pages_dirty becoming less than 100. Also purge would end sooner or later (that benchmark is read-only). But even after 10 or 15 minutes runtime I don't see any change in performance. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hmm ,what is the culprit then? Apart from purge I can't think of anything else atm. Is there any perf data from a slow run? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I verified that it is not purge. For that reason I started a dummy TRANSACTION WITH CONSISTENT SNAPSHOT even before loading the tables. Then I ran the benchmark and finally killed the client holding the open transaction. This should reliably stop all purge activity during the benchmark. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I traced the regression through the commits between tags mariadb-10.5.3 and mariadb-10.5.4. The first bad commit is b1ab211dee599eabd9a5b886fafa3adea29ae041. Steps to reproduce. Load sysbench tables:
run benchmark:
No special my.cnf needed. A good commit gives ~20000 tps, a bad one ~3500 tps. If the server is stopped and restarted with the existing datadir, a bad commit gives the same performance as a good one too. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Oh, that's an elephant sized commit | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Based on some perf record traces, the problem is in the following:
The condition is actually negated! And I do not think that buf_page_optimistic_get() ever needs to initiate read-ahead. axel, can you please test the following patch:
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2020-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
That looks good. I have rerun all the tests from the regression test suite that showed a regression and now all numbers are back to normal. See attachment MDEV-23017.pdf |