[MDEV-32351] Significant slowdown for query with many outer joins Created: 2023-10-04 Updated: 2023-11-01 Resolved: 2023-10-28 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Optimizer, Storage Engine - InnoDB |
| Affects Version/s: | 10.2.38, 10.3.29, 10.4.19, 10.5.10, 10.6.1 |
| Fix Version/s: | 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cyprien Devillez | Assignee: | Igor Babaev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance, regression | ||
| Environment: |
Ubuntu 22.04, Docker 24.0.6 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
SummaryWith MariaDB 10.2.38,10.3.29,10.4.19,10.5.10,10.6.1 and later version, this request (https://github.com/cypx/mariadb-issue/blob/main/request.sql) launched on this database (https://github.com/cypx/mariadb-issue/blob/main/magentodb.sql anonymized tables from a partial export of a Magento 2 database) need 20x more time to finish. ReproduceOn a Linux system with an up-to-date docker install you could clone the following repository https://github.com/cypx/mariadb-issue and launch the ./bench.sh script followed by the MariaDB you want to test (ex: ./bench.sh 10.2.37 ) Example on my laptop (CPU: Intel i7-1065G7, Memory: 16 GB)
|
| Comments |
| Comment by Marko Mäkelä [ 2023-10-04 ] | ||||||||||||||||||||||
|
I think that to find the source of the regression, flame graphs produced from perf record samples would be most useful. Would it be possible for you to upload the flame graphs of the benchmark on MariaDB 10.2.37 and 10.2.38? Another option might be to compare MariaDB 10.4.18 and 10.4.19 (assuming that the same regression is observable between these releases) and to enable the optimizer trace ( | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-05 ] | ||||||||||||||||||||||
|
I added FlameGraph to the ticket. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-05 ] | ||||||||||||||||||||||
|
cypx, thank you for trying it. One more thing that could be helpful would be the EXPLAIN output for the SELECT statement in each environment. danblack, can you help with the debug symbols? | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-06 ] | ||||||||||||||||||||||
|
Trace and explain added! | ||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-07 ] | ||||||||||||||||||||||
|
https://github.com/cypx/mariadb-issue/pull/1 rough explaination. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-08 ] | ||||||||||||||||||||||
|
cypx, thank you. I do not see a huge difference between the optimizer trace and the EXPLAIN SELECT output between 10.4.18 and 10.4.19. Did you reproduce the 20× performance difference between those 2 versions? | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-08 ] | ||||||||||||||||||||||
|
@danblack thank for the explanation, I think understand, I will retry tomorrow. @marko Yes I reproduce, It took less than 5mn to get the 1.4.18 trace and more than one and half hour to get the 1.4.19 one. Also, I realize I don't have say that the issue have been initially spotted on baremetal servers running Debian (so it's not related to docker and/or virtualization side effect) | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-09 ] | ||||||||||||||||||||||
|
cypx, the difference between the end of trace_10.4.18.json I am not an expert at reading this output, but the difference between the numbers of rows or the costs seems to be rather small. I hope that the flame graphs will shed some light on this. I can’t yet rule out InnoDB (my area of expertise) as the culprit of this regression. | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-09 ] | ||||||||||||||||||||||
|
I gave up trying to do perf report with symbols on docker because despite Daniel Black advice and some others documentation found on the web it's not working on my environment. | ||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-09 ] | ||||||||||||||||||||||
|
__nss_database_lookup stands out, though this appears to be a [false lookup| Only a little innodb shows up in 10.4.18 (index read). I failed to find a significant sample in 10.4.19. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-10 ] | ||||||||||||||||||||||
|
The function JOIN_CACHE::join_matching_records() stands out in the call graphs. I do not see any significant time being spent in InnoDB. In perf_full_10.4.18_8mn03.svg A search
identified this change that was supposed to fix Would reverting that change fix the regression? | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-10 ] | ||||||||||||||||||||||
|
@marko Good catch! I confirm reverting 90780bb5a949d5e27b39df4adaa1d7b1ad98948e on 10.4.19 fix the regression. On the same VM, it's take 7mn52s to run the query with a reverted 10.4.19 build. | ||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-10 ] | ||||||||||||||||||||||
|
Raising to blocker priority as the level of regression in the query time is of a very large magnitude and occurring in such a common application like Magento 2 should be given the highest priority possible, especially as the causing commit has been identified. | ||||||||||||||||||||||
| Comment by Igor Babaev [ 2023-10-10 ] | ||||||||||||||||||||||
|
The patch 90780bb5a949d5e27b39df4adaa1d7b1ad98948e fixed an algorithmic bug in the implementation of BNL for outer joins. That bug led to wrong result set for a class of queries. The cause of the bug were missing actions needed for BNL to be valid for any dataset used by an affected query. It may happens that for other datasets that actions are not needed to get the right result set. | ||||||||||||||||||||||
| Comment by Igor Babaev [ 2023-10-11 ] | ||||||||||||||||||||||
|
cypx, please could you confirm that the following piece of your query is as it was intended to be:
If it's so then there is no conditions imposed on eao3. Is it what should be? | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-11 ] | ||||||||||||||||||||||
|
@igor I'm not a Magento developper but as far as I understand it's seems intended and the result match the expectation. | ||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-13 ] | ||||||||||||||||||||||
|
alice, can I trouble you for a minimised test case of https://github.com/cypx/mariadb-issue | ||||||||||||||||||||||
| Comment by Alice Sherepa [ 2023-10-16 ] | ||||||||||||||||||||||
|
I was afraid to oversimplify the test and to loose the problem, so please also check the initial test case.
on debug build:
90780bb5a949d5e27b39df4adaa1d7b1ad98948e
10.4 0c7af6a2a19343cb9d4fedbd7165b8
| ||||||||||||||||||||||
| Comment by Alice Sherepa [ 2023-10-17 ] | ||||||||||||||||||||||
|
| ||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-18 ] | ||||||||||||||||||||||
|
Folks, igor has done a first proof of concept fix. This hasn't been reviewed yet, but if you are able to test this on a non-production environment we welcome the feedback. The container image: quay.io/mariadb-foundation/mariadb-devel:10.4-mdev-32351-pkgtest (arm64/amd64) Various packages: https://ci.mariadb.org/39481/ Debian repos can use a repo line:
Adjust arch/distro/distro version as need. RPM autobake directories have a repo file in them that can be used. Feedback welcome. | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-18 ] | ||||||||||||||||||||||
|
With 10.4.32 build, the query takes 08mn 24s on the same VM used for previous testing (8mn 03s with 10.4.18 and 3hours 48mn 28s with 10.4.19). | ||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-18 ] | ||||||||||||||||||||||
|
Thanks for testing cypx. Good to know the positive case worked. As we arrived at this bug however its making sure all the other queries also perform correctly and fast as well. | ||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-20 ] | ||||||||||||||||||||||
|
cypx, I should have also asked, the result was correct for the query? | ||||||||||||||||||||||
| Comment by Cyprien Devillez [ 2023-10-23 ] | ||||||||||||||||||||||
|
Daniel Black, The result was correct and it was the same between 10.4.18 & 10.4.32 (diff & md5 match). | ||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2023-10-26 ] | ||||||||||||||||||||||
|
igor, it's a problem that there's no visible test coverage for "Don't unpack the buffer if we have MATCH_IMPOSSIBLE" functionality. I've backported r_unpack_time_ms patch from 10.6: https://github.com/MariaDB/server/commit/a9cf22654e99b6b40d7eedd6e7443c1a2813da17 After that, one can do with a very small patch to add tests for this: https://github.com/MariaDB/server/commit/5745468d17e6f48d136935329a55cc97090024e1 Both patches are in bb-10.4-mdev32351-tests, on top of your fix. What do you think? | ||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2023-10-26 ] | ||||||||||||||||||||||
|
... The fix itself is ok, although it's difficult to understand. |