[MDEV-11284] Extremely high cpu usage and slow performance after upgrade from 5.5.52 to 10.1.29 Created: 2016-11-15 Updated: 2017-09-13 Resolved: 2017-09-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | OTHER |
| Affects Version/s: | 10.1.19 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Michael Graf | Assignee: | Axel Schwenke |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | 10.1, 5.5, cpu, mariadb, performance, usage | ||
| Environment: |
Windows Server 2008 R2 Enterprise |
||
| Attachments: |
|
| Description |
|
After upgrade from MariaDB 5.5.52 to 10.1.19 the performance of our application is about 5 times slower than before. It is noticeable that the MariaDB 10.1.19 has a cpu usage of 95%. I appended 2 pictures which shows the difference between these 2 versions. The settings of my.ini are appended. |
| Comments |
| Comment by Axel Schwenke [ 2016-11-17 ] | ||||||
|
Hello Micheal, this could be anything. Maybe some query from your application is using a different execution plan in 10.1. From your my.ini I see that you have enabled the slow query log. Please check the slow query log and/or attach it to this issue for inspection by us. | ||||||
| Comment by Michael Graf [ 2016-11-18 ] | ||||||
|
Hello Axel, instead of the slow query log I have measured the times of all sql statements which are used to generate 1 event. Here are the results: 10.1.19 5.5.52 The database server consists of several databases. I have appeded the SQL statements which are used to generate 1 event. The text files contains also the "show create table" statements of the used tables. Maybe the low performance of 10.1.19 have to do with the large number of tables in database eventmonitor. TimeMeasurement for SQLStatements of MariaDB 5.5.52 | ||||||
| Comment by Axel Schwenke [ 2016-11-22 ] | ||||||
|
Hi Micheal, when you have indeed 10.000 active tables, then your table cache is much too small. I also notice a mismatch in the configuration of the two related caches: table_open_cache=869 Normally the table_open_cache has to be bigger than the table_definition_cache. This latter cache contains table definitions for open(ed) tables. There cannot be more such definitions as you have tables. The table_open_cache however needs one entry for each open(ed) instance of each table. I.e. if you have a query that joins a table with itself, two instances of that table are opened and two slots in the table_open_cache are used. Per rule of thumb the size of the table_open_cache should be number of active tables * average number of tables in a join. So you shouldn't go below 10.000 for the table_open_cache. I would even suggest 40.000. Can you please test if increasing the table_open_cache makes a difference? For one thing, opening tables can be costly. Secondly the table_open_cache is an area that has seen big changes in MariaDB 10.1. | ||||||
| Comment by Michael Graf [ 2016-11-24 ] | ||||||
|
Hi Axel, the value of table_open_cache was determined a few years ago when Windows only allowed 2000 opened files. I think the value of table_definition_cache=388202 is correct, because there are really so many tables and Now to the results of the different values for table_open_cache and table_definition_cache. With table_open_cache=40000 and table_definition_cache=40000 the start of the application lasted 8m20s. With table_open_cache=400000 and table_definition_cache=400000 the start of the application lasted 4m30s. With table_open_cache=200000 and table_definition_cache=200000 the start of the application lasted 4m30s. With table_open_cache=100000 and table_definition_cache=100000 the start of the application lasted too long. Summary: With MariaDB 5.5 the start time with 160000 tables is 4m30s and this with only a value of 869 for table_open_cache. | ||||||
| Comment by Axel Schwenke [ 2016-11-29 ] | ||||||
|
Michael, thank you for doing those tests. It seems clear to me that the problem is not the execution of your SQL queries, but the table open operation. I will try to come up with a reproducable test case and hand the issue over to our developers. As for setting the cache sizes: you can and should size those caches independently. If you found that you need the table definition cache to have 380.000 slots, then it is safe to stay with it. You just need to set the size of the table open cache to something that works for you. In general bigger numbers should be better. When I try so summarize your numbers above, it doesn't seem to be conclusive though: caches=40.000
caches=100.000
caches=200.000
caches=400.000
And performance in the synthetic test (10K tables) was good in all cases. Did I summarize this correctly? Then what bites me is, that by going from 200.000 to 400.000 actually decreases performance where I had expected an increase. Finally I'd like to say that having that many tables in a single database instance is very unusual and I wonder if your data model is correct. | ||||||
| Comment by Michael Graf [ 2016-11-30 ] | ||||||
|
Hi Axel, Did I summarize this correctly? The performance from 200.000 to 400.000 doesn't decease for the performance test. The performance which I mean goes down is within our client application which we use to configure our application. MariaDB 5.5.52 MariaDB 10.1.19 MariaDB 10.1.19 I know that the values are inconsistent but that is what I can observe. I have attached the cpu usage of the startup time and performance test for all 3 variations. | ||||||
| Comment by Axel Schwenke [ 2016-11-30 ] | ||||||
|
Michael, thanks for the clarification. It's funny that even with 400K table open cache you still see stalls. There certainly is some problem here. I have two more questions:
Please do that at a time when your applications runs under normal load. I'm on vacation until next weekend (12th). So don't worry if I don't answer. | ||||||
| Comment by Sergey Vojtovich [ 2016-11-30 ] | ||||||
|
It is likely that table eviction is culprit in both cases (startup and benchmark). That is when we run over table_open_cache limit we have to evict least recently used table from cache and put new table instead. In 5.5 it was reasonably fast: we had a pointer to LRU table. But in 10.0/10.1 we don't maintain this LRU list and have to go through all tables. It is very expensive taking into account number of tables you have. You may try reducing both table_open_cache and table_definition_cache down too e.g. 100, I won't be surprised if you get better performance this way. This problem was fixed in 10.2, performance should be similar to 5.5. | ||||||
| Comment by Michael Graf [ 2016-12-01 ] | ||||||
|
Hi Axel, the machine I did the tests was updated from Windows Server 2008 R2 Enterprise to Windows Server 2012 R2 Standard because we had to do some tests for a customer. Since the upgrade to Windows Server 2012 R2 Standard I can't reproduce anymore the problems with the click performance. 1. what engine are your tables using? I guess InnoDB with file_per_table=false? 2. can you please collect server status variables like so | ||||||
| Comment by Michael Graf [ 2016-12-01 ] | ||||||
|
Hi Sergey, here are the results: MariaDB 10.1.19 The cpu is low as you supposed and the startup is between the best 4m30s and the the worst 8m20s. The performance test results the best value with 1800. I couldn't test MariaDB 10.2.2 because it crashed during the startup. | ||||||
| Comment by Sergey Vojtovich [ 2016-12-01 ] | ||||||
|
migra, did you try to lower down table_definition_cache as well? I guess you won't see big difference with low table_open_cache alone. Do you have any trace for that crash? We fixed many issues since 10.2.2, so it would be nice to check if you're hitting existing issue or not. | ||||||
| Comment by Michael Graf [ 2016-12-02 ] | ||||||
|
Hi Sergey, >did you try to lower down table_definition_cache as well? I made 2 minidumps of the crash. | ||||||
| Comment by Sergey Vojtovich [ 2016-12-02 ] | ||||||
|
wlad, could you extract traces from the above dumps? | ||||||
| Comment by Vladislav Vaintroub [ 2016-12-02 ] | ||||||
|
As far as I can tell, one of those was a manually taken dump , where one thread locks an MDL lock, FLUSH TABLES WITH READ LOCK mostly likely, and others wait. This one https://jira.mariadb.org/secure/attachment/42950/callstacks_10_1_19.txt | ||||||
| Comment by Sergey Vojtovich [ 2016-12-02 ] | ||||||
|
So was that 10.1->10.2 binary upgrade? It didn't work with 10.2.2 indeed. But I hope it will be fixed in 10.2.3. In fact I vaguely remember we had some pushes regarding that recently. | ||||||
| Comment by Michael Graf [ 2016-12-02 ] | ||||||
|
mysqld.exe.2160.dmp was generated with WER. mysqld.exe_161128_064117.dmp was generated with procdump.exe. | ||||||
| Comment by Michael Graf [ 2016-12-02 ] | ||||||
|
>So was that 10.1->10.2 binary upgrade? | ||||||
| Comment by Michael Graf [ 2016-12-02 ] | ||||||
|
The Minidump (Plus) created with procdump.exe says INT_DIVIDE_BY_ZERO. d:\leutek\mariadb\bin>c:\uti\sysinternals\procdump.exe -mp -e 1 -f "" -x d:\minidumps mysqld.exe --defaults-file=d:\leutek\mariadb\my.ini --console I attached the whole output of procdump.exe and the minidump (Plus) mysqld.exe_161202_180716.001 | ||||||
| Comment by Vladislav Vaintroub [ 2016-12-02 ] | ||||||
|
Right, this is tracked in another bug filed by you | ||||||
| Comment by Vladislav Vaintroub [ 2016-12-02 ] | ||||||
|
Also, unless a big deal on your side, it is somewhat easier for us to handle dump files are compressed (zip, 7zip, whatever) rather than split. Minidump for crashes would usually suffice. | ||||||
| Comment by Michael Graf [ 2016-12-02 ] | ||||||
|
Next time I will zip it | ||||||
| Comment by Axel Schwenke [ 2017-09-13 ] | ||||||
|
The problem was never reproduced. It could have been the same as |