[MDEV-11284] Extremely high cpu usage and slow performance after upgrade from 5.5.52 to 10.1.29 Created: 2016-11-15  Updated: 2017-09-13  Resolved: 2017-09-13

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 10.1.19
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Michael Graf Assignee: Axel Schwenke
Resolution: Cannot Reproduce Votes: 0
Labels: 10.1, 5.5, cpu, mariadb, performance, usage
Environment:

Windows Server 2008 R2 Enterprise


Attachments: PNG File MariaDB 10.1.19 caches 40000.png     PNG File MariaDB 10.1.19 caches 400000.png     PNG File MariaDB_5.5.52.png     PNG File Performance of MariaDB 10.1.19.png     PNG File Performance of MariaDB 5.5.52.png     Text File TimeMeasurement for SQLStatements of MariaDB 10.1.19.txt     File TimeMeasurement for SQLStatements of MariaDB 5.5.52     Text File callstacks_10_1_19.txt     File my.ini     File mysqld.exe.2160.001     File mysqld.exe.2160.002     File mysqld.exe.2160.crc     File mysqld.exe_161128_064117.001     File mysqld.exe_161128_064117.002     File mysqld.exe_161128_064117.crc     File mysqld.exe_161202_180716.001     File mysqld.exe_161202_180716.002     File mysqld.exe_161202_180716.003     File mysqld.exe_161202_180716.004     File mysqld.exe_161202_180716.005     File mysqld.exe_161202_180716.006     File mysqld.exe_161202_180716.007     File mysqld.exe_161202_180716.008     File mysqld.exe_161202_180716.009     File mysqld.exe_161202_180716.010     File mysqld.exe_161202_180716.011     File mysqld.exe_161202_180716.012     File mysqld.exe_161202_180716.013     File mysqld.exe_161202_180716.014     File mysqld.exe_161202_180716.015     File mysqld.exe_161202_180716.crc     Text File procdump_output.txt     Text File status.txt    

 Description   

After upgrade from MariaDB 5.5.52 to 10.1.19 the performance of our application is about 5 times slower than before.

It is noticeable that the MariaDB 10.1.19 has a cpu usage of 95%.
With MariaDB 5.5.52 the cpu usage was only about 5%.

I appended 2 pictures which shows the difference between these 2 versions.
For both versions we used the same server with the same test application.
The only difference is the upgrade of MariaDB.

The settings of my.ini are appended.



 Comments   
Comment by Axel Schwenke [ 2016-11-17 ]

Hello Micheal,

this could be anything. Maybe some query from your application is using a different execution plan in 10.1. From your my.ini I see that you have enabled the slow query log. Please check the slow query log and/or attach it to this issue for inspection by us.
Hopefully you also have the slow query log from MariaDB 5.5. Then you (or we) could check if there are specific queries running slow in 10.1 but not in 5.5.
Do you have control over the SQL queries that your application sends? Because in many cases the easiest solution is to rewrite a query that runs slow.

Comment by Michael Graf [ 2016-11-18 ]

Hello Axel,

instead of the slow query log I have measured the times of all sql statements which are used to generate 1 event.

Here are the results:

10.1.19 5.5.52
---------------------------------------------------------------------------------
SQL statement 1 - SELECT 0.137s 0.001s
SQL statement 2 - SELECT 0.146s 0.001s
SQL statement 3 - SELECT 0.059s 0.000s (<1ms)
SQL statement 4 - SELECT 0.060s 0.000s (<1ms)
SQL statement 5 - BEGIN 0.002s 0.000s (<1ms)
SQL statement 6 - UPDATE 0.084s 0.000s (<1ms)
SQL statement 7 - SELECT 0.127s 0.000s (<1ms)
SQL statement 8 - DELETE 0.001s 0.000s (<1ms)
SQL statement 9 - COMMIT 0.001s 0.000s (<1ms)

The database server consists of several databases.
For the event test 2 databases are used: zissystem and eventmonitor.
The database zissystem contains the configuration data of our application.
The database eventmonitor contains the tables for the events.
It contains 320.000 tables (EUStatus00000100000_Pending/History - EUStatus_00000260000_Pending/History) of which 10.000 tables are written by the test.
In each eventmonitor table is only !!! 1 !!! event/row which is updated.
Only the "Pending" tables are written (UPDATE). The "History" tables are not used.
50 threads generate parallel events. Each thread writes 1 event one after another into 200 tables.
Each thread write only in "its" 200 tables and not in the tables of the other threads.
When into all 200 tables a new event was written (UPDATE) the thread starts again.

I have appeded the SQL statements which are used to generate 1 event.

The text files contains also the "show create table" statements of the used tables.

Maybe the low performance of 10.1.19 have to do with the large number of tables in database eventmonitor.
The used my.ini files for 10.1.19 and 5.5.52 are the same only the binaries of MariaDB were changed.

TimeMeasurement for SQLStatements of MariaDB 5.5.52 TimeMeasurement for SQLStatements of MariaDB 10.1.19.txt

Comment by Axel Schwenke [ 2016-11-22 ]

Hi Micheal,

when you have indeed 10.000 active tables, then your table cache is much too small. I also notice a mismatch in the configuration of the two related caches:

table_open_cache=869
table_definition_cache=388202

Normally the table_open_cache has to be bigger than the table_definition_cache. This latter cache contains table definitions for open(ed) tables. There cannot be more such definitions as you have tables. The table_open_cache however needs one entry for each open(ed) instance of each table. I.e. if you have a query that joins a table with itself, two instances of that table are opened and two slots in the table_open_cache are used.

Per rule of thumb the size of the table_open_cache should be number of active tables * average number of tables in a join. So you shouldn't go below 10.000 for the table_open_cache. I would even suggest 40.000.

Can you please test if increasing the table_open_cache makes a difference? For one thing, opening tables can be costly. Secondly the table_open_cache is an area that has seen big changes in MariaDB 10.1.

Comment by Michael Graf [ 2016-11-24 ]

Hi Axel,

the value of table_open_cache was determined a few years ago when Windows only allowed 2000 opened files.
Because the performance of MariaDB 5.5 was good I didn't have a further look at this value.
But with MariaDB 10.1 this parameter seems to be now very critical.

I think the value of table_definition_cache=388202 is correct, because there are really so many tables and
on a productive system all these tables are used. Only the performance test, which we use to get a quick
performance value, is using 10.000 tables. But when our application starts and before the performance test
is started all 160000 EUStatus_Pending tables are read. 10 minutes later after our application was started
the 160000 EUStatus_History tables are also read.

Now to the results of the different values for table_open_cache and table_definition_cache.

With table_open_cache=40000 and table_definition_cache=40000 the start of the application lasted 8m20s.
The start with MariaDB 10.1 (until all 160000 EUStatus_Pending tables are read) last almost twice as long as with MariaDB 5.5.
With MariaDB 5.5 the start lasts about 4m30s.
The performance test is as fast as with MariaDB 5.5.
--> The start time is much too long

With table_open_cache=400000 and table_definition_cache=400000 the start of the application lasted 4m30s.
The performance test is as fast as with MariaDB 5.5.
When I click on different groups in our application the response time of MariaDB is bad.
The time until a I get a response on a single query last 10s and longer.
With Maria 5.5 the response time is within milliseconds.
--> The response time in our client application is much too bad

With table_open_cache=200000 and table_definition_cache=200000 the start of the application lasted 4m30s.
The performance test is as fast as with MariaDB 5.5.
When I click on different groups in our application the response of MariaDB is bad.
The time until a I get a response on a single query last 1s and longer.
With Maria 5.5 the response time is within milliseconds.
--> The response time in our client application is still too bad

With table_open_cache=100000 and table_definition_cache=100000 the start of the application lasted too long.
--> I cancelled the start after 10 minutes

Summary:
For me it seems that the table_open_cache must be set > 160000 to get start times within 4m30s.
This means that this value must be set as high as the number of tables which are read during the start phase to get a good start time performance.
But when this value is so too high the response time on single query requests are getting worse.
Instead of a few milliseconds the response time is >1s (when the performance test is runng - but this is the normal case by our customers that the EUStatus_Pending/History tables are written and read).
If I set the table_open_cache < 160000 I get bat start times > 8 minutes (almost twice as long as with MariaDB 5.5).

With MariaDB 5.5 the start time with 160000 tables is 4m30s and this with only a value of 869 for table_open_cache.
Why is the performance of MariaDB 10.1 now so dependent on table_open_cache ?
On which value do we have to set table_open_cache or further parameters to get the same performance as with MariaDB 5.5 ?

Comment by Axel Schwenke [ 2016-11-29 ]

Michael,

thank you for doing those tests. It seems clear to me that the problem is not the execution of your SQL queries, but the table open operation. I will try to come up with a reproducable test case and hand the issue over to our developers.

As for setting the cache sizes: you can and should size those caches independently. If you found that you need the table definition cache to have 380.000 slots, then it is safe to stay with it. You just need to set the size of the table open cache to something that works for you. In general bigger numbers should be better.

When I try so summarize your numbers above, it doesn't seem to be conclusive though:

caches=40.000

  • startup 8:20
  • performance ~10s

caches=100.000

  • startup takes too long

caches=200.000

  • startup time ok (4:30)
  • performance ~ 1s

caches=400.000

  • startup time ok (4:30)
  • performance ~ 10s

And performance in the synthetic test (10K tables) was good in all cases. Did I summarize this correctly?

Then what bites me is, that by going from 200.000 to 400.000 actually decreases performance where I had expected an increase.

Finally I'd like to say that having that many tables in a single database instance is very unusual and I wonder if your data model is correct.

Comment by Michael Graf [ 2016-11-30 ]

Hi Axel,

Did I summarize this correctly?
Yes.

The performance from 200.000 to 400.000 doesn't decease for the performance test.
But this test only measures the number of events/s.

The performance which I mean goes down is within our client application which we use to configure our application.
There are several views and for each view a simple SELECT is executed against the same MariaDB server on which performance test is executed.
When I now click in the client application and switch between different views (during the performance test is running) I have to wait noticeable on the query response.
The client response time is going down the higher the value of table_open_cache is.

MariaDB 5.5.52
table_open_cache=869
startup time: 4m30
cpu usage of MariaDB during startup: 5%
cpu usage of MariaDB during performance test: 13%
events/s: 1800
client click performance: < 1s (sporadic single SQL queries)
--> all ok

MariaDB 10.1.19
caches=40000
startup time: 8m20s
cpu usage of MariaDB during startup: 40%
cpu usage of MariaDB during performance test: 15%
events/s: 1800
client click performance: < 1s (sporadic single SQL queries)
--> startup time and cpu usage are not good

MariaDB 10.1.19
caches=400000
startup time: 4m30
cpu usage during startup: 7%
cpu usage during performance test: 12%
events/s: 1800
client click performance: > 0-10s (sporadic single SQL queries)
--> response time of sporadic SQL queries is not good
We have to dig deeper to understand why the reponse time is so "high"

I know that the values are inconsistent but that is what I can observe.

I have attached the cpu usage of the startup time and performance test for all 3 variations.

Comment by Axel Schwenke [ 2016-11-30 ]

Michael,

thanks for the clarification. It's funny that even with 400K table open cache you still see stalls. There certainly is some problem here. I have two more questions:

  1. what engine are your tables using? I guess InnoDB with file_per_table=false?
  2. can you please collect server status variables like so

mysql>tee status.txt
mysql>show global variables;
mysql>show global status;
mysql>do sleep(60);
mysql>show global status;
mysql>notee

Please do that at a time when your applications runs under normal load.

I'm on vacation until next weekend (12th). So don't worry if I don't answer.

Comment by Sergey Vojtovich [ 2016-11-30 ]

It is likely that table eviction is culprit in both cases (startup and benchmark). That is when we run over table_open_cache limit we have to evict least recently used table from cache and put new table instead. In 5.5 it was reasonably fast: we had a pointer to LRU table. But in 10.0/10.1 we don't maintain this LRU list and have to go through all tables. It is very expensive taking into account number of tables you have.

You may try reducing both table_open_cache and table_definition_cache down too e.g. 100, I won't be surprised if you get better performance this way.

This problem was fixed in 10.2, performance should be similar to 5.5.

Comment by Michael Graf [ 2016-12-01 ]

Hi Axel,

the machine I did the tests was updated from Windows Server 2008 R2 Enterprise to Windows Server 2012 R2 Standard because we had to do some tests for a customer. Since the upgrade to Windows Server 2012 R2 Standard I can't reproduce anymore the problems with the click performance.

1. what engine are your tables using? I guess InnoDB with file_per_table=false?
Yes.

2. can you please collect server status variables like so
The file is attached.

status.txt

Comment by Michael Graf [ 2016-12-01 ]

Hi Sergey,

here are the results:

MariaDB 10.1.19
table_open_cache=100
table_definition_cache=400000
startup time: 6m15s
cpu usage of MariaDB during startup: 3%
cpu usage of MariaDB during performance test: 12%
events/s: 1800
client click performance: < 1s (sporadic single SQL queries)

The cpu is low as you supposed and the startup is between the best 4m30s and the the worst 8m20s. The performance test results the best value with 1800.

I couldn't test MariaDB 10.2.2 because it crashed during the startup.

Comment by Sergey Vojtovich [ 2016-12-01 ]

migra, did you try to lower down table_definition_cache as well? I guess you won't see big difference with low table_open_cache alone.

Do you have any trace for that crash? We fixed many issues since 10.2.2, so it would be nice to check if you're hitting existing issue or not.

Comment by Michael Graf [ 2016-12-02 ]

Hi Sergey,

>did you try to lower down table_definition_cache as well?
No. I will try this next week because the test machine is currently occupied by another test.

I made 2 minidumps of the crash.
mysqld.exe.2160.001 mysqld.exe.2160.002 mysqld.exe.2160.crc mysqld.exe_161128_064117.001 mysqld.exe_161128_064117.002 mysqld.exe_161128_064117.crc

Comment by Sergey Vojtovich [ 2016-12-02 ]

wlad, could you extract traces from the above dumps?

Comment by Vladislav Vaintroub [ 2016-12-02 ]

As far as I can tell, one of those was a manually taken dump , where one thread locks an MDL lock, FLUSH TABLES WITH READ LOCK mostly likely, and others wait. This one https://jira.mariadb.org/secure/attachment/42950/callstacks_10_1_19.txt

Comment by Sergey Vojtovich [ 2016-12-02 ]

So was that 10.1->10.2 binary upgrade? It didn't work with 10.2.2 indeed. But I hope it will be fixed in 10.2.3. In fact I vaguely remember we had some pushes regarding that recently.

Comment by Michael Graf [ 2016-12-02 ]

mysqld.exe.2160.dmp was generated with WER.

mysqld.exe_161128_064117.dmp was generated with procdump.exe.
Maybe I made a mistake using procdump.

Comment by Michael Graf [ 2016-12-02 ]

>So was that 10.1->10.2 binary upgrade?
Yes. Upgrade from 10.1.19 to 10.2.2.

Comment by Michael Graf [ 2016-12-02 ]

The Minidump (Plus) created with procdump.exe says INT_DIVIDE_BY_ZERO.

d:\leutek\mariadb\bin>c:\uti\sysinternals\procdump.exe -mp -e 1 -f "" -x d:\minidumps mysqld.exe --defaults-file=d:\leutek\mariadb\my.ini --console

I attached the whole output of procdump.exe and the minidump (Plus) mysqld.exe_161202_180716.001 mysqld.exe_161202_180716.002 mysqld.exe_161202_180716.003 mysqld.exe_161202_180716.004 mysqld.exe_161202_180716.005 mysqld.exe_161202_180716.006 mysqld.exe_161202_180716.007 mysqld.exe_161202_180716.008 mysqld.exe_161202_180716.009 mysqld.exe_161202_180716.010 mysqld.exe_161202_180716.011 mysqld.exe_161202_180716.012 mysqld.exe_161202_180716.013 mysqld.exe_161202_180716.014 mysqld.exe_161202_180716.015 mysqld.exe_161202_180716.crc .

procdump_output.txt

Comment by Vladislav Vaintroub [ 2016-12-02 ]

Right, this is tracked in another bug filed by you MDEV-11435. Let's keep this one for performance issues only

Comment by Vladislav Vaintroub [ 2016-12-02 ]

Also, unless a big deal on your side, it is somewhat easier for us to handle dump files are compressed (zip, 7zip, whatever) rather than split. Minidump for crashes would usually suffice.

Comment by Michael Graf [ 2016-12-02 ]

Next time I will zip it

Comment by Axel Schwenke [ 2017-09-13 ]

The problem was never reproduced. It could have been the same as MDEV-10409.

Generated at Thu Feb 08 07:48:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.