[MDEV-10499] Extreme memory use when rebuilding a large table Created: 2016-08-05 Updated: 2016-10-07 Resolved: 2016-10-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | OTHER |
| Affects Version/s: | 10.1.16 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Mark Wadham | Assignee: | Unassigned |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | need_feedback | ||
| Description |
|
When rebuilding a very large table as a result of executing DDL, eg:
it gets to this point with very little memory use:
At which point it begins to consume all the available ram and then all the available swap and if there isn't enough it gets OOM'd by the kernel. We have run this test many times with a 160GB table on nodes that have 64GB ram and no swap and it exhausted all of it and got OOM'd by the kernel. We succeeded once with an innodb buffer pool lowered from 50GB to 16GB and a 16GB swap volume added, but attempts to repeat it failed. It does succeed if we add a really big swap volume, for example one of the size of the data length of the table which is around ~90GB. It therefore seems as if rebuilds of the primary index (which includes all of the table data) are being done purely in-memory rather than on-disk. |
| Comments |
| Comment by Elena Stepanova [ 2016-08-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Could you please paste the output of SHOW CREATE TABLE and attach your cnf file(s)? I tried to reproduce on a smaller table (ibd file is ~11 GB) with buffer pool 8 GB; but I'm not getting a memory growth above buffer pool on an arbitrary table with the default configuration (apart from the increased buffer pool). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Wadham [ 2016-08-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Wadham [ 2016-08-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-09-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
No matter what I do, I'm not getting the described memory consumption. The only way I can make the server eat up memory on this operation is if I map the datadir where the temporary file is created to RAM, but I assume if it were the case here, it would have been mentioned. Which environment are you working in – linux distribution/version etc.? Your cnf file includes /etc/my.cnf.d directory, is there anything in there? Names of your binary logs mention galera, but there is no other Galera-related configuration – is the server a Galera node, by any chance? When the problem happens, does top show the corresponding figures for mysqld? Could you please paste output of top for the process, where it has clearly overgrown innodb_buffer_pool_size, but hasn't been killed by OOM watchdog yet? Can you take a stack trace from the running server at this time? gdb --batch --eval-command="thread apply all bt" <mysqld path> <pid>. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-10-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Please comment to re-open the issue if you have more information. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Wadham [ 2016-10-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Elena, Sorry for my slow response. It is a galera node, unfortunately I don't have time to do any further testing and the test environment has been destroyed now anyway. We have worked around the problem with operational procedures now. The galera cluster was running on AWS, I don't know why that would affect it in this way but I guess it's possible. Thanks, |