[MCOL-553] "Too many open files" errors during DBT3 performance test Created: 2017-02-07 Updated: 2017-03-02 Resolved: 2017-03-02 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | PrimProc |
| Affects Version/s: | None |
| Fix Version/s: | 1.1.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Daniel Lee (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | relnote | ||
| Environment: |
2-node combo configuration on AWS, using 2d.8xlarge instance type |
||
| Sprint: | 2017-3, 2017-4, 2017-5 |
| Description |
|
Build tested: 1.0.7-1 AMI. The DBT3 test failed with many missing data file errors. I checked the log files and noticed that there is a "too many open files" error when connection to ExeMgr is lost. Soon after, PrimProc got restarted. Does ColumnStore close/reuse the file handle it hit a missing data file error? Feb 7 15:40:49 ip-172-30-0-236 PrimProc[66666]: 49.330422 |0|0|0| C 28 CAL0053: PrimProc could not open file for OID 3053; /000.dir/000.dir/011.dir/237.dir/000.dir/FILE001.cdf:No such file or directory |
| Comments |
| Comment by Daniel Lee (Inactive) [ 2017-02-07 ] |
|
The DBT3 test caused this error to occur was on a 1TB database. 1) run cpimport to import the 1TB source files. From the time stamps, it is either query #3 or #4 that caused this error. Query #3 select Query #4 [mariadb-user@ip-172-30-0-236 1000g]$ cat /data/qa/autopilot/performance/dbt3/sql/1000g/4.sql select |
| Comment by Daniel Lee (Inactive) [ 2017-02-07 ] |
|
Additional info: There is logic to keep file handles handles around for performance reasons. Hopefully, my understand on the logic is not out dated. Here should be the logic. If number of open files handles goes over the <MaxOpenFiles>, ColumnStore will close <DescreaseOpenFilesCount> file handles, before continuing. Parameters and default values in Columnstore.xml: <MaxOpenFiles>2K</MaxOpenFiles> |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ] |
|
So, for starters MaxOpenFiles and DecreaseOpenFilesCount only trigger when the extents are marked as not in use, a single DBT3 query could use all the extents simultaneously. ColumnStore sets the OS files limit to 65536. This won't work for non-root installs that will be stuck at the default which is typically 1024. Now, the limit is shared across all processes by that user. So mysqld's temp and table files also consume that limit as well as log files, etc... Based on some very rough maths with 8bn rows in lineitem this will take a minimum of 18,000 extent files. On top of that you have the extent files for all the other tables for example the orders table will be 15 files per 8M rows (assuming full extents). the 65K file limit we have set is far too small. We have a couple of options here: 1. Remove the 64K file limit in the code completely and document that the user should set this, we will need to do this for non-root anyway To solve this bug we need to: 1. Resolve the limit issue above. As an optional point 4, maybe look at increasing the number of rows per extent to reduce the amount of files used for TB range data sets. I suggest we do part 1 of this ASAP, this will affect people in the TB range of data. |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ] |
|
it should also be noted that due to the way Linux works a TCP/IP socket connection also counts as a file. There are a few cases where we make a lot of these for a single query. Fixing |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ] |
|
The pull requests are for develop and develop-1.0. They are only for part 1 of the fix and this ticket should be reopened afterwards. |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ] |
|
In addition to all of the above non-root has a 1K soft, 4K hard cap. The ColumnStore binaries try and silently fail to change this. We need proper error handling there and documentation. |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ] |
|
Reopening so I can add error handling. It doesn't look like my original commits are needed |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ] |
|
New pull request adds error handling to setting the file limit. Documentation additions added at: Please keep this open after review for the rest. |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-17 ] |
|
Hall has reviewed and merged. Keeping open for points 2&3 in my earlier comment. |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-21 ] |
|
Patch needs updating to report up the error to OAM rather than stderr |
| Comment by Andrew Hutchings (Inactive) [ 2017-02-21 ] |
|
Pull request to also send the error to the error log and stop postConfigure if we cannot set the file limit high enough. |
| Comment by Daniel Lee (Inactive) [ 2017-03-02 ] |
|
Build tested: Github source [root@localhost columnstore]# cd mariadb-columnstore-server/ Merge pull request #31 from jbfavre/fix_deb_package_dependency [root@localhost mariadb-columnstore-server]# cd mariadb-columnstore-engine/ change the check for prompt back to the previous code Did both root and non-root test for 1um4pm installation For non-root, the stack did not come up if the /etc/security/limits.conf is not setup correctly. Once setup, the stack came up fine. The following are the messages in crit.log if limits.conf is not setup correctly for non-root user: Mar 1 15:09:17 localhost PrimProc[9345]: 17.194857 |0|0|0| C 28 CAL0000: Error setting file limits, please see non-root install documentation |