[MCOL-553] "Too many open files" errors during DBT3 performance test Created: 2017-02-07  Updated: 2017-03-02  Resolved: 2017-03-02

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: None
Fix Version/s: 1.1.0

Type: Bug Priority: Minor
Reporter: Daniel Lee (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: relnote
Environment:

2-node combo configuration on AWS, using 2d.8xlarge instance type


Sprint: 2017-3, 2017-4, 2017-5

 Description   

Build tested: 1.0.7-1 AMI.

The DBT3 test failed with many missing data file errors. I checked the log files and noticed that there is a "too many open files" error when connection to ExeMgr is lost. Soon after, PrimProc got restarted.

Does ColumnStore close/reuse the file handle it hit a missing data file error?

Feb 7 15:40:49 ip-172-30-0-236 PrimProc[66666]: 49.330422 |0|0|0| C 28 CAL0053: PrimProc could not open file for OID 3053; /000.dir/000.dir/011.dir/237.dir/000.dir/FILE001.cdf:No such file or directory
Feb 7 17:41:45 ip-172-30-0-236 joblist[124597]: 45.262103 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Feb 7 17:41:45 ip-172-30-0-236 joblist[124597]: 45.262157 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr
Feb 7 21:39:13 ip-172-30-0-236 PrimProc[8608]: 13.252067 |0|0|0| C 28 CAL0053: PrimProc could not open file for OID 3352; /home/mariadb-user/mariadb/columnstore/data1/000.dir/000.dir/013.dir/024.dir/058.dir/FILE002.cdf:Too many open files
Feb 7 21:39:15 ip-172-30-0-236 joblist[9083]: 15.690711 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 172.30.0.232



 Comments   
Comment by Daniel Lee (Inactive) [ 2017-02-07 ]

The DBT3 test caused this error to occur was on a 1TB database.

1) run cpimport to import the 1TB source files.
2) Execute the DBT3 queries for 1TB database. Please note that queries are dataset size specific.

From the time stamps, it is either query #3 or #4 that caused this error.

Query #3
[mariadb-user@ip-172-30-0-236 1000g]$ cat /data/qa/autopilot/performance/dbt3/sql/1000g/3.sql
– using 2313685168 as a seed to the RNG

select
l_orderkey,
sum(l_extendedprice * (1 - l_discount)) as revenue,
o_orderdate,
o_shippriority
from
customer,
orders,
lineitem
where
c_mktsegment = 'FURNITURE'
and c_custkey = o_custkey
and l_orderkey = o_orderkey
and o_orderdate < '1995-03-07'
and l_shipdate > '1995-03-07'
group by
l_orderkey,
o_orderdate,
o_shippriority
order by
revenue desc,
o_orderdate
LIMIT 10;

Query #4

[mariadb-user@ip-172-30-0-236 1000g]$ cat /data/qa/autopilot/performance/dbt3/sql/1000g/4.sql
– using 3780981089 as a seed to the RNG

select
o_orderpriority,
count as order_count
from
orders
where
o_orderdate >= date '1995-06-01'
and o_orderdate < date_add( '1995-06-01', interval 3 month)
and exists (
select
*
from
lineitem
where
l_orderkey = o_orderkey
and l_commitdate < l_receiptdate
)
group by
o_orderpriority
order by
o_orderpriority;

Comment by Daniel Lee (Inactive) [ 2017-02-07 ]

Additional info:

There is logic to keep file handles handles around for performance reasons. Hopefully, my understand on the logic is not out dated. Here should be the logic.

If number of open files handles goes over the <MaxOpenFiles>, ColumnStore will close <DescreaseOpenFilesCount> file handles, before continuing.

Parameters and default values in Columnstore.xml:

<MaxOpenFiles>2K</MaxOpenFiles>
<DecreaseOpenFilesCount>200</DecreaseOpenFilesCount>

Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ]

So, for starters MaxOpenFiles and DecreaseOpenFilesCount only trigger when the extents are marked as not in use, a single DBT3 query could use all the extents simultaneously. ColumnStore sets the OS files limit to 65536. This won't work for non-root installs that will be stuck at the default which is typically 1024.

Now, the limit is shared across all processes by that user. So mysqld's temp and table files also consume that limit as well as log files, etc... Based on some very rough maths with 8bn rows in lineitem this will take a minimum of 18,000 extent files. On top of that you have the extent files for all the other tables for example the orders table will be 15 files per 8M rows (assuming full extents). the 65K file limit we have set is far too small.

We have a couple of options here:

1. Remove the 64K file limit in the code completely and document that the user should set this, we will need to do this for non-root anyway
2. Increase this, by a lot! (1M?)

To solve this bug we need to:

1. Resolve the limit issue above.
2. Investigate to see if we are leaking FDs anywhere
3. Handle hitting the limits better.

As an optional point 4, maybe look at increasing the number of rows per extent to reduce the amount of files used for TB range data sets.

I suggest we do part 1 of this ASAP, this will affect people in the TB range of data.

Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ]

it should also be noted that due to the way Linux works a TCP/IP socket connection also counts as a file. There are a few cases where we make a lot of these for a single query. Fixing MCOL-529 will help there.

Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ]

The pull requests are for develop and develop-1.0. They are only for part 1 of the fix and this ticket should be reopened afterwards.

Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ]

In addition to all of the above non-root has a 1K soft, 4K hard cap. The ColumnStore binaries try and silently fail to change this. We need proper error handling there and documentation.

Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ]

Reopening so I can add error handling. It doesn't look like my original commits are needed

Comment by Andrew Hutchings (Inactive) [ 2017-02-08 ]

New pull request adds error handling to setting the file limit. Documentation additions added at:

https://mariadb.com/kb/en/mariadb/preparing-for-columnstore-installation/#set-the-user-file-limits-by-root-user

Please keep this open after review for the rest.

Comment by Andrew Hutchings (Inactive) [ 2017-02-17 ]

Hall has reviewed and merged. Keeping open for points 2&3 in my earlier comment.

Comment by Andrew Hutchings (Inactive) [ 2017-02-21 ]

Patch needs updating to report up the error to OAM rather than stderr

Comment by Andrew Hutchings (Inactive) [ 2017-02-21 ]

Pull request to also send the error to the error log and stop postConfigure if we cannot set the file limit high enough.

Comment by Daniel Lee (Inactive) [ 2017-03-02 ]

Build tested: Github source

[root@localhost columnstore]# cd mariadb-columnstore-server/
[root@localhost mariadb-columnstore-server]# git show
commit 3da188e5c8a2630019ea810fb8c1bd3ece5e058b
Merge: 5d9686c 53c1df7
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Fri Feb 10 15:07:31 2017 +0000

Merge pull request #31 from jbfavre/fix_deb_package_dependency

MCOL-562 Fix Debian package dependencies

[root@localhost mariadb-columnstore-server]# cd mariadb-columnstore-engine/
[root@localhost mariadb-columnstore-engine]# git show
commit 16cef50caedd9ec7585b04c096996a9441bdf2d5
Author: David Hill <david.hill@mariadb.com>
Date: Wed Mar 1 10:39:11 2017 -0600

change the check for prompt back to the previous code

Did both root and non-root test for 1um4pm installation

For non-root, the stack did not come up if the /etc/security/limits.conf is not setup correctly. Once setup, the stack came up fine.

The following are the messages in crit.log if limits.conf is not setup correctly for non-root user:

Mar 1 15:09:17 localhost PrimProc[9345]: 17.194857 |0|0|0| C 28 CAL0000: Error setting file limits, please see non-root install documentation
Mar 1 15:09:18 localhost ProcessMonitor[8809]: 18.814887 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: PrimProc, old PID = 9345
Mar 1 15:09:19 localhost ProcessManager[8895]: 19.870695 |0|0|0| C 17 CAL0000: startMgrProcessThread Exit with a failure, error returned from startSystemThread
Mar 1 15:09:20 localhost PrimProc[9405]: 20.896971 |0|0|0| C 28 CAL0000: Error setting file limits, please see non-root install documentation
Mar 1 15:14:44 localhost controllernode[8547]: 44.498221 |0|0|0| C 29 CAL0000: ExtentMap::save(): got request to save an empty BRM
Mar 1 15:17:09 localhost PrimProc[10638]: 09.564554 |0|0|0| C 28 CAL0000: Error setting file limits, please see non-root install documentation
Mar 1 15:17:11 localhost ProcessMonitor[10101]: 11.165440 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: PrimProc, old PID = 10638
Mar 1 15:17:12 localhost ProcessManager[10187]: 12.352456 |0|0|0| C 17 CAL0000: startMgrProcessThread Exit with a failure, error returned from startSystemThread
Mar 1 15:17:13 localhost PrimProc[10702]: 13.223764 |0|0|0| C 28 CAL0000: Error setting file limits, please see non-root install documentation

Generated at Thu Feb 08 02:21:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.