[MDEV-18027] Running out of file descriptors and eventual crash Created: 2018-12-18  Updated: 2023-10-10  Resolved: 2020-02-05

Status: Closed
Project: MariaDB Server
Component/s: Configuration, Server
Affects Version/s: 10.2.12, 10.2, 10.3, 10.4
Fix Version/s: 10.2.32, 10.3.23, 10.4.13

Type: Bug Priority: Critical
Reporter: David Crimmins Assignee: Oleksandr Byelkin
Resolution: Fixed Votes: 2
Labels: regression, regression-10.2
Environment:

Linux OL7


Attachments: File my_eye.cnf     File mysqld.error.log.truncated    

 Description   

Server generates errors and eventually crashes due to execeeding limit on number of open file descriptors.

This occurs when additional open_table_caches_instances are created. The calculation for open_files_limit does not account for the fact that there may be multiple instances.

I expect (but have not proven) problem could be avoided adjusting setting in config file to limit number of open_table_caches_instances or increasing open_files_limit. Currently neither of these are set in our config.



 Comments   
Comment by Elena Stepanova [ 2018-12-28 ]

Could you please paste or attach

  • the exact messages that you see – warnings upon the server startup about adjusting values, errors and the crash report which you are getting later;
  • the output of

    select @@max_connections, @@open_files_limit, @@table_open_cache, @@table_open_cache_instances;
    

  • the output of

    ulimit -a
    ulimit -aH
    

  • your server config file(s) and command-line options if you use any.

Very importantly, it has to be the consistent set of data, all from the same single run.

Comment by David Crimmins [ 2019-01-03 ]

Config and log files attached. NB log file has been truncated as it was too big.

Further information as requested plus limits for running db process:

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.2.12-MariaDB-log MariaDB Server
 
Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> select @@max_connections, @@open_files_limit, @@table_open_cache, @@table_open_cache_instances;
+-------------------+--------------------+--------------------+------------------------------+
| @@max_connections | @@open_files_limit | @@table_open_cache | @@table_open_cache_instances |
+-------------------+--------------------+--------------------+------------------------------+
|               400 |               2005 |                500 |                            8 |
+-------------------+--------------------+--------------------+------------------------------+
1 row in set (0.00 sec)

bri-lin7 Entuity # ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 10908
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 10908
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

bri-lin7 Entuity # ulimit -aH
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 10908
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 10908
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

bri-lin7 Entuity # cat /proc/`pgrep mysqld`/limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             10908                10908                processes
Max open files            2005                 2005                 files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       10908                10908                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Comment by Elena Stepanova [ 2019-01-21 ]

Thanks for the information.

Indeed, with MariaDB implementation of table_open_cache_instances, the total table_open_cache size might grow much higher than configured, and neither calculation of the number of open files to request from the system, nor auto-adjustment of max_connections and table_open_cache takes it into account.

In this case, with only 1024 open files as the initial value (raised to 2005 based on the configuration), and with two detected occurrences of contention, which make total table open cache to jump to 1500, the problem is inevitable. (I would expect "Too many open files" rather than "Bad file descriptor", maybe it depends on the linux flavor.)

Instead, it should have requested not 2005 open files ((max_connections + extra_max_connections) * 5), but 8431 ((extra_files + max_connections + extra_max_connections + tc_size * 2 * tc_instances)). It wouldn't succeed of course with the hard limit 4096, so the auto-sized values would have to be recalculated. max_connections would probably stay the same, although barely, but table_open_cache would drop to ~230.

Comment by Valerii Kravchuk [ 2019-04-26 ]

Note that as table_open_cache_instances is introduced in 10.2.2+ and is 8 by default, users upgrading from 10.1, for example, may start to get "Too many open files" errors with the load that wroked well in 10.1. It's a regression of a kind.

Comment by Sergey Vojtovich [ 2019-05-15 ]

One possible way of fixing this is to set hard limit according to table_open_cache_instances. Initial soft limit should stay low. When number of table cache instance goes up, raise soft limit accordingly. Don't let table cache instances number go up if soft limit cannot be raised.

Comment by Oleksandr Byelkin [ 2019-07-10 ]

My concern if it really should go to 10.2 because next complain from support will be that user upgraded and now number of connections and table cache decreased because there is no enough file handlers...

Comment by Oleksandr Byelkin [ 2019-07-10 ]

commit fb27ed99a79f7e9b6c4e838d8a788a4685cfbee4 (HEAD > bb-10.2MDEV-18027, origin/bb-10.2-MDEV-18027)
Author: Oleksandr Byelkin <sanja@mariadb.com>
Date: Wed Jul 10 13:40:54 2019 +0200

MDEV-18027: Running out of file descriptors and eventual crash

For automatic number of opened files limit take into account number of table instances for table cache

Comment by Sergey Vojtovich [ 2019-07-10 ]

My concern if it really should go to 10.2 because next complain from support will be that user upgraded and now number of connections and table cache decreased because there is no enough file handlers...

It wouldn't be the case if it were implemented as I suggested May 15: Don't let table cache instances number go up if soft limit cannot be raised.

Comment by Sergei Golubchik [ 2019-07-15 ]

I think svoj suggested a better fix than fb27ed99a79f7e9b6c4e838d8a788a4685cfbee4.

New cache instances are created, when the contention is too high, which normally means there is some hot table accessed by many connections concurrently.

There are many common workloads wihout a hot table, in these cases there will be only one table cache instance. Your fix in fb27ed99a79f7e9b6c4e838d8a788a4685cfbee4 would unnecessary penalize these workloads — they'll have a smaller table cache for no good reason. I'd suggest to auto-reduce tc_instances instead.

Comment by Oleksandr Byelkin [ 2019-07-18 ]

I still do not understand how playing with soft limit can solve the problem described above?

We promise big number of instances and big cash, in the middle of the game we say no we can not open more - how it is different from what we have now (inability to open files and crash)?

Comment by Sergey Vojtovich [ 2019-07-18 ]

Don't increment number of instances if failed to increment soft limit. Then everything is under control, right?

Comment by Oleksandr Byelkin [ 2019-11-05 ]

ommit edc9059c31bddfaa5294423dafc6adfd5a3eabc0 (HEAD > bb-10.2MDEV-18027, origin/bb-10.2-MDEV-18027)
Author: Oleksandr Byelkin <sanja@mariadb.com>
Date: Wed Jul 10 13:40:54 2019 +0200

MDEV-18027: Running out of file descriptors and eventual crash

For automatic number of opened files limit take into account number of table instances for table cache

Generated at Thu Feb 08 08:40:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.