[MDEV-31139] MariaDB hangs with multiple threads open Created: 2023-04-27  Updated: 2024-01-05  Resolved: 2023-09-05

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.11.2
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Erlandas Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: Connection, speed
Environment:

Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-70-generic x86_64)


Attachments: PNG File download.png     Text File mariadbd_full_bt_all_threads.txt     Text File mariadbd_full_bt_all_threads_2.txt     JPEG File master_db_recovery.jpg     Text File master_mariadb_full_bt_all_threads.txt     Text File process_list.txt     Text File process_list2.txt    
Issue Links:
Duplicate
duplicates MDEV-29835 Partial server freeze Closed
Relates
relates to MDEV-33185 Connection timeout/slowdown upon incr... Needs Feedback

 Description   

It started happening soon after upgrade to 10.11.2 version from 10.10 I believe. The problem starts when PHP scripts starts using 250-350 threads for INSERT statements mostly (SELECT statements are frequent too). On average MariaDB receives around 6000 queries per second, but problem comes even when it is just 2000 queries per second.

There is 1 master and 4 replicas (each using 16 parallel slave threads for replication).

At some point server (sometimes one sometimes two) - either master or replica stop processing any INSERT or SELECT statements. Process list then shows several hundred threads connected and they show time in many minutes or hours. Stopping any further queries to that affected server does not resolve anything - nothing changes for hours.

When this happens - a simple CONNECT to MariaDB is measured and it usually shows connection time of 60-460 ms instead of normal 0.5-1.1ms.

When I try to kill all threads one by one - they are killed, process list shows them as killed but present on the list.

If then (or before) I try to use systemctl stop mariadb - server is not stopped for hours (same with restart). The only option is to find any mariadb process and issue KILL -9 and then start Mariadb again which then starts in recovery mode. All tables are InnoDB in the database.

Servers, which do not receive many SELECT statements (the last 2 on the image list) - they never get such a problem thus I believe replication is not an issue. Replicas are READ ONLY.

This happens about once or twice a day at different times but when number of queries is on a high side. There is no shortage of RAM or disk.



 Comments   
Comment by Marko Mäkelä [ 2023-04-27 ]

Can you please try to produce stack traces of all threads during the hang? Specifically, the output of the GDB command thread apply all backtrace full would allow me to confirm if this is a duplicate of MDEV-29835.

Comment by Daniel Black [ 2023-04-27 ]

with so many threads, using gcore (gdb package) to capture it and then take the backtrace from the core file might be required.

Comment by Erlandas [ 2023-04-27 ]

process_list2.txt I have noticed that a bulk UPDATE somehow frequently is at the top of the list of hanging queries.

I'll read how to make stack traces and next time I have a hangup - I'll make it.

Comment by Erlandas [ 2023-04-27 ]

Yes, I'll be able to use GDB on next hangup and will try producing output. Do I need any core debug modules installed for MariaDB to get meaningful data?

Comment by Erlandas [ 2023-04-27 ]

Is mariadb-server-core-10.11-dbgsym the right package for me? It comes up as unavailable:

sudo add-apt-repository 'deb [arch=amd64,arm64,ppc64el,s390x] https://ftp.osuosl.org/pub/mariadb/repo/10.11/ubuntu jammy main/debug'
Repository: 'deb [arch=amd64,arm64,ppc64el,s390x] https://ftp.osuosl.org/pub/mariadb/repo/10.11/ubuntu jammy main/debug'
Description:
Archive for codename: jammy components: main/debug
More info: https://ftp.osuosl.org/pub/mariadb/repo/10.11/ubuntu
Adding repository.
Press [ENTER] to continue or Ctrl-c to cancel.
Adding deb entry to /etc/apt/sources.list.d/archive_uri-https_ftp_osuosl_org_pub_mariadb_repo_10_11_ubuntu-jammy.list
Adding disabled deb-src entry to /etc/apt/sources.list.d/archive_uri-https_ftp_osuosl_org_pub_mariadb_repo_10_11_ubuntu-jammy.list
Reading package lists... Done

apt-get update && apt-get install -y mariadb-server-core-10.11-dbgsym
Hit:1 https://ftp.osuosl.org/pub/mariadb/repo/10.11/ubuntu jammy InRelease
Hit:2 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy InRelease
Hit:4 https://mirrors.gigenet.com/mariadb/repo/10.11/ubuntu jammy InRelease
Hit:5 http://repo.manticoresearch.com/repository/manticoresearch_jammy_dev jammy InRelease
Hit:6 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:8 http://ddebs.ubuntu.com jammy InRelease [41.3 kB]
Hit:9 http://ddebs.ubuntu.com jammy-updates InRelease
Get:10 http://ddebs.ubuntu.com jammy-proposed InRelease [41.4 kB]
Fetched 82.7 kB in 1s (101 kB/s)
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to locate package mariadb-server-core-10.11-dbgsym
E: Couldn't find any package by glob 'mariadb-server-core-10.11-dbgsym'
E: Couldn't find any package by regex 'mariadb-server-core-10.11-dbgsym'

Comment by Erlandas [ 2023-04-28 ]

I have installed debug symbols just now with:

sudo apt install mariadb-server-dbgsym

But I see one replica is already having a problem and is hanging with a large number of threads and is not processing any queries anymore.

I have run this command:

sudo gdb --batch --eval-command="set print frame-arguments all" --eval-command="thread apply all bt full" /usr/sbin/mariadbd $(pgrep -xn mariadbd) > mariadbd_full_bt_all_threads.txt

and got an output file but since MariaDB was not restarted after debug symbols were installed yet - it does not show local info . I hope the output is of some value anyway, but I'll restart servers to produce debug symbols when they hang up next time.

mariadbd_full_bt_all_threads.txt

Comment by Erlandas [ 2023-04-29 ]

Today got 2 servers hangup - 1 replica and master itself.

Does it hold sufficient data to find the problem or should I do something additionally?

Uploaded 2 files.

Comment by Marko Mäkelä [ 2023-05-01 ]

I checked the file master_mariadb_full_bt_all_threads.txt. When it was generated, only the debugging symbols for GNU libc appear to have been available. The function names for any InnoDB code are unresolved, and for some code outside InnoDB, only the function names are available, but not the function parameters nor any local variables. Based on this information, it is impossible to make any analysis. The depth of the call stack of Thread 32 (with 14 unresolved function names) does possibly match MDEV-23855.

Comment by Erlandas [ 2023-05-01 ]

Thank you for your reply Marko,

What needs to be installed additionally to make a more meaningful output when the problem happens again?

I have done this already but it is obviously not enough:

sudo apt install mariadb-server-dbgsym

Comment by Erlandas [ 2023-05-03 ]

One notice after monitoring Master connection speed (replicas as well being monitored but not having this so frequent) - simple connection frequently drops from 0.5-1ms to 200-400ms when using mysqldump or when Writing/Reading to Master DB goes to 300-400 threads (max connections is 2000, 96 CPU, 500 GB RAM), where 90% of queries are reading queries.

Comment by Erlandas [ 2023-05-06 ]

We have implemented the following procedure on our system:

  1. Normally connection takes under 2 ms, normally write test takes under 2 ms too. We consider Replica and Master DB not readable if connection to it takes more than 10 ms. System stops using (for SELECT statements) DB which has local network connection speed over 10 ms.
  2. Write test is made once a second to see how fast writing is, if write takes more than 10ms to master on local network 5 times in a row - we stop writing to master for 2 minutes.
  3. Upon detecting slow connection on either of the replicas or Master DB - that server is not getting any selects for about 20 seconds

This gives some sort of relief for replicas and to master upon detecting slow connection or slow write and subsequently - there was no stalling/hangup on any Replicas and on Master DB in the last few days.

I believe that some SELECT or INSERT/UPDATE takes much longer than it should and then other queries just keep on building which basically hang up server completely and it does not recover for many hours even when no more queries are being sent after full hangup.

Stopping within a second after detecting an issue - does not allow the problem to be escalated further and server is able to recover. I am attaching an image showing 3 steps (basically around 3 seconds) of connection times and number of connections (which is fairly moderate).

Comment by Marko Mäkelä [ 2023-05-08 ]

The command

file /usr/sbin/mariadbd

should produce a BuildID snippet that includes some hexadecimal digits in the form XXYY…. If you have the corresponding debug symbol package installed, a file /usr/lib/debug/.build-id/XX/YY….debug should exist. If you invoke the command

gdb /usr/sbin/mariadbd

it should report that the symbols were read:

Reading symbols from /usr/lib/debug/.build-id/XX/YY….debug...

Without those symbols being available, it will be impossible to conclude for sure from the stack traces if this is a duplicate of MDEV-29835. I think that it is very likely, but I cannot say it for sure. That bug would cause the affected InnoDB threads to be hung forever. The only way out is to kill the mariadbd process and to rely on the InnoDB crash recovery.

Comment by Erlandas [ 2023-05-09 ]

file /usr/sbin/mariadbd

/usr/sbin/mariadbd: ELF 64-bit LSB pie executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=a020a168d7cbdfc16daa85aba218e1047c5afa9a, for GNU/Linux 3.2.0, stripped

gdb /usr/sbin/mariadbd

GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/mariadbd...
(No debugging symbols found in /usr/sbin/mariadbd)

Checking if debug symbols installed:

sudo apt-get install software-properties-common

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
software-properties-common is already the newest version (0.99.22.6).
0 upgraded, 0 newly installed, 0 to remove and 10 not upgraded.

sudo apt-get install mariadb-server-dbgsym

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
mariadb-server-dbgsym is already the newest version (1:10.11.2+maria~ubu2204).
0 upgraded, 0 newly installed, 0 to remove and 10 not upgraded.

Debug symbols installed as seen from the above messages but GDB is not reading those symbols. What am I missing?

Comment by Sergei Golubchik [ 2023-05-30 ]

danblack any debian-specific suggestions about how to get symbols to show?

Comment by Daniel Black [ 2023-05-31 ]

I pulled up a 22.04 container and the mariadb-server-core-dbgsym package on both 10.11.2 and 10.11.3 and could disassemble the innodb functions on both the executable and running instance with gdb.

Sorry can't see anything wrong or what could cause the lack of resolution of these. The build id is corresponding to MariaDB's build of 10.11.2.

Comment by Vladislav Vaintroub [ 2023-05-31 ]

To the mystery of the decreased connection speed - you're using pool-of-threads, and perhaps threadpool thinks the system is overloaded and throttles thread creations. If you want to workaround this, you can decrease thread_pool_stall_limit from its default 500ms to say 10ms. Or you can also increase thread_pool_oversubscribe from its default value of 3 to say value of 8. Dumping information_schema.thread_pool_queues ( https://mariadb.com/kb/en/information-schema-thread_pool_queues-table ) could also be helpful for diagnostics,

Threadpool is not always informed about waits, i.e not all longish waits are instrumented with thd_wait_begin/thd_wait_end in the code, thus its estimates about system load may be off, this can lead to thread creation throttling, which in turn can lead to increased query times even for trivial queries.

That said, mostly likely the real problem here is a different, perhaps a hang somewhere in Innodb.

Comment by Erlandas [ 2023-06-17 ]

Should I do dumping information_schema.thread_pool_queues before changing pool-of-threads and thread_pool_oversubscribe settings or afterwards?

Currently problem is "nicely controlled" by stopping writes to database when connection to MariaDB is getting slower than 10 ms (instead of normally taking 0.5 ms to connect) and stopping writes for about 1-2 minutes. The problem solves itself and everything is fast again. Happens about once a day, sometimes 2-3 times a day. It seems to happen due to some insert which triggers that as this happens during intensive writing and not during reading only times (with just seldom writes)

Comment by Erlandas [ 2023-06-27 ]

We have code in place now to dump information_schema.thread_pool_queues when we detect Master DB is starting to give slow connection on writes.

But stop using DB server if connection to it slows down after 5 attempts (normal speed is 0.5 ms, we stop using it after it is slower than 10 ms (but usually 20+ms))

A query was run on the last 2 occasions of slowdown but it returned:

MariaDB [information_schema]> SELECT * FROM THREAD_POOL_QUEUES;
Empty set (0.000 sec)

Possibly it did not build the queue yet. We already know that stopping using a server (be is Replica or a Master) after connection to it slows down , does solve the problem and server recovers to normal levels within 20-60 seconds.

We could allow (at expense of a whole website becoming slower) - to let it build the queries and eventually hang up, by not stopping using it when it shows a sign of slowdown. I believe it does start building the queue then as it takes 30-60 minutes for the server to finally hang up after connection starts giving slower response. At first is slows down to 20ms, then 160ms and when it reaches 200-400ms - server is non responsive anymore to read or write queries.

Comment by Vladislav Vaintroub [ 2023-06-27 ]

the most accurate queueing information is returned if you use thread_pool_dedicated_listener=1 and thread_pool_exact_stats=1 .Unfortunately, both come at some cost, with thread_pool_dedicated_listener server is guaranteed to continuously peek events from network, at the expense of more context switches if server is lightly loaded. thread_pool_exact_stats=1 gives accurate queueing time, but queries the system clock more often.

Comment by Erlandas [ 2023-06-27 ]

How badly can it affect the server if I enable both options?

It is having Avg. qps = 2074 with peak being at around 8000 qps. It has 96 vCPU cores, which are loaded at 30% on average and 90% at peak usage.

Comment by Erlandas [ 2023-07-10 ]

I have implemented (as noted in previous message) dumping thread pool queues, but in the last 2 weeks there was just empty dumps every time app sensed DB slowdown.

I have decided to try setting thread_pool_stall_limit and thread_pool_oversubscribe as per your suggestion to see if this stops periodic slowdown in server responses (which fix themselves as soon as you stop using server for a minute):

MariaDB [(none)]> SHOW GLOBAL STATUS LIKE '%thread%';
+------------------------------------------+-------+
| Variable_name                            | Value |
+------------------------------------------+-------+
| Delayed_insert_threads                   | 0     |
| Innodb_master_thread_active_loops        | 0     |
| Innodb_master_thread_idle_loops          | 40306 |
| Performance_schema_thread_classes_lost   | 0     |
| Performance_schema_thread_instances_lost | 0     |
| Slow_launch_threads                      | 0     |
| Threadpool_idle_threads                  | 197   |
| Threadpool_threads                       | 198   |
| Threads_cached                           | 0     |
| Threads_connected                        | 65    |
| Threads_created                          | 4149  |
| Threads_running                          | 5     |
| wsrep_applier_thread_count               | 0     |
| wsrep_rollbacker_thread_count            | 0     |
| wsrep_thread_count                       | 0     |
+------------------------------------------+-------+
15 rows in set (0.000 sec)
 
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%thread%';
 
+-----------------------------------------+-----------------+
| Variable_name                           | Value           |
+-----------------------------------------+-----------------+
| aria_repair_threads                     | 1               |
| binlog_optimize_thread_scheduling       | ON              |
| debug_no_thread_alarm                   | OFF             |
| innodb_encryption_threads               | 0               |
| innodb_purge_threads                    | 4               |
| innodb_read_io_threads                  | 32              |
| innodb_write_io_threads                 | 32              |
| max_delayed_threads                     | 20              |
| max_insert_delayed_threads              | 20              |
| myisam_repair_threads                   | 1               |
| performance_schema_max_thread_classes   | 50              |
| performance_schema_max_thread_instances | -1              |
| slave_domain_parallel_threads           | 0               |
| slave_parallel_threads                  | 0               |
| thread_cache_size                       | 256             |
| thread_handling                         | pool-of-threads |
| thread_pool_dedicated_listener          | OFF             |
| thread_pool_exact_stats                 | OFF             |
| thread_pool_idle_timeout                | 60              |
| thread_pool_max_threads                 | 65536           |
| thread_pool_oversubscribe               | 3               |
| thread_pool_prio_kickup_timer           | 1000            |
| thread_pool_priority                    | auto            |
| thread_pool_size                        | 96              |
| thread_pool_stall_limit                 | 500             |
| thread_stack                            | 299008          |
| wsrep_slave_threads                     | 1               |
+-----------------------------------------+-----------------+
27 rows in set (0.001 sec)
 
MariaDB [(none)]> SET GLOBAL thread_pool_oversubscribe=8;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [(none)]> SET GLOBAL thread_pool_stall_limit=10;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%thread%';
+-----------------------------------------+-----------------+
| Variable_name                           | Value           |
+-----------------------------------------+-----------------+
| aria_repair_threads                     | 1               |
| binlog_optimize_thread_scheduling       | ON              |
| debug_no_thread_alarm                   | OFF             |
| innodb_encryption_threads               | 0               |
| innodb_purge_threads                    | 4               |
| innodb_read_io_threads                  | 32              |
| innodb_write_io_threads                 | 32              |
| max_delayed_threads                     | 20              |
| max_insert_delayed_threads              | 20              |
| myisam_repair_threads                   | 1               |
| performance_schema_max_thread_classes   | 50              |
| performance_schema_max_thread_instances | -1              |
| slave_domain_parallel_threads           | 0               |
| slave_parallel_threads                  | 0               |
| thread_cache_size                       | 256             |
| thread_handling                         | pool-of-threads |
| thread_pool_dedicated_listener          | OFF             |
| thread_pool_exact_stats                 | OFF             |
| thread_pool_idle_timeout                | 60              |
| thread_pool_max_threads                 | 65536           |
| thread_pool_oversubscribe               | 8               |
| thread_pool_prio_kickup_timer           | 1000            |
| thread_pool_priority                    | auto            |
| thread_pool_size                        | 96              |
| thread_pool_stall_limit                 | 10              |
| thread_stack                            | 299008          |
| wsrep_slave_threads                     | 1               |
+-----------------------------------------+-----------------+
27 rows in set (0.000 sec)

Comment by Erlandas [ 2023-07-13 ]

I can confirm that the changes above made no difference. Slowdowns still happen (and a complete Hangup would happen if application were not to stop using server on slowdown detection).

The changes did not make situation better or worse - frequency is about the same of the issue.

Comment by Vladislav Vaintroub [ 2023-07-13 ]

The queues are accurate, only when thread_pool_dedicated_listener=1 . Otherwise, it can happen that queueing occurs on different level , inside OS, where it is difficult to trace. With thread_pool_dedicated_listener=1, connections with readable socket (i.e with data coming from client) will be immediately pushed into the queue and visible in the information schema, while thread_pool_dedicated_listener=0, sockets can be readable/ready, but it not guaranteed that they are visible or handled immediately., because epoll_wait() can be delayed for up to threadpool_stall_limit milliseconds.

Comment by Erlandas [ 2023-08-31 ]

After this fix - https://jira.mariadb.org/browse/MDEV-31443 I was able to upgrade. And having 11.0.3 version seem to have solved the hangup problem too as I am not seeing server reporting slowdown in writes or reads for the last week or so.

Comment by Daniel Black [ 2023-09-05 ]

Thanks for the feedback.

Marking as dup of MDEV-29835 on general assumptions so far.

Generated at Thu Feb 08 10:21:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.