[MDEV-31139] MariaDB hangs with multiple threads open Created: 2023-04-27 Updated: 2024-01-05 Resolved: 2023-09-05 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.11.2 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Erlandas | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | Connection, speed | ||
| Environment: |
Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-70-generic x86_64) |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
It started happening soon after upgrade to 10.11.2 version from 10.10 I believe. The problem starts when PHP scripts starts using 250-350 threads for INSERT statements mostly (SELECT statements are frequent too). On average MariaDB receives around 6000 queries per second, but problem comes even when it is just 2000 queries per second. There is 1 master and 4 replicas (each using 16 parallel slave threads for replication). At some point server (sometimes one sometimes two) - either master or replica stop processing any INSERT or SELECT statements. Process list then shows several hundred threads connected and they show time in many minutes or hours. Stopping any further queries to that affected server does not resolve anything - nothing changes for hours. When this happens - a simple CONNECT to MariaDB is measured and it usually shows connection time of 60-460 ms instead of normal 0.5-1.1ms. When I try to kill all threads one by one - they are killed, process list shows them as killed but present on the list. If then (or before) I try to use systemctl stop mariadb - server is not stopped for hours (same with restart). The only option is to find any mariadb process and issue KILL -9 and then start Mariadb again which then starts in recovery mode. All tables are InnoDB in the database. Servers, which do not receive many SELECT statements (the last 2 on the image list) - they never get such a problem thus I believe replication is not an issue. Replicas are READ ONLY. This happens about once or twice a day at different times but when number of queries is on a high side. There is no shortage of RAM or disk. |
| Comments |
| Comment by Marko Mäkelä [ 2023-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you please try to produce stack traces of all threads during the hang? Specifically, the output of the GDB command thread apply all backtrace full would allow me to confirm if this is a duplicate of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
with so many threads, using gcore (gdb package) to capture it and then take the backtrace from the core file might be required. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
process_list2.txt I'll read how to make stack traces and next time I have a hangup - I'll make it. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, I'll be able to use GDB on next hangup and will try producing output. Do I need any core debug modules installed for MariaDB to get meaningful data? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-04-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is mariadb-server-core-10.11-dbgsym the right package for me? It comes up as unavailable: sudo add-apt-repository 'deb [arch=amd64,arm64,ppc64el,s390x] https://ftp.osuosl.org/pub/mariadb/repo/10.11/ubuntu jammy main/debug' apt-get update && apt-get install -y mariadb-server-core-10.11-dbgsym | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-04-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have installed debug symbols just now with: sudo apt install mariadb-server-dbgsym But I see one replica is already having a problem and is hanging with a large number of threads and is not processing any queries anymore. I have run this command: sudo gdb --batch --eval-command="set print frame-arguments all" --eval-command="thread apply all bt full" /usr/sbin/mariadbd $(pgrep -xn mariadbd) > mariadbd_full_bt_all_threads.txt and got an output file but since MariaDB was not restarted after debug symbols were installed yet - it does not show local info . I hope the output is of some value anyway, but I'll restart servers to produce debug symbols when they hang up next time. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-04-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Today got 2 servers hangup - 1 replica and master itself. Does it hold sufficient data to find the problem or should I do something additionally? Uploaded 2 files. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-05-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I checked the file master_mariadb_full_bt_all_threads.txt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-05-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for your reply Marko, What needs to be installed additionally to make a more meaningful output when the problem happens again? I have done this already but it is obviously not enough: sudo apt install mariadb-server-dbgsym | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-05-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
One notice after monitoring Master connection speed (replicas as well being monitored but not having this so frequent) - simple connection frequently drops from 0.5-1ms to 200-400ms when using mysqldump or when Writing/Reading to Master DB goes to 300-400 threads (max connections is 2000, 96 CPU, 500 GB RAM), where 90% of queries are reading queries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-05-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have implemented the following procedure on our system:
This gives some sort of relief for replicas and to master upon detecting slow connection or slow write and subsequently - there was no stalling/hangup on any Replicas and on Master DB in the last few days. I believe that some SELECT or INSERT/UPDATE takes much longer than it should and then other queries just keep on building which basically hang up server completely and it does not recover for many hours even when no more queries are being sent after full hangup. Stopping within a second after detecting an issue - does not allow the problem to be escalated further and server is able to recover. I am attaching an image showing 3 steps (basically around 3 seconds) of connection times and number of connections (which is fairly moderate). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-05-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The command
should produce a BuildID snippet that includes some hexadecimal digits in the form XXYY…. If you have the corresponding debug symbol package installed, a file /usr/lib/debug/.build-id/XX/YY….debug should exist. If you invoke the command
it should report that the symbols were read:
Without those symbols being available, it will be impossible to conclude for sure from the stack traces if this is a duplicate of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-05-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
file /usr/sbin/mariadbd /usr/sbin/mariadbd: ELF 64-bit LSB pie executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=a020a168d7cbdfc16daa85aba218e1047c5afa9a, for GNU/Linux 3.2.0, stripped gdb /usr/sbin/mariadbd GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1 For help, type "help". Checking if debug symbols installed: sudo apt-get install software-properties-common Reading package lists... Done sudo apt-get install mariadb-server-dbgsym Reading package lists... Done Debug symbols installed as seen from the above messages but GDB is not reading those symbols. What am I missing? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2023-05-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
danblack any debian-specific suggestions about how to get symbols to show? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-05-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I pulled up a 22.04 container and the mariadb-server-core-dbgsym package on both 10.11.2 and 10.11.3 and could disassemble the innodb functions on both the executable and running instance with gdb. Sorry can't see anything wrong or what could cause the lack of resolution of these. The build id is corresponding to MariaDB's build of 10.11.2. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2023-05-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To the mystery of the decreased connection speed - you're using pool-of-threads, and perhaps threadpool thinks the system is overloaded and throttles thread creations. If you want to workaround this, you can decrease thread_pool_stall_limit from its default 500ms to say 10ms. Or you can also increase thread_pool_oversubscribe from its default value of 3 to say value of 8. Dumping information_schema.thread_pool_queues ( https://mariadb.com/kb/en/information-schema-thread_pool_queues-table ) could also be helpful for diagnostics, Threadpool is not always informed about waits, i.e not all longish waits are instrumented with thd_wait_begin/thd_wait_end in the code, thus its estimates about system load may be off, this can lead to thread creation throttling, which in turn can lead to increased query times even for trivial queries. That said, mostly likely the real problem here is a different, perhaps a hang somewhere in Innodb. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-06-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Should I do dumping information_schema.thread_pool_queues before changing pool-of-threads and thread_pool_oversubscribe settings or afterwards? Currently problem is "nicely controlled" by stopping writes to database when connection to MariaDB is getting slower than 10 ms (instead of normally taking 0.5 ms to connect) and stopping writes for about 1-2 minutes. The problem solves itself and everything is fast again. Happens about once a day, sometimes 2-3 times a day. It seems to happen due to some insert which triggers that as this happens during intensive writing and not during reading only times (with just seldom writes) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-06-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have code in place now to dump information_schema.thread_pool_queues when we detect Master DB is starting to give slow connection on writes. But stop using DB server if connection to it slows down after 5 attempts (normal speed is 0.5 ms, we stop using it after it is slower than 10 ms (but usually 20+ms)) A query was run on the last 2 occasions of slowdown but it returned:
Possibly it did not build the queue yet. We already know that stopping using a server (be is Replica or a Master) after connection to it slows down , does solve the problem and server recovers to normal levels within 20-60 seconds. We could allow (at expense of a whole website becoming slower) - to let it build the queries and eventually hang up, by not stopping using it when it shows a sign of slowdown. I believe it does start building the queue then as it takes 30-60 minutes for the server to finally hang up after connection starts giving slower response. At first is slows down to 20ms, then 160ms and when it reaches 200-400ms - server is non responsive anymore to read or write queries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2023-06-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
the most accurate queueing information is returned if you use thread_pool_dedicated_listener=1 and thread_pool_exact_stats=1 .Unfortunately, both come at some cost, with thread_pool_dedicated_listener server is guaranteed to continuously peek events from network, at the expense of more context switches if server is lightly loaded. thread_pool_exact_stats=1 gives accurate queueing time, but queries the system clock more often. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-06-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
How badly can it affect the server if I enable both options? It is having Avg. qps = 2074 with peak being at around 8000 qps. It has 96 vCPU cores, which are loaded at 30% on average and 90% at peak usage. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-07-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have implemented (as noted in previous message) dumping thread pool queues, but in the last 2 weeks there was just empty dumps every time app sensed DB slowdown. I have decided to try setting thread_pool_stall_limit and thread_pool_oversubscribe as per your suggestion to see if this stops periodic slowdown in server responses (which fix themselves as soon as you stop using server for a minute):
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-07-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm that the changes above made no difference. Slowdowns still happen (and a complete Hangup would happen if application were not to stop using server on slowdown detection). The changes did not make situation better or worse - frequency is about the same of the issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2023-07-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The queues are accurate, only when thread_pool_dedicated_listener=1 . Otherwise, it can happen that queueing occurs on different level , inside OS, where it is difficult to trace. With thread_pool_dedicated_listener=1, connections with readable socket (i.e with data coming from client) will be immediately pushed into the queue and visible in the information schema, while thread_pool_dedicated_listener=0, sockets can be readable/ready, but it not guaranteed that they are visible or handled immediately., because epoll_wait() can be delayed for up to threadpool_stall_limit milliseconds. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Erlandas [ 2023-08-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
After this fix - https://jira.mariadb.org/browse/MDEV-31443 I was able to upgrade. And having 11.0.3 version seem to have solved the hangup problem too as I am not seeing server reporting slowdown in writes or reads for the last week or so. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-09-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the feedback. Marking as dup of |