[MDEV-26873] Partial server hang when using many threads Created: 2021-10-21 Updated: 2021-10-29 Resolved: 2021-10-29 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Locking |
| Affects Version/s: | 10.2, 10.3, 10.4, 10.5, 10.6, 10.7 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Roel Van de Paar | Assignee: | Vladislav Vaintroub |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | hang | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Split from MDEV-26381. Logging a simplified overview here, with easy reproducibility and keeping things simple, though there are likely more aspects to these hang(s), some described in that ticket. Execute the attached hang.sql After a few minutes, even on optimized builds, partial hang issues will start to show. SHOW FULL PROCESSLIST attached as show_full_processlist.txt When logging errors (like ERROR 1146 (42S02) at line 1: Table 'test.t2' doesn't exist) to the screen, it's easy to see when the server starts locking up after 1-5 minutes as the error rate either abruptly stops or slows down clearly/significantly. It then stays in that semi-hang state for 30+ minutes, sometimes unlocking partially with some threads continuing to process transactions whilst others remain in hanged state. Machine is not OOM, nor OOS, nor busy (nothing else running), not challenged by the 10k threads (low load average in htop). IOW, this is not server hardware/capability related in any way afaict. Tested version/revision was 10.7.1 b4911f5a34f8dcfb642c6f14535bc9d5d97ade44 (Optimized) |
| Comments |
| Comment by Roel Van de Paar [ 2021-10-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
MDEV-26873_SHOW_FULL_PROCESSLIST_10.6_OPT.txt | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
10.5 Also looks to be affected (queries queuing up), but not sure if it is as consistent in hanging as 10.6-10.7 | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
10.2-10.4 Also affected. When shutting down the sever, we see:
| |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
In MDEV-26873_GDB_10.6_OPT.txt.tar.xz
InnoDB cannot possibly be the culprit here. I see no occurrence of "part" in that file either, so the cause should be unlikely to be partitioning either. Of the 9979 threads, 3767 are executing MDL_context::try_acquire_lock_impl() and 8694 are executing open_table_get_mdl_lock(). That is why I set the component to Locking. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
I removed "Critical". It can't be critical. It is probably also not "Major" | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
marko comments, that there is no partitioning. So, could you please verify that, and file a non-partitioning, most simplified example, ideally with 2-3 threads, or with as few threads as possible. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Roel, also what is "partial hang". It is either a hang, or not a hang. There can't be a partial hang, just like there is no way to be only a little bit pregnant. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
wlad Please feel free to call or describe it as you please. The best term I could find to describe the behavior I am seeing is a partial server hang; some operations (like CLI client access, executing SHOW PROCESSLIST etc.) are still possible, while at the same time a lot of the threads are in some form of long-duration (30 min+) hanging or locked situation, with at times some threads continuing to process a few transactions only. That, to me, seems to be most accurately described as a partial sever hang, and is offset from a full server hang where for example nothing is possible, like no CLI access and no thread interaction of any kind after the hang was established, or a deadlock situation which times out etc. Also, IMHO, any issue which has the potential to partially hang the server in release builds (and that with simple SQL) is definitely critical. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Confirmed issue is reproducible without partitioning. Updated title, removed tag. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue is not reproducible with 100, 500 nor 2000 threads only. Sidenote: A 4 to 5 second pause in output flow (like error messages) is regularly observed. It is possible that some mechanism or background process/thread (of approx this duration) is firing and that the same may be related to the issue. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue can be produced without even creating a single table. Only DROP TABLE IF EXISTS, INSERT and SELECT are necessary in 10k threads. At least two table names for those commands are necessary (even though such tables never exist). With a single table name only the issue does not reproduce. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue does not reproduce with 13 INSERT statements per (nonexisting) table only. 23 Is sufficient. It is likely that more INSERT statements make the issue easier to reproduce. Whilst the threads are running, there is a build up of the number of active threads, and there seem to be two stages to this; firstly a build up to around 95-120 active threads quite quickly, then some sort of trigger event (5-60 seconds later) with a subsequent quick but gradual buildup to 6-10k active threads in SHOW PROCESSLIST; | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Improved and final testcase. Run, continually looping, in random order, against 10k threads.
I consider this issue critical. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Note: whilst with the testcase in the last comment is able to reproduce the issue, i.e. for example:
Adding the following to the SQL mix:
Produces the issue clearer, and faster. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
I set it priority to "minor", I consider it fair. There are some problems with this bug report, especialy "how to reproduce". There is no user bug associated with it. Not everything can be Critical | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
the gdb thing that allows to suspend threads, to take point-in-time "thread apply all bt" is
according to https://stackoverflow.com/a/5697181/547065 It seems to be the only viable way to get the accurate picture on what's happening. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Responding to each numbered point: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Re: 4) It seems the 'set scheduler-locking on' gdb command is not directly useful (for QA) without stepping/debugging. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm quite positive about "gdb thread apply all bt" not stopping the threads. And we have seen obviously "impossible stacktraces" in the past, like executing code under mutex in multiple threads seemingly concurrenty, although I already forgot which bug. Maybe marko can remember, as I also recall his proposal to first issue "kill -SIGSTOP" or similar prior to gdb invocation. The gdb documentation talks that a breakpoint will stop all threads, but otherwise it won't. That the situation we have here, there is no breakpoints. there is some "non-stop mode", and "all-stop mode"(https://sourceware.org/gdb/current/onlinedocs/gdb/All_002dStop-Mode.html) , although I can't extract sense of that documentation, being only occasional user of that debugger. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
wlad, I think that you are referring to I would think that the main purpose of set scheduler-locking is to control commands like step, stepi, next, nexti, continue, finish. I typically do not use the scheduler-locking, so I do not know for sure. I would expect that when the process is paused in a debugger, all threads will be paused. And I would expect that to be the case when the debugger is ready to execute commands, such as thread apply all backtrace. So, I do not believe that thread apply all backtrace would display an impossible snapshot of thread stacks. In any case, I think that it would be much better to try to repeat this with fewer threads (say, less than 100 or 50). I have hard time imagining that a deadlock or livelock (partial hang) would involve more than 5 threads. Usually the root cause of any concurrency problem involves 2 or 3 threads. (Some more threads may be necessary to set the right conditions.) My suggestion to Roel: Study the stack traces and inject some code around the MDL acquisition (sleeps or sched_yield()) to make the hang more likely with fewer threads. Also, try some other tricks, such as compiling a RelWithDebInfo with ASAN and UBSAN to affect the timing. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
There is no hang in there. Whenever I execute the sequence, 1 CPU is always busy. doing MDL deadlock detection, or something in table definition cache. so, a snapshot that would show that everything is locked up, when not everything is locked up, is fake. i trust gdb to produce fake snapshots, that's my experience | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
I did not investigate whether any of the attached stack traces in this ticket expose genuine hangs. I only had the objective to prove that no storage engine is the culprit. I agree that one snapshot of a running server is not sufficient to prove a hang. It would require multiple samples (say, multiple thread apply all backtrace) and some (preferrably automated) analysis to highlight which threads do not seem to have progressed. (If the backtrace or backtrace full of a thread is identical between samples or differs at most by some timestamp values, then the thread has likely not made any progress. But we could get an identical stack trace if the exact same command is being executed over and over again.) With a larger number of threads, it is more plausible (although I think improbable) that the operating system scheduler is unfair, and some theoretically runnable thread could have been starved of CPU resources. That we have witnessed many times with rr replay traces that show a ‘fake’ hang. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
So, as perf reports, there is no hang, instead MDL subsystem is busy trying to find a deadlock, in a system with 10K participants. 73.53% mariadbd [.] MDL_lock::visit_subgraph | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
I attached a quick and dirty program, which demonstrates allegedly "very easy to reproduce" issue. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
so, the resolution to this problem is currently add There is a currently a cosmetic problem that threadpool will reach its max (it does so, when contention is detected), and then write the following into error log.
[ERROR] should definitely be warning, I'll change that back in 10.2. Maybe it should be a [Note], I dunno. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-10-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you wlad for the analysis & input thus far, as well as for writing a test script from your end. Very cool. Observations from my side:
Question:
Notes:
| |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
"Using --thread_pool_max_threads=1000 - whilst it does not produce the issue - is just masking the issue as the number of threads is limited to 1000 in that case. 1000 threads would not be sufficient for customers with busy websites etc". This is one-thread-per-connection thinking. You do not need as many threads as there are connections. The thread pool's idea is to mask those issues, and fix the OS scheduler problems, which Linux has, and to reduce the contention. If I run my program on a different OS, scheduler there turns out to be smarter, as seen by the output of my program. Run it on Windows 10 , with quadcore, and it will runs much faster than 32core Linux, even with its non-default one-thread-per-connection. So if you like, maybe you can tune your Linux scheduler. I can't do that, I have no clue. And yes, if one sets thread_pool_max_threads to 10K, the number of threads will grow until it degenerates in basically one-thread-per-connection, because high contention is a bad scenario for threadpool (and for everyone else). Once contention is over, the additional threads disappear. Now , take a look at many threads would say SQLServer run, https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/configure-the-max-worker-threads-server-configuration-option?view=sql-server-ver15. They have at most , 16 per core on machines with less than 64 cores, and 32 on machines with more than 64 cores. To reach 10K OS-threads, as you suggest, one would need 300 or so cores. To know how much improvement a setting like "thread-handling=pool-of-threads" and "thread-pool-max-threads" gives, measure QPS. With that, it is no more "remains to be seen". My program, which I shared, measures QPS. you can script sysbench otherwise, and if you manage to run it with 10K connections (cause I can't), it gives a better overview now only about QPS, but also about latency. The table cache automagically opens an instance if it measures contention. There is a contention on table cache (no surprise). Therefore there are many instances. I only would trust gdb to be point in time, if it works on crash dump. This is my experience analyzing several thousands-of-threads dumps, over the years, I'm not really new to this. So we assume, gdb "thread apply all" is not really "point-of-time" (why would it else have all these all-stop-mode or non-stop-mode, really?), or some problems with Linux, scheduler, and what not, that make it appear as-if it would not be point-in-time. There is a workaround for this. Use wonderful PMP tool, take several probes, and it will be at least statistically correct. And the abbreviated stack traces in PMP are much-much better to read, than one dumps with all variables and function parameters, those dumps take longer to read than both volumes of War and Peace. For SHOW PROCESSLIST, I even happen to know it is not point in time, because I understand its implementation. Anyway, I suggest to close this one, because there is a workaround with threadpool, "more or less", much better than without. I'm closing it, in favor of newly created task MDEV-26935 . I doubt we can do anything for already released versions. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Closing, as MDL deadlock detector can not be fixed in already released versions, to scale to 10K OS threads. There is a workaround with threadpool and capping thread_pool_max_threads to a reasonable value . If you find 1K unreasonable ,try 2K, 4K I created MDEV-26935 as a task to try to make MDL scale to 10K concurrent OS threads. It is a task, not a bug, and it requires enough time even for investigation and experimentation only |