[MDEV-26823] provide a way to monitor thread_pool_max_threads status Created: 2021-09-28  Updated: 2024-02-05

Status: Needs Feedback
Project: MariaDB Server
Component/s: None
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Allen Lee (Inactive) Assignee: Ben Stillman
Resolution: Unresolved Votes: 0
Labels: None


 Description   

A customer is requesting the feature to monitor the usage of thread_pool_max_threads.
There is tables related with thread_pool in information_schema in 10.5+.

  • thread_pool_groups
  • thread_pool_stats
  • thread_pool_queues
  • thread_pool_waits


 Comments   
Comment by Vladislav Vaintroub [ 2021-10-27 ]

First, thread_pool_max_threads does not grow. It is a system variable, and it does not change, unless user with enough privileges changes its value with "SET GLOBAL thread_pool_max_threads".

I assume in following that the variable in question is "threadpool_threads", and people want to know how many threads there are. The answer is - you monitor is like any other global status variable. How to know that maximum threads were reached - compare the status variable "threadpool_threads" with system variable "thread_pool_max_threads".

SELECT
IF(@@thread_pool_max_threads > VARIABLE_VALUE, 
   "max threads not reached", "max threads reached") 
AS good_to_know 
FROM INFORMATION_SCHEMA.GLOBAL_STATUS
WHERE VARIABLE_NAME='threadpool_threads';

The 'statistic' variable is called "threadpool_threads". There is no forecast, because in case of contention number of threads will grow rather rapidly (this was necessary to fix MDEV-2978, when there is no active thread in group, and they all block or wait on something, a new thread will be created without throttling). The reasons why threads are created, statistics about wakes, waits and stalls, are given in the information_schema.thread_pool_stats, and information_schema.thread_pool_waits.

I suggest to describe the actual problem people have, it is hard to make sense of the request otherwise.

Comment by Chris Calender (Inactive) [ 2021-10-28 ]

Hi wlad!

Many thanks for the feedback!

Yes, my apologies about thread_pool_max_threads. Yes, that is the system variable they set.

So your assumption is correct in that they want to monitor the status variable of the max threads being used at any point in time.

While threadpool_threads sounds like it should do what they want, they are not finding that to be the case. Perhaps we simply do not know how the calculation is performed, or maybe it is not tracked 100% correctly.

I say that because the customer has the following threadpool-related variables set:

thread_handling=pool-of-threads
thread_pool_size=512
thread_pool_max_threads=4096

While the server is running, they actively monitor both threadpool_threads and thread_pool_idle_threads.

Just before the exhaustion error:

2021-09-01 0:00:23 0 [ERROR] Threadpool could not create additional thread to handle queries, because the number of allowed threads was reached. Increasing 'thread_pool_max_threads' parameter can help in this situation.

They saw reported values of (~1 minute before the crash):

threadpool_threads: 1044
threadpool_idle_threads: 1026

So we're unsure how these 2 variables relate to the 4096 they have set for thread_pool_max_threads.

I would expect thread_pool_threads to be close to the 4096? Or at least threadpool_threads + threadpool_idle_threads = 4096? Perhaps even slightly higher based on the docs:

"Number of threads in the thread pool. In rare cases, this can be slightly higher than thread_pool_max_threads, because each thread group needs at least two threads (i.e. at least one worker thread and at least one listener thread) to prevent deadlocks. "

Am I missing something obvious in the calculation? If I take ((threadpool_threads + threadpool_idle_threads) * 2), then I approach 4096, but I think I'm just grasping at straws on that idea...

It would be terrific news if they can indeed simply monitor threadpool_threads and threadpool_idle_threads...

In the end, they just want to be able to monitor it, so they can try to avoid threadpool exhaustion on their side.

Hope this helps explain.

Comment by Vladislav Vaintroub [ 2021-10-28 ]

I think the calculations are incorrect. Thing is

  • threadpool_threads is the overall number of threads in the pool. It is capped by thread_pool_max_threads, in exceptional situations it would be higher than thread_pool_max_threads , with something like 1 additional thread per group. The number of groups is thread_pool_size. the counter is a global variable, there are no specific calculations for it. It is incremented after thread creation, it is decremented before thread exits.
  • threadpool_idle_threads is the number of inactive threads. If you read the documentation, it states, that idle does not necessarily mean "waiting for new request", but also "blocked on mutex, IO" etc. Also, SELECT SLEEP(1) would make a thread "idle". The counter is aggregated over all thread groups (keeping it global does not make sense)

In addition

  • (thread_pool_threads - thread_pool_idle_threads) gives the number of "active" threads running "on CPU" (not always accurate, an approximation, because threadpool is not always informed about all waits)
  • Threads_running - active_threads , or Threads_running - (thread_pool_threads - thread_pool_idle_threads) - gives the number of currently running queries blocking on something , in 10.5 .

The assumption that threadpool can't create 3K threads in a minute is incorrect. In your case it did, and it can create 3K threads in couple of milliseconds, if all workers block on something. I guess all your workers are blocking on something, wild guess is that something like FLUSH TABLES WITH READ LOCK is running. You could try a couple of thousands "SELECT SLEEP(1)" in parallel, to experience a surge in thread counts, too.

So, you can monitor both status variables, and in addition you can monitor "Threads_running" status variable, which tells you how many queries are currently executing (they could block though, contributing to "idle" thread count). Just don't assume that throttling in thread creation always applies. These rules usually apply, unless there is a contention on some global resource, or something that makes all threads block (as I explained, also SLEEP(1) will do)

BTW, unless their box is like 128 CPU large, or something around it, thread_pool_size=512 is an exaggeration. It should usually work OK with default settings. If it is a workaround for something, I'd like to know for what. If you want to avoid those messages, do not set thread_pool_max_threads to 4096. Set it larger, or much larger.
For me, it seems like threadpool would be quite misconfigured.

You mentioned some crash. If there was a crash, it is a different thing than this (often harmless) "thread pool blocked" message. I changed the severity of "thread pool blocked" message now, so it spits out a warning, rather than [ERROR]. Nothing is blocked, as long as queries are executed. even if no queries are executed, it could also be just a consequence of global locks.

In the end, I think people should start with default settings for the threadpool, and if they monitor surge in number of threads, they could correlate it with contention. They could try to ignore the warning, it does not flood the log, and it is written at most once since the startup.

And of course, they could use the admin connection via --extra-port, to modify the thread_pool_max_threads or "recover from this situation". th admin connection is not participating in threadpool, it won't experience queuing or anything like that.

Comment by Vladislav Vaintroub [ 2021-10-28 ]

BTW, what should probably be done in threadpool to avoid surges in thread counts, is the ability "not" to yield a worker thread if current thread holds some lock(FTWRL, backup lock, table lock, user lock, or even inside transaction, possibly holding row locks), plus minimal throttling interval between thread creations. in many cases, immediate creation of threads is done in a hope that a newly created thread will serve a query that releases some global lock that everyone is waiting for, e.g UNLOCK TABLES.

Generated at Thu Feb 08 09:48:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.