[MDEV-21101] Clients run into wait_timeout while waiting for a free thread with pool-of-threads - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.3.20, 10.4.10
Fix Version/s: 10.3.24, 10.4.14, 10.5.5
Component/s: Server
Labels:
None
Environment:
Linux

Description

When using pool-of-threads, clients that have their latest query still in the thread pool queue, waiting for a free execution thread, do not have that query shown in the SHOW PROCESSLIST output, instead they show with State: Sleep.

Apparently they are also considered idle by the server itself, as after a query has been waiting in the thread pool queue for more than wait_timeout seconds the connection gets terminated with a timeout message in the servers error log stating a read timeout (if log_warning>2).

The client just receives an unspecific "Lost connection to MySQL server during query" error without actual information why the query and the connection was terminated.

Even worse: sending the query to the server does not even reset the idle timer. With wait_timeout=60 and a client that has already idle for 30 seconds, query and connection are terminated after another 30 seconds already.

IMHO this is wrong as the client is not actually idle, the responsibility for the inactivity is totally on the server side, and even if we consider to terminate a query after a certain wait time in the pool-of-threads queue, there should be a more meaningful error message for this, and a separately configurable timeout for this.

Attachments

Issue Links

relates to

MDEV-21103 Show thread pool wait status in SHOW PROCESSLIST

Open

Activity

Ascending order - Click to sort in descending order

Hartmut Holzgraefe created issue - 2019-11-20 13:21

Hartmut Holzgraefe made changes - 2019-11-20 14:26

Field	Original Value	New Value
Link		This issue relates to MDEV-21103 [ MDEV-21103 ]

Vladislav Vaintroub added a comment - 2019-11-20 20:45 - edited

What settings do you use , so that queueing time actually matters? 30 sec queueing as in your example is a long time.
The responsibility of inactivity can be on the server side, due to misconfiguration, or network can be blamed, too. Server cannot accurately count client's idle time, it is not the same process, and is often a separate machine .
We do not know for sure when the client request arrives either. It can stay in the epoll queue, until it is picked up by listener. And listener might temporarily assume "worker" responsibility, so that a thread group has no listener at all during that time. Usually this period is short, but this of course can be misconfigured with huge stall_limit.

Vladislav Vaintroub added a comment - 2019-11-20 20:45 - edited What settings do you use , so that queueing time actually matters? 30 sec queueing as in your example is a long time. The responsibility of inactivity can be on the server side, due to misconfiguration, or network can be blamed, too. Server cannot accurately count client's idle time, it is not the same process, and is often a separate machine . We do not know for sure when the client request arrives either. It can stay in the epoll queue, until it is picked up by listener. And listener might temporarily assume "worker" responsibility, so that a thread group has no listener at all during that time. Usually this period is short, but this of course can be misconfigured with huge stall_limit.

Elena Stepanova made changes - 2019-11-25 19:47

Fix Version/s		10.3 [ 22126 ]
Fix Version/s		10.4 [ 22408 ]
Assignee		Vladislav Vaintroub [ wlad ]

Vladislav Vaintroub made changes - 2020-07-28 12:21

Status

Open [ 1 ]

In Progress [ 3 ]

Vladislav Vaintroub made changes - 2020-07-30 08:31

issue.field.resolutiondate

2020-07-30 08:31:44.0

2020-07-30 08:31:44.59

Vladislav Vaintroub made changes - 2020-07-30 08:31

Fix Version/s		10.3.24 [ 24306 ]
Fix Version/s		10.4.14 [ 24305 ]
Fix Version/s		10.5.5 [ 24423 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Vladislav Vaintroub added a comment - 2020-07-30 13:50 - edited

hholzgra, to the general problem about why does not server writes the last goodbye message to peer when closing the connection. In many cases, such as pooled connections, client won't be actively reading, and server sends a message, and closes the socket, and since there is data on that connection, and the client does not close it, the connection will continue to persist in TCP, using its resources, and client's ephemeral ports.

Another example:
Server sends large result set. Client is being too slow reading it. socket send buffer fills up, server's send() runs into timeout. Here, sending more to the client, e.g error packet "I'm closing you because I'm running into send() timeout" does not make any sense, in fact, server will run into send() timeout once again while trying to send it.

So, the correct thing to do, in most cases, where server closes the connection first, is I think is just to close the connection and do it "forcefully", with TCP reset. https://jira.mariadb.org/browse/MDEV-14113 discusses that in some more detail. Closing connection gracefully will only result in client reading 0 bytes, and TCP ephemeral port shortage on client, and other TCP resources on server, due to TIME_WAIT state.

There are some exception to this, when client connection is active, i.e query is running, and connection is being killed. A notable case where server should send a message are problems during connection setup, "Too many connections", "user blocked", "authentication error", and such. client is actively reading server response in this case.

But otherwise, I guess logging the error server-side, and killing TCP connection without the last goodbye, would be the correct thing to do.

Finally, there used to be some problems with "out-of-band" error messages, misinterpreted by multiple client drivers as protocol breakage, see ~~MDEV-19893~~ for example.

Also, the well-meant "connection was killed" message turned up pretty bad for the client connection pools. https://github.com/sqlalchemy/sqlalchemy/issues/4945. TCP close with reset would be much-much better.

Vladislav Vaintroub added a comment - 2020-07-30 13:50 - edited hholzgra , to the general problem about why does not server writes the last goodbye message to peer when closing the connection. In many cases, such as pooled connections, client won't be actively reading, and server sends a message, and closes the socket, and since there is data on that connection, and the client does not close it, the connection will continue to persist in TCP, using its resources, and client's ephemeral ports. Another example: Server sends large result set. Client is being too slow reading it. socket send buffer fills up, server's send() runs into timeout. Here, sending more to the client, e.g error packet "I'm closing you because I'm running into send() timeout" does not make any sense, in fact, server will run into send() timeout once again while trying to send it. So, the correct thing to do, in most cases, where server closes the connection first, is I think is just to close the connection and do it "forcefully", with TCP reset. https://jira.mariadb.org/browse/MDEV-14113 discusses that in some more detail. Closing connection gracefully will only result in client reading 0 bytes, and TCP ephemeral port shortage on client, and other TCP resources on server, due to TIME_WAIT state. There are some exception to this, when client connection is active, i.e query is running, and connection is being killed. A notable case where server should send a message are problems during connection setup, "Too many connections", "user blocked", "authentication error", and such. client is actively reading server response in this case. But otherwise, I guess logging the error server-side, and killing TCP connection without the last goodbye, would be the correct thing to do. Finally, there used to be some problems with "out-of-band" error messages, misinterpreted by multiple client drivers as protocol breakage, see MDEV-19893 for example. Also, the well-meant "connection was killed" message turned up pretty bad for the client connection pools. https://github.com/sqlalchemy/sqlalchemy/issues/4945 . TCP close with reset would be much-much better.

Hartmut Holzgraefe added a comment - 2020-07-30 16:20

Well, in this case the client is actively waiting, it has sent a query and is waiting for the results. So it should not run into wait_timeout as it did not let the server wait for new client activity for longer than wait_timeout, it is actually the server that has been starving it for more than wait_timeout seconds ...

Hartmut Holzgraefe added a comment - 2020-07-30 16:20 Well, in this case the client is actively waiting, it has sent a query and is waiting for the results. So it should not run into wait_timeout as it did not let the server wait for new client activity for longer than wait_timeout, it is actually the server that has been starving it for more than wait_timeout seconds ...

Vladislav Vaintroub added a comment - 2020-07-30 16:28

yes, this case is fixed. I was commenting on the general rant about server being quiet on why it closed the client.

Vladislav Vaintroub added a comment - 2020-07-30 16:28 yes, this case is fixed. I was commenting on the general rant about server being quiet on why it closed the client.

Sergei Golubchik made changes - 2021-12-06 21:50

Workflow

MariaDB v3 [ 101293 ]

MariaDB v4 [ 157000 ]

Jira Automation (IT) made changes - 2024-07-04 03:19

Zendesk Related Tickets

194203

People

Assignee:: Vladislav Vaintroub

Reporter:: Hartmut Holzgraefe

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2019-11-20 13:21

Updated:: 2024-07-07 22:32

Resolved:: 2020-07-30 08:31

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server