Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21101

Clients run into wait_timeout while waiting for a free thread with pool-of-threads

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.3.20, 10.4.10
    • 10.3.24, 10.4.14, 10.5.5
    • Server
    • None
    • Linux

    Description

      When using pool-of-threads, clients that have their latest query still in the thread pool queue, waiting for a free execution thread, do not have that query shown in the SHOW PROCESSLIST output, instead they show with State: Sleep.

      Apparently they are also considered idle by the server itself, as after a query has been waiting in the thread pool queue for more than wait_timeout seconds the connection gets terminated with a timeout message in the servers error log stating a read timeout (if log_warning>2).

      The client just receives an unspecific "Lost connection to MySQL server during query" error without actual information why the query and the connection was terminated.

      Even worse: sending the query to the server does not even reset the idle timer. With wait_timeout=60 and a client that has already idle for 30 seconds, query and connection are terminated after another 30 seconds already.

      IMHO this is wrong as the client is not actually idle, the responsibility for the inactivity is totally on the server side, and even if we consider to terminate a query after a certain wait time in the pool-of-threads queue, there should be a more meaningful error message for this, and a separately configurable timeout for this.

      Attachments

        Issue Links

          Activity

            wlad Vladislav Vaintroub added a comment - - edited

            What settings do you use , so that queueing time actually matters? 30 sec queueing as in your example is a long time.
            The responsibility of inactivity can be on the server side, due to misconfiguration, or network can be blamed, too. Server cannot accurately count client's idle time, it is not the same process, and is often a separate machine .
            We do not know for sure when the client request arrives either. It can stay in the epoll queue, until it is picked up by listener. And listener might temporarily assume "worker" responsibility, so that a thread group has no listener at all during that time. Usually this period is short, but this of course can be misconfigured with huge stall_limit.

            wlad Vladislav Vaintroub added a comment - - edited What settings do you use , so that queueing time actually matters? 30 sec queueing as in your example is a long time. The responsibility of inactivity can be on the server side, due to misconfiguration, or network can be blamed, too. Server cannot accurately count client's idle time, it is not the same process, and is often a separate machine . We do not know for sure when the client request arrives either. It can stay in the epoll queue, until it is picked up by listener. And listener might temporarily assume "worker" responsibility, so that a thread group has no listener at all during that time. Usually this period is short, but this of course can be misconfigured with huge stall_limit.
            wlad Vladislav Vaintroub added a comment - - edited

            hholzgra, to the general problem about why does not server writes the last goodbye message to peer when closing the connection. In many cases, such as pooled connections, client won't be actively reading, and server sends a message, and closes the socket, and since there is data on that connection, and the client does not close it, the connection will continue to persist in TCP, using its resources, and client's ephemeral ports.

            Another example:
            Server sends large result set. Client is being too slow reading it. socket send buffer fills up, server's send() runs into timeout. Here, sending more to the client, e.g error packet "I'm closing you because I'm running into send() timeout" does not make any sense, in fact, server will run into send() timeout once again while trying to send it.

            So, the correct thing to do, in most cases, where server closes the connection first, is I think is just to close the connection and do it "forcefully", with TCP reset. https://jira.mariadb.org/browse/MDEV-14113 discusses that in some more detail. Closing connection gracefully will only result in client reading 0 bytes, and TCP ephemeral port shortage on client, and other TCP resources on server, due to TIME_WAIT state.

            There are some exception to this, when client connection is active, i.e query is running, and connection is being killed. A notable case where server should send a message are problems during connection setup, "Too many connections", "user blocked", "authentication error", and such. client is actively reading server response in this case.

            But otherwise, I guess logging the error server-side, and killing TCP connection without the last goodbye, would be the correct thing to do.

            Finally, there used to be some problems with "out-of-band" error messages, misinterpreted by multiple client drivers as protocol breakage, see MDEV-19893 for example.

            Also, the well-meant "connection was killed" message turned up pretty bad for the client connection pools. https://github.com/sqlalchemy/sqlalchemy/issues/4945. TCP close with reset would be much-much better.

            wlad Vladislav Vaintroub added a comment - - edited hholzgra , to the general problem about why does not server writes the last goodbye message to peer when closing the connection. In many cases, such as pooled connections, client won't be actively reading, and server sends a message, and closes the socket, and since there is data on that connection, and the client does not close it, the connection will continue to persist in TCP, using its resources, and client's ephemeral ports. Another example: Server sends large result set. Client is being too slow reading it. socket send buffer fills up, server's send() runs into timeout. Here, sending more to the client, e.g error packet "I'm closing you because I'm running into send() timeout" does not make any sense, in fact, server will run into send() timeout once again while trying to send it. So, the correct thing to do, in most cases, where server closes the connection first, is I think is just to close the connection and do it "forcefully", with TCP reset. https://jira.mariadb.org/browse/MDEV-14113 discusses that in some more detail. Closing connection gracefully will only result in client reading 0 bytes, and TCP ephemeral port shortage on client, and other TCP resources on server, due to TIME_WAIT state. There are some exception to this, when client connection is active, i.e query is running, and connection is being killed. A notable case where server should send a message are problems during connection setup, "Too many connections", "user blocked", "authentication error", and such. client is actively reading server response in this case. But otherwise, I guess logging the error server-side, and killing TCP connection without the last goodbye, would be the correct thing to do. Finally, there used to be some problems with "out-of-band" error messages, misinterpreted by multiple client drivers as protocol breakage, see MDEV-19893 for example. Also, the well-meant "connection was killed" message turned up pretty bad for the client connection pools. https://github.com/sqlalchemy/sqlalchemy/issues/4945 . TCP close with reset would be much-much better.

            Well, in this case the client is actively waiting, it has sent a query and is waiting for the results. So it should not run into wait_timeout as it did not let the server wait for new client activity for longer than wait_timeout, it is actually the server that has been starving it for more than wait_timeout seconds ...

            hholzgra Hartmut Holzgraefe added a comment - Well, in this case the client is actively waiting, it has sent a query and is waiting for the results. So it should not run into wait_timeout as it did not let the server wait for new client activity for longer than wait_timeout, it is actually the server that has been starving it for more than wait_timeout seconds ...

            yes, this case is fixed. I was commenting on the general rant about server being quiet on why it closed the client.

            wlad Vladislav Vaintroub added a comment - yes, this case is fixed. I was commenting on the general rant about server being quiet on why it closed the client.

            People

              wlad Vladislav Vaintroub
              hholzgra Hartmut Holzgraefe
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.