Details
-
New Feature
-
Status: In Review (View Workflow)
-
Critical
-
Resolution: Unresolved
Attachments
Issue Links
- blocks
-
MDEV-26318 update P_S to 8.0.x
-
- Open
-
-
MDEV-32355 Show user variables in performance_schema
-
- Open
-
- is part of
-
MDEV-6114 merge new P_S instrumentation and tables from 5.7
-
- Closed
-
Activity
I have pushed unfinished work to bb-10.6-MDEV-16440
The task shoud be rewritten using apc interface (see THD::apc_target & Co)
So far, the solution was polished and patches d1a925b...53478dc were reviewed by Sergei.
However the APC issue with inaccessible sleeping threads is still unresolved. It turned out to be not trivial, because LOCK_thd_kill can't be held while retreiving system variables.
We have now two competing solutions:
1. Mine: add THD::awake flag, protected by mutex and condvar. Set awake = true before statement runs, and awake = false after. Then, on the requestor side, the usage:
if(!thd->apc_target.is_enabled()) |
{
|
|
lock(&thd->LOCK_thd_awake);
|
thd->COND_awake.wait(()[thd]{ return !thd->awake; }); |
unlock(&thd->LOCK_thd_kill);
|
|
(this->*func)(param, update_plugin); |
|
unlock(&thd->LOCK_thd_awake);
|
}
|
2. Wlad's: add queue_apc(void (f)(THD)) into scheduler_functions, and make an implementation on the scheduler side to safely process APCs while thd (connection) is inactive
Unfortunately first solution will not work. There can be a race condition. Guarding statement processing with some lock would be also an incorrect solution (there's a deadlock possibility).
So the only known way is to schedule the APC on the target thread's side.
There are two problems:
1. In thread per connection mode the thread can be waiting for socket data.
2. In tpool mode, THD can be inactive.
We have to resolve both problems, this is why scheduler_functions is very convenient here.
While tpool solution is more or less straightforward – we can just send the task in the working queue, dedicated to a particular THD –
thread per connection mode problem requires some thinking.
Usually THD is poked from waiting by destroying the socket – this wakes up the thread, but the connection will be refused. It is done so, because the only reason for poking the thread was to gracefully KILL it.
Waiting on recv()/poll() or can be interrupted by signal on *nix, and by calling WSACancelBlockingCall/CancelloEx on windows. We can miss the proper timing for calling it though, then this cancellation will be spared.
A. We can temporarily set up the timeout to some neglegible value during signal handling, and recover it after APC will be processed. But what if another APC request will come after APC is processed and before the timeout is reset? We can then first reset the timeout and then process APC, this should work.
B. Another solution is to hack into poll() used to wait for input events in vio_wait (see vio_read: it is actually never waited on recv(), and ought to always wait on vio_wait i.e. poll). We can add a function into Vio api, like setup_wakeup(socketfd* sock, callback*), which adds and additional socket to wait on in the mentioned poll() call. I think setting up the callback can be also a good idea – then, vio_read() behavior will not be changed.
EDIT: we have to extend Vio and Net api in both cases
So far, both algorithms are ready.
First, the thread-per-connection algorithm has been implemented:
https://github.com/MariaDB/server/commit/0cc9df282a51ff6d0c73c0ec23e3bb7e3299e4ff
The key moments there are:
- Signal handler setup in setup_connection_thread_globals
- Function thread_scheduler_notify_apc() which
- On linux, sends the signal to the processor thread (pthread_kill). For now, the signal chosen is SIGURG. The signal sets net timeout to a minimal value (1ms): if the signal wakes up thread from poll (see vio_io_wait), then APC will be processed due retry; if the signal was before wait, then it will wake up by timeout very soon.
- while I was writing that, I found a small window, where the signal can still have no effect: if the value was already read from vio->read_timeout, but poll was not yet called. I think maybe some retry-loop is required there, then, to make sure the handshake happens
- On Windows, QueueUserAPC2 is used. It does not wake up from io itself, so additionally CancelSynchronousIo is called.
- On linux, sends the signal to the processor thread (pthread_kill). For now, the signal chosen is SIGURG. The signal sets net timeout to a minimal value (1ms): if the signal wakes up thread from poll (see vio_io_wait), then APC will be processed due retry; if the signal was before wait, then it will wake up by timeout very soon.
Second, the threadpool algorithm was done
https://github.com/MariaDB/server/commit/9eff530067954877a743c34f8cc961840370f0ff
- This was implementation that involves one additional unconditional LOCK_thd_kill lock before c->start_io. I liked it because it happens in the safe time, when there are no more immediate requests to process. However that didn't satisfy Wlad, the main threadpool author.
- Then I have made another implementation, which I call epoch-based, and it doesn't add any unconditional locks into the task lifetime, but costs THD 16 bytes, and two atomic increments:
https://github.com/MariaDB/server/commit/eec4ee24be4b445debbd4c233124f631ed186306- Probably, the same will be done for a thread-per-connection case with regard to the race just found
Both algorithms suppose that nix io polls are built similarly. Removal from the poll can is done atomically, and the failure (missing shot) is reported as ENOENT in errno.
- Probably, the same will be done for a thread-per-connection case with regard to the race just found
- On windows, on the contrary, we cannot check whether the removal was successful or not. Instead, that can be done with CancelIo, which wakes up the thread/enqueues the io work into the pool.
- The windows implementation is adapted to this, but is not yet tested:
https://github.com/MariaDB/server/commit/a9afa00784265eaa15efacc04fd60ba1c3bd1f91
***
During all this research, the race condition in APC was also found. The thing is we can't destroy the mutex that was involved in cond_wait (it is reacquired afterwards, and, then, oops):
https://github.com/MariaDB/server/commit/87e60d900ca348d556862819752f1a95a2aa94b6
Thread-per-connection algorithm was redesigned, see the commit message:
https://github.com/MariaDB/server/commit/621d04a327caae38b9d518119b49018ee1d50aee
wlad this task is ready for your review. The actual code is now located at bb-10.10-nikita, current head is here.
Nikita, as before, the locking strategy in threadpool is questionable, and is not safe, unless the APIs we use guarantee that safety, on all platforms the server runs on.
The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel. That guarantee is stronger than the mutual exclusion that we usually need from a mutex. And is not the same as "It shall be safe to destroy an initialized mutex that is unlocked" by Single Unix Spec, which basically says that "destroying locked mutex is bad". The guarantee is stronger than that. This guarantee is apparently explicitly not given on Windows, where the documentation for DeleteCriticalSection says following "The caller is responsible for ensuring that the critical section object is unowned and the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process." (emphasis mine). I can see DeleteCriticalSection and LeaveCriticalSection at the same time here, there is no prevention.
One way that allowed you to play safe is to use locks that are not destroyed until the very end of the server, for example.
More, this lock and unlock once per every single command seems too expensive, for an interface that was supposed to be free. I do not think we shall accept some work that adds this amount of locking without extensive benchmarking first. I suggest to benchmark "very multiuser" "DO 1", or protocol level COM_PING
>The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel
you are right, it's not safe to destroy the mutex in parallel with it being held. POSIX also classifies it as UB.
The guarantee is given by ~THD destructor:
/* |
Other threads may have a lock on LOCK_thd_kill to ensure that this
|
THD is not deleted while they access it. The following mutex_lock
|
ensures that no one else is using this THD and it's now safe to delete
|
*/
|
mysql_mutex_lock(&LOCK_thd_kill);
|
mysql_mutex_unlock(&LOCK_thd_kill);
|
Suchwise, if worker is reached execution again after c->start_io and reach threadpool_remove_connection() before LOCK_thd_kill will be unlocked, it will wait until it is unlocked there. This means, that it is safe to unlock LOCK_thd_kill after c->start_io(), if it was locked before.
The similar guarantee works when we are releasing LOCK_thd_kill by requestor threads, or by killer threads. Also that code in ~THD, but what makes it work there – is THD_list_iterator::iterate (actually it's just server_threads) and Find_THD_variable::operator(), that is, LOCK_thd_kill is locked before server_threads.lock is unlocked.
On the connection side, you may see that firs unlink_thd(thd) is called (that is, server_thread.erase(thd), so server_threads.lock was reacquired), and then LOCK_thd_kill is reacquired in ~THD.
This means, that we cannot acquire LOCK_thd_kill without server_threads.lock in general. But we can do so, when we gauarantee that connection is not executed and not in the io pool.
> More, this lock and unlock once per every single command seems too expensive
I also didn't like it having this way. That's why I've put significant effort, to make this lock happen not before the command, but after it. By the way, it is not a lock per command, but rather a lock per available request data, and only when no requests are left! So it should be quite nice on performance. The context switching was anyway costy: we aquired sonme locks, put mysys_vars with pthread_setspecific (at least twice). See thread_attach and Worker_thread_context::save for a clue. Whatnot. I guess we also check for killed quite regularly, which implies another LOCK_thd_kill acquisition.
You may also like more an obstruction-free approach, see next commit. There, only two atomic increments are the overhead. But one increment is before any command is processed, so it's 1ns worse. Also, it implies weaker guarantees for a requestor thread, but should be ok, given that it is very specific request.
There is no guarantee given by THD destructor,
LeaveCriticalSection and DeleteCriticalSection can still run in parallel, and LeaveCriticalSection can finish after DeleteCriticalSection (I'm using Win example because the docs are supporting my point). The lock/unlock "trick" does not work, if data of the critical section /mutex is accessed after another thread is allowed to run. Those assumptions were always wrong, yet luckily we did not have many outside callers of this LOCK_thd_killed before your patch.
We also discussed this earlier.
Here is an example
Thread 1
|
LeaveCriticalSection
|
... part where it allows other thread to run
|
Thread 2
|
EnterCriticalSection
|
LeaveCriticalSection
|
DeleteCriticalSection
|
Thread 1
|
continues LeaveCriticalSection, e.g update statistics (accessing freed memory)
|
CRASH
|
There is no mentioning in the Microsoft docs, that critical section data is not accessed after another thread is allowed to run. There is instead mentioning , in DeleteCriticalSection documentation, The caller is responsible for ensuring that ... the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. Did the lock/unlock trick ensure it? No. DeleteCriticalSection runs at the same time as LeaveCriticalSection, as the example shows. This is what you should not do, according to MS documentation. This is UB, if you want it.
Thus, the code in the patch puts the server in danger of a crash during any (i.e also ordinary) client disconnect, which I'd think is not acceptable. An possible workaround is not to release, but reuse those critical sections, and really delete them at the very end of the server process.
Also the "context switching is costly, thus we make it most costly" is not a very good argument. If it is costly, and measured to be costly, we need to make it cheaper.
I have revised the documentation related to the CriticalSection. Indeed, it does not make any clue on the CRITICAL_SECTION structure access order. While I think it's fraudish to put it this way (no sane implementation should make any accesses to a mutex structure after the ownership is released), let's see what the developers can say: https://github.com/MicrosoftDocs/sdk-api/pull/1212.
Anyway, if it turns out that winapi's CriticalSection violates such guaranteees, I think we should just stop using it, and implement a mutex for example based on spinlock+WaitOnAddress
There is no neeed to implement a new mutex. There are no guarantees in POSIX either, it is just that MS documentation is more explicit. One can't just free structure, when it can be in used by any other function, it is just common sense. That the function in name this case happen to be pthread_mutex_lock/unlock, does not eliminate the common sense.
POSIX does:
> A mutex can be destroyed immediately after it is unlocked.
The current version of this pthread_cond_destroy singke unix spec
documentation tells that mutex should not be referenced by another threads,
and removes the older code example suggesting the code you were using is
safe.
So, nikitamalyavin , I fixed the Windows and threadpool implementations of the "wake" to address following limitation of your previous implementation
- Windows, thread-per-connection - rather than a unconventional, QueueUserAPC + self-pipe trick with own implementation of socketpair, use CancelIoEx, to cancel an IO, which is what was originally intendedn. Own socketpair is costly in that it needs 2 extra ports in the ephemeral range, on the server which reduces the max possible number of connection by factor of 3
- Threadpool - move the common logic into threadpool_common, introduce a new TP_STATE TP_INTERRUPTED , to mark that connection just needs to check for the APC, rather than reading and executing query. Unfortunately Linux needs yet another workaround, because unlike on Windows, where we can cancel IO, or elsewhere where we can disable monitoring of fd safely, Linux offers nothing like that for epoll, and attempts to fix it with EPOLL_CTL_CLOSE were abandoned.
With that, implementation of "wake" looks more or less ok to me. All this work is now in bb-10.10-MDEV-16440.
There needs to be much more investment into testing, because there is a lot of subtle scenarios, that might go wrong.
Apparently interesting cases that are not covered by your testing just yet, are
- interrupting named pipe connections on Windows
- Interrupting long running queries, in both thread-per-connection, and pool-of-thread such as
SELECT SLEEP(a lot)
SELECT SLEEP(a lot), which is done from for remote a table (federated and/or connect)
long operation (reads binlogs) in replication scenarios
stuck in some kind of locks, user lock (GET_LOCK), TABLE LOCK, FTWRL - You need also a basic test for non-debug build, as it is tested much more often on buildbots.
https://dev.mysql.com/doc/refman/5.7/en/performance-schema-system-variable-tables.html