[MDEV-16440] merge 5.7 P_S sysvars instrumentation and tables Created: 2018-06-08 Updated: 2023-12-07 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Performance Schema |
| Fix Version/s: | 11.5 |
| Type: | New Feature | Priority: | Critical |
| Reporter: | Sergei Golubchik | Assignee: | Nikita Malyavin |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | Compatibility | ||
| Issue Links: |
|
||||||||||||||||||||
| Comments |
| Comment by Oleksandr Byelkin [ 2020-09-08 ] | |||||||||||
|
https://dev.mysql.com/doc/refman/5.7/en/performance-schema-system-variable-tables.html | |||||||||||
| Comment by Oleksandr Byelkin [ 2021-07-23 ] | |||||||||||
|
I have pushed unfinished work to bb-10.6-MDEV-16440 The task shoud be rewritten using apc interface (see THD::apc_target & Co) | |||||||||||
| Comment by Nikita Malyavin [ 2021-09-14 ] | |||||||||||
|
So far, the solution was polished and patches d1a925b...53478dc were reviewed by Sergei. We have now two competing solutions:
2. Wlad's: add queue_apc(void (f)(THD)) into scheduler_functions, and make an implementation on the scheduler side to safely process APCs while thd (connection) is inactive | |||||||||||
| Comment by Nikita Malyavin [ 2022-02-08 ] | |||||||||||
|
Unfortunately first solution will not work. There can be a race condition. Guarding statement processing with some lock would be also an incorrect solution (there's a deadlock possibility). So the only known way is to schedule the APC on the target thread's side. 2. In tpool mode, THD can be inactive. We have to resolve both problems, this is why scheduler_functions is very convenient here. While tpool solution is more or less straightforward – we can just send the task in the working queue, dedicated to a particular THD – Usually THD is poked from waiting by destroying the socket – this wakes up the thread, but the connection will be refused. It is done so, because the only reason for poking the thread was to gracefully KILL it. Waiting on recv()/poll() or can be interrupted by signal on *nix, and by calling WSACancelBlockingCall/CancelloEx on windows. We can miss the proper timing for calling it though, then this cancellation will be spared. A. We can temporarily set up the timeout to some neglegible value during signal handling, and recover it after APC will be processed. But what if another APC request will come after APC is processed and before the timeout is reset? We can then first reset the timeout and then process APC, this should work. B. Another solution is to hack into poll() used to wait for input events in vio_wait (see vio_read: it is actually never waited on recv(), and ought to always wait on vio_wait i.e. poll). We can add a function into Vio api, like setup_wakeup(socketfd* sock, callback*), which adds and additional socket to wait on in the mentioned poll() call. I think setting up the callback can be also a good idea – then, vio_read() behavior will not be changed. EDIT: we have to extend Vio and Net api in both cases | |||||||||||
| Comment by Nikita Malyavin [ 2022-03-01 ] | |||||||||||
|
So far, both algorithms are ready. First, the thread-per-connection algorithm has been implemented:https://github.com/MariaDB/server/commit/0cc9df282a51ff6d0c73c0ec23e3bb7e3299e4ff
Second, the threadpool algorithm was donehttps://github.com/MariaDB/server/commit/9eff530067954877a743c34f8cc961840370f0ff
***During all this research, the race condition in APC was also found. The thing is we can't destroy the mutex that was involved in cond_wait (it is reacquired afterwards, and, then, oops): | |||||||||||
| Comment by Nikita Malyavin [ 2022-03-03 ] | |||||||||||
|
Thread-per-connection algorithm was redesigned, see the commit message: | |||||||||||
| Comment by Nikita Malyavin [ 2022-06-19 ] | |||||||||||
|
wlad this task is ready for your review. The actual code is now located at bb-10.10-nikita, current head is here. | |||||||||||
| Comment by Vladislav Vaintroub [ 2022-06-19 ] | |||||||||||
|
Nikita, as before, the locking strategy in threadpool is questionable, and is not safe, unless the APIs we use guarantee that safety, on all platforms the server runs on. The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel. That guarantee is stronger than the mutual exclusion that we usually need from a mutex. And is not the same as "It shall be safe to destroy an initialized mutex that is unlocked" by Single Unix Spec, which basically says that "destroying locked mutex is bad". The guarantee is stronger than that. This guarantee is apparently explicitly not given on Windows, where the documentation for DeleteCriticalSection says following "The caller is responsible for ensuring that the critical section object is unowned and the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process." (emphasis mine). I can see DeleteCriticalSection and LeaveCriticalSection at the same time here, there is no prevention. One way that allowed you to play safe is to use locks that are not destroyed until the very end of the server, for example. More, this lock and unlock once per every single command seems too expensive, for an interface that was supposed to be free. I do not think we shall accept some work that adds this amount of locking without extensive benchmarking first. I suggest to benchmark "very multiuser" "DO 1", or protocol level COM_PING | |||||||||||
| Comment by Nikita Malyavin [ 2022-06-20 ] | |||||||||||
|
>The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel you are right, it's not safe to destroy the mutex in parallel with it being held. POSIX also classifies it as UB.
Suchwise, if worker is reached execution again after c->start_io and reach threadpool_remove_connection() before LOCK_thd_kill will be unlocked, it will wait until it is unlocked there. This means, that it is safe to unlock LOCK_thd_kill after c->start_io(), if it was locked before. The similar guarantee works when we are releasing LOCK_thd_kill by requestor threads, or by killer threads. Also that code in ~THD, but what makes it work there – is THD_list_iterator::iterate (actually it's just server_threads) and Find_THD_variable::operator(), that is, LOCK_thd_kill is locked before server_threads.lock is unlocked. On the connection side, you may see that firs unlink_thd(thd) is called (that is, server_thread.erase(thd), so server_threads.lock was reacquired), and then LOCK_thd_kill is reacquired in ~THD. > More, this lock and unlock once per every single command seems too expensive You may also like more an obstruction-free approach, see next commit. There, only two atomic increments are the overhead. But one increment is before any command is processed, so it's 1ns worse. Also, it implies weaker guarantees for a requestor thread, but should be ok, given that it is very specific request. | |||||||||||
| Comment by Vladislav Vaintroub [ 2022-06-21 ] | |||||||||||
|
There is no guarantee given by THD destructor, We also discussed this earlier. Here is an example
There is no mentioning in the Microsoft docs, that critical section data is not accessed after another thread is allowed to run. There is instead mentioning , in DeleteCriticalSection documentation, The caller is responsible for ensuring that ... the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. Did the lock/unlock trick ensure it? No. DeleteCriticalSection runs at the same time as LeaveCriticalSection, as the example shows. This is what you should not do, according to MS documentation. This is UB, if you want it. Thus, the code in the patch puts the server in danger of a crash during any (i.e also ordinary) client disconnect, which I'd think is not acceptable. An possible workaround is not to release, but reuse those critical sections, and really delete them at the very end of the server process. Also the "context switching is costly, thus we make it most costly" is not a very good argument. If it is costly, and measured to be costly, we need to make it cheaper. | |||||||||||
| Comment by Nikita Malyavin [ 2022-06-21 ] | |||||||||||
|
I have revised the documentation related to the CriticalSection. Indeed, it does not make any clue on the CRITICAL_SECTION structure access order. While I think it's fraudish to put it this way (no sane implementation should make any accesses to a mutex structure after the ownership is released), let's see what the developers can say: https://github.com/MicrosoftDocs/sdk-api/pull/1212. Anyway, if it turns out that winapi's CriticalSection violates such guaranteees, I think we should just stop using it, and implement a mutex for example based on spinlock+WaitOnAddress | |||||||||||
| Comment by Vladislav Vaintroub [ 2022-06-21 ] | |||||||||||
|
There is no neeed to implement a new mutex. There are no guarantees in POSIX either, it is just that MS documentation is more explicit. One can't just free structure, when it can be in used by any other function, it is just common sense. That the function in name this case happen to be pthread_mutex_lock/unlock, does not eliminate the common sense. | |||||||||||
| Comment by Nikita Malyavin [ 2022-06-21 ] | |||||||||||
|
POSIX does: | |||||||||||
| Comment by Vladislav Vaintroub [ 2022-06-21 ] | |||||||||||
|
The current version of this pthread_cond_destroy singke unix spec | |||||||||||
| Comment by Vladislav Vaintroub [ 2022-08-23 ] | |||||||||||
|
So, nikitamalyavin , I fixed the Windows and threadpool implementations of the "wake" to address following limitation of your previous implementation
With that, implementation of "wake" looks more or less ok to me. All this work is now in bb-10.10-MDEV-16440. There needs to be much more investment into testing, because there is a lot of subtle scenarios, that might go wrong. Apparently interesting cases that are not covered by your testing just yet, are
|