[MDEV-16440] merge 5.7 P_S sysvars instrumentation and tables - Jira

Sergei Golubchik created issue - 2018-06-08 09:50

Sergei Golubchik made changes - 2018-06-08 09:50

Field	Original Value	New Value
Link		This issue is part of ~~MDEV-6114~~ [ ~~MDEV-6114~~ ]

Ralf Gebhardt made changes - 2019-02-01 10:59

Fix Version/s		10.5 [ 23123 ]
Fix Version/s	10.4 [ 22408 ]

Ralf Gebhardt made changes - 2020-03-12 13:48

Labels

RM_106_CANDIDATE

Ralf Gebhardt made changes - 2020-03-12 13:48

Fix Version/s		10.6 [ 24028 ]
Fix Version/s	10.5 [ 23123 ]

Sergei Golubchik made changes - 2020-07-20 16:26

Labels

RM_106_CANDIDATE

Sergei Golubchik made changes - 2020-08-15 19:09

Assignee

Sergei Golubchik [ serg ]

Oleksandr Byelkin [ sanja ]

Sergei Golubchik made changes - 2020-08-16 19:52

Rank

Ranked higher

Oleksandr Byelkin made changes - 2020-09-08 12:45

Status

Open [ 1 ]

In Progress [ 3 ]

Oleksandr Byelkin added a comment - 2020-09-08 12:57

https://dev.mysql.com/doc/refman/5.7/en/performance-schema-system-variable-tables.html

Oleksandr Byelkin added a comment - 2020-09-08 12:57 https://dev.mysql.com/doc/refman/5.7/en/performance-schema-system-variable-tables.html

Oleksandr Byelkin made changes - 2020-09-09 17:52

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2020-09-22 08:19

Rank

Ranked higher

Sergei Golubchik made changes - 2020-09-29 13:00

Link

This issue blocks ~~MDEV-6114~~ [ ~~MDEV-6114~~ ]

Julien Fritsch made changes - 2020-10-05 07:31

Priority

Major [ 3 ]

Critical [ 2 ]

Ralf Gebhardt made changes - 2020-11-05 10:00

Component/s

Performance Schema [ 15627 ]

Sergei Golubchik made changes - 2020-11-13 09:20

Link

This issue relates to ~~MDEV-6114~~ [ ~~MDEV-6114~~ ]

Sergei Golubchik made changes - 2020-11-13 09:21

Link

This issue blocks ~~MDEV-6114~~ [ ~~MDEV-6114~~ ]

Julien Fritsch made changes - 2020-12-10 16:03

Priority

Critical [ 2 ]

Blocker [ 1 ]

Sergei Golubchik made changes - 2020-12-10 19:57

Priority

Blocker [ 1 ]

Critical [ 2 ]

Julien Fritsch made changes - 2021-02-24 15:22

Link

This issue relates to ~~MDEV-6114~~ [ ~~MDEV-6114~~ ]

Ralf Gebhardt made changes - 2021-03-03 18:06

Fix Version/s		10.7 [ 24805 ]
Fix Version/s	10.6 [ 24028 ]

Ralf Gebhardt made changes - 2021-03-03 18:10

Labels

Compatibility

Ralf Gebhardt made changes - 2021-07-21 17:55

Due Date

2021-09-14

Oleksandr Byelkin added a comment - 2021-07-23 17:42

I have pushed unfinished work to bb-10.6-MDEV-16440

The task shoud be rewritten using apc interface (see THD::apc_target & Co)

Oleksandr Byelkin added a comment - 2021-07-23 17:42 I have pushed unfinished work to bb-10.6- MDEV-16440 The task shoud be rewritten using apc interface (see THD::apc_target & Co)

Oleksandr Byelkin made changes - 2021-07-23 17:42

Assignee

Oleksandr Byelkin [ sanja ]

Nikita Malyavin [ nikitamalyavin ]

Nikita Malyavin made changes - 2021-08-03 11:54

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Sergei Golubchik made changes - 2021-08-07 12:02

Link

This issue blocks MDEV-26318 [ MDEV-26318 ]

Sergei Golubchik made changes - 2021-08-25 10:15

Comment

[ A comment with security level 'Developers' was removed. ]

Nikita Malyavin made changes - 2021-09-14 10:14

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Nikita Malyavin added a comment - 2021-09-14 10:24

So far, the solution was polished and patches d1a925b...53478dc were reviewed by Sergei.
However the APC issue with inaccessible sleeping threads is still unresolved. It turned out to be not trivial, because LOCK_thd_kill can't be held while retreiving system variables.

We have now two competing solutions:
1. Mine: add THD::awake flag, protected by mutex and condvar. Set awake = true before statement runs, and awake = false after. Then, on the requestor side, the usage:

if(!thd->apc_target.is_enabled())

  lock(&thd->LOCK_thd_awake);

  thd->COND_awake.wait(()[thd]{ return !thd->awake; });

  unlock(&thd->LOCK_thd_kill);

  (this->*func)(param, update_plugin);

  unlock(&thd->LOCK_thd_awake);

2. Wlad's: add queue_apc(void (f)(THD)) into scheduler_functions, and make an implementation on the scheduler side to safely process APCs while thd (connection) is inactive

Nikita Malyavin added a comment - 2021-09-14 10:24 So far, the solution was polished and patches d1a925b...53478dc were reviewed by Sergei. However the APC issue with inaccessible sleeping threads is still unresolved. It turned out to be not trivial, because LOCK_thd_kill can't be held while retreiving system variables. We have now two competing solutions: 1. Mine: add THD::awake flag, protected by mutex and condvar. Set awake = true before statement runs, and awake = false after. Then, on the requestor side, the usage: if (!thd->apc_target.is_enabled()) { lock(&thd->LOCK_thd_awake); thd->COND_awake.wait(()[thd]{ return !thd->awake; }); unlock(&thd->LOCK_thd_kill); ( this ->*func)(param, update_plugin); unlock(&thd->LOCK_thd_awake); } 2. Wlad's: add queue_apc(void ( f)(THD )) into scheduler_functions , and make an implementation on the scheduler side to safely process APCs while thd (connection) is inactive

Sergei Golubchik made changes - 2021-09-14 19:34

Priority

Critical [ 2 ]

Major [ 3 ]

Ralf Gebhardt made changes - 2021-09-16 07:34

Due Date

2021-09-14

Ralf Gebhardt made changes - 2021-09-22 15:00

Fix Version/s		10.8 [ 26121 ]
Fix Version/s	10.7 [ 24805 ]

Nikita Malyavin made changes - 2021-11-08 13:01

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Nikita Malyavin made changes - 2021-11-11 12:42

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2021-12-06 21:22

Workflow

MariaDB v3 [ 87753 ]

MariaDB v4 [ 131693 ]

Nikita Malyavin made changes - 2022-01-31 13:34

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Sergei Golubchik made changes - 2022-02-01 11:35

Fix Version/s		10.9 [ 26905 ]
Fix Version/s	10.8 [ 26121 ]

Sergei Golubchik made changes - 2022-02-01 12:10

Priority

Major [ 3 ]

Critical [ 2 ]

Nikita Malyavin added a comment - 2022-02-08 17:33 - edited

Unfortunately first solution will not work. There can be a race condition. Guarding statement processing with some lock would be also an incorrect solution (there's a deadlock possibility).

So the only known way is to schedule the APC on the target thread's side.
There are two problems:
1. In thread per connection mode the thread can be waiting for socket data.

2. In tpool mode, THD can be inactive.

We have to resolve both problems, this is why scheduler_functions is very convenient here.

While tpool solution is more or less straightforward – we can just send the task in the working queue, dedicated to a particular THD –
thread per connection mode problem requires some thinking.

Usually THD is poked from waiting by destroying the socket – this wakes up the thread, but the connection will be refused. It is done so, because the only reason for poking the thread was to gracefully KILL it.

Waiting on recv()/poll() or can be interrupted by signal on *nix, and by calling WSACancelBlockingCall/CancelloEx on windows. We can miss the proper timing for calling it though, then this cancellation will be spared.

A. We can temporarily set up the timeout to some neglegible value during signal handling, and recover it after APC will be processed. But what if another APC request will come after APC is processed and before the timeout is reset? We can then first reset the timeout and then process APC, this should work.

B. Another solution is to hack into poll() used to wait for input events in vio_wait (see vio_read: it is actually never waited on recv(), and ought to always wait on vio_wait i.e. poll). We can add a function into Vio api, like setup_wakeup(socketfd* sock, callback*), which adds and additional socket to wait on in the mentioned poll() call. I think setting up the callback can be also a good idea – then, vio_read() behavior will not be changed.

EDIT: we have to extend Vio and Net api in both cases

Nikita Malyavin added a comment - 2022-02-08 17:33 - edited Unfortunately first solution will not work. There can be a race condition. Guarding statement processing with some lock would be also an incorrect solution (there's a deadlock possibility). So the only known way is to schedule the APC on the target thread's side. There are two problems: 1. In thread per connection mode the thread can be waiting for socket data. 2. In tpool mode, THD can be inactive. We have to resolve both problems, this is why scheduler_functions is very convenient here. While tpool solution is more or less straightforward – we can just send the task in the working queue, dedicated to a particular THD – thread per connection mode problem requires some thinking. Usually THD is poked from waiting by destroying the socket – this wakes up the thread, but the connection will be refused. It is done so, because the only reason for poking the thread was to gracefully KILL it. Waiting on recv() / poll() or can be interrupted by signal on *nix, and by calling WSACancelBlockingCall/CancelloEx on windows. We can miss the proper timing for calling it though, then this cancellation will be spared. A. We can temporarily set up the timeout to some neglegible value during signal handling, and recover it after APC will be processed. But what if another APC request will come after APC is processed and before the timeout is reset? We can then first reset the timeout and then process APC, this should work. B. Another solution is to hack into poll() used to wait for input events in vio_wait (see vio_read : it is actually never waited on recv(), and ought to always wait on vio_wait i.e. poll ). We can add a function into Vio api, like setup_wakeup(socketfd* sock, callback*), which adds and additional socket to wait on in the mentioned poll() call. I think setting up the callback can be also a good idea – then, vio_read() behavior will not be changed. EDIT: we have to extend Vio and Net api in both cases

Nikita Malyavin added a comment - 2022-03-01 14:25 - edited

So far, both algorithms are ready.

First, the thread-per-connection algorithm has been implemented:

https://github.com/MariaDB/server/commit/0cc9df282a51ff6d0c73c0ec23e3bb7e3299e4ff
The key moments there are:

Signal handler setup in setup_connection_thread_globals
Function thread_scheduler_notify_apc() which
- On linux, sends the signal to the processor thread (pthread_kill). For now, the signal chosen is SIGURG. The signal sets net timeout to a minimal value (1ms): if the signal wakes up thread from poll (see vio_io_wait), then APC will be processed due retry; if the signal was before wait, then it will wake up by timeout very soon.
  - while I was writing that, I found a small window, where the signal can still have no effect: if the value was already read from vio->read_timeout, but poll was not yet called. I think maybe some retry-loop is required there, then, to make sure the handshake happens
- On Windows, QueueUserAPC2 is used. It does not wake up from io itself, so additionally CancelSynchronousIo is called.

Second, the threadpool algorithm was done

https://github.com/MariaDB/server/commit/9eff530067954877a743c34f8cc961840370f0ff

This was implementation that involves one additional unconditional LOCK_thd_kill lock before c->start_io. I liked it because it happens in the safe time, when there are no more immediate requests to process. However that didn't satisfy Wlad, the main threadpool author.

Then I have made another implementation, which I call epoch-based, and it doesn't add any unconditional locks into the task lifetime, but costs THD 16 bytes, and two atomic increments:
https://github.com/MariaDB/server/commit/eec4ee24be4b445debbd4c233124f631ed186306
- Probably, the same will be done for a thread-per-connection case with regard to the race just found
  Both algorithms suppose that nix io polls are built similarly. Removal from the poll can is done atomically, and the failure (missing shot) is reported as ENOENT in errno.

On windows, on the contrary, we cannot check whether the removal was successful or not. Instead, that can be done with CancelIo, which wakes up the thread/enqueues the io work into the pool.
The windows implementation is adapted to this, but is not yet tested:
https://github.com/MariaDB/server/commit/a9afa00784265eaa15efacc04fd60ba1c3bd1f91

***

During all this research, the race condition in APC was also found. The thing is we can't destroy the mutex that was involved in cond_wait (it is reacquired afterwards, and, then, oops):
https://github.com/MariaDB/server/commit/87e60d900ca348d556862819752f1a95a2aa94b6

Nikita Malyavin added a comment - 2022-03-01 14:25 - edited So far, both algorithms are ready. First, the thread-per-connection algorithm has been implemented: https://github.com/MariaDB/server/commit/0cc9df282a51ff6d0c73c0ec23e3bb7e3299e4ff The key moments there are: Signal handler setup in setup_connection_thread_globals Function thread_scheduler_notify_apc() which On linux, sends the signal to the processor thread (pthread_kill). For now, the signal chosen is SIGURG. The signal sets net timeout to a minimal value (1ms): if the signal wakes up thread from poll (see vio_io_wait ), then APC will be processed due retry; if the signal was before wait, then it will wake up by timeout very soon. while I was writing that, I found a small window, where the signal can still have no effect: if the value was already read from vio->read_timeout, but poll was not yet called. I think maybe some retry-loop is required there, then, to make sure the handshake happens On Windows, QueueUserAPC2 is used. It does not wake up from io itself, so additionally CancelSynchronousIo is called. Second, the threadpool algorithm was done https://github.com/MariaDB/server/commit/9eff530067954877a743c34f8cc961840370f0ff This was implementation that involves one additional unconditional LOCK_thd_kill lock before c->start_io . I liked it because it happens in the safe time, when there are no more immediate requests to process. However that didn't satisfy Wlad, the main threadpool author. Then I have made another implementation, which I call epoch-based, and it doesn't add any unconditional locks into the task lifetime, but costs THD 16 bytes, and two atomic increments: https://github.com/MariaDB/server/commit/eec4ee24be4b445debbd4c233124f631ed186306 Probably, the same will be done for a thread-per-connection case with regard to the race just found Both algorithms suppose that nix io polls are built similarly. Removal from the poll can is done atomically, and the failure (missing shot) is reported as ENOENT in errno. On windows, on the contrary, we cannot check whether the removal was successful or not. Instead, that can be done with CancelIo, which wakes up the thread/enqueues the io work into the pool. The windows implementation is adapted to this, but is not yet tested: https://github.com/MariaDB/server/commit/a9afa00784265eaa15efacc04fd60ba1c3bd1f91 *** During all this research, the race condition in APC was also found. The thing is we can't destroy the mutex that was involved in cond_wait (it is reacquired afterwards, and, then, oops): https://github.com/MariaDB/server/commit/87e60d900ca348d556862819752f1a95a2aa94b6

Nikita Malyavin added a comment - 2022-03-03 20:29

Thread-per-connection algorithm was redesigned, see the commit message:
https://github.com/MariaDB/server/commit/621d04a327caae38b9d518119b49018ee1d50aee

Nikita Malyavin added a comment - 2022-03-03 20:29 Thread-per-connection algorithm was redesigned, see the commit message: https://github.com/MariaDB/server/commit/621d04a327caae38b9d518119b49018ee1d50aee

Sergei Golubchik made changes - 2022-03-15 19:41

Fix Version/s		10.10 [ 27530 ]
Fix Version/s	10.9 [ 26905 ]

Sergei Golubchik made changes - 2022-06-15 13:35

Fix Version/s		10.11 [ 27614 ]
Fix Version/s	10.10 [ 27530 ]

Nikita Malyavin made changes - 2022-06-19 12:03

Assignee	Nikita Malyavin [ nikitamalyavin ]	Vladislav Vaintroub [ wlad ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Nikita Malyavin added a comment - 2022-06-19 12:11

wlad this task is ready for your review. The actual code is now located at bb-10.10-nikita, current head is here.

Nikita Malyavin added a comment - 2022-06-19 12:11 wlad this task is ready for your review. The actual code is now located at bb-10.10-nikita , current head is here .

Vladislav Vaintroub added a comment - 2022-06-19 14:17 - edited

Nikita, as before, the locking strategy in threadpool is questionable, and is not safe, unless the APIs we use guarantee that safety, on all platforms the server runs on.

The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel. That guarantee is stronger than the mutual exclusion that we usually need from a mutex. And is not the same as "It shall be safe to destroy an initialized mutex that is unlocked" by Single Unix Spec, which basically says that "destroying locked mutex is bad". The guarantee is stronger than that. This guarantee is apparently explicitly not given on Windows, where the documentation for DeleteCriticalSection says following "The caller is responsible for ensuring that the critical section object is unowned and the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process." (emphasis mine). I can see DeleteCriticalSection and LeaveCriticalSection at the same time here, there is no prevention.

One way that allowed you to play safe is to use locks that are not destroyed until the very end of the server, for example.

More, this lock and unlock once per every single command seems too expensive, for an interface that was supposed to be free. I do not think we shall accept some work that adds this amount of locking without extensive benchmarking first. I suggest to benchmark "very multiuser" "DO 1", or protocol level COM_PING

Vladislav Vaintroub added a comment - 2022-06-19 14:17 - edited Nikita, as before, the locking strategy in threadpool is questionable, and is not safe, unless the APIs we use guarantee that safety, on all platforms the server runs on. The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel. That guarantee is stronger than the mutual exclusion that we usually need from a mutex. And is not the same as "It shall be safe to destroy an initialized mutex that is unlocked" by Single Unix Spec, which basically says that "destroying locked mutex is bad". The guarantee is stronger than that. This guarantee is apparently explicitly not given on Windows, where the documentation for DeleteCriticalSection says following "The caller is responsible for ensuring that the critical section object is unowned and the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. " (emphasis mine). I can see DeleteCriticalSection and LeaveCriticalSection at the same time here, there is no prevention. One way that allowed you to play safe is to use locks that are not destroyed until the very end of the server, for example. More, this lock and unlock once per every single command seems too expensive, for an interface that was supposed to be free. I do not think we shall accept some work that adds this amount of locking without extensive benchmarking first. I suggest to benchmark "very multiuser" "DO 1", or protocol level COM_PING

Vladislav Vaintroub made changes - 2022-06-20 14:30

Assignee

Vladislav Vaintroub [ wlad ]

Nikita Malyavin [ nikitamalyavin ]

Vladislav Vaintroub made changes - 2022-06-20 14:30

Assignee

Nikita Malyavin [ nikitamalyavin ]

Vladislav Vaintroub [ wlad ]

Vladislav Vaintroub made changes - 2022-06-20 14:30

Assignee	Vladislav Vaintroub [ wlad ]	Nikita Malyavin [ nikitamalyavin ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Nikita Malyavin added a comment - 2022-06-20 14:39

>The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel

you are right, it's not safe to destroy the mutex in parallel with it being held. POSIX also classifies it as UB.
The guarantee is given by ~THD destructor:

/*

    Other threads may have a lock on LOCK_thd_kill to ensure that this

    THD is not deleted while they access it. The following mutex_lock

    ensures that no one else is using this THD and it's now safe to delete

*/

  mysql_mutex_lock(&LOCK_thd_kill);

  mysql_mutex_unlock(&LOCK_thd_kill);

Suchwise, if worker is reached execution again after c->start_io and reach threadpool_remove_connection() before LOCK_thd_kill will be unlocked, it will wait until it is unlocked there. This means, that it is safe to unlock LOCK_thd_kill after c->start_io(), if it was locked before.

The similar guarantee works when we are releasing LOCK_thd_kill by requestor threads, or by killer threads. Also that code in ~THD, but what makes it work there – is THD_list_iterator::iterate (actually it's just server_threads) and Find_THD_variable::operator(), that is, LOCK_thd_kill is locked before server_threads.lock is unlocked.

On the connection side, you may see that firs unlink_thd(thd) is called (that is, server_thread.erase(thd), so server_threads.lock was reacquired), and then LOCK_thd_kill is reacquired in ~THD.
This means, that we cannot acquire LOCK_thd_kill without server_threads.lock in general. But we can do so, when we gauarantee that connection is not executed and not in the io pool.

> More, this lock and unlock once per every single command seems too expensive
I also didn't like it having this way. That's why I've put significant effort, to make this lock happen not before the command, but after it. By the way, it is not a lock per command, but rather a lock per available request data, and only when no requests are left! So it should be quite nice on performance. The context switching was anyway costy: we aquired sonme locks, put mysys_vars with pthread_setspecific (at least twice). See thread_attach and Worker_thread_context::save for a clue. Whatnot. I guess we also check for killed quite regularly, which implies another LOCK_thd_kill acquisition.

You may also like more an obstruction-free approach, see next commit. There, only two atomic increments are the overhead. But one increment is before any command is processed, so it's 1ns worse. Also, it implies weaker guarantees for a requestor thread, but should be ok, given that it is very specific request.

Nikita Malyavin added a comment - 2022-06-20 14:39 >The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel you are right, it's not safe to destroy the mutex in parallel with it being held. POSIX also classifies it as UB. The guarantee is given by ~THD destructor: /* Other threads may have a lock on LOCK_thd_kill to ensure that this THD is not deleted while they access it. The following mutex_lock ensures that no one else is using this THD and it's now safe to delete */ mysql_mutex_lock(&LOCK_thd_kill); mysql_mutex_unlock(&LOCK_thd_kill); Suchwise, if worker is reached execution again after c->start_io and reach threadpool_remove_connection() before LOCK_thd_kill will be unlocked, it will wait until it is unlocked there. This means, that it is safe to unlock LOCK_thd_kill after c->start_io() , if it was locked before. The similar guarantee works when we are releasing LOCK_thd_kill by requestor threads, or by killer threads. Also that code in ~THD , but what makes it work there – is THD_list_iterator::iterate (actually it's just server_threads ) and Find_THD_variable::operator() , that is, LOCK_thd_kill is locked before server_threads.lock is unlocked. On the connection side, you may see that firs unlink_thd(thd) is called (that is, server_thread.erase(thd) , so server_threads.lock was reacquired), and then LOCK_thd_kill is reacquired in ~THD . This means, that we cannot acquire LOCK_thd_kill without server_threads.lock in general. But we can do so, when we gauarantee that connection is not executed and not in the io pool. > More, this lock and unlock once per every single command seems too expensive I also didn't like it having this way. That's why I've put significant effort, to make this lock happen not before the command, but after it. By the way, it is not a lock per command, but rather a lock per available request data, and only when no requests are left! So it should be quite nice on performance. The context switching was anyway costy: we aquired sonme locks, put mysys_vars with pthread_setspecific (at least twice). See thread_attach and Worker_thread_context::save for a clue. Whatnot. I guess we also check for killed quite regularly, which implies another LOCK_thd_kill acquisition. You may also like more an obstruction-free approach, see next commit. There, only two atomic increments are the overhead. But one increment is before any command is processed, so it's 1ns worse. Also, it implies weaker guarantees for a requestor thread, but should be ok, given that it is very specific request.

Nikita Malyavin made changes - 2022-06-20 22:18

Assignee	Nikita Malyavin [ nikitamalyavin ]	Vladislav Vaintroub [ wlad ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Vladislav Vaintroub added a comment - 2022-06-21 01:03 - edited

There is no guarantee given by THD destructor,
LeaveCriticalSection and DeleteCriticalSection can still run in parallel, and LeaveCriticalSection can finish after DeleteCriticalSection (I'm using Win example because the docs are supporting my point). The lock/unlock "trick" does not work, if data of the critical section /mutex is accessed after another thread is allowed to run. Those assumptions were always wrong, yet luckily we did not have many outside callers of this LOCK_thd_killed before your patch.

We also discussed this earlier.

Here is an example

Thread 1

  LeaveCriticalSection

  ... part where it allows other thread to run

Thread 2

  EnterCriticalSection

  LeaveCriticalSection

  DeleteCriticalSection

Thread 1

  continues LeaveCriticalSection, e.g update statistics (accessing freed memory)

  CRASH

There is no mentioning in the Microsoft docs, that critical section data is not accessed after another thread is allowed to run. There is instead mentioning , in DeleteCriticalSection documentation, The caller is responsible for ensuring that ... the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. Did the lock/unlock trick ensure it? No. DeleteCriticalSection runs at the same time as LeaveCriticalSection, as the example shows. This is what you should not do, according to MS documentation. This is UB, if you want it.

Thus, the code in the patch puts the server in danger of a crash during any (i.e also ordinary) client disconnect, which I'd think is not acceptable. An possible workaround is not to release, but reuse those critical sections, and really delete them at the very end of the server process.

Also the "context switching is costly, thus we make it most costly" is not a very good argument. If it is costly, and measured to be costly, we need to make it cheaper.

Vladislav Vaintroub added a comment - 2022-06-21 01:03 - edited There is no guarantee given by THD destructor, LeaveCriticalSection and DeleteCriticalSection can still run in parallel, and LeaveCriticalSection can finish after DeleteCriticalSection (I'm using Win example because the docs are supporting my point). The lock/unlock "trick" does not work, if data of the critical section /mutex is accessed after another thread is allowed to run. Those assumptions were always wrong, yet luckily we did not have many outside callers of this LOCK_thd_killed before your patch. We also discussed this earlier. Here is an example Thread 1 LeaveCriticalSection ... part where it allows other thread to run Thread 2 EnterCriticalSection LeaveCriticalSection DeleteCriticalSection Thread 1 continues LeaveCriticalSection, e.g update statistics (accessing freed memory) CRASH There is no mentioning in the Microsoft docs, that critical section data is not accessed after another thread is allowed to run. There is instead mentioning , in DeleteCriticalSection documentation, The caller is responsible for ensuring that ... the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. Did the lock/unlock trick ensure it? No. DeleteCriticalSection runs at the same time as LeaveCriticalSection, as the example shows. This is what you should not do, according to MS documentation. This is UB, if you want it. Thus, the code in the patch puts the server in danger of a crash during any (i.e also ordinary) client disconnect, which I'd think is not acceptable. An possible workaround is not to release, but reuse those critical sections, and really delete them at the very end of the server process. Also the "context switching is costly, thus we make it most costly" is not a very good argument. If it is costly, and measured to be costly, we need to make it cheaper.

Vladislav Vaintroub made changes - 2022-06-21 01:57

Assignee

Vladislav Vaintroub [ wlad ]

Nikita Malyavin [ nikitamalyavin ]

Vladislav Vaintroub made changes - 2022-06-21 01:57

Assignee

Nikita Malyavin [ nikitamalyavin ]

Vladislav Vaintroub [ wlad ]

Vladislav Vaintroub made changes - 2022-06-21 01:57

Assignee	Vladislav Vaintroub [ wlad ]	Nikita Malyavin [ nikitamalyavin ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Nikita Malyavin made changes - 2022-06-21 09:20

Assignee	Nikita Malyavin [ nikitamalyavin ]	Vladislav Vaintroub [ wlad ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Nikita Malyavin added a comment - 2022-06-21 12:41

I have revised the documentation related to the CriticalSection. Indeed, it does not make any clue on the CRITICAL_SECTION structure access order. While I think it's fraudish to put it this way (no sane implementation should make any accesses to a mutex structure after the ownership is released), let's see what the developers can say: https://github.com/MicrosoftDocs/sdk-api/pull/1212.

Anyway, if it turns out that winapi's CriticalSection violates such guaranteees, I think we should just stop using it, and implement a mutex for example based on spinlock+WaitOnAddress

Nikita Malyavin added a comment - 2022-06-21 12:41 I have revised the documentation related to the CriticalSection. Indeed, it does not make any clue on the CRITICAL_SECTION structure access order. While I think it's fraudish to put it this way (no sane implementation should make any accesses to a mutex structure after the ownership is released), let's see what the developers can say: https://github.com/MicrosoftDocs/sdk-api/pull/1212 . Anyway, if it turns out that winapi's CriticalSection violates such guaranteees, I think we should just stop using it, and implement a mutex for example based on spinlock+WaitOnAddress

Vladislav Vaintroub added a comment - 2022-06-21 12:45

There is no neeed to implement a new mutex. There are no guarantees in POSIX either, it is just that MS documentation is more explicit. One can't just free structure, when it can be in used by any other function, it is just common sense. That the function in name this case happen to be pthread_mutex_lock/unlock, does not eliminate the common sense.

Vladislav Vaintroub added a comment - 2022-06-21 12:45 There is no neeed to implement a new mutex. There are no guarantees in POSIX either, it is just that MS documentation is more explicit. One can't just free structure, when it can be in used by any other function, it is just common sense. That the function in name this case happen to be pthread_mutex_lock/unlock, does not eliminate the common sense.

Nikita Malyavin added a comment - 2022-06-21 16:55

POSIX does:
> A mutex can be destroyed immediately after it is unlocked.

Nikita Malyavin added a comment - 2022-06-21 16:55 POSIX does : > A mutex can be destroyed immediately after it is unlocked.

Vladislav Vaintroub added a comment - 2022-06-21 23:25

The current version of this pthread_cond_destroy singke unix spec
documentation tells that mutex should not be referenced by another threads,
and removes the older code example suggesting the code you were using is
safe.

Vladislav Vaintroub added a comment - 2022-06-21 23:25 The current version of this pthread_cond_destroy singke unix spec documentation tells that mutex should not be referenced by another threads, and removes the older code example suggesting the code you were using is safe.

Vladislav Vaintroub added a comment - 2022-08-23 08:48

So, nikitamalyavin , I fixed the Windows and threadpool implementations of the "wake" to address following limitation of your previous implementation

Windows, thread-per-connection - rather than a unconventional, QueueUserAPC + self-pipe trick with own implementation of socketpair, use CancelIoEx, to cancel an IO, which is what was originally intendedn. Own socketpair is costly in that it needs 2 extra ports in the ephemeral range, on the server which reduces the max possible number of connection by factor of 3
Threadpool - move the common logic into threadpool_common, introduce a new TP_STATE TP_INTERRUPTED , to mark that connection just needs to check for the APC, rather than reading and executing query. Unfortunately Linux needs yet another workaround, because unlike on Windows, where we can cancel IO, or elsewhere where we can disable monitoring of fd safely, Linux offers nothing like that for epoll, and attempts to fix it with EPOLL_CTL_CLOSE were abandoned.

With that, implementation of "wake" looks more or less ok to me. All this work is now in bb-10.10-MDEV-16440.

There needs to be much more investment into testing, because there is a lot of subtle scenarios, that might go wrong.

Apparently interesting cases that are not covered by your testing just yet, are

interrupting named pipe connections on Windows
Interrupting long running queries, in both thread-per-connection, and pool-of-thread such as
SELECT SLEEP(a lot)
SELECT SLEEP(a lot), which is done from for remote a table (federated and/or connect)
long operation (reads binlogs) in replication scenarios
stuck in some kind of locks, user lock (GET_LOCK), TABLE LOCK, FTWRL
You need also a basic test for non-debug build, as it is tested much more often on buildbots.

Vladislav Vaintroub added a comment - 2022-08-23 08:48 So, nikitamalyavin , I fixed the Windows and threadpool implementations of the "wake" to address following limitation of your previous implementation Windows, thread-per-connection - rather than a unconventional, QueueUserAPC + self-pipe trick with own implementation of socketpair, use CancelIoEx, to cancel an IO, which is what was originally intendedn. Own socketpair is costly in that it needs 2 extra ports in the ephemeral range, on the server which reduces the max possible number of connection by factor of 3 Threadpool - move the common logic into threadpool_common, introduce a new TP_STATE TP_INTERRUPTED , to mark that connection just needs to check for the APC, rather than reading and executing query. Unfortunately Linux needs yet another workaround, because unlike on Windows, where we can cancel IO, or elsewhere where we can disable monitoring of fd safely, Linux offers nothing like that for epoll, and attempts to fix it with EPOLL_CTL_CLOSE were abandoned. With that, implementation of "wake" looks more or less ok to me. All this work is now in bb-10.10- MDEV-16440 . There needs to be much more investment into testing, because there is a lot of subtle scenarios, that might go wrong. Apparently interesting cases that are not covered by your testing just yet, are interrupting named pipe connections on Windows Interrupting long running queries, in both thread-per-connection, and pool-of-thread such as SELECT SLEEP(a lot) SELECT SLEEP(a lot), which is done from for remote a table (federated and/or connect) long operation (reads binlogs) in replication scenarios stuck in some kind of locks, user lock (GET_LOCK), TABLE LOCK, FTWRL You need also a basic test for non-debug build, as it is tested much more often on buildbots.

Vladislav Vaintroub made changes - 2022-08-23 08:48

Assignee	Vladislav Vaintroub [ wlad ]	Nikita Malyavin [ nikitamalyavin ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Ralf Gebhardt made changes - 2022-09-20 17:29

Fix Version/s		10.12 [ 28320 ]
Fix Version/s	10.11 [ 27614 ]

Sergei Golubchik made changes - 2023-01-19 11:30

Fix Version/s		11.1 [ 28549 ]
Fix Version/s	11.0 [ 28320 ]

Nikita Malyavin made changes - 2023-01-30 12:56

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Nikita Malyavin made changes - 2023-02-20 14:40

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Nikita Malyavin made changes - 2023-02-20 14:40

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Ralf Gebhardt made changes - 2023-03-21 13:17

Fix Version/s		11.2 [ 28603 ]
Fix Version/s	11.1 [ 28549 ]

Ralf Gebhardt made changes - 2023-05-05 13:54

Fix Version/s		11.3 [ 28565 ]
Fix Version/s	11.2 [ 28603 ]

Sergei Golubchik made changes - 2023-08-16 15:08

Fix Version/s		11.4 [ 29301 ]
Fix Version/s	11.3 [ 28565 ]

Sergei Golubchik made changes - 2023-10-08 19:00

Link

This issue blocks MDEV-32355 [ MDEV-32355 ]

Sergei Golubchik made changes - 2023-10-30 12:59

Fix Version/s		11.5 [ 29506 ]
Fix Version/s	11.4 [ 29301 ]

Julien Fritsch made changes - 2023-11-30 16:29

Issue Type

Task [ 3 ]

New Feature [ 2 ]

Julien Fritsch made changes - 2023-12-07 10:02

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2024-02-13 16:41

Fix Version/s		11.6 [ 29515 ]
Fix Version/s	11.5 [ 29506 ]

Nikita Malyavin made changes - 2024-05-27 12:03

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Sergei Golubchik made changes - 2024-06-04 15:38

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2024-06-04 15:39

Fix Version/s		11.7 [ 29815 ]
Fix Version/s	11.6 [ 29515 ]

Ralf Gebhardt made changes - 2024-07-02 13:56

Assignee

Nikita Malyavin [ nikitamalyavin ]

Sergei Golubchik [ serg ]

Jira Automation (IT) made changes - 2024-07-04 07:39

Zendesk Related Tickets		201777
Zendesk active tickets		201777

Jira Automation (IT) made changes - 2024-08-22 05:46

Zendesk Related Tickets	201777	201777 204705
Zendesk active tickets	201777	201777 204705

Jira Automation (IT) made changes - 2024-09-01 08:01

Zendesk active tickets

201777 204705

201777

Sergei Golubchik made changes - 2024-09-08 15:50

Status

Stalled [ 10000 ]

In Review [ 10002 ]

Sergei Golubchik made changes - 2024-09-08 16:02

Fix Version/s		11.8 [ 29921 ]
Fix Version/s	11.7 [ 29815 ]

Sergei Golubchik made changes - 2024-12-10 13:30

Fix Version/s		11.9 [ 29945 ]
Fix Version/s	11.8 [ 29921 ]

Sergei Golubchik made changes - 2025-02-25 19:56

Fix Version/s		12.1 [ 29992 ]
Fix Version/s	12.0 [ 29945 ]

Ralf Gebhardt made changes - 2025-03-28 13:31

Fix Version/s		12.3 [ 30147 ]
Fix Version/s	12.1 [ 29992 ]

MariaDB Server

merge 5.7 P_S sysvars instrumentation and tables

Details

Attachments

Issue Links

Activity

First, the thread-per-connection algorithm has been implemented:

Second, the threadpool algorithm was done

***

People

Dates

Git Integration