Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16440

merge 5.7 P_S sysvars instrumentation and tables

Details

    Attachments

      Issue Links

        Activity

          serg Sergei Golubchik created issue -
          serg Sergei Golubchik made changes -
          Field Original Value New Value
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 10.5 [ 23123 ]
          Fix Version/s 10.4 [ 22408 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Labels RM_106_CANDIDATE
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 10.6 [ 24028 ]
          Fix Version/s 10.5 [ 23123 ]
          serg Sergei Golubchik made changes -
          Labels RM_106_CANDIDATE
          serg Sergei Golubchik made changes -
          Assignee Sergei Golubchik [ serg ] Oleksandr Byelkin [ sanja ]
          serg Sergei Golubchik made changes -
          Rank Ranked higher
          sanja Oleksandr Byelkin made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          sanja Oleksandr Byelkin added a comment - https://dev.mysql.com/doc/refman/5.7/en/performance-schema-system-variable-tables.html
          sanja Oleksandr Byelkin made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]
          serg Sergei Golubchik made changes -
          Rank Ranked higher
          serg Sergei Golubchik made changes -
          julien.fritsch Julien Fritsch made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Component/s Performance Schema [ 15627 ]
          serg Sergei Golubchik made changes -
          serg Sergei Golubchik made changes -
          julien.fritsch Julien Fritsch made changes -
          Priority Critical [ 2 ] Blocker [ 1 ]
          serg Sergei Golubchik made changes -
          Priority Blocker [ 1 ] Critical [ 2 ]
          julien.fritsch Julien Fritsch made changes -
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 10.7 [ 24805 ]
          Fix Version/s 10.6 [ 24028 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Labels Compatibility
          ralf.gebhardt Ralf Gebhardt made changes -
          Due Date 2021-09-14

          I have pushed unfinished work to bb-10.6-MDEV-16440

          The task shoud be rewritten using apc interface (see THD::apc_target & Co)

          sanja Oleksandr Byelkin added a comment - I have pushed unfinished work to bb-10.6- MDEV-16440 The task shoud be rewritten using apc interface (see THD::apc_target & Co)
          sanja Oleksandr Byelkin made changes -
          Assignee Oleksandr Byelkin [ sanja ] Nikita Malyavin [ nikitamalyavin ]
          nikitamalyavin Nikita Malyavin made changes -
          Status Stalled [ 10000 ] In Progress [ 3 ]
          serg Sergei Golubchik made changes -
          serg Sergei Golubchik made changes -
          Comment [ A comment with security level 'Developers' was removed. ]
          nikitamalyavin Nikita Malyavin made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]

          So far, the solution was polished and patches d1a925b...53478dc were reviewed by Sergei.
          However the APC issue with inaccessible sleeping threads is still unresolved. It turned out to be not trivial, because LOCK_thd_kill can't be held while retreiving system variables.

          We have now two competing solutions:
          1. Mine: add THD::awake flag, protected by mutex and condvar. Set awake = true before statement runs, and awake = false after. Then, on the requestor side, the usage:

          if(!thd->apc_target.is_enabled())
          {
           
            lock(&thd->LOCK_thd_awake);
            thd->COND_awake.wait(()[thd]{ return !thd->awake; });
            unlock(&thd->LOCK_thd_kill);
           
            (this->*func)(param, update_plugin);
            
            unlock(&thd->LOCK_thd_awake);
          }
          

          2. Wlad's: add queue_apc(void (f)(THD)) into scheduler_functions, and make an implementation on the scheduler side to safely process APCs while thd (connection) is inactive

          nikitamalyavin Nikita Malyavin added a comment - So far, the solution was polished and patches d1a925b...53478dc were reviewed by Sergei. However the APC issue with inaccessible sleeping threads is still unresolved. It turned out to be not trivial, because LOCK_thd_kill can't be held while retreiving system variables. We have now two competing solutions: 1. Mine: add THD::awake flag, protected by mutex and condvar. Set awake = true before statement runs, and awake = false after. Then, on the requestor side, the usage: if (!thd->apc_target.is_enabled()) { lock(&thd->LOCK_thd_awake); thd->COND_awake.wait(()[thd]{ return !thd->awake; }); unlock(&thd->LOCK_thd_kill);   ( this ->*func)(param, update_plugin); unlock(&thd->LOCK_thd_awake); } 2. Wlad's: add queue_apc(void ( f)(THD )) into scheduler_functions , and make an implementation on the scheduler side to safely process APCs while thd (connection) is inactive
          serg Sergei Golubchik made changes -
          Priority Critical [ 2 ] Major [ 3 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Due Date 2021-09-14
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 10.8 [ 26121 ]
          Fix Version/s 10.7 [ 24805 ]
          nikitamalyavin Nikita Malyavin made changes -
          Status Stalled [ 10000 ] In Progress [ 3 ]
          nikitamalyavin Nikita Malyavin made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]
          serg Sergei Golubchik made changes -
          Workflow MariaDB v3 [ 87753 ] MariaDB v4 [ 131693 ]
          nikitamalyavin Nikita Malyavin made changes -
          Status Stalled [ 10000 ] In Progress [ 3 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 10.9 [ 26905 ]
          Fix Version/s 10.8 [ 26121 ]
          serg Sergei Golubchik made changes -
          Priority Major [ 3 ] Critical [ 2 ]
          nikitamalyavin Nikita Malyavin added a comment - - edited

          Unfortunately first solution will not work. There can be a race condition. Guarding statement processing with some lock would be also an incorrect solution (there's a deadlock possibility).

          So the only known way is to schedule the APC on the target thread's side.
          There are two problems:
          1. In thread per connection mode the thread can be waiting for socket data.

          2. In tpool mode, THD can be inactive.

          We have to resolve both problems, this is why scheduler_functions is very convenient here.

          While tpool solution is more or less straightforward – we can just send the task in the working queue, dedicated to a particular THD –
          thread per connection mode problem requires some thinking.

          Usually THD is poked from waiting by destroying the socket – this wakes up the thread, but the connection will be refused. It is done so, because the only reason for poking the thread was to gracefully KILL it.

          Waiting on recv()/poll() or can be interrupted by signal on *nix, and by calling WSACancelBlockingCall/CancelloEx on windows. We can miss the proper timing for calling it though, then this cancellation will be spared.

          A. We can temporarily set up the timeout to some neglegible value during signal handling, and recover it after APC will be processed. But what if another APC request will come after APC is processed and before the timeout is reset? We can then first reset the timeout and then process APC, this should work.

          B. Another solution is to hack into poll() used to wait for input events in vio_wait (see vio_read: it is actually never waited on recv(), and ought to always wait on vio_wait i.e. poll). We can add a function into Vio api, like setup_wakeup(socketfd* sock, callback*), which adds and additional socket to wait on in the mentioned poll() call. I think setting up the callback can be also a good idea – then, vio_read() behavior will not be changed.

          EDIT: we have to extend Vio and Net api in both cases

          nikitamalyavin Nikita Malyavin added a comment - - edited Unfortunately first solution will not work. There can be a race condition. Guarding statement processing with some lock would be also an incorrect solution (there's a deadlock possibility). So the only known way is to schedule the APC on the target thread's side. There are two problems: 1. In thread per connection mode the thread can be waiting for socket data. 2. In tpool mode, THD can be inactive. We have to resolve both problems, this is why scheduler_functions is very convenient here. While tpool solution is more or less straightforward – we can just send the task in the working queue, dedicated to a particular THD – thread per connection mode problem requires some thinking. Usually THD is poked from waiting by destroying the socket – this wakes up the thread, but the connection will be refused. It is done so, because the only reason for poking the thread was to gracefully KILL it. Waiting on recv() / poll() or can be interrupted by signal on *nix, and by calling WSACancelBlockingCall/CancelloEx on windows. We can miss the proper timing for calling it though, then this cancellation will be spared. A. We can temporarily set up the timeout to some neglegible value during signal handling, and recover it after APC will be processed. But what if another APC request will come after APC is processed and before the timeout is reset? We can then first reset the timeout and then process APC, this should work. B. Another solution is to hack into poll() used to wait for input events in vio_wait (see vio_read : it is actually never waited on recv(), and ought to always wait on vio_wait i.e. poll ). We can add a function into Vio api, like setup_wakeup(socketfd* sock, callback*), which adds and additional socket to wait on in the mentioned poll() call. I think setting up the callback can be also a good idea – then, vio_read() behavior will not be changed. EDIT: we have to extend Vio and Net api in both cases
          nikitamalyavin Nikita Malyavin added a comment - - edited

          So far, both algorithms are ready.

          First, the thread-per-connection algorithm has been implemented:

          https://github.com/MariaDB/server/commit/0cc9df282a51ff6d0c73c0ec23e3bb7e3299e4ff
          The key moments there are:

          • Signal handler setup in setup_connection_thread_globals
          • Function thread_scheduler_notify_apc() which
            • On linux, sends the signal to the processor thread (pthread_kill). For now, the signal chosen is SIGURG. The signal sets net timeout to a minimal value (1ms): if the signal wakes up thread from poll (see vio_io_wait), then APC will be processed due retry; if the signal was before wait, then it will wake up by timeout very soon.
              • while I was writing that, I found a small window, where the signal can still have no effect: if the value was already read from vio->read_timeout, but poll was not yet called. I think maybe some retry-loop is required there, then, to make sure the handshake happens
            • On Windows, QueueUserAPC2 is used. It does not wake up from io itself, so additionally CancelSynchronousIo is called.

          Second, the threadpool algorithm was done

          https://github.com/MariaDB/server/commit/9eff530067954877a743c34f8cc961840370f0ff

          • This was implementation that involves one additional unconditional LOCK_thd_kill lock before c->start_io. I liked it because it happens in the safe time, when there are no more immediate requests to process. However that didn't satisfy Wlad, the main threadpool author.
          • Then I have made another implementation, which I call epoch-based, and it doesn't add any unconditional locks into the task lifetime, but costs THD 16 bytes, and two atomic increments:
            https://github.com/MariaDB/server/commit/eec4ee24be4b445debbd4c233124f631ed186306
            • Probably, the same will be done for a thread-per-connection case with regard to the race just found
              Both algorithms suppose that nix io polls are built similarly. Removal from the poll can is done atomically, and the failure (missing shot) is reported as ENOENT in errno.

          ***

          During all this research, the race condition in APC was also found. The thing is we can't destroy the mutex that was involved in cond_wait (it is reacquired afterwards, and, then, oops):
          https://github.com/MariaDB/server/commit/87e60d900ca348d556862819752f1a95a2aa94b6

          nikitamalyavin Nikita Malyavin added a comment - - edited So far, both algorithms are ready. First, the thread-per-connection algorithm has been implemented: https://github.com/MariaDB/server/commit/0cc9df282a51ff6d0c73c0ec23e3bb7e3299e4ff The key moments there are: Signal handler setup in setup_connection_thread_globals Function thread_scheduler_notify_apc() which On linux, sends the signal to the processor thread (pthread_kill). For now, the signal chosen is SIGURG. The signal sets net timeout to a minimal value (1ms): if the signal wakes up thread from poll (see vio_io_wait ), then APC will be processed due retry; if the signal was before wait, then it will wake up by timeout very soon. while I was writing that, I found a small window, where the signal can still have no effect: if the value was already read from vio->read_timeout, but poll was not yet called. I think maybe some retry-loop is required there, then, to make sure the handshake happens On Windows, QueueUserAPC2 is used. It does not wake up from io itself, so additionally CancelSynchronousIo is called. Second, the threadpool algorithm was done https://github.com/MariaDB/server/commit/9eff530067954877a743c34f8cc961840370f0ff This was implementation that involves one additional unconditional LOCK_thd_kill lock before c->start_io . I liked it because it happens in the safe time, when there are no more immediate requests to process. However that didn't satisfy Wlad, the main threadpool author. Then I have made another implementation, which I call epoch-based, and it doesn't add any unconditional locks into the task lifetime, but costs THD 16 bytes, and two atomic increments: https://github.com/MariaDB/server/commit/eec4ee24be4b445debbd4c233124f631ed186306 Probably, the same will be done for a thread-per-connection case with regard to the race just found Both algorithms suppose that nix io polls are built similarly. Removal from the poll can is done atomically, and the failure (missing shot) is reported as ENOENT in errno. On windows, on the contrary, we cannot check whether the removal was successful or not. Instead, that can be done with CancelIo, which wakes up the thread/enqueues the io work into the pool. The windows implementation is adapted to this, but is not yet tested: https://github.com/MariaDB/server/commit/a9afa00784265eaa15efacc04fd60ba1c3bd1f91 *** During all this research, the race condition in APC was also found. The thing is we can't destroy the mutex that was involved in cond_wait (it is reacquired afterwards, and, then, oops): https://github.com/MariaDB/server/commit/87e60d900ca348d556862819752f1a95a2aa94b6

          Thread-per-connection algorithm was redesigned, see the commit message:
          https://github.com/MariaDB/server/commit/621d04a327caae38b9d518119b49018ee1d50aee

          nikitamalyavin Nikita Malyavin added a comment - Thread-per-connection algorithm was redesigned, see the commit message: https://github.com/MariaDB/server/commit/621d04a327caae38b9d518119b49018ee1d50aee
          serg Sergei Golubchik made changes -
          Fix Version/s 10.10 [ 27530 ]
          Fix Version/s 10.9 [ 26905 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 10.11 [ 27614 ]
          Fix Version/s 10.10 [ 27530 ]
          nikitamalyavin Nikita Malyavin made changes -
          Assignee Nikita Malyavin [ nikitamalyavin ] Vladislav Vaintroub [ wlad ]
          Status In Progress [ 3 ] In Review [ 10002 ]

          wlad this task is ready for your review. The actual code is now located at bb-10.10-nikita, current head is here.

          nikitamalyavin Nikita Malyavin added a comment - wlad this task is ready for your review. The actual code is now located at bb-10.10-nikita , current head is here .
          wlad Vladislav Vaintroub added a comment - - edited

          Nikita, as before, the locking strategy in threadpool is questionable, and is not safe, unless the APIs we use guarantee that safety, on all platforms the server runs on.

          The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel. That guarantee is stronger than the mutual exclusion that we usually need from a mutex. And is not the same as "It shall be safe to destroy an initialized mutex that is unlocked" by Single Unix Spec, which basically says that "destroying locked mutex is bad". The guarantee is stronger than that. This guarantee is apparently explicitly not given on Windows, where the documentation for DeleteCriticalSection says following "The caller is responsible for ensuring that the critical section object is unowned and the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process." (emphasis mine). I can see DeleteCriticalSection and LeaveCriticalSection at the same time here, there is no prevention.

          One way that allowed you to play safe is to use locks that are not destroyed until the very end of the server, for example.

          More, this lock and unlock once per every single command seems too expensive, for an interface that was supposed to be free. I do not think we shall accept some work that adds this amount of locking without extensive benchmarking first. I suggest to benchmark "very multiuser" "DO 1", or protocol level COM_PING

          wlad Vladislav Vaintroub added a comment - - edited Nikita, as before, the locking strategy in threadpool is questionable, and is not safe, unless the APIs we use guarantee that safety, on all platforms the server runs on. The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel. That guarantee is stronger than the mutual exclusion that we usually need from a mutex. And is not the same as "It shall be safe to destroy an initialized mutex that is unlocked" by Single Unix Spec, which basically says that "destroying locked mutex is bad". The guarantee is stronger than that. This guarantee is apparently explicitly not given on Windows, where the documentation for DeleteCriticalSection says following "The caller is responsible for ensuring that the critical section object is unowned and the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. " (emphasis mine). I can see DeleteCriticalSection and LeaveCriticalSection at the same time here, there is no prevention. One way that allowed you to play safe is to use locks that are not destroyed until the very end of the server, for example. More, this lock and unlock once per every single command seems too expensive, for an interface that was supposed to be free. I do not think we shall accept some work that adds this amount of locking without extensive benchmarking first. I suggest to benchmark "very multiuser" "DO 1", or protocol level COM_PING
          wlad Vladislav Vaintroub made changes -
          Assignee Vladislav Vaintroub [ wlad ] Nikita Malyavin [ nikitamalyavin ]
          wlad Vladislav Vaintroub made changes -
          Assignee Nikita Malyavin [ nikitamalyavin ] Vladislav Vaintroub [ wlad ]
          wlad Vladislav Vaintroub made changes -
          Assignee Vladislav Vaintroub [ wlad ] Nikita Malyavin [ nikitamalyavin ]
          Status In Review [ 10002 ] Stalled [ 10000 ]

          >The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel

          you are right, it's not safe to destroy the mutex in parallel with it being held. POSIX also classifies it as UB.
          The guarantee is given by ~THD destructor:

            /*
              Other threads may have a lock on LOCK_thd_kill to ensure that this
              THD is not deleted while they access it. The following mutex_lock
              ensures that no one else is using this THD and it's now safe to delete
            */
            mysql_mutex_lock(&LOCK_thd_kill);
            mysql_mutex_unlock(&LOCK_thd_kill);
          

          Suchwise, if worker is reached execution again after c->start_io and reach threadpool_remove_connection() before LOCK_thd_kill will be unlocked, it will wait until it is unlocked there. This means, that it is safe to unlock LOCK_thd_kill after c->start_io(), if it was locked before.

          The similar guarantee works when we are releasing LOCK_thd_kill by requestor threads, or by killer threads. Also that code in ~THD, but what makes it work there – is THD_list_iterator::iterate (actually it's just server_threads) and Find_THD_variable::operator(), that is, LOCK_thd_kill is locked before server_threads.lock is unlocked.

          On the connection side, you may see that firs unlink_thd(thd) is called (that is, server_thread.erase(thd), so server_threads.lock was reacquired), and then LOCK_thd_kill is reacquired in ~THD.
          This means, that we cannot acquire LOCK_thd_kill without server_threads.lock in general. But we can do so, when we gauarantee that connection is not executed and not in the io pool.

          > More, this lock and unlock once per every single command seems too expensive
          I also didn't like it having this way. That's why I've put significant effort, to make this lock happen not before the command, but after it. By the way, it is not a lock per command, but rather a lock per available request data, and only when no requests are left! So it should be quite nice on performance. The context switching was anyway costy: we aquired sonme locks, put mysys_vars with pthread_setspecific (at least twice). See thread_attach and Worker_thread_context::save for a clue. Whatnot. I guess we also check for killed quite regularly, which implies another LOCK_thd_kill acquisition.

          You may also like more an obstruction-free approach, see next commit. There, only two atomic increments are the overhead. But one increment is before any command is processed, so it's 1ns worse. Also, it implies weaker guarantees for a requestor thread, but should be ok, given that it is very specific request.

          nikitamalyavin Nikita Malyavin added a comment - >The problem is that pthread_unlock() must not access any structures of the lock, after giving up the ownership (allowing another thread to access), because that structure can be freed by a thread that is doing pthread_mutex_destroy() in parallel you are right, it's not safe to destroy the mutex in parallel with it being held. POSIX also classifies it as UB. The guarantee is given by ~THD destructor: /* Other threads may have a lock on LOCK_thd_kill to ensure that this THD is not deleted while they access it. The following mutex_lock ensures that no one else is using this THD and it's now safe to delete */ mysql_mutex_lock(&LOCK_thd_kill); mysql_mutex_unlock(&LOCK_thd_kill); Suchwise, if worker is reached execution again after c->start_io and reach threadpool_remove_connection() before LOCK_thd_kill will be unlocked, it will wait until it is unlocked there. This means, that it is safe to unlock LOCK_thd_kill after c->start_io() , if it was locked before. The similar guarantee works when we are releasing LOCK_thd_kill by requestor threads, or by killer threads. Also that code in ~THD , but what makes it work there – is THD_list_iterator::iterate (actually it's just server_threads ) and Find_THD_variable::operator() , that is, LOCK_thd_kill is locked before server_threads.lock is unlocked. On the connection side, you may see that firs unlink_thd(thd) is called (that is, server_thread.erase(thd) , so server_threads.lock was reacquired), and then LOCK_thd_kill is reacquired in ~THD . This means, that we cannot acquire LOCK_thd_kill without server_threads.lock in general. But we can do so, when we gauarantee that connection is not executed and not in the io pool. > More, this lock and unlock once per every single command seems too expensive I also didn't like it having this way. That's why I've put significant effort, to make this lock happen not before the command, but after it. By the way, it is not a lock per command, but rather a lock per available request data, and only when no requests are left! So it should be quite nice on performance. The context switching was anyway costy: we aquired sonme locks, put mysys_vars with pthread_setspecific (at least twice). See thread_attach and Worker_thread_context::save for a clue. Whatnot. I guess we also check for killed quite regularly, which implies another LOCK_thd_kill acquisition. You may also like more an obstruction-free approach, see next commit. There, only two atomic increments are the overhead. But one increment is before any command is processed, so it's 1ns worse. Also, it implies weaker guarantees for a requestor thread, but should be ok, given that it is very specific request.
          nikitamalyavin Nikita Malyavin made changes -
          Assignee Nikita Malyavin [ nikitamalyavin ] Vladislav Vaintroub [ wlad ]
          Status Stalled [ 10000 ] In Review [ 10002 ]
          wlad Vladislav Vaintroub added a comment - - edited

          There is no guarantee given by THD destructor,
          LeaveCriticalSection and DeleteCriticalSection can still run in parallel, and LeaveCriticalSection can finish after DeleteCriticalSection (I'm using Win example because the docs are supporting my point). The lock/unlock "trick" does not work, if data of the critical section /mutex is accessed after another thread is allowed to run. Those assumptions were always wrong, yet luckily we did not have many outside callers of this LOCK_thd_killed before your patch.

          We also discussed this earlier.

          Here is an example

          Thread 1
            LeaveCriticalSection
            ... part where it allows other thread to run
          Thread 2
            EnterCriticalSection
            LeaveCriticalSection
            DeleteCriticalSection
          Thread 1
            continues LeaveCriticalSection, e.g update statistics (accessing freed memory)
            CRASH
          

          There is no mentioning in the Microsoft docs, that critical section data is not accessed after another thread is allowed to run. There is instead mentioning , in DeleteCriticalSection documentation, The caller is responsible for ensuring that ... the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. Did the lock/unlock trick ensure it? No. DeleteCriticalSection runs at the same time as LeaveCriticalSection, as the example shows. This is what you should not do, according to MS documentation. This is UB, if you want it.

          Thus, the code in the patch puts the server in danger of a crash during any (i.e also ordinary) client disconnect, which I'd think is not acceptable. An possible workaround is not to release, but reuse those critical sections, and really delete them at the very end of the server process.

          Also the "context switching is costly, thus we make it most costly" is not a very good argument. If it is costly, and measured to be costly, we need to make it cheaper.

          wlad Vladislav Vaintroub added a comment - - edited There is no guarantee given by THD destructor, LeaveCriticalSection and DeleteCriticalSection can still run in parallel, and LeaveCriticalSection can finish after DeleteCriticalSection (I'm using Win example because the docs are supporting my point). The lock/unlock "trick" does not work, if data of the critical section /mutex is accessed after another thread is allowed to run. Those assumptions were always wrong, yet luckily we did not have many outside callers of this LOCK_thd_killed before your patch. We also discussed this earlier. Here is an example Thread 1 LeaveCriticalSection ... part where it allows other thread to run Thread 2 EnterCriticalSection LeaveCriticalSection DeleteCriticalSection Thread 1 continues LeaveCriticalSection, e.g update statistics (accessing freed memory) CRASH There is no mentioning in the Microsoft docs, that critical section data is not accessed after another thread is allowed to run. There is instead mentioning , in DeleteCriticalSection documentation, The caller is responsible for ensuring that ... the specified CRITICAL_SECTION structure is not being accessed by any critical section functions called by other threads in the process. Did the lock/unlock trick ensure it? No. DeleteCriticalSection runs at the same time as LeaveCriticalSection, as the example shows. This is what you should not do, according to MS documentation. This is UB, if you want it. Thus, the code in the patch puts the server in danger of a crash during any (i.e also ordinary) client disconnect, which I'd think is not acceptable. An possible workaround is not to release, but reuse those critical sections, and really delete them at the very end of the server process. Also the "context switching is costly, thus we make it most costly" is not a very good argument. If it is costly, and measured to be costly, we need to make it cheaper.
          wlad Vladislav Vaintroub made changes -
          Assignee Vladislav Vaintroub [ wlad ] Nikita Malyavin [ nikitamalyavin ]
          wlad Vladislav Vaintroub made changes -
          Assignee Nikita Malyavin [ nikitamalyavin ] Vladislav Vaintroub [ wlad ]
          wlad Vladislav Vaintroub made changes -
          Assignee Vladislav Vaintroub [ wlad ] Nikita Malyavin [ nikitamalyavin ]
          Status In Review [ 10002 ] Stalled [ 10000 ]
          nikitamalyavin Nikita Malyavin made changes -
          Assignee Nikita Malyavin [ nikitamalyavin ] Vladislav Vaintroub [ wlad ]
          Status Stalled [ 10000 ] In Review [ 10002 ]

          I have revised the documentation related to the CriticalSection. Indeed, it does not make any clue on the CRITICAL_SECTION structure access order. While I think it's fraudish to put it this way (no sane implementation should make any accesses to a mutex structure after the ownership is released), let's see what the developers can say: https://github.com/MicrosoftDocs/sdk-api/pull/1212.

          Anyway, if it turns out that winapi's CriticalSection violates such guaranteees, I think we should just stop using it, and implement a mutex for example based on spinlock+WaitOnAddress

          nikitamalyavin Nikita Malyavin added a comment - I have revised the documentation related to the CriticalSection. Indeed, it does not make any clue on the CRITICAL_SECTION structure access order. While I think it's fraudish to put it this way (no sane implementation should make any accesses to a mutex structure after the ownership is released), let's see what the developers can say: https://github.com/MicrosoftDocs/sdk-api/pull/1212 . Anyway, if it turns out that winapi's CriticalSection violates such guaranteees, I think we should just stop using it, and implement a mutex for example based on spinlock+WaitOnAddress

          There is no neeed to implement a new mutex. There are no guarantees in POSIX either, it is just that MS documentation is more explicit. One can't just free structure, when it can be in used by any other function, it is just common sense. That the function in name this case happen to be pthread_mutex_lock/unlock, does not eliminate the common sense.

          wlad Vladislav Vaintroub added a comment - There is no neeed to implement a new mutex. There are no guarantees in POSIX either, it is just that MS documentation is more explicit. One can't just free structure, when it can be in used by any other function, it is just common sense. That the function in name this case happen to be pthread_mutex_lock/unlock, does not eliminate the common sense.

          POSIX does:
          > A mutex can be destroyed immediately after it is unlocked.

          nikitamalyavin Nikita Malyavin added a comment - POSIX does : > A mutex can be destroyed immediately after it is unlocked.

          The current version of this pthread_cond_destroy singke unix spec
          documentation tells that mutex should not be referenced by another threads,
          and removes the older code example suggesting the code you were using is
          safe.

          wlad Vladislav Vaintroub added a comment - The current version of this pthread_cond_destroy singke unix spec documentation tells that mutex should not be referenced by another threads, and removes the older code example suggesting the code you were using is safe.

          So, nikitamalyavin , I fixed the Windows and threadpool implementations of the "wake" to address following limitation of your previous implementation

          1. Windows, thread-per-connection - rather than a unconventional, QueueUserAPC + self-pipe trick with own implementation of socketpair, use CancelIoEx, to cancel an IO, which is what was originally intendedn. Own socketpair is costly in that it needs 2 extra ports in the ephemeral range, on the server which reduces the max possible number of connection by factor of 3
          2. Threadpool - move the common logic into threadpool_common, introduce a new TP_STATE TP_INTERRUPTED , to mark that connection just needs to check for the APC, rather than reading and executing query. Unfortunately Linux needs yet another workaround, because unlike on Windows, where we can cancel IO, or elsewhere where we can disable monitoring of fd safely, Linux offers nothing like that for epoll, and attempts to fix it with EPOLL_CTL_CLOSE were abandoned.

          With that, implementation of "wake" looks more or less ok to me. All this work is now in bb-10.10-MDEV-16440.

          There needs to be much more investment into testing, because there is a lot of subtle scenarios, that might go wrong.

          Apparently interesting cases that are not covered by your testing just yet, are

          • interrupting named pipe connections on Windows
          • Interrupting long running queries, in both thread-per-connection, and pool-of-thread such as
            SELECT SLEEP(a lot)
            SELECT SLEEP(a lot), which is done from for remote a table (federated and/or connect)
            long operation (reads binlogs) in replication scenarios
            stuck in some kind of locks, user lock (GET_LOCK), TABLE LOCK, FTWRL
          • You need also a basic test for non-debug build, as it is tested much more often on buildbots.
          wlad Vladislav Vaintroub added a comment - So, nikitamalyavin , I fixed the Windows and threadpool implementations of the "wake" to address following limitation of your previous implementation Windows, thread-per-connection - rather than a unconventional, QueueUserAPC + self-pipe trick with own implementation of socketpair, use CancelIoEx, to cancel an IO, which is what was originally intendedn. Own socketpair is costly in that it needs 2 extra ports in the ephemeral range, on the server which reduces the max possible number of connection by factor of 3 Threadpool - move the common logic into threadpool_common, introduce a new TP_STATE TP_INTERRUPTED , to mark that connection just needs to check for the APC, rather than reading and executing query. Unfortunately Linux needs yet another workaround, because unlike on Windows, where we can cancel IO, or elsewhere where we can disable monitoring of fd safely, Linux offers nothing like that for epoll, and attempts to fix it with EPOLL_CTL_CLOSE were abandoned. With that, implementation of "wake" looks more or less ok to me. All this work is now in bb-10.10- MDEV-16440 . There needs to be much more investment into testing, because there is a lot of subtle scenarios, that might go wrong. Apparently interesting cases that are not covered by your testing just yet, are interrupting named pipe connections on Windows Interrupting long running queries, in both thread-per-connection, and pool-of-thread such as SELECT SLEEP(a lot) SELECT SLEEP(a lot), which is done from for remote a table (federated and/or connect) long operation (reads binlogs) in replication scenarios stuck in some kind of locks, user lock (GET_LOCK), TABLE LOCK, FTWRL You need also a basic test for non-debug build, as it is tested much more often on buildbots.
          wlad Vladislav Vaintroub made changes -
          Assignee Vladislav Vaintroub [ wlad ] Nikita Malyavin [ nikitamalyavin ]
          Status In Review [ 10002 ] Stalled [ 10000 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 10.12 [ 28320 ]
          Fix Version/s 10.11 [ 27614 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 11.1 [ 28549 ]
          Fix Version/s 11.0 [ 28320 ]
          nikitamalyavin Nikita Malyavin made changes -
          Status Stalled [ 10000 ] In Progress [ 3 ]
          nikitamalyavin Nikita Malyavin made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]
          nikitamalyavin Nikita Malyavin made changes -
          Status Stalled [ 10000 ] In Progress [ 3 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 11.2 [ 28603 ]
          Fix Version/s 11.1 [ 28549 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 11.3 [ 28565 ]
          Fix Version/s 11.2 [ 28603 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 11.4 [ 29301 ]
          Fix Version/s 11.3 [ 28565 ]
          serg Sergei Golubchik made changes -
          serg Sergei Golubchik made changes -
          Fix Version/s 11.5 [ 29506 ]
          Fix Version/s 11.4 [ 29301 ]
          julien.fritsch Julien Fritsch made changes -
          Issue Type Task [ 3 ] New Feature [ 2 ]
          julien.fritsch Julien Fritsch made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 11.6 [ 29515 ]
          Fix Version/s 11.5 [ 29506 ]
          nikitamalyavin Nikita Malyavin made changes -
          Status Stalled [ 10000 ] In Progress [ 3 ]
          serg Sergei Golubchik made changes -
          Status In Progress [ 3 ] Stalled [ 10000 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 11.7 [ 29815 ]
          Fix Version/s 11.6 [ 29515 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Assignee Nikita Malyavin [ nikitamalyavin ] Sergei Golubchik [ serg ]
          mariadb-jira-automation Jira Automation (IT) made changes -
          Zendesk Related Tickets 201777
          Zendesk active tickets 201777
          mariadb-jira-automation Jira Automation (IT) made changes -
          Zendesk Related Tickets 201777 201777 204705
          Zendesk active tickets 201777 201777 204705
          mariadb-jira-automation Jira Automation (IT) made changes -
          Zendesk active tickets 201777 204705 201777
          serg Sergei Golubchik made changes -
          Status Stalled [ 10000 ] In Review [ 10002 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 11.8 [ 29921 ]
          Fix Version/s 11.7 [ 29815 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 11.9 [ 29945 ]
          Fix Version/s 11.8 [ 29921 ]
          serg Sergei Golubchik made changes -
          Fix Version/s 12.1 [ 29992 ]
          Fix Version/s 12.0 [ 29945 ]
          ralf.gebhardt Ralf Gebhardt made changes -
          Fix Version/s 12.3 [ 30147 ]
          Fix Version/s 12.1 [ 29992 ]

          People

            serg Sergei Golubchik
            serg Sergei Golubchik
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.