The idea is to initiate flush to disk asynchronously, and delay waiting until transaction is persistent for as long as possible. In best case there will be no waiting in client (foreground) thread.
If pool-of-threads is used, do_command() can complete without sending "OK packet" to the client, freeing worker thread to server another client . The rest of the command will be done later, when data is persistent.
If thread-per-connection is used, the benefits are somewhat smaller, but we still can delay waiting for durability until the very end of do_command (or dispatch_command).
This is an improvement upon existing group commit logic.
Threads do not need to wait for disk IO, or flush to complete, and can do other work required to wrap up the query
(closing tables, etc). If we're lucky, while the foreground thread finished the query, the Innodb data is already flushed
into redo log, and there was 0 waiting. If we're less lucky, the timespan of waiting is reduced, in one-thread-per-connection. In the threadpool, the worker thread can handle a different client instead of waiting for redo log flush.
Introduce THD::async_state to track operations that must be finished before the server sends a reply to the client. Since we promise durability, modified transaction data needs to be durably stored before client receives an OK reply
Change log_write_up_to() such that it supports asynchronous waiting (i.e initiate write and flush rather than wait for completion) . the underlying group_commit_lock would need to support on-completion callbacks(that change the THD::async_state above), in addition to locks. The group commit logic will need to change a little, but the idea that there is one group commit leader, that does the operation synchronously remains. Other threads that do log_write_up_to won't wait but signal completion to the server via callback.
Sometimes, a blocking wait is required, so that this log_write_up_to will allow either blocking or non-blocking waits.
Server can block when it writes response to the client
(net_real_write will wait for async ops go down to 0)
If async operation counter != 0 at the end of dispatch_command(), then current execution state
will be saved, and THD is "suspended", means that dispatch_command()/do_command()
will finish before net_real_write. Since do_command() finishes, worker thread in the pool is able to do something else, e.g handle another client's query.
As Innodb group commit leader(background thread that does log_write_up_to) flushes the corresponding LSN to the disk, and decrements async operations count for the THD so it goes to zero, and finds out that THD was suspended, it will resume THD by submitting a corresponding task into threadpool . This task will continue do_command() from where it was suspended, basically it is just net_real_write and some end-of-query cleanup.
Using fibers/coroutines for suspend-resume would be tempting, but not really possible cross-platform (already tried) and we'd need an alternate stack per every THD, which makes it expensive.
We just save the minimal state manually during suspend, and resume by using "goto" to the place where previously left, which is luckily just a single place at the end of dispatch_command(), and most of things we need are in the THD already.
Note that while the THD is suspended, not much can happen to it. It won't read new queries from the client, and it can't be killed (it can be marked for kill, and find out the kill flags, once resumed). We expect only a very short suspend duration, so it is not a problem.