[MDEV-24341] Innodb - do not block in foreground thread in log_write_up_to() Created: 2020-12-03  Updated: 2023-09-06  Resolved: 2021-02-15

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: None
Fix Version/s: 10.6.0

Type: Bug Priority: Major
Reporter: Vladislav Vaintroub Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 0
Labels: performance

Issue Links:
Problem/Incident
causes MDEV-29843 Server hang in thd_decrement_pending_... Closed
Relates
relates to MDEV-26007 Rollback unnecessarily initiates redo... Closed
relates to MDEV-18959 Engine transaction recovery through p... Stalled
relates to MDEV-32103 InnoDB ALTER TABLE is not crash safe Closed

 Description   

The idea is to initiate flush to disk asynchronously, and delay waiting until transaction is persistent for as long as possible. In best case there will be no waiting in client (foreground) thread.

If pool-of-threads is used, do_command() can complete without sending "OK packet" to the client, freeing worker thread to server another client . The rest of the command will be done later, when data is persistent.

If thread-per-connection is used, the benefits are somewhat smaller, but we still can delay waiting for durability until the very end of do_command (or dispatch_command).

This is an improvement upon existing group commit logic.

What is the improvement

Threads do not need to wait for disk IO, or flush to complete, and can do other work required to wrap up the query
(closing tables, etc). If we're lucky, while the foreground thread finished the query, the Innodb data is already flushed
into redo log, and there was 0 waiting. If we're less lucky, the timespan of waiting is reduced, in one-thread-per-connection. In the threadpool, the worker thread can handle a different client instead of waiting for redo log flush.

How it is implemented

Server side

Introduce THD::async_state to track operations that must be finished before the server sends a reply to the client. Since we promise durability, modified transaction data needs to be durably stored before client receives an OK reply

Innodb side

Change log_write_up_to() such that it supports asynchronous waiting (i.e initiate write and flush rather than wait for completion) . the underlying group_commit_lock would need to support on-completion callbacks(that change the THD::async_state above), in addition to locks. The group commit logic will need to change a little, but the idea that there is one group commit leader, that does the operation synchronously remains. Other threads that do log_write_up_to won't wait but signal completion to the server via callback.

Sometimes, a blocking wait is required, so that this log_write_up_to will allow either blocking or non-blocking waits.

One-thread-per-connection scheduler handling (wait for async operations to finish)

Server can block when it writes response to the client
(net_real_write will wait for async ops go down to 0)

Threadpool (don't wait for async operations to finish, instead suspend/resume THD)

If async operation counter != 0 at the end of dispatch_command(), then current execution state
will be saved, and THD is "suspended", means that dispatch_command()/do_command()
will finish before net_real_write. Since do_command() finishes, worker thread in the pool is able to do something else, e.g handle another client's query.

As Innodb group commit leader(background thread that does log_write_up_to) flushes the corresponding LSN to the disk, and decrements async operations count for the THD so it goes to zero, and finds out that THD was suspended, it will resume THD by submitting a corresponding task into threadpool . This task will continue do_command() from where it was suspended, basically it is just net_real_write and some end-of-query cleanup.

How suspend/resume is implemented

Using fibers/coroutines for suspend-resume would be tempting, but not really possible cross-platform (already tried) and we'd need an alternate stack per every THD, which makes it expensive.
We just save the minimal state manually during suspend, and resume by using "goto" to the place where previously left, which is luckily just a single place at the end of dispatch_command(), and most of things we need are in the THD already.

Note that while the THD is suspended, not much can happen to it. It won't read new queries from the client, and it can't be killed (it can be marked for kill, and find out the kill flags, once resumed). We expect only a very short suspend duration, so it is not a problem.



 Comments   
Comment by Marko Mäkelä [ 2020-12-03 ]

Currently, the log uses synchronous writes. Could we change that to asynchronous I/O as part of this task?

Comment by Vladislav Vaintroub [ 2020-12-11 ]

marko It probably makes sense to try it after this task, which is relatively large. For simplicity, I did not change the existing log_write_up_to, only changed group_commit_lock somewhat, so after flush it can signal LSN waiters using callbacks, in addition to wakeups. The log_write_up_to is still synchronous, and it needs to be synchronous in some situations

On the first glance, it seems that splitting log_write_up_to into a part which is executed by foreground thread, and the part executed on IO completion is not very trivial, mainly due to weird write loop inside log_write_buf(). I'd rather just do a single AIO not many.

Comment by Oleksandr Byelkin [ 2021-02-12 ]

c4d15d677a981cec07d3313ae7a444a08f36dfbb OK to push after fixing comment and the name of dispatch return as we agreed.

Generated at Thu Feb 08 09:29:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.