[MDEV-24341] Innodb - do not block in foreground thread in log_write_up_to() - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.6.0
Component/s: Server, Storage Engine - InnoDB
Labels:
- performance

Description

The idea is to initiate flush to disk asynchronously, and delay waiting until transaction is persistent for as long as possible. In best case there will be no waiting in client (foreground) thread.

If pool-of-threads is used, do_command() can complete without sending "OK packet" to the client, freeing worker thread to server another client . The rest of the command will be done later, when data is persistent.

If thread-per-connection is used, the benefits are somewhat smaller, but we still can delay waiting for durability until the very end of do_command (or dispatch_command).

This is an improvement upon existing group commit logic.

What is the improvement

Threads do not need to wait for disk IO, or flush to complete, and can do other work required to wrap up the query
(closing tables, etc). If we're lucky, while the foreground thread finished the query, the Innodb data is already flushed
into redo log, and there was 0 waiting. If we're less lucky, the timespan of waiting is reduced, in one-thread-per-connection. In the threadpool, the worker thread can handle a different client instead of waiting for redo log flush.

How it is implemented

Server side

Introduce THD::async_state to track operations that must be finished before the server sends a reply to the client. Since we promise durability, modified transaction data needs to be durably stored before client receives an OK reply

Innodb side

Change log_write_up_to() such that it supports asynchronous waiting (i.e initiate write and flush rather than wait for completion) . the underlying group_commit_lock would need to support on-completion callbacks(that change the THD::async_state above), in addition to locks. The group commit logic will need to change a little, but the idea that there is one group commit leader, that does the operation synchronously remains. Other threads that do log_write_up_to won't wait but signal completion to the server via callback.

Sometimes, a blocking wait is required, so that this log_write_up_to will allow either blocking or non-blocking waits.

One-thread-per-connection scheduler handling (wait for async operations to finish)

Server can block when it writes response to the client
(net_real_write will wait for async ops go down to 0)

Threadpool (don't wait for async operations to finish, instead suspend/resume THD)

If async operation counter != 0 at the end of dispatch_command(), then current execution state
will be saved, and THD is "suspended", means that dispatch_command()/do_command()
will finish before net_real_write. Since do_command() finishes, worker thread in the pool is able to do something else, e.g handle another client's query.

As Innodb group commit leader(background thread that does log_write_up_to) flushes the corresponding LSN to the disk, and decrements async operations count for the THD so it goes to zero, and finds out that THD was suspended, it will resume THD by submitting a corresponding task into threadpool . This task will continue do_command() from where it was suspended, basically it is just net_real_write and some end-of-query cleanup.

How suspend/resume is implemented

Using fibers/coroutines for suspend-resume would be tempting, but not really possible cross-platform (already tried) and we'd need an alternate stack per every THD, which makes it expensive.
We just save the minimal state manually during suspend, and resume by using "goto" to the place where previously left, which is luckily just a single place at the end of dispatch_command(), and most of things we need are in the THD already.

Note that while the THD is suspended, not much can happen to it. It won't read new queries from the client, and it can't be killed (it can be marked for kill, and find out the kill flags, once resumed). We expect only a very short suspend duration, so it is not a problem.

Attachments

Issue Links

causes

MDEV-29843 Server hang in thd_decrement_pending_ops/pthread_cond_signal

Closed

relates to

MDEV-26007 Rollback unnecessarily initiates redo log write

Closed

MDEV-18959 Engine transaction recovery through persistent binlog

Stalled

MDEV-32103 InnoDB ALTER TABLE is not crash safe

Closed

Activity

People

Assignee:: Vladislav Vaintroub

Reporter:: Vladislav Vaintroub

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2020-12-03 14:52

Updated:: 2023-09-06 15:03

Resolved:: 2021-02-15 08:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server