[MDEV-38776] [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6
Fix Version/s: 10.6
Component/s: Replication
Labels:
- regression

Bug Category:
Can result in hang or crash

Description

This has been seen in the wild for at least a couple of years, but is very elusive and not easily repeatable.

It happens in parallel replication, where the slave will start to hang (make no progress) for a while and then eventually fail with:

[ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

A worker thread gets repeated innodb_lock_wait_timeout until eventually it gives up after slave_transaction_retries attempts and stops the slave with the error. With the default timeout and retry settings of 50 times and 10 seconds, this occurs after 500 seconds.

In one instance, it was observed that during the hang, and just prior to the error, another worker is seen to be in this state:

Waiting for parallel replication deadlock handling to complete

This state occurs when a conflict between two in-order transactions T1 and T2 has been detected. T1 is determined to be the priority transaction, and a kill of transaction T2 is queued. If T2 is about to commit at just that point, it will notice the pending kill and instead go wait for the kill to complete before rolling back and retrying.

Thus, this state is only supposed to be able to exist for a very short time. Indeed, if a worker gets stuck in this state, it will be able to block replication indefinitely. So this state can explain the resulting error.

EDIT: The culprit seems to be this commit:

commit 34f11b06e6aa5e8d4e648c04b5a4049179b66cfd

Author: Kristian Nielsen <knielsen@knielsen-hq.org>

Date:   Sun Oct 14 20:41:49 2018 +0200

    Move deletion of old GTID rows to slave background thread

This introduced an additional job into the slave background thread that is responsible for deadlock handling/kill. If this unrelated processing gets an InnoDB row lock wait it can block critical deadlock processing from occurring. It has been confirmed from user observations that the problem can be seen while the deletion of old GTID rows is stuck on an innodb row lock.

Later, this patch moved the processing into the manager thread, where there is potentially even more contended processing happeing:

commit 6a1cb449feb1b77e5ec94904c228d7c5477f528a

Author: Sergei Golubchik <serg@mariadb.org>

Date:   Mon Jan 18 18:02:16 2021 +0100

    cleanup: remove slave background thread, use handle_manager thread instead

The queueing and completion of the kill during parallel replication deadlock handling is a critical operation. It must happen as quickly as possible, and without taking any locks or waiting on other dependencies, otherwise the whole replication may end up deadlocked. If the manager thread ends up waiting for some other thread (say on an InnoDB row lock or a mutex), this may block deadlock processing, which in turn may block whatever the manager thread is waiting for.

(And even apart from the deadlock possibility, when a deadlock kill is pending it means that the entire parallel replication is momentarily blocked, and this needs a dedicated thread to resolve as quickly as possible, it shouldn't ever have to wait for any unrelated activity to be processed first).

The manager thread is for example handling bg_gtid_delete_pending which can take InnoDB row locks. Even worse, it is running tc_purge() while holding LOCK_manager, which can end up blocking slave_background_kill_request() which is called while holding the InnoDB lock_sys mutex.

Thus, to fix this regression, the above commit must be reverted (for the part that removes the dedicated thread handling deadlock kills), while the bg_gtid_delete_pending should remain in the manager thread where it cannot block parallel replication deadlock handling.

EDIT: I got additional info from an affected user. This shows definitely (gdb stacktrace) that he manager thread is running bg_gtid_delete_pending() and is blocked on an InnoDB row lock inside ha_innobase::delete_row().

Meanwhile one worker thread is stuck in the state "Waiting for parallel replication deadlock handling to complete", because the manager is stuck and cannot process the queued kill to resolve the wait. Thus, replication is now deadlocked and hung.

The new information shows that the problem is in fact much worse than the rare occurrence of the slave stopping with "retried transactions 10 time(s) in vain, giving up". The stuck state "Waiting for parallel replication deadlock handling to complete" was observed multiple times within a short period, hanging the replication for some interval but eventually resolving itself, presumably on lock wait timeout or similar. Thus this bug seems to be causing frequent smaller slave stalls, which will lead to small replication lag building up regularly, but very hard to diagnose.

Attachments

Activity

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 4 days ago 23:41

Updated:: Yesterday 22:13

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.