[MDEV-38003] Intermittent Replication Failure on a multi-table DELETE - Jira

XML

Word

Printable

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 11.8.3
Fix Version/s: 11.8
Component/s: Optimizer, Replication
Labels:
Environment:
Debian GNU/Linux 12 (bookworm) using mariadb 11.8.3 from mariadb repos

Bug Category:
Can result in data loss

Description

I upgraded from 11.4.x last week. After an overnight batch process, I got a replication failure on several tables. I determined the cause was a single, multi-table DELETE that cleans up some old rows.

The statement works fine on the MASTER but does not replicate to the SLAVE. It doesn't even appear in the MASTER's binlog and so never even gets sent to the SLAVE I don't think. I use MIXED replication. Here's the kicker. The second time I run the statement, when there are 0 rows to delete on the MASTER, then the statement is logged and replicated and the SLAVE gets back into sync. This HAS TO BE A REGRESSION.

This is a routine function that has run nightly for over 10 years and never caused a single problem until I upgraded to 11.8. I can get nothing from the error logs about it. It issues no warnings when I try SHOW WARNINGS. I even enabled performance_schema logging and this is the result:

THREAD_ID: 2152574
EVENT_ID: 6
END_EVENT_ID: 6
EVENT_NAME: statement/sql/delete_multi
SOURCE:
TIMER_START: 384545772677685000
TIMER_END: 384545783917225000
TIMER_WAIT: 11239540000
LOCK_TIME: 157000000
SQL_TEXT: DELETE pc, pcru, cl, cs
FROM pc
LEFT JOIN pcru ON pcru.pc_id=pc.pc_id
LEFT JOIN cl ON cl.pc_id=pc.pc_id
LEFT JOIN cs ON cs.pc_id=pc.pc_id
WHERE pc.ru_id='262'
AND pc.time<'2025-08-31 00:00:00'
DIGEST: a5fae46dc7cf0c4a2b40c1cc8d78260d
CURRENT_SCHEMA: production
OBJECT_TYPE: NULL
OBJECT_SCHEMA: NULL
OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: NULL
MYSQL_ERRNO: 0
RETURNED_SQLSTATE: 00000
MESSAGE_TEXT: NULL
ERRORS: 0
WARNINGS: 0
ROWS_AFFECTED: 144
ROWS_SENT: 0
ROWS_EXAMINED: 1710
CREATED_TMP_DISK_TABLES: 0
CREATED_TMP_TABLES: 4
SELECT_FULL_JOIN: 0
SELECT_FULL_RANGE_JOIN: 0
SELECT_RANGE: 0
SELECT_RANGE_CHECK: 0
SELECT_SCAN: 0
SORT_MERGE_PASSES: 0
SORT_RANGE: 0
SORT_ROWS: 0
SORT_SCAN: 0
NO_INDEX_USED: 0
NO_GOOD_INDEX_USED: 0
NESTING_EVENT_ID: NULL
NESTING_EVENT_TYPE: NULL
NESTING_EVENT_LEVEL: 0

So this shows ROWS AFFECTED: 144 – this is the first time. The rows are successfully deleted on the MASTER without any issue but nothing happens on the SLAVE. Here is what it looks like the 2nd time the command is issued and where the SLAVE does delete its rows and "catches up" to the MASTER:

THREAD_ID: 2157652
EVENT_ID: 6
END_EVENT_ID: 6
EVENT_NAME: statement/sql/delete_multi
SOURCE:
TIMER_START: 386046338379228000
TIMER_END: 386046344241938000
TIMER_WAIT: 5862710000
LOCK_TIME: 163000000
SQL_TEXT: DELETE pc, pcru, cl, cs
FROM pc
LEFT JOIN pcru ON pcru.pc_id=pc.pc_id
LEFT JOIN cl ON cl.pc_id=pc.pc_id
LEFT JOIN cs ON cs.pc_id=pc.pc_id
WHERE pc.ru_id='262'
AND pc.time<'2025-08-31 00:00:00'
DIGEST: a5fae46dc7cf0c4a2b40c1cc8d78260d
CURRENT_SCHEMA: production
OBJECT_TYPE: NULL
OBJECT_SCHEMA: NULL
OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: NULL
MYSQL_ERRNO: 0
RETURNED_SQLSTATE: 00000
MESSAGE_TEXT: NULL
ERRORS: 0
WARNINGS: 0
ROWS_AFFECTED: 0
ROWS_SENT: 0
ROWS_EXAMINED: 1447
CREATED_TMP_DISK_TABLES: 0
CREATED_TMP_TABLES: 4
SELECT_FULL_JOIN: 0
SELECT_FULL_RANGE_JOIN: 0
SELECT_RANGE: 0
SELECT_RANGE_CHECK: 0
SELECT_SCAN: 0
SORT_MERGE_PASSES: 0
SORT_RANGE: 0
SORT_ROWS: 0
SORT_SCAN: 0
NO_INDEX_USED: 0
NO_GOOD_INDEX_USED: 0
NESTING_EVENT_ID: NULL
NESTING_EVENT_TYPE: NULL
NESTING_EVENT_LEVEL: 0

Notice the 2nd time it says ROWS AFFECTED: 0 . That is the only clue I have in this entire, miserable experience.

All tables are innodb. It's not a complicated query. It uses the pc table's PRIMARY KEY, which contains an INDEXED column in all the subordinate tables. The DELETE takes miliseconds to perform. It has run everyday for over 10 years without ever causing a single issue – until I upgraded to 11.8 .

I don't see how the query could be considered non-deterministic for replication, and as I said already, I'm using MIXED mode, so if a "STATEMENT" were considered bad, it SHOULD switch to ROW based replication. If nothing else, it should issue warnings or log something to the error log (I set log_warnings to 3), and yet nothing.

There are any number of workarounds. Right now I just run the DELETE statement twice in a row. But beyond the inexplicable mystery of this, I have to be concerned that other queries will fall prey to this same regression. I use multi-table DELETES throughout several code bases. Fortunately none of the others are triggering this bug.

Attachments

Activity

People

Assignee:: Dave Gosselin

Reporter:: Eric X

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2025-10-31 21:00

Updated:: 2 days ago 11:28

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.