[MDEV-25607] Auto-generated DELETE from HEAP table can break replication Created: 2021-05-05 Updated: 2023-12-15 |
|
| Status: | In Review |
| Project: | MariaDB Server |
| Component/s: | Replication, Storage Engine - Memory |
| Affects Version/s: | 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9 |
| Fix Version/s: | 10.4, 10.5, 10.6 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Elena Stepanova | Assignee: | Andrei Elkin |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Description |
|
After server restart, a DELETE query is written into the binary log for every HEAP table to reflect the restart emptying them. It is written unconditionally, regardless whether it's actually executable or not. If it is not, it causes replication abort. In the example test case below DELETE causes an error because the table has a DELETE trigger which refers to a non-existing table.
|
| Comments |
| Comment by Brandon Nesterenko [ 2021-06-07 ] |
|
Conceptually, the fix for this would depend on what we would determine to be the underlying cause.. What would be the contributing "issue(s)" here (as opposed to allowable input/behavior)? Each problem would then have a different fix: My thought would be just #1. Elkin |
| Comment by Andrei Elkin [ 2021-06-08 ] |
|
bnestere: There's no crashing really, rather the slave stops with an error. But that's an inconsistency with the master in the very setup that is required. |
| Comment by Andrei Elkin [ 2021-06-08 ] |
|
elenst: Finally I understood the "magic". Needed to have the 2nd chance |
| Comment by Elena Stepanova [ 2021-06-08 ] |
|
That's exactly right. The essence of this report is that DELETE which, if it had been executed, would have caused an error on the master and thus would have never been written into the binary log, gets written there directly by the special HEAP logic, gets replicated to the slave, expectedly ends up with an error on the slave, and thus causes replication abort due to the error code mismatch (0 vs non-zero). There is no previous discrepancy between master and slave, besides the inevitable loss of contents of the memory table after master restart. Hypothetically it is not anyhow limited to triggers, it can be any error which satisfies the conditions above. Triggers were just the only example I came up with for DELETE failure on a memory table. Surely, as bnestere describes, there can be numerous reasons for error code mismatches causing replication abort. They would be different problems though, or non-bugs if, for example, they are caused by previous schema/data discrepancy. The problem is, I don't see a way out of it. Even if you don't write DELETE on the master, or if you ignore the event failure on the slave – still, we will end up with discrepancy in the table data, which will very likely make replication abort anyway, just later. Somehow the table on the slave has to be emptied to preserve consistency. |
| Comment by Andrei Elkin [ 2021-06-09 ] |
|
elenst, (to The problem is) indeed! So we should only binlog it but with a hint to slave(s) which is not call triggers - bnestere, right? And the best hint to my knowledge would be logging TRUNCATE instead of DELETE |
| Comment by Brandon Nesterenko [ 2021-06-09 ] |
|
Elkin Nice. Just did some testing/debugging with truncate, and I agree, I think truncate would work well here. |
| Comment by Brandon Nesterenko [ 2021-06-09 ] |
|
Hi Andrei, The fix is ready for review. |
| Comment by Andrei Elkin [ 2023-12-15 ] |
|
The status confirmed. The review should be completed, to make the patch into the upcoming CS release. |