[MDEV-29796] Auto-generated DELETE for MEMORY/HEAP table can break GTID-based replication Created: 2022-10-14  Updated: 2023-11-28

Status: Open
Project: MariaDB Server
Component/s: Replication, Storage Engine - Memory
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9
Fix Version/s: 10.4, 10.5, 10.6

Type: Bug Priority: Major
Reporter: Valerii Kravchuk Assignee: Sergei Golubchik
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Blocks

 Description   

It's known (see MDEV-25607) even if not properly documented (see MDEV-29795) that After server restart, a DELETE query is written into the binary log for every HEAP table, at the moment when it is accessed in any way, to reflect the restart emptying them for replicas. It is written unconditionally, and if we restart a replica with, say, server_id = 2 in the domain_id = 0, this extra transaction (logged nowhere else) generates out of order GTID for the domain and then causes replication break in strict GTID mode. Moreover, replication can be broken any time later after restart, by simple SELECT from low-privileged user (like SELECT from information_schema.tables that touches any MEMORY table), even on slave with read_only = ON.

While it seems to be by design, this behaviour breaks simple single level replication setups with strict GTIDs. For the end user this looks like a bug hard to prevent and troubleshoot. We have to think about some other solution, maybe do NOT write these DELETEs to the binary log if read_only = ON (and the table is accessed by the user without SUPER privilege?), use some special separate domain_id, something else?



 Comments   
Comment by Andrei Elkin [ 2022-10-18 ]

valerii, I agree with the description. We should study how to block the unconditional DELETE. The read-only server may be always exempted from doing that. Let me redirect the question to serg (thanks in advance!).
Alternatively the DELETE could be deferred until the table has experienced the very first update operation on the restarted server. If the restarted server is demoted into slave, obviously no DELETE would be binlogged.

Comment by Andrei Elkin [ 2022-10-18 ]

Just a "binary" question is asked.

Comment by Sergei Golubchik [ 2022-10-18 ]

What about this logic: if auto-DELETE on MEMORY tables won't be replicated, replication will go out of sync. It will not go out of sync only if these tables are not replicated (--replicate-ignore-tables or something). So, if MEMORY table is ignored, its cleanup should not be binlogged, otherwise it should be.

Comment by Andrei Elkin [ 2022-10-19 ]

valerii, arguably the DELETE action is a kind of SUPER-user one. And it also makes sense to register in binlog any data change. So an isolated gtid domain for local updates deems the only way forward.

How to make it error-free?
If one just specifies those domains in ignore_domain_ids of any Change-Master, would not it be enough?

Also notice a separate gtid domain may need to be assigned for updates over the MEMORY tables.
After all, the read-only server may require its slave service configuration to not accept the MEMORY table replication events.
(Alternatively Sergei's mentioning of replicate-ignore-tables would do the same).
Consequently, ignore_domain_ids = memory_table_domain_id would be necessary on all read-only 'intermediate masters'.

Comment by Sergei Golubchik [ 2022-10-20 ]

--read-only cannot affect it. If the server is read-only, one shouldn't be able to create slave-only MEMORY tables in the first place. If CREATE TABLE was allowed, then DELETE should be too.

If those tables are slave-side only and should not be replicated, they should be ignored, and auto-DELETE should respect it.

new domain-id makes sense if these changes need to be replicated. If they don't — they need to be ignored, and then they don't need a domain-id

Comment by Andrei Elkin [ 2022-11-09 ]

serg, valerii: it must be clear the current DELETE binlogging policy makes sense only for restarted masters. (A pure) Slave should not log DELETE on its own. After all were it to receive a replication event on a post-restart empty table it would just stop with a typical ER_KEY_NOT_FOUND or such. When Slave also serves a master role then the user would have to organize gtid domains in her replication chain. Sure this measure alone won't help if a memory table is replicated throughout the chain while a slave-master server in the middle gets restarted so ultimately the restarter should receive a partial memory-table backup at the handshake with master (I have not tried to find one, but we should have a separate ticket for that).

--read-only then indeed is irrelevant to the pure slave which is supposed to never execute updates over replicated data/domains or if it does, at least it *must* not binlog them with gtids of the replication domains.
So that's a requirement to the user.
As to the server the pure slave should have read-access to empty tables without logging DELETE events.
That must be a solution.

Practically then, instead of deciding whether a server is the pure slave, why won't we extend the read-access-causes-no-DELETE over any server and only binlog DELETE when a server is about to write something into binlog (attempting to update a memory table or reads from it to update other tables)?

Assigning Sergei to reply to this.

Comment by Sergei Golubchik [ 2022-11-11 ]

I'm not sure I understand. Of course, DELETE is written to binlog only when a server is supposed to write to binlog. If it should never write anything to binlog, it has binlogging disabled, and then it won't write DELETE either.

Comment by Andrei Elkin [ 2022-11-11 ]

serg, I verified the description that claims a mere SELECT from a MEMORY table causes
DELETE event into binlog, e.g of the slave running with `--log-bin`. That's a legacy of a 2004 commit that coded open_table_entry_fini() uninterested in
whether the read or write access the table is going to provide.

My suggestion is to binlog DELETE before the very first write access, on both master and slave.

Comment by Sergei Golubchik [ 2022-11-17 ]

What use case will it solve? An empty MEMORY table that is only selected from but is never written to?

Comment by Andrei Elkin [ 2022-11-18 ]

(After self-reviewing to improve on few points)

serg, first not just 'selected from' but merely touched, like with COM_FIELD_LIST.

I have only one case in my mind. A listener/learned/slave process is subscribed for changes in the table.
(So it does not care what SELECT would do)
Clearly the reported past-crash-restart DELETE logging on such process could at least wait until the first write operation has arrived at it. If that one would be from a local connection DELETE would be logged. In other words, DELETE gets binlogged only at doing local writes.
Otherwise (it's from replication) it should not.
The first replicated event is recommended (to the user application) to be DELETE which
Valerii pointed to

MEMORY table is not empty on the master that the data are originally from (*or DELETEs are logged there already and will be replicated*),

So if it does so this replication setup is safe, else (the first event after the restart is not the DELETE) slave would error out volunteerly at once with a message like "Replicated event to the restarted slave is not an expected DELETE" (nothing gets binlogged of course for the table).

Comment by Andrei Elkin [ 2023-01-09 ]

valerii,serg, (sorry for couple of typos, corrected) while the docs are made to discourage Memory table replication, we could do more to enforce
that. E.g with automatic @@skip_replication = ON for the duration of a statement that creates Memory table create/modify events.
This measure would at least not punish those who did not really mean this kind's replication but are currently punished with a "mystique" gtid OOO error, for unreading the docs.

As there might exist users who would insist to replicate it, @@skip_replication could be converted into an enum with the default automatic or smth value with its meaning to automatically evaluate to OFF. The taking risk users would have to set it ON explicitly.

Comment by Andrei Elkin [ 2023-01-10 ]

maxmether, MDEV-25607 has a master side flavor. It's fine to auto-delete from Memory table on the restarting master, but the deletion can't be a query that may trigger a trigger (so TRUNCATE is safe).

Comment by Sergei Golubchik [ 2023-02-05 ]

We've established that if the MEMORY table on the slave is cleared because the server was restarted — this table cannot be part of the replication set, must've been ignored by the replication. Otherwise the replication will break because table's content on the slave will differ from the master's.

But this slave might be a master for some other slave or at least can write its own binlog (for fail-over or backup). In this case it can put the auto-delete into the next existing GTID that would've used this table anyway. Like INSERT...SELECT in SBR or any row event that uses this table in RBR.

Meaning the auto-delete binlog write would happen not on the first MEMORY table access (like, SELECT) but on the first binlog-relevant table access. And it would not need a GTID

Generated at Thu Feb 08 10:11:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.