[MDEV-37047] How to expanding this non-GTID capability for GTID replication? - Jira

XML

Word

Printable

Details

Type: Technical task
Status: Closed (View Workflow)
Priority: Major
Resolution: Done
Affects Version/s: None
Fix Version/s: N/A
Component/s: Replication
Labels:
None

Sprint:
Q3/2025 Maintenance, Q4/2025 Server Maintenance

Description

knielsen said that GTID replication does not purge the relay logs if only one of the IO and SQL threads is stopped.
Then why isn't it already capable?
- Because the stateless code is afraid that the shutdown that stops the threads might not be graceful (i.e., from a crash), whereas it knows that the server has not crashed as long as one of the threads is running.
Are all GTID config changes between thread STOP/START consistent (persistence aside)?
- (Non-key) CHANGE MASTER TO
  - The IO thread applies DO/IGNORE_DOMAIN_IDS like how it processes IGNORE_SERVER_IDS.
  - That pair and MASTER_USE_GTID are the only CHANGE MASTER TO options that are exclusive to GTID mode.
    (They are also the only MariaDB-exclusive options, as MySQL instead has an unrelated transaction UUID system.)
- @@GLOBAL vars that apply to all connections
  - For @@gtid_ignore_duplicates, the “allow connecting at a nonexistent GTID position” part is a part of the IO thread connecting to the primary.
    The SQL thread invokes the core functionality of ignoring event groups with existing GTIDs.
  - Replication filters
  - gtid_slave_pos and gtid_pos_auto_engines inherently determine the GTID-mode SQL thread position and its per-engine partitioning.
Given @@GLOBAL.gtid_io_pos is not @@GLOBAL.gtid_slave_pos, what would it be?
- Something in SSS? Gtid_IO_Pos?
  - This is it, but it is only an in-memory C++ variable and not persisted anywhere.
- Something in the master info or the relay log info?
  - They only record non-GTID positions (file-position). GTID mode stores its SQL thread position in the mysql.gtid_slave_pos table.
- Must it scan the relay log, especially for the crash recovery case?
  - Binary (including Relay) logs have an index file that lists the log files. The log class is not meant for random access, however, so after fetching the latest filename, the scan must be manually driven (e.g., start with open_binlog()).
- Similarly, @@gtid_binlog_pos (and by extension, @@gtid_current_pos is also a state cache, but also written in a log-bin.state file.
  When the state file is unavailable or possibly out of date, the recovery process rebuilds the @@gtid_binlog_pos state from the latest binary log.
- Additionally, when only the IO thread is restarted, it already has a counter for GTID mode to skip the already-logged events of an event group.
  This means the full form of a GTID-based position also has an event subindex.
  We can build upon this mechanism to avoid truncating a partial transaction of complete events.
  Of course, partial events (from horrible power or hardware failures) are still a concern.
  - In both file-position and GTID, the SQL thread updates @@relay_log_info_file when it switches the log file to read, so it doesn’t need to scan the log directory.
  - What are the contents of ~~MDEV-4991~~’s GTID indices, and could they help here?
    - This is a sparse (non-exhaustive) index that maps GTIDs to positions per file.
      It (at least originally) is also not saved with durability in mind, as the plan is to fall back to linear scanning when this persistent cache is unavailable.
The primary can locate replica-requested GTIDs in the binary logs even without the GTID indices.
How can the replica leverage this procedure to find where it left off in the relay logs?
- This is a hard-coded loop that skips events in groups and per domain ID.
  Refactoring is only feasible as a feature patch, and that patch might as well include mariadb-binlog in the DRYing.
Why is --gtid-strict-mode a concern? Isn't out-of-order GTID always an error?
- The infamous repeated events problem applies to GTID as well.
  Therefore, the new “@@GLOBAL.gtid_io_pos” must always refer to the last event group, whether including the partial transaction (with full, readable events) at the end or truncating it.
What pattern do the GTIDs of a partial transaction have?
- The GTID_EVENT is the first event of the transaction group and represents the START TRANSACTION query.
How concerning is the mentioned MDEV-33268?
- If resuming the IO thread mid-transaction (as opposed to truncating it from the relay log), it must not count relay-exclusive (i.e. IO-generated) events when dropping re-replicated binlog events.
  From there, MDEV-33268 becomes an independent concern.
- The IO thread counts the mid-transaction events to skip enqueuing after it re-ignores any redownloaded events.
  This means the skip-enqueue count is consistent with the mid-transaction event count, even if the IO thread is unable to write_ignored_events_info_to_relay_log().

Attachments

Issue Links

relates to

MDEV-33268 IO Thread Can Write Gtid_list_log_event Mid-transaction into Relay Log

Open

Activity

People

Assignee:: Jimmy Hú

Reporter:: Jimmy Hú

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2025-06-19 22:09

Updated:: 2025-10-08 01:11

Resolved:: 2025-09-26 02:46

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.