[MDEV-4991] GTID binlog indexing - Jira

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 11.4.1
Component/s: Replication
Labels:
- Preview_11.4
- gtid

Description

Current GTID code needs to scan one binlog file from the beginning when a
slave connects, to find the place to start replicating. If slave reconnects
are frequent, this can be a performance regression, as in earlier versions
slaves could immediate seek to the supplied file offset.

To fix this problem, indexing should be done on the binlog files, allowing to
quickly locate any GTID in the binlog file. As an added bonus, this would
allow to detect if old-style replication tries to connect at an incorrect file
offset (eg. in the middle of an event), avoiding sending potentially corrupted
events.

The index could be an extra file master-bin.000001.idx written in parallel
with the binlog file. There is no need to flush or sync the file at every
binlog write, as it can be recovered easily in case of crash or code can fall
back to scanning the corresponding binlog file.

The index would be page-based, allowing a connecting slave to do binary search
to find the desired location in the binlog to start replication. The file
would contain an ordered sequence of GTID binlog states with their
corresponding start offset into the associated binlog file.

The connecting slave would then binary-search for its start position in the index, and this way be able to jump directly to the right start position in the binlog file, without needing to scan that binlog from the start. This would greatly improve performance when many slaves connect simultaneously.
A B-tree like structure is more efficient for disk-based searching than binary search.
Since we write out the index record in order, we can actually build a B-tree
append-only to the index file. At the end, the root node will be the last page in the
file.

There is no need to include every position in the index. We can write say one
in every 10 transactions into the index; a connecting slave will then lookup
the closest matching position in the index and at most need to skip over 10
transactions in the binlog. In general, we can keep track of the size of
binlog written and index written, and write only a fraction of transactions
into the index to ensure that the ratio of index size to binlog size does not
exceed some appropriate number (eg. 2% or something).

To further reduce the index size, it could be "compressed" by omitting from entries those (domain_id, server_id) combinations that do not change. Typically, there can be many distinct such values in a binlog file, but only a few of them are likely to change within one given file.

A work-in-progress high-level design description:

  This implements an on-disk index for each binlog file to speed up access to

  the binlog at a specific offset or GTID position. This is primarily used when

  a slave connects to the master, but also by user calling BINLOG_GTID_POS().

  A connecting slave first scans the binlog files to find the last one with an

  initial GTID_LIST event that lies before the starting GTID position. Then a

  sequential scan of the binlog file is done until the requested GTID position

  is found.

  The binlog index conceptually extends this using index records corresponding

  to different offset within one binlog file. Each record functions as if it

  was the initial GTID_LIST event of a new binlog file, allowing the

  sequential scan to start from the corresponding position. By having

  sufficiently many index records, the scan will be fast.

  The code has a performance-critical "sync" path which is called while holding

  LOCK_log whenever a new GTID is added to a binlog file. And a less critical

  "async" path which runs in the binlog background thread and does most of the

  processing. The "sync" and "async" paths each run single threaded, but can

  execute in parallel with each other.

  The index file is written incrementally together with the binlog file.

  However there is no fsync()'s of the index file needed while writing. A

  partially written index left by a crashing server will be re-written during

  binlog recovery. A reader is allowed to use the index as it is begin written

  (for the "hot" binlog file); such access is protected by mutex.

  In case of lost or corrupt index, fallback to full sequential scan is done

  (so performance will be affected but not correct functionality).

  The index file is structured like a B+-tree. The index is append-only, so

  also resembles a log-structured merge-tree, but with no merging of levels

  needed as it covers a single fixed-size binlog file. This makes the building

  of the tree relatively simple.

  Keys in the tree consist of a GTID state (corresponding to a GTID_LIST

  event) and the associated binlog file offset. All keys (except the first key

  in each level of the tree) are delta-compressed to save space, holding only

  the (domain_id, server_id) pairs that differ from the previous record.

  The file is page-based. The first page contains the leftmost leaf node, and

  the root node is at the end of the file. An incompletely written index file

  can be detected by the last page in the file not being a root node page.

  Nodes in the B+-tree usually fit in one page, but a node can be split across

  multiple pages if GTID states are very large.

  ToDo: Document the page /indexfile format.

  Here is an example index file in schematic form:

       S0 D1 D2    D3 D4 D5    D6 D7 D8    D9 D10 D11

    A(S0 D1 D2) B(D3 D4 D5) C(D6 D7 D8) E(D9 D10) F(D11)

        D(A <S3> B <D4+D5+D6> C)   G(E <D10+D11> F)

                        H(D <S9> G)

  S0 is the full initial GTID state at the start of the file.

  D1-D11 are the differential GTID states in the binlog file; eg. they could

      be the individual GTIDs in the binlog file if a record is writte for

      each GTID.

  S3 is the full GTID state corresponding to D3, ie. S3=S0+D1+D2+D3.

  A(), B(), ..., H() are the nodes in the binlog index. H is the root.

  A(S0 D1 D2) is a leaf node containing records S0, D1, and D2.

  G(E <D10+D11> F) is an interior node with key <D10+D11> and child pointers to

      E and F.

  To find eg. S4, we start from the root H. S4<S9, so we follow the left child

  pointer to D. S4>S3, so we follow the child pointer to leaf node C.

  Here are the operations that occur while writing the example index file:

    S0  A(A) R(A,S0)

    D1       R(A,D1)

    D2       R(A,D2)

    D3  W(A) I(D) P(D,A) A(B) R(B,D3) R(D,S3)

    D4       R(A,D4)

    D5       R(A,D5)

    D6  W(B) P(D,B) A(C) R(C,D6) R(D,D4+D5+D6)

    D7       R(C,D7)

    D8       R(C,D8)

    D9  W(C) P(D,C) A(E) R(E,D9) W(D) I(H) P(H,D) R(H,S9)

    D10      R(E,D10)

    D11 W(E) I(G) P(G,E) A(F) R(F,S10) R(G,D10+D11)

    <EOF> W(F) P(G,F) W(G) P(H,G) W(H)

    A(x)   -> allocate leaf node x.

    R(x,k) -> insert an index record containing key k in node x.

    W(x)   -> write node x to the index file.

    I(y)   -> allocate interior node y.

    P(y,x) -> insert a child pointer to y in x.

Attachments

Issue Links

blocks

MDEV-25764 SELECT binlog_gtid_pos takes a very long time when binary log is encrypted

Closed

causes

MDEV-36424 binlog_encryption.encrypted_master_switch_to_unencrypted_gtid Fails in BB 11.4+

Closed

includes

MDEV-25392 IO thread reporting yes despite failing to fetch GTID

Open

relates to

MDEV-33426 Assertion `status_var.local_memory_used == 0 || !debug_assert_on_not_freed_memory' failed in THD::~THD from handle_slave_sql on slave upon INSERT to TEMPORARY Aria table, Memory not freed: -50616

In Testing

links to

Mailing list discussion of the design

mentioned in: Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(138 mentioned in)

Activity

Ascending order - Click to sort in descending order

View 42 older comments

Roel Van de Paar added a comment - 2024-02-01 04:03

Thank you. I understand where the confusion comes from now. However, please note that wherever a patch or feature comes from, testing and signoff by MariaDB testers is standard procedure, alike to review signoff.

Roel Van de Paar added a comment - 2024-02-01 04:03 Thank you. I understand where the confusion comes from now. However, please note that wherever a patch or feature comes from, testing and signoff by MariaDB testers is standard procedure, alike to review signoff.

Roel Van de Paar added a comment - 2024-02-05 06:41

knielsen Hi!

An issue that I regularly see in the runs is this assertion upon shutdown:

11.3.0 d6efded921e4168875ef6d4e1a0ba439a03b5c30
2023-12-30 23:51:47 0 [Note] /test/MDEV-4991_MD301223-mariadb-11.3.0-linux-x86_64-dbg/bin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
2023-12-30 23:51:47 6 [Note] Error reading relay log event: slave SQL thread was killed
2023-12-30 23:51:47 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 367; GTID position '0-1-202', master: 127.0.0.1:45519
2023-12-30 23:51:47 5 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
2023-12-30 23:51:47 5 [Note] Slave I/O thread exiting, read up to log 'binlog.000002', position 367; GTID position 0-1-202, master 127.0.0.1:45519
mariadbd: /test/knielsen_mdev4991_11.3_dbg/sql/mysqld.cc:3767: void my_malloc_size_cb_func(long long int, my_bool): Assertion `!is_thread_specific \|\| (mysqld_server_initialized && thd)' failed.

The only bug I could locate with a similar assert is an old closed Galera bug (~~MDEV-25389~~) and possible MDEV-19515 is related.

The issue has proven to be not reducible into a testcase, and always happens after shutdown.

Roel Van de Paar added a comment - 2024-02-05 06:41 knielsen Hi! An issue that I regularly see in the runs is this assertion upon shutdown: 11.3.0 d6efded921e4168875ef6d4e1a0ba439a03b5c30 2023-12-30 23:51:47 0 [Note] /test/MDEV-4991_MD301223-mariadb-11.3.0-linux-x86_64-dbg/bin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2023-12-30 23:51:47 6 [Note] Error reading relay log event: slave SQL thread was killed 2023-12-30 23:51:47 6 [Note] Slave SQL thread exiting, replication stopped in log 'binlog.000002' at position 367; GTID position '0-1-202', master: 127.0.0.1:45519 2023-12-30 23:51:47 5 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read 2023-12-30 23:51:47 5 [Note] Slave I/O thread exiting, read up to log 'binlog.000002', position 367; GTID position 0-1-202, master 127.0.0.1:45519 mariadbd: /test/knielsen_mdev4991_11.3_dbg/sql/mysqld.cc:3767: void my_malloc_size_cb_func(long long int, my_bool): Assertion `!is_thread_specific || (mysqld_server_initialized && thd)' failed. The only bug I could locate with a similar assert is an old closed Galera bug ( MDEV-25389 ) and possible MDEV-19515 is related. The issue has proven to be not reducible into a testcase, and always happens after shutdown.

Kristian Nielsen added a comment - 2024-02-05 07:14

Roel , do you have a stack trace showing where this happens? This is an assertion about memory usage accounting, the stack trace should show which memory allocation is involved.

Can you get a core file and show (with gdb) the values of the variables is_thread_specific, mysql_server_initialized, and thd? Apparently this is a memory allocation that should be accounted on a specific thread, but either the server is not yet initialized, or it's called from a place where the thread is unknown.

Kristian Nielsen added a comment - 2024-02-05 07:14 Roel , do you have a stack trace showing where this happens? This is an assertion about memory usage accounting, the stack trace should show which memory allocation is involved. Can you get a core file and show (with gdb) the values of the variables is_thread_specific, mysql_server_initialized, and thd? Apparently this is a memory allocation that should be accounted on a specific thread, but either the server is not yet initialized, or it's called from a place where the thread is unknown.

Roel Van de Paar added a comment - 2024-02-06 00:08 - edited

knielsen Thank you for the input. All good; I have found that the issue is present in a non-patched 11.3 as well. I will continue to debug it on the side and log it later as a new ticket (including the information mentioned), but it is confirmed not related to ~~MDEV-4991~~. (NTS: data/000000)

Update: this issue (my_free in maria_close) looks to be MDEV-33426 or closely related to it.

Roel Van de Paar added a comment - 2024-02-06 00:08 - edited knielsen Thank you for the input. All good; I have found that the issue is present in a non-patched 11.3 as well. I will continue to debug it on the side and log it later as a new ticket (including the information mentioned), but it is confirmed not related to MDEV-4991 . (NTS: data/000000) Update: this issue (my_free in maria_close) looks to be MDEV-33426 or closely related to it.

Roel Van de Paar added a comment - 2024-02-12 21:37

Testing signoff; 13 Feb 24. Feature already pushed to 11.4. Closing.

Roel Van de Paar added a comment - 2024-02-12 21:37 Testing signoff; 13 Feb 24. Feature already pushed to 11.4. Closing.

People

Assignee:: Roel Van de Paar

Reporter:: Kristian Nielsen

Votes:: 5 Vote for this issue

Watchers:: 22 Start watching this issue

Dates

Created:: 2013-09-03 15:48

Updated:: 2025-03-28 14:28

Resolved:: 2024-02-12 21:38

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server