Details
-
New Feature
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
Current GTID code needs to scan one binlog file from the beginning when a
slave connects, to find the place to start replicating. If slave reconnects
are frequent, this can be a performance regression, as in earlier versions
slaves could immediate seek to the supplied file offset.
To fix this problem, indexing should be done on the binlog files, allowing to
quickly locate any GTID in the binlog file. As an added bonus, this would
allow to detect if old-style replication tries to connect at an incorrect file
offset (eg. in the middle of an event), avoiding sending potentially corrupted
events.
The index could be an extra file master-bin.000001.idx written in parallel
with the binlog file. There is no need to flush or sync the file at every
binlog write, as it can be recovered easily in case of crash or code can fall
back to scanning the corresponding binlog file.
The index would be page-based, allowing a connecting slave to do binary search
to find the desired location in the binlog to start replication. The file
would contain an ordered sequence of GTID binlog states with their
corresponding start offset into the associated binlog file.
The connecting slave would then binary-search for its start position in the index, and this way be able to jump directly to the right start position in the binlog file, without needing to scan that binlog from the start. This would greatly improve performance when many slaves connect simultaneously.
A B-tree like structure is more efficient for disk-based searching than binary search.
Since we write out the index record in order, we can actually build a B-tree
append-only to the index file. At the end, the root node will be the last page in the
file.
There is no need to include every position in the index. We can write say one
in every 10 transactions into the index; a connecting slave will then lookup
the closest matching position in the index and at most need to skip over 10
transactions in the binlog. In general, we can keep track of the size of
binlog written and index written, and write only a fraction of transactions
into the index to ensure that the ratio of index size to binlog size does not
exceed some appropriate number (eg. 2% or something).
To further reduce the index size, it could be "compressed" by omitting from entries those (domain_id, server_id) combinations that do not change. Typically, there can be many distinct such values in a binlog file, but only a few of them are likely to change within one given file.
A work-in-progress high-level design description:
This implements an on-disk index for each binlog file to speed up access to
|
the binlog at a specific offset or GTID position. This is primarily used when
|
a slave connects to the master, but also by user calling BINLOG_GTID_POS().
|
|
A connecting slave first scans the binlog files to find the last one with an
|
initial GTID_LIST event that lies before the starting GTID position. Then a
|
sequential scan of the binlog file is done until the requested GTID position
|
is found.
|
|
The binlog index conceptually extends this using index records corresponding
|
to different offset within one binlog file. Each record functions as if it
|
was the initial GTID_LIST event of a new binlog file, allowing the
|
sequential scan to start from the corresponding position. By having
|
sufficiently many index records, the scan will be fast.
|
|
The code has a performance-critical "sync" path which is called while holding
|
LOCK_log whenever a new GTID is added to a binlog file. And a less critical
|
"async" path which runs in the binlog background thread and does most of the
|
processing. The "sync" and "async" paths each run single threaded, but can
|
execute in parallel with each other.
|
|
The index file is written incrementally together with the binlog file.
|
However there is no fsync()'s of the index file needed while writing. A
|
partially written index left by a crashing server will be re-written during
|
binlog recovery. A reader is allowed to use the index as it is begin written
|
(for the "hot" binlog file); such access is protected by mutex.
|
|
In case of lost or corrupt index, fallback to full sequential scan is done
|
(so performance will be affected but not correct functionality).
|
|
The index file is structured like a B+-tree. The index is append-only, so
|
also resembles a log-structured merge-tree, but with no merging of levels
|
needed as it covers a single fixed-size binlog file. This makes the building
|
of the tree relatively simple.
|
|
Keys in the tree consist of a GTID state (corresponding to a GTID_LIST
|
event) and the associated binlog file offset. All keys (except the first key
|
in each level of the tree) are delta-compressed to save space, holding only
|
the (domain_id, server_id) pairs that differ from the previous record.
|
|
The file is page-based. The first page contains the leftmost leaf node, and
|
the root node is at the end of the file. An incompletely written index file
|
can be detected by the last page in the file not being a root node page.
|
Nodes in the B+-tree usually fit in one page, but a node can be split across
|
multiple pages if GTID states are very large.
|
|
ToDo: Document the page /indexfile format.
|
|
Here is an example index file in schematic form:
|
|
S0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
|
A(S0 D1 D2) B(D3 D4 D5) C(D6 D7 D8) E(D9 D10) F(D11)
|
D(A <S3> B <D4+D5+D6> C) G(E <D10+D11> F)
|
H(D <S9> G)
|
|
S0 is the full initial GTID state at the start of the file.
|
D1-D11 are the differential GTID states in the binlog file; eg. they could
|
be the individual GTIDs in the binlog file if a record is writte for
|
each GTID.
|
S3 is the full GTID state corresponding to D3, ie. S3=S0+D1+D2+D3.
|
A(), B(), ..., H() are the nodes in the binlog index. H is the root.
|
A(S0 D1 D2) is a leaf node containing records S0, D1, and D2.
|
G(E <D10+D11> F) is an interior node with key <D10+D11> and child pointers to
|
E and F.
|
|
To find eg. S4, we start from the root H. S4<S9, so we follow the left child
|
pointer to D. S4>S3, so we follow the child pointer to leaf node C.
|
|
Here are the operations that occur while writing the example index file:
|
|
S0 A(A) R(A,S0)
|
D1 R(A,D1)
|
D2 R(A,D2)
|
D3 W(A) I(D) P(D,A) A(B) R(B,D3) R(D,S3)
|
D4 R(A,D4)
|
D5 R(A,D5)
|
D6 W(B) P(D,B) A(C) R(C,D6) R(D,D4+D5+D6)
|
D7 R(C,D7)
|
D8 R(C,D8)
|
D9 W(C) P(D,C) A(E) R(E,D9) W(D) I(H) P(H,D) R(H,S9)
|
D10 R(E,D10)
|
D11 W(E) I(G) P(G,E) A(F) R(F,S10) R(G,D10+D11)
|
<EOF> W(F) P(G,F) W(G) P(H,G) W(H)
|
|
A(x) -> allocate leaf node x.
|
R(x,k) -> insert an index record containing key k in node x.
|
W(x) -> write node x to the index file.
|
I(y) -> allocate interior node y.
|
P(y,x) -> insert a child pointer to y in x.
|
Attachments
Issue Links
- blocks
-
MDEV-25764 SELECT binlog_gtid_pos takes a very long time when binary log is encrypted
- Closed
- includes
-
MDEV-25392 IO thread reporting yes despite failing to fetch GTID
- Open
- relates to
-
MDEV-33426 Assertion `status_var.local_memory_used == 0 || !debug_assert_on_not_freed_memory' failed in THD::~THD from handle_slave_sql on slave upon INSERT to TEMPORARY Aria table, Memory not freed: -50616
- In Testing
- links to
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...