Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4991

GTID binlog indexing

    XMLWordPrintable

Details

    Description

      Current GTID code needs to scan one binlog file from the beginning when a
      slave connects, to find the place to start replicating. If slave reconnects
      are frequent, this can be a performance regression, as in earlier versions
      slaves could immediate seek to the supplied file offset.

      To fix this problem, indexing should be done on the binlog files, allowing to
      quickly locate any GTID in the binlog file. As an added bonus, this would
      allow to detect if old-style replication tries to connect at an incorrect file
      offset (eg. in the middle of an event), avoiding sending potentially corrupted
      events.

      The index could be an extra file master-bin.000001.idx written in parallel
      with the binlog file. There is no need to flush or sync the file at every
      binlog write, as it can be recovered easily in case of crash or code can fall
      back to scanning the corresponding binlog file.

      The index would be page-based, allowing a connecting slave to do binary search
      to find the desired location in the binlog to start replication. The file
      would contain an ordered sequence of GTID binlog states with their
      corresponding start offset into the associated binlog file.

      The connecting slave would then binary-search for its start position in the index, and this way be able to jump directly to the right start position in the binlog file, without needing to scan that binlog from the start. This would greatly improve performance when many slaves connect simultaneously.
      A B-tree like structure is more efficient for disk-based searching than binary search.
      Since we write out the index record in order, we can actually build a B-tree
      append-only to the index file. At the end, the root node will be the last page in the
      file.

      There is no need to include every position in the index. We can write say one
      in every 10 transactions into the index; a connecting slave will then lookup
      the closest matching position in the index and at most need to skip over 10
      transactions in the binlog. In general, we can keep track of the size of
      binlog written and index written, and write only a fraction of transactions
      into the index to ensure that the ratio of index size to binlog size does not
      exceed some appropriate number (eg. 2% or something).

      To further reduce the index size, it could be "compressed" by omitting from entries those (domain_id, server_id) combinations that do not change. Typically, there can be many distinct such values in a binlog file, but only a few of them are likely to change within one given file.

      A work-in-progress high-level design description:

        This implements an on-disk index for each binlog file to speed up access to
        the binlog at a specific offset or GTID position. This is primarily used when
        a slave connects to the master, but also by user calling BINLOG_GTID_POS().
       
        A connecting slave first scans the binlog files to find the last one with an
        initial GTID_LIST event that lies before the starting GTID position. Then a
        sequential scan of the binlog file is done until the requested GTID position
        is found.
       
        The binlog index conceptually extends this using index records corresponding
        to different offset within one binlog file. Each record functions as if it
        was the initial GTID_LIST event of a new binlog file, allowing the
        sequential scan to start from the corresponding position. By having
        sufficiently many index records, the scan will be fast.
       
        The code has a performance-critical "sync" path which is called while holding
        LOCK_log whenever a new GTID is added to a binlog file. And a less critical
        "async" path which runs in the binlog background thread and does most of the
        processing. The "sync" and "async" paths each run single threaded, but can
        execute in parallel with each other.
       
        The index file is written incrementally together with the binlog file.
        However there is no fsync()'s of the index file needed while writing. A
        partially written index left by a crashing server will be re-written during
        binlog recovery. A reader is allowed to use the index as it is begin written
        (for the "hot" binlog file); such access is protected by mutex.
       
        In case of lost or corrupt index, fallback to full sequential scan is done
        (so performance will be affected but not correct functionality).
       
        The index file is structured like a B+-tree. The index is append-only, so
        also resembles a log-structured merge-tree, but with no merging of levels
        needed as it covers a single fixed-size binlog file. This makes the building
        of the tree relatively simple.
       
        Keys in the tree consist of a GTID state (corresponding to a GTID_LIST
        event) and the associated binlog file offset. All keys (except the first key
        in each level of the tree) are delta-compressed to save space, holding only
        the (domain_id, server_id) pairs that differ from the previous record.
       
        The file is page-based. The first page contains the leftmost leaf node, and
        the root node is at the end of the file. An incompletely written index file
        can be detected by the last page in the file not being a root node page.
        Nodes in the B+-tree usually fit in one page, but a node can be split across
        multiple pages if GTID states are very large.
       
        ToDo: Document the page /indexfile format.
       
        Here is an example index file in schematic form:
       
             S0 D1 D2    D3 D4 D5    D6 D7 D8    D9 D10 D11
          A(S0 D1 D2) B(D3 D4 D5) C(D6 D7 D8) E(D9 D10) F(D11)
              D(A <S3> B <D4+D5+D6> C)   G(E <D10+D11> F)
                              H(D <S9> G)
       
        S0 is the full initial GTID state at the start of the file.
        D1-D11 are the differential GTID states in the binlog file; eg. they could
            be the individual GTIDs in the binlog file if a record is writte for
            each GTID.
        S3 is the full GTID state corresponding to D3, ie. S3=S0+D1+D2+D3.
        A(), B(), ..., H() are the nodes in the binlog index. H is the root.
        A(S0 D1 D2) is a leaf node containing records S0, D1, and D2.
        G(E <D10+D11> F) is an interior node with key <D10+D11> and child pointers to
            E and F.
       
        To find eg. S4, we start from the root H. S4<S9, so we follow the left child
        pointer to D. S4>S3, so we follow the child pointer to leaf node C.
       
        Here are the operations that occur while writing the example index file:
       
          S0  A(A) R(A,S0)
          D1       R(A,D1)
          D2       R(A,D2)
          D3  W(A) I(D) P(D,A) A(B) R(B,D3) R(D,S3)
          D4       R(A,D4)
          D5       R(A,D5)
          D6  W(B) P(D,B) A(C) R(C,D6) R(D,D4+D5+D6)
          D7       R(C,D7)
          D8       R(C,D8)
          D9  W(C) P(D,C) A(E) R(E,D9) W(D) I(H) P(H,D) R(H,S9)
          D10      R(E,D10)
          D11 W(E) I(G) P(G,E) A(F) R(F,S10) R(G,D10+D11)
          <EOF> W(F) P(G,F) W(G) P(H,G) W(H)
       
          A(x)   -> allocate leaf node x.
          R(x,k) -> insert an index record containing key k in node x.
          W(x)   -> write node x to the index file.
          I(y)   -> allocate interior node y.
          P(y,x) -> insert a child pointer to y in x.
      

      Attachments

        Issue Links

          Activity

            People

              Roel Roel Van de Paar
              knielsen Kristian Nielsen
              Votes:
              5 Vote for this issue
              Watchers:
              22 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.