Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.4
    • None
    • None

    Description

      Global transaction ID

      The purpose of Global transaction ID (GTID) is to make slave position
      independent of internal details of master's binlog (file name, file
      offset). This allows a simpler switch of a slave to a new master, as the
      current slave position is valid on the new master as well as the old.

      The GTID is some kind of a tag, that is attached to every transaction. The
      point is — it must be globally unique, it must go into the binary log together
      with the transaction itself, it must be replicated to slaves, and it must be
      preserved when a transaction is applied on the slave. That is no matter how
      many slaves the transaction was replicated through (in a complex replication
      graph), on every slave and in all binary logs on slaves the transaction should
      have the same GTID.

      As part of this task, we will also make the slave replication state crash safe
      (can be recovered after a crash in a transactionally safe way).

      Gtid_log_event

      A global trasaction id is a pair (server_id, seq_no). The server_id already exists
      to identify events originating at different servers. The seq_no is new, it is
      a 64-bit unsigned integer that increases monotonically (not necessarily
      without holes) at each commit on the master.

      Every binlog event group (eg. transaction, DDL, non-transactional statement)
      is annotated with its global transaction id. Most event groups are already
      bracketed with BEGIN/COMMIT events. We replace the BEGIN event with a new
      Gtid_Log_Event event. This event contains the seq_no of the global transaction
      id for the following event group (the server_id is stored in the event header
      of every event). We also include some flag bits.

      On the slave, the Gtid_Log_Event is applied like the BEGIN query
      event. However, the seq_no from the event is remembered, and preserved when
      the replicated event is binlogged, just like server_id currently is. A session
      variable pseudo_seq_no is introduced (requires SUPER to change) to similarly
      override seq_no, and used by mysqlbinlog to preserve gtid for
      mysqlbinlog|mysql style binlog apply.

      For event groups that currently have no BEGIN/COMMIT around them, a
      Gtid_Log_event is added before it. This event has a flag set to let the slave
      know that no COMMIT will follow, so it just applies to the following event
      group. This is for eg. DDL and certain out-of-band events like TRUNCATE of
      HEAP table after master restart.

      Old MariaDB slaves (or MySQL slaves) will not understand the new
      Gtid_log_event. We handle this using the existing mechanism for replacing
      events that old slaves cannot handle. For Gtid_log_event with no corresponding
      COMMIT we can just omit it or replace with a dummy event as appropriate. For
      the case with a corresponding COMMIT, we replace the Gtid_log_event with a
      normal BEGIN query event.

      To be able to do this, we make the Gtid_log_event be 38 bytes (19 bytes header
      + 19 bytes body), reserving a couple unused bytes for future expansin, as this
      is the minimum size for a BEGIN query event. Incidentally the current BEGIN
      event takes 68 bytes due to redundant information, so we still get 30 bytes
      space saved for every event group by introducing GTID.

      GTID in binlog

      Whenever we write a new event group to the binlog, we allocate the next seq_no
      and put it in a Gtid_log_event at the start of the group. When we write
      through a cache, we put 0 there and back-patch the proper seq_no when we write
      the cache to the binlog, so that seq_no ordering follows commit order and
      binlog order (per server-id).

      When the server shuts down, we write (and fsync()) the current seq_no to a
      file gtid.info in the data directory. When we startup, we read it back in to
      resume from the correct number. We fsync() the gtid.info file before marking
      the last binlog file as cleanly closed. And if at startup we detect that last
      binlog was not cleanly closed, we do the normal binlog crash recovery, and as
      part of that extract the last used seq_no from the events seen there, instead
      of relying on the (probably corrupt) gtid.info.

      The server remembers (eg. in a hash) the last seq_no seen for every
      server_id. When a new binlog file is written, this list of GTIDs is written
      out at the start of the binlog as a new Gtid_list_log_event.

      This allows to find the location in the server binlogs of any given GTID (server_id,
      seq_no): start from the last binlog file, and scan backwards. For each binlog
      file, read the Gtid_list_log_event at the start. If server_id is found with a
      lower-or-equal seq_no, then the GTID is found in this file, and we can scan
      forward until we find it. If server_id is found with a bigger seq_no, then the GTID
      exists in an earlier binlog file (or has been purged if this is true for every
      binlog file). If server_id is not found in the Gtid_list_log_event, then the GTID
      was never seen by this server.

      Slave replication state

      By the slave replication state, we denote what currently is the master binlog
      file name and file offset of the last event group applied on a slave. This is
      needed when reconnecting a slave to a master to resume replication at the
      correct point in the binlog event stream and not lose or duplicate any events.

      When GTIDs are used, the slave replication state is instead the GTID of the
      last event group applied on the slave. However, because of multi-master
      replication (and later perhaps parallel replication that re-orders events),
      the replication state becomes a set of GTIDs last applied.

      The slave remembers the GTID with the maximal seq_no for every server_id value
      of every event group applied. If an event is received with lower-or-equal
      seq_no than already applied for that server_id, then the event is ignored
      (same as currently when receiving an event with own server_id). This avoids
      duplicating events in circular replication topologies.

      The slave replication state (list of GTIDs) is written to gtid.info at
      shutdown, and also written at the start of every slave binlog file. This makes
      it possible to recover after a slave crash, during the scan of the last
      binlog, same way as for seq_no on a master. We can in addition store it in
      relay-log.info if --log-slave-updates=OFF, so slaves can run with binlog
      disabled, however this will be as crash-unsafe as current replication.

      When the slave connects to the master with GTID enabled, it no longer needs to
      send master binlog file name and file offset. Instead it sends its replication
      state as a list of GTIDs. The master will search back through its binlogs as
      described above for the earliest of these GTIDs, and start sending events from
      that point on back to the slave. Note that if the server_id of a GTID has
      never been seen by a master, it can be ignored; however if it is seen in some
      Gtid_list_log_event but has been purged, then it is an error and slave connect
      fails (same as currently if slave asks for a binlog file that has been
      purged).

      When a slave connects using the old style (binlog file name and file offset),
      the master will send back the replication state corresponding to this
      position. This allows to automatically migrate to GTID; the next time the
      slave reconnects, it can use the replication state obtained from previous
      connect. It also allows to provision a new slave from a backup made with
      mysqldump --master-data or XtraDB. These provide the old-style binlog file
      name and positions to use — and after first connect to the master, the slave
      can automatically switch over to use global transaction IDs.

      Note that the list of GTIDs kept on master and on slave is in fact identical,
      even though it is used for different purposes on master and slave. Of course,
      it is possible for a server to be both a master and a slave, and use the
      information for both purposes.

      User interface

      The SHOW SLAVE STATUS command needs to be extended to also show the
      GTID-enabled replication state, which is the set of GTIDs with maximal seq_no
      per server-id applied on the slave. The SHOW MASTER STATUS should also show
      it.

      The CHANGE MASTER TO command can be used as before. If a GTID replication
      state is available, and no explicit filename/offset is given, and master
      supports GTID, then GTID will be used to automatically start from the correct
      event group.

      CHANGE MASTER should also be extended to allow to specify the current
      replication state. This should not normally be needed, but can be useful to
      experiment or recover from fatal server loss or corruption, etc. Specifying a
      GTID sets the maximal seq_no for the given server id. Specifying NULL for
      set_no in the GTID removes the server_id from the GTID list, leaving things as
      if that server_id was never seen before on the server.

      The slave_skip_counter can be used as before. It still records the GTID of any
      event group skipped.

      START SLAVE UNTIL is extended in syntax to take a GTID. It stops the slave
      when a GTID with same server_id and greater-or-equal seq_no is reached (if
      equal, the event is applied on the slave before stopping). If given a list of
      GTIDs, stops when any of them is reached. (Stopping when all GTIDs in a list
      have been reached can be achieved by a sequence of START SLAVE UNTIL
      commands).

      Switching to a new slave

      Suppose we have a number of slaves replicating off of one master (or several
      masters with multi-source replication). GTIDs make it simpler to switch to
      using one of the slaves as a new master (because the original master died or
      is taken down for maintenance or whatever).

      In the simple case where there is no multi-source replication (and no parallel
      replication that reorders transactions, if that is later implemented), then
      the event stream is completely linear, and with the same sequence in every
      slave at any point in the replication hierarchy. Then if one runs SHOW SLAVE
      STATUS on each slave, there will be one slave that has greater-or-equal seq_no
      for each server_id than any other slave.

      Now we can simply promote that slave (there may be several equal to choose
      from) as the new master. All the other slaves can simply do CHANGE MASTER TO,
      specifying the connection details of the new master, and GTID will ensure that
      they continue at the correct position.

      Of course if the master can be stopped gracefully while switching slave, we
      can just let all slaves run until they have all replicated everything from the
      master. Then any slave can be promoted as the new master.

      If using multi-source (or possibly later implemented parallel replication),
      then matters can be more complex (but remember that this is not the common
      case — only multi-source replication, where such switch of master is probably
      uncommon, and any later implemented parallel replication).

      Let us say that for two GTIDs we have

      (sid1,seq_no1) <= (sid2,seq_no2) iff sid1==sid2 && seq_no1 <= seq_no2

      It is possible that if the old master disappears abruptly (ie. crashes), then
      for every pair of slaves S1, S2, the replication state of S1 and S2 may be
      in-commensurable: There is a GTID1 on S1 with no greater GTID on S2, and
      likewise a GTID2 on S2 with no greater GITD on S1.

      In this case no slave is immediately ready to take over as master without some
      event groups getting missing. However, we can still do a reliable master
      switch, as follows:

      First, arbitrarily pick any slave as the new master. Obtain the replication
      state (SHOW SLAVE STATUS) of all other slaves. The idea is that we will
      replicate all missing changes from every other slave to the newly selected
      master to make sure it has everything needed to fullfill the master role.

      For each slave S, we do a CHANGE MASTER TO on the server selected as new
      master. Then we START SLAVE UNTIL <gtid> for every gtid in the replication
      state of S. This ensures that we have every event group seen by S on the new
      master. By repeating for every slave server, we end up with a new master
      server that will have a replication state that is a superset of all the
      remaining slaves. We can then simply CHANGE MASTER TO on all the slaves, and
      continue.

      As part of this worklog, I will write a script that reads a list of connection
      strings to the set of slave servers, and goes through the above procedure,
      resulting in the first server on the list being promoted as the new master and
      every other slave on the list changed to replicate from the new master.

      Comparison with MySQL 5.6 global transaction ID

      The main motivation for this design is my dislike for the MySQL 5.6 global
      transaction ID design.

      Despite all of its flaws (mainly lack of robustness), MySQL replication has
      been extremely successful. I believe the reason is that is transparent, in the
      sense that the way it works is conceptually simple to understand, and thus
      possible to tweak and manipulate by users. Replication consists simply of
      sending a stream of the changes done on a master to the slaves to be
      repeated. The slave replication state is simply the position in the stream.

      The design explained here preserves this conceptual simplicity. The
      replication state is still just the position in the stream. We just have a
      universal way to refer to that position that works on all servers across the
      replication topology. It is still possible to tweak and manipuate that
      position (slave_skip_counter, START SLAVE UNTIL, etc.)

      Multi-source replication makes the state more complex, since we have now a set
      of positions, but it is still conceptually sane.

      The MySQL 5.6 design, in contrast, loses the simple concept of replication
      state. The state is now an abstract set of all GTIDs ever applied on a slave,
      which is harder to grasp and manipulate. The concept of position in a
      replication stream becomes meaningless, as parallel replication can
      arbitrarily re-order events in the binlogs at different levels of the
      replication topology.

      (See below for plans on how to extend the design presented here to handle
      MySQL-style parallel replication in a way that tries to preserve the nice
      properties).

      Comparison with the Google transaction ID patch

      I think this approach is rather similar in concept to the Google transaction
      ID patch. The main differences are probably:

      • The use of the binlog for persisting the replication state in a crash-safe
        manner. I believe this is a better approach than what the Google patch
        does.
      • The extensions to handle multi-source replication, which is necessary as
        this feature has been introduced to MariaDB.

      Future expansion for parallel replication

      Parallel replication is not a part of this design. This task can be fully
      implemted as described here in a self-consistent way. However, parallel
      replication is an important feature, and this section describes how to plan
      for being able to extend GTIDs later in a nice way to handle parallel
      replication.

      Now, parallel replication can come in two variants. One is where transactions
      are run in parallel, but still committed in the same order on the slave as on
      the master. This in-order parallelism has no conflicts with the GTID design
      described here, and can be implemented independently. MWL#184 is an example of
      this.

      The other out-of-order variant is what MySQL 5.6 does. Transactions are
      committed on a slave in different order than on the master, so the slave
      binlog has transactions in a different order than the master. But this kind of
      parallelism can potentially obtain higher degree on the slave, so can be
      desirable.

      Multi-source replication is related to out-of-order parallelism. On a
      multi-source slave, event groups are applied in parallel, but written
      interleaved with one another in some arbitrary way in a single
      binlog. So a multi-source slave S1 could itself be a master for a deeper-down
      slave S2, but currently events on S2 would have to be replicated in-order with
      no parallelism, which can be prohibitively expensive.

      With multi-source we have extra information available on S1: We know that
      events from two different upstream masters have no predefined ordering, while
      events from a single upstream master must be applied in the order given. If we
      record that information in the binlog of S1, then S2 can use that information
      to know that it is safe for it to also apply events from distinct uptream
      masters in parallel, but keep the ordering of events belonging to a single
      upstream master.

      If we do it this way, then we retain the nice conceptual understanding of a
      GTID as a well-defined position in a replication stream. If two GTIDs
      originate from the same upstream master, they have a well-defined ordering
      which will be preserved all across the replication topology.

      In effect we now have multiple replication streams in each binlog. Between
      different replication streams, there is no ordering implied, however within a
      single replication stream GTIDs uniquely and consistently define a simple
      linear order.

      We can extend this by allowing multiple user-defined streams originating at a
      single server, with the application having the responsibility of ensuring that
      different streams are really independent. For example, we could create N
      streams based on a hash of the used database, to get different databases
      replicated in parallel just as the MySQL 5.6 MTS (multi-threaded slave)
      feature. But we still retain the concept of GTID as position in a linear
      stream — just with multiple streams possible.

      The full design of this will be written up in a different task. But we can
      prepare the global transaction ID design to be more easily extensible towards
      multiple binlog streams, without much extra effort.

      Whenever we store a global transaction ID (server_id, seq_no), we also store a
      replication stream ID. So we store this in Gtid_log_event and
      Gtid_list_log_event, write it in gtid.info (and relay-log.info), and remember
      it as part of the slave replication state. We can use a 32-bit unsigned
      integer as replication stream id.

      For the implemetation of GTID, this will always be zero. Later, we can
      implement that multi-source replication can assign different replication
      stream id to events from different upstream masters. The idea is that two
      events groups with distinct replication stream ID can be replicated in
      parallel and committed in any order to the binlog. This will allow a
      downstream slave S2 to also parallelise events from different upstream
      masters.

      Similarly, by allowing applications to annotate different transactions with
      different stream IDs, we can achieve the same kind of parallel replication as
      MySQL 5.6, in a way that is both simpler and more flexible.

      Attachments

        Issue Links

          Activity

            ratzpo Rasmus Johansson (Inactive) created issue -
            ratzpo Rasmus Johansson (Inactive) made changes -
            Field Original Value New Value
            Issue Type New Feature [ 2 ] Story [ 6 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Issue Type Story [ 6 ] Task [ 3 ]
            serg Sergei Golubchik made changes -
            Workflow jira [ 10115 ] defaullt [ 10651 ]
            colin Colin Charles made changes -
            ratzpo Rasmus Johansson (Inactive) made changes -
            Assignee Rasmus Johansson [ ratzpo ] Kristian Nielsen [ knielsen ]
            serg Sergei Golubchik made changes -
            Description From MySQL 5.6
            so the slave state becomes recoverable, and facilitate automatic moving a slave to a new master across multi-level hierarchies.
            From MySQL 5.6
            so the slave state becomes recoverable, and facilitate automatic moving a slave to a new master across multi-level hierarchies.

            The *gtrid* is some kind of a tag, that is attached to every transaction. The point is - it must be globally unique, it must go into the binary log together with the transaction itself, it must be replicated to slaves, and it must be preserved when a transaction is applied on the slave. That is no matter how many slaves the transaction was replicated through (in a complex replication graph), on every slave and in all binary logs on slaves the transaction should have the same *gtrid*.

            there is a google patch that we might probably use.
            knielsen Kristian Nielsen made changes -
            Priority Major [ 3 ] Minor [ 4 ]
            serg Sergei Golubchik made changes -
            Priority Minor [ 4 ] Critical [ 2 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.1 [ 11400 ]
            Fix Version/s 10.0.0 [ 10000 ]
            knielsen Kristian Nielsen made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            knielsen Kristian Nielsen made changes -
            Description From MySQL 5.6
            so the slave state becomes recoverable, and facilitate automatic moving a slave to a new master across multi-level hierarchies.

            The *gtrid* is some kind of a tag, that is attached to every transaction. The point is - it must be globally unique, it must go into the binary log together with the transaction itself, it must be replicated to slaves, and it must be preserved when a transaction is applied on the slave. That is no matter how many slaves the transaction was replicated through (in a complex replication graph), on every slave and in all binary logs on slaves the transaction should have the same *gtrid*.

            there is a google patch that we might probably use.
            From MySQL 5.6
            so the slave state becomes recoverable, and facilitate automatic moving a slave to a new master across multi-level hierarchies.

            The *gtrid* is some kind of a tag, that is attached to every transaction. The point is - it must be globally unique, it must go into the binary log together with the transaction itself, it must be replicated to slaves, and it must be preserved when a transaction is applied on the slave. That is no matter how many slaves the transaction was replicated through (in a complex replication graph), on every slave and in all binary logs on slaves the transaction should have the same *gtrid*.

            there is a google patch that we might probably use.

            See also:

                http://askmonty.org/worklog/Server-Sprint/?tid=175
                http://askmonty.org/worklog/Server-RawIdeaBin/?tid=170
            knielsen Kristian Nielsen made changes -
            Description From MySQL 5.6
            so the slave state becomes recoverable, and facilitate automatic moving a slave to a new master across multi-level hierarchies.

            The *gtrid* is some kind of a tag, that is attached to every transaction. The point is - it must be globally unique, it must go into the binary log together with the transaction itself, it must be replicated to slaves, and it must be preserved when a transaction is applied on the slave. That is no matter how many slaves the transaction was replicated through (in a complex replication graph), on every slave and in all binary logs on slaves the transaction should have the same *gtrid*.

            there is a google patch that we might probably use.

            See also:

                http://askmonty.org/worklog/Server-Sprint/?tid=175
                http://askmonty.org/worklog/Server-RawIdeaBin/?tid=170
            Global transaction ID

            The purpose of Global transaction ID (GTID) is to make slave position
            independent of internal details of master's binlog (file name, file
            offset). This allows a simpler switch of a slave to a new master, as the
            current slave position is valid on the new master as well as the old.

            The gtrid is some kind of a tag, that is attached to every transaction. The
            point is - it must be globally unique, it must go into the binary log together
            with the transaction itself, it must be replicated to slaves, and it must be
            preserved when a transaction is applied on the slave. That is no matter how
            many slaves the transaction was replicated through (in a complex replication
            graph), on every slave and in all binary logs on slaves the transaction should
            have the same gtrid.

            As part of this task, we will also make the slave replication state crash safe
            (can be recovered after a crash in a transactionally safe way).


            Gtid_log_event

            A global trasaction id is a pair (server_id, seq_no). server_id already exists
            to identify events originating at different servers. The seq_no is new, it is
            a 64-bit unsigned integer that increases monotonically (not necessarily
            without holes) at each commit on the master.

            Every binlog event group (eg. transaction, DDL, non-transactional statement)
            is annotated with its global transaction ID. Most event groups are already
            bracketed with BEGIN/COMMIT events. We replace the BEGIN event with a new
            Gtid_Log_Event event. This event contains the seq_no of the global transaction
            id for the following event group (the server_id is stored in the event header
            of every event). We also include some flag bits.

            On the slave, the Gtid_Log_Event is applied like the BEGIN query
            event. However, the seq_no from the event is remembered, and preserved when
            the replicated event is binlogged, just like server_id currently is. A session
            variable pseudo_seq_no is introduced (requires SUPER to change) to similarly
            override seq_no, and used by mysqlbinlog to preserve gtid for
            mysqlbinlog|mysql style binlog apply.

            For event groups that currently have no BEGIN/COMMIT around them, a
            Gtid_Log_event is added before it. This event has a flag set to let the slave
            know that no COMMIT will follow, so it just applies to the following event
            group. This is for eg. DDL and certain out-of-band events like TRUNCATE of
            HEAP table after master restart.

            Old MariaDB slaves (or MySQL slaves) will not understand the new
            Gtid_log_event. We handle this using the existing mechanism for replacing
            events that old slaves cannot handle. For Gtid_log_event with no corresponding
            COMMIT we can just omit it or replace with a dummy event as appropriate. For
            the case with a corresponding COMMIT, we replace the Gtid_log_event with a
            normal BEGIN query event.

            To be able to do this, we make the Gtid_log_event be 38 bytes (19 bytes header
            + 19 bytes body), reserving a couple unused bytes for future expansin, as this
            is the minimum size for a BEGIN query event. Incidentally the current BEGIN
            event takes 68 bytes due to redundant information, so we still get 30 bytes
            space saved for every event group by introducing GTID.


            GTID in binlog

            Whenever we write a new event group to the binlog, we allocate the next seq_no
            and put it in a Gtid_log_event at the start of the group. When we write
            through a cache, we put 0 there and back-patch the proper seq_no when we write
            the cache to the binlog, so that seq_no ordering follows commit order and
            binlog order (per server-id).

            When the server shuts down, we write (and fsync()) the current seq_no to a
            file gtid.info in the data directory. When we startup, we read it back in to
            resume from the correct number. We fsync() the gtid.info file before marking
            the last binlog file as cleanly closed. And if at startup we detect that last
            binlog was not cleanly closed, we do the normal binlog crash recovery, and as
            part of that extract the last used seq_no from the events seen there, instead
            of relying on the (probably corrupt) gtid.info.

            The server remembers (eg. in a hash) the last seq_no seen for every
            server_id. When a new binlog file is written, this list of GTIDs is written
            out at the start of the binlog as a new Gtid_list_log_event.

            This allows to find the location in the server binlogs of any given GTID (sid,
            seq_no): start from the last binlog file, and scan backwards. For each binlog
            file, read the Gtid_list_log_event at the start. If sid is found with a
            lower-or-equal seq_no, then the GTID is found in this file, and we can scan
            forward until we find it. If sid is found with a bigger seq_no, then the GTID
            exists in an earlier binlog file (or has been purged if this is true for every
            binlog file). If sid is not found in the Gtid_list_log_event, then the GTID
            was never seen by this server.


            Slave replication state

            By "slave replication state", we denote what currently is the master binlog
            file name and file offset of the last event group applied on a slave. This is
            needed when reconnecting a slave to a master to resume replication at the
            correct point in the binlog event stream and not lose or duplicate any events.

            When GTIDs are used, the slave replication state is instead the GTID of the
            last event group applied on the slave. However, because of multi-master
            replication (and later perhaps parallel replication that re-orders events),
            the replication state becomes a _set_ of GTIDs last applied.

            The slave remembers the GTID with the maximal seq_no for every server_id value
            of every event group applied. If an event is received with lower-or-equal
            seq_no than already applied for that server_id, then the event is ignored
            (same as currently when receiving an event with own server_id). This avoids
            duplicating events in circular replication topologies.

            The slave replication state (list of GTIDs) is written to gtid.info at
            shutdown, and also written at the start of every slave binlog file. This makes
            it possible to recover after a slave crash, during the scan of the last
            binlog, same way as for seq_no on a master. We can in addition store it in
            relay-log.info if --log-slave-updates=OFF, so slaves can run with binlog
            disabled, however this will be as crash-unsafe as current replication.

            When the slave connects to the master with GTID enabled, it no longer needs to
            send master binlog file name and file offset. Instead it sends its replication
            state as a list of GTIDs. The master will search back through its binlogs as
            described above for the earliest of these GTIDs, and start sending events from
            that point on back to the slave. Note that if the server_id of a GTID has
            never been seen by a master, it can be ignored; however if it is seen in some
            Gtid_list_log_event but has been purged, then it is an error and slave connect
            fails (same as currently if slave asks for a binlog file that has been
            purged).

            When a slave connects using the old style (binlog file name and file offset),
            the master will send back the replication state corresponding to this
            position. This allows to automatically migrate to GTID; the next time the
            slave reconnects, it can use the replication state obtained from previous
            connect. It also allows to provision a new slave from a backup made with
            mysqldump --master-data or XtraDB. These provide the old-style binlog file
            name and positions to use - and after first connect to the master, the slave
            can automatically switch over to use global transaction IDs.

            Note that the list of GTIDs kept on master and on slave is in fact identical,
            even though it is used for different purposes on master and slave. Of course,
            it is possible for a server to be both a master and a slave, and use the
            information for both purposes.


            User interface

            The SHOW SLAVE STATUS command needs to be extended to also show the
            GTID-enabled replication state, which is the set of GTIDs with maximal seq_no
            per server-id applied on the slave. The SHOW MASTER STATUS should also show
            it.

            The CHANGE MASTER TO command can be used as before. If a GTID replication
            state is available, and no explicit filename/offset is given, and master
            supports GTID, then GTID will be used to automatically start from the correct
            event group.

            CHANGE MASTER should also be extended to allow to specify the current
            replication state. This should not normally be needed, but can be useful to
            experiment or recover from fatal server loss or corruption, etc. Specifying a
            GTID sets the maximal seq_no for the given server id. Specifying NULL for
            set_no in the GTID removes the server_id from the GTID list, leaving things as
            if that server_id was never seen before on the server.

            The slave_skip_counter can be used as before. It still records the GTID of any
            event group skipped.

            START SLAVE UNTIL is extended in syntax to take a GTID. It stops the slave
            when a GTID with same server_id and greater-or-equal seq_no is reached (if
            equal, the event is applied on the slave before stopping). If given a list of
            GTIDs, stops when any of them is reached. (Stopping when all GTIDs in a list
            have been reached can be achieved by a sequence of START SLAVE UNTIL
            commands).


            Switching to a new slave

            Suppose we have a number of slaves replicating off of one master (or several
            masters with multi-source replication). GTIDs make it simpler to switch to
            using one of the slaves as a new master (because the original master died or
            is taken down for maintenance or whatever).

            In the simple case where there is no multi-source replication (and no parallel
            replication that reorders transactions, if that is later implemented), then
            the event stream is completely linear, and with the same sequence in every
            slave at any point in the replication hierarchy. Then if one runs SHOW SLAVE
            STATUS on each slave, there will be one slave that has greater-or-equal seq_no
            for each server_id than any other slave.

            Now we can simply promote that slave (there may be several equal to choose
            from) as the new master. All the other slaves can simply do CHANGE MASTER TO,
            specifying the connection details of the new master, and GTID will ensure that
            they continue at the correct position.

            Of course if the master can be stopped gracefully while switching slave, we
            can just let all slaves run until they have all replicated everything from the
            master. Then any slave can be promoted as the new master.

            If using multi-source (or possibly later implemented parallel replication),
            then matters can be more complex (but remember that this is not the common
            case - only multi-source replication, where such switch of master is probably
            uncommon, and any later implemented parallel replication).

            Let us say that for two GTIDs we have

                (sid1,seq_no1) <= (sid2,seq_no2) iff sid1==sid2 && seq_no1 <= seq_no2

            It is possible that if the old master disappears abruptly (ie. crashes), then
            for every pair of slaves S1, S2, the replication state of S1 and S2 may be
            in-commensurable: There is a GTID1 on S1 with no greater GTID on S2, and
            likewise a GTID2 on S2 with no greater GITD on S1.

            In this case no slave is immediately ready to take over as master without some
            event groups getting missing. However, we can still do a reliable master
            switch, as follows:

            First, arbitrarily pick any slave as the new master. Obtain the replication
            state (SHOW SLAVE STATUS) of all other slaves. The idea is that we will
            replicate all missing changes from every other slave to the newly selected
            master to make sure it has everything needed to fullfill the master role.

            For each slave S, we do a CHANGE MASTER TO on the server selected as new
            master. Then we START SLAVE UNTIL <gtid> for every gtid in the replication
            state of S. This ensures that we have every event group seen by S on the new
            master. By repeating for every slave server, we end up with a new master
            server that will have a replication state that is a superset of all the
            remaining slaves. We can then simply CHANGE MASTER TO on all the slaves, and
            continue.

            As part of this worklog, I will write a script that reads a list of connection
            strings to the set of slave servers, and goes through the above procedure,
            resulting in the first server on the list being promoted as the new master and
            every other slave on the list changed to replicate from the new master.


            Comparison with MySQL 5.6 global transaction ID

            The main motivation for this design is my dislike for the MySQL 5.6 global
            transaction ID design.

            Despite all of its flaws (mainly lack of robustness), MySQL replication has
            been extremely successful. I believe the reason is that is transparent, in the
            sense that the way it works is conceptually simple to understand, and thus
            possible to tweak and manipulate by users. Replication consists simply of
            sending a stream of the changes done on a master to the slaves to be
            repeated. The slave replication state is simply the position in the stream.

            The design explained here preserves this conceptual simplicity. The
            replication state is still just the position in the stream. We just have a
            universal way to refer to that position that works on all servers across the
            replication topology. It is still possible to tweak and manipuate that
            position (slave_skip_counter, START SLAVE UNTIL, etc.)

            Multi-source replication makes the state more complex, since we have now a set
            of positions, but it is still conceptually sane.

            The MySQL 5.6 design, in contrast, loses the simple concept of replication
            state. The state is now an abstract set of all GTIDs ever applied on a slave,
            which is harder to grasp and manipulate. The concept of position in a
            replication stream becomes meaningless, as parallel replication can
            arbitrarily re-order events in the binlogs at different levels of the
            replication topology.

            (See below for plans on how to extend the design presented here to handle
            MySQL-style parallel replication in a way that tries to preserve the nice
            properties).


            Comparison with the Google transaction ID patch

            I think this approach is rather similar in concept to the Google transaction
            ID patch. The main differences are probably:

             - The use of the binlog for persisting the replication state in a crash-safe
               manner. I believe this is a better approach than what the Google patch
               does.

             - The extensions to handle multi-source replication, which is necessary as
               this feature has been introduced to MariaDB.


            Future expansion for parallel replication

            Parallel replication is not a part of this design. This task can be fully
            implemted as described here in a self-consistent way. However, parallel
            replication _is_ an important feature, and this section describes how to plan
            for being able to extend GTIDs later in a nice way to handle parallel
            replication.

            Now, parallel replication can come in two variants. One is where transactions
            are run in parallel, but still committed in the same order on the slave as on
            the master. This in-order parallelism has no conflicts with the GTID design
            described here, and can be implemented independently. MWL#184 is an example of
            this.

            The other out-of-order variant is what MySQL 5.6 does. Transactions are
            committed on a slave in different order than on the master, so the slave
            binlog has transactions in a different order than the master. But this kind of
            parallelism can potentially obtain higher degree on the slave, so can be
            desirable.

            Multi-source replication is related to out-of-order parallelism. On a
            multi-source slave, event groups are applied in parallel, but written
            interleaved with one another in some arbitrary way in a single
            binlog. So a multi-source slave S1 could itself be a master for a deeper-down
            slave S2, but currently events on S2 would have to be replicated in-order with
            no parallelism, which can be prohibitively expensive.

            With multi-source we have extra information available on S1: We know that
            events from two different upstream masters have no predefined ordering, while
            events from a single upstream master must be applied in the order given. If we
            record that information in the binlog of S1, then S2 can use that information
            to know that it is safe for it to also apply events from distinct uptream
            masters in parallel, but keep the ordering of events belonging to a single
            upstream master.

            If we do it this way, then we retain the nice conceptual understanding of a
            GTID as a well-defined position in a replication stream. If two GTIDs
            originate from the same upstream master, they have a well-defined ordering
            which will be preserved all across the replication topology.

            In effect we now have multiple replication streams in each binlog. Between
            different replication streams, there is no ordering implied, however within a
            single replication stream GTIDs uniquely and consistently define a simple
            linear order.

            We can extend this by allowing multiple user-defined streams originating at a
            single server, with the application having the responsibility of ensuring that
            different streams are really independent. For example, we could create N
            streams based on a hash of the used database, to get different databases
            replicated in parallel just as the MySQL 5.6 MTS (multi-threaded slave)
            feature. But we still retain the concept of GTID as position in a linear
            stream - just with multiple streams possible.

            The full design of this will be written up in a different task. But we can
            prepare the global transaction ID design to be more easily extensible towards
            multiple binlog streams, without much extra effort.

            Whenever we store a global transaction ID (server_id, seq_no), we also store a
            _replication stream ID_. So we store this in Gtid_log_event and
            Gtid_list_log_event, write it in gtid.info (and relay-log.info), and remember
            it as part of the slave replication state. We can use a 32-bit unsigned
            integer as replication stream id.

            For the implemetation of GTID, this will always be zero. Later, we can
            implement that multi-source replication can assign different replication
            stream id to events from different upstream masters. The idea is that two
            events groups with distinct replication stream ID can be replicated in
            parallel and committed in any order to the binlog. This will allow a
            downstream slave S2 to also parallelise events from different upstream
            masters.

            Similarly, by allowing applications to annotate different transactions with
            different stream IDs, we can achieve the same kind of parallel replication as
            MySQL 5.6, in a way that is both simpler and more flexible.

            We can simplify the design somewhat if we put some extra requirements on the
            use of multi-source replication with global transaction ID.

            We might require that if global transaction ID will be used with multi-source,
            then each of the multiple masters must configure a different stream_id. This
            way, it gets explicitly declared in the binlogs that the streams are
            replicated independently.

            Then there is no longer a need for the replication state to remember the
            last seq_no applied for every server_id ever seen. Instead, we only need to
            remember seq_no for each stream_id seen.

            This makes the replication stream much simpler, especially when multi-source
            replication is not used.

            For example, in the common setup with just one master and (N-1) slaves, then
            the replication state is just a single global transaction ID, and the binlogs
            on all the servers always have transactions in the save order. This makes it
            conceptually much easier to understand where a slave is replicating currently,
            and to provision manually a new slave.

            knielsen Kristian Nielsen added a comment - We can simplify the design somewhat if we put some extra requirements on the use of multi-source replication with global transaction ID. We might require that if global transaction ID will be used with multi-source, then each of the multiple masters must configure a different stream_id. This way, it gets explicitly declared in the binlogs that the streams are replicated independently. Then there is no longer a need for the replication state to remember the last seq_no applied for every server_id ever seen. Instead, we only need to remember seq_no for each stream_id seen. This makes the replication stream much simpler, especially when multi-source replication is not used. For example, in the common setup with just one master and (N-1) slaves, then the replication state is just a single global transaction ID, and the binlogs on all the servers always have transactions in the save order. This makes it conceptually much easier to understand where a slave is replicating currently, and to provision manually a new slave.
            serg Sergei Golubchik made changes -
            Description Global transaction ID

            The purpose of Global transaction ID (GTID) is to make slave position
            independent of internal details of master's binlog (file name, file
            offset). This allows a simpler switch of a slave to a new master, as the
            current slave position is valid on the new master as well as the old.

            The gtrid is some kind of a tag, that is attached to every transaction. The
            point is - it must be globally unique, it must go into the binary log together
            with the transaction itself, it must be replicated to slaves, and it must be
            preserved when a transaction is applied on the slave. That is no matter how
            many slaves the transaction was replicated through (in a complex replication
            graph), on every slave and in all binary logs on slaves the transaction should
            have the same gtrid.

            As part of this task, we will also make the slave replication state crash safe
            (can be recovered after a crash in a transactionally safe way).


            Gtid_log_event

            A global trasaction id is a pair (server_id, seq_no). server_id already exists
            to identify events originating at different servers. The seq_no is new, it is
            a 64-bit unsigned integer that increases monotonically (not necessarily
            without holes) at each commit on the master.

            Every binlog event group (eg. transaction, DDL, non-transactional statement)
            is annotated with its global transaction ID. Most event groups are already
            bracketed with BEGIN/COMMIT events. We replace the BEGIN event with a new
            Gtid_Log_Event event. This event contains the seq_no of the global transaction
            id for the following event group (the server_id is stored in the event header
            of every event). We also include some flag bits.

            On the slave, the Gtid_Log_Event is applied like the BEGIN query
            event. However, the seq_no from the event is remembered, and preserved when
            the replicated event is binlogged, just like server_id currently is. A session
            variable pseudo_seq_no is introduced (requires SUPER to change) to similarly
            override seq_no, and used by mysqlbinlog to preserve gtid for
            mysqlbinlog|mysql style binlog apply.

            For event groups that currently have no BEGIN/COMMIT around them, a
            Gtid_Log_event is added before it. This event has a flag set to let the slave
            know that no COMMIT will follow, so it just applies to the following event
            group. This is for eg. DDL and certain out-of-band events like TRUNCATE of
            HEAP table after master restart.

            Old MariaDB slaves (or MySQL slaves) will not understand the new
            Gtid_log_event. We handle this using the existing mechanism for replacing
            events that old slaves cannot handle. For Gtid_log_event with no corresponding
            COMMIT we can just omit it or replace with a dummy event as appropriate. For
            the case with a corresponding COMMIT, we replace the Gtid_log_event with a
            normal BEGIN query event.

            To be able to do this, we make the Gtid_log_event be 38 bytes (19 bytes header
            + 19 bytes body), reserving a couple unused bytes for future expansin, as this
            is the minimum size for a BEGIN query event. Incidentally the current BEGIN
            event takes 68 bytes due to redundant information, so we still get 30 bytes
            space saved for every event group by introducing GTID.


            GTID in binlog

            Whenever we write a new event group to the binlog, we allocate the next seq_no
            and put it in a Gtid_log_event at the start of the group. When we write
            through a cache, we put 0 there and back-patch the proper seq_no when we write
            the cache to the binlog, so that seq_no ordering follows commit order and
            binlog order (per server-id).

            When the server shuts down, we write (and fsync()) the current seq_no to a
            file gtid.info in the data directory. When we startup, we read it back in to
            resume from the correct number. We fsync() the gtid.info file before marking
            the last binlog file as cleanly closed. And if at startup we detect that last
            binlog was not cleanly closed, we do the normal binlog crash recovery, and as
            part of that extract the last used seq_no from the events seen there, instead
            of relying on the (probably corrupt) gtid.info.

            The server remembers (eg. in a hash) the last seq_no seen for every
            server_id. When a new binlog file is written, this list of GTIDs is written
            out at the start of the binlog as a new Gtid_list_log_event.

            This allows to find the location in the server binlogs of any given GTID (sid,
            seq_no): start from the last binlog file, and scan backwards. For each binlog
            file, read the Gtid_list_log_event at the start. If sid is found with a
            lower-or-equal seq_no, then the GTID is found in this file, and we can scan
            forward until we find it. If sid is found with a bigger seq_no, then the GTID
            exists in an earlier binlog file (or has been purged if this is true for every
            binlog file). If sid is not found in the Gtid_list_log_event, then the GTID
            was never seen by this server.


            Slave replication state

            By "slave replication state", we denote what currently is the master binlog
            file name and file offset of the last event group applied on a slave. This is
            needed when reconnecting a slave to a master to resume replication at the
            correct point in the binlog event stream and not lose or duplicate any events.

            When GTIDs are used, the slave replication state is instead the GTID of the
            last event group applied on the slave. However, because of multi-master
            replication (and later perhaps parallel replication that re-orders events),
            the replication state becomes a _set_ of GTIDs last applied.

            The slave remembers the GTID with the maximal seq_no for every server_id value
            of every event group applied. If an event is received with lower-or-equal
            seq_no than already applied for that server_id, then the event is ignored
            (same as currently when receiving an event with own server_id). This avoids
            duplicating events in circular replication topologies.

            The slave replication state (list of GTIDs) is written to gtid.info at
            shutdown, and also written at the start of every slave binlog file. This makes
            it possible to recover after a slave crash, during the scan of the last
            binlog, same way as for seq_no on a master. We can in addition store it in
            relay-log.info if --log-slave-updates=OFF, so slaves can run with binlog
            disabled, however this will be as crash-unsafe as current replication.

            When the slave connects to the master with GTID enabled, it no longer needs to
            send master binlog file name and file offset. Instead it sends its replication
            state as a list of GTIDs. The master will search back through its binlogs as
            described above for the earliest of these GTIDs, and start sending events from
            that point on back to the slave. Note that if the server_id of a GTID has
            never been seen by a master, it can be ignored; however if it is seen in some
            Gtid_list_log_event but has been purged, then it is an error and slave connect
            fails (same as currently if slave asks for a binlog file that has been
            purged).

            When a slave connects using the old style (binlog file name and file offset),
            the master will send back the replication state corresponding to this
            position. This allows to automatically migrate to GTID; the next time the
            slave reconnects, it can use the replication state obtained from previous
            connect. It also allows to provision a new slave from a backup made with
            mysqldump --master-data or XtraDB. These provide the old-style binlog file
            name and positions to use - and after first connect to the master, the slave
            can automatically switch over to use global transaction IDs.

            Note that the list of GTIDs kept on master and on slave is in fact identical,
            even though it is used for different purposes on master and slave. Of course,
            it is possible for a server to be both a master and a slave, and use the
            information for both purposes.


            User interface

            The SHOW SLAVE STATUS command needs to be extended to also show the
            GTID-enabled replication state, which is the set of GTIDs with maximal seq_no
            per server-id applied on the slave. The SHOW MASTER STATUS should also show
            it.

            The CHANGE MASTER TO command can be used as before. If a GTID replication
            state is available, and no explicit filename/offset is given, and master
            supports GTID, then GTID will be used to automatically start from the correct
            event group.

            CHANGE MASTER should also be extended to allow to specify the current
            replication state. This should not normally be needed, but can be useful to
            experiment or recover from fatal server loss or corruption, etc. Specifying a
            GTID sets the maximal seq_no for the given server id. Specifying NULL for
            set_no in the GTID removes the server_id from the GTID list, leaving things as
            if that server_id was never seen before on the server.

            The slave_skip_counter can be used as before. It still records the GTID of any
            event group skipped.

            START SLAVE UNTIL is extended in syntax to take a GTID. It stops the slave
            when a GTID with same server_id and greater-or-equal seq_no is reached (if
            equal, the event is applied on the slave before stopping). If given a list of
            GTIDs, stops when any of them is reached. (Stopping when all GTIDs in a list
            have been reached can be achieved by a sequence of START SLAVE UNTIL
            commands).


            Switching to a new slave

            Suppose we have a number of slaves replicating off of one master (or several
            masters with multi-source replication). GTIDs make it simpler to switch to
            using one of the slaves as a new master (because the original master died or
            is taken down for maintenance or whatever).

            In the simple case where there is no multi-source replication (and no parallel
            replication that reorders transactions, if that is later implemented), then
            the event stream is completely linear, and with the same sequence in every
            slave at any point in the replication hierarchy. Then if one runs SHOW SLAVE
            STATUS on each slave, there will be one slave that has greater-or-equal seq_no
            for each server_id than any other slave.

            Now we can simply promote that slave (there may be several equal to choose
            from) as the new master. All the other slaves can simply do CHANGE MASTER TO,
            specifying the connection details of the new master, and GTID will ensure that
            they continue at the correct position.

            Of course if the master can be stopped gracefully while switching slave, we
            can just let all slaves run until they have all replicated everything from the
            master. Then any slave can be promoted as the new master.

            If using multi-source (or possibly later implemented parallel replication),
            then matters can be more complex (but remember that this is not the common
            case - only multi-source replication, where such switch of master is probably
            uncommon, and any later implemented parallel replication).

            Let us say that for two GTIDs we have

                (sid1,seq_no1) <= (sid2,seq_no2) iff sid1==sid2 && seq_no1 <= seq_no2

            It is possible that if the old master disappears abruptly (ie. crashes), then
            for every pair of slaves S1, S2, the replication state of S1 and S2 may be
            in-commensurable: There is a GTID1 on S1 with no greater GTID on S2, and
            likewise a GTID2 on S2 with no greater GITD on S1.

            In this case no slave is immediately ready to take over as master without some
            event groups getting missing. However, we can still do a reliable master
            switch, as follows:

            First, arbitrarily pick any slave as the new master. Obtain the replication
            state (SHOW SLAVE STATUS) of all other slaves. The idea is that we will
            replicate all missing changes from every other slave to the newly selected
            master to make sure it has everything needed to fullfill the master role.

            For each slave S, we do a CHANGE MASTER TO on the server selected as new
            master. Then we START SLAVE UNTIL <gtid> for every gtid in the replication
            state of S. This ensures that we have every event group seen by S on the new
            master. By repeating for every slave server, we end up with a new master
            server that will have a replication state that is a superset of all the
            remaining slaves. We can then simply CHANGE MASTER TO on all the slaves, and
            continue.

            As part of this worklog, I will write a script that reads a list of connection
            strings to the set of slave servers, and goes through the above procedure,
            resulting in the first server on the list being promoted as the new master and
            every other slave on the list changed to replicate from the new master.


            Comparison with MySQL 5.6 global transaction ID

            The main motivation for this design is my dislike for the MySQL 5.6 global
            transaction ID design.

            Despite all of its flaws (mainly lack of robustness), MySQL replication has
            been extremely successful. I believe the reason is that is transparent, in the
            sense that the way it works is conceptually simple to understand, and thus
            possible to tweak and manipulate by users. Replication consists simply of
            sending a stream of the changes done on a master to the slaves to be
            repeated. The slave replication state is simply the position in the stream.

            The design explained here preserves this conceptual simplicity. The
            replication state is still just the position in the stream. We just have a
            universal way to refer to that position that works on all servers across the
            replication topology. It is still possible to tweak and manipuate that
            position (slave_skip_counter, START SLAVE UNTIL, etc.)

            Multi-source replication makes the state more complex, since we have now a set
            of positions, but it is still conceptually sane.

            The MySQL 5.6 design, in contrast, loses the simple concept of replication
            state. The state is now an abstract set of all GTIDs ever applied on a slave,
            which is harder to grasp and manipulate. The concept of position in a
            replication stream becomes meaningless, as parallel replication can
            arbitrarily re-order events in the binlogs at different levels of the
            replication topology.

            (See below for plans on how to extend the design presented here to handle
            MySQL-style parallel replication in a way that tries to preserve the nice
            properties).


            Comparison with the Google transaction ID patch

            I think this approach is rather similar in concept to the Google transaction
            ID patch. The main differences are probably:

             - The use of the binlog for persisting the replication state in a crash-safe
               manner. I believe this is a better approach than what the Google patch
               does.

             - The extensions to handle multi-source replication, which is necessary as
               this feature has been introduced to MariaDB.


            Future expansion for parallel replication

            Parallel replication is not a part of this design. This task can be fully
            implemted as described here in a self-consistent way. However, parallel
            replication _is_ an important feature, and this section describes how to plan
            for being able to extend GTIDs later in a nice way to handle parallel
            replication.

            Now, parallel replication can come in two variants. One is where transactions
            are run in parallel, but still committed in the same order on the slave as on
            the master. This in-order parallelism has no conflicts with the GTID design
            described here, and can be implemented independently. MWL#184 is an example of
            this.

            The other out-of-order variant is what MySQL 5.6 does. Transactions are
            committed on a slave in different order than on the master, so the slave
            binlog has transactions in a different order than the master. But this kind of
            parallelism can potentially obtain higher degree on the slave, so can be
            desirable.

            Multi-source replication is related to out-of-order parallelism. On a
            multi-source slave, event groups are applied in parallel, but written
            interleaved with one another in some arbitrary way in a single
            binlog. So a multi-source slave S1 could itself be a master for a deeper-down
            slave S2, but currently events on S2 would have to be replicated in-order with
            no parallelism, which can be prohibitively expensive.

            With multi-source we have extra information available on S1: We know that
            events from two different upstream masters have no predefined ordering, while
            events from a single upstream master must be applied in the order given. If we
            record that information in the binlog of S1, then S2 can use that information
            to know that it is safe for it to also apply events from distinct uptream
            masters in parallel, but keep the ordering of events belonging to a single
            upstream master.

            If we do it this way, then we retain the nice conceptual understanding of a
            GTID as a well-defined position in a replication stream. If two GTIDs
            originate from the same upstream master, they have a well-defined ordering
            which will be preserved all across the replication topology.

            In effect we now have multiple replication streams in each binlog. Between
            different replication streams, there is no ordering implied, however within a
            single replication stream GTIDs uniquely and consistently define a simple
            linear order.

            We can extend this by allowing multiple user-defined streams originating at a
            single server, with the application having the responsibility of ensuring that
            different streams are really independent. For example, we could create N
            streams based on a hash of the used database, to get different databases
            replicated in parallel just as the MySQL 5.6 MTS (multi-threaded slave)
            feature. But we still retain the concept of GTID as position in a linear
            stream - just with multiple streams possible.

            The full design of this will be written up in a different task. But we can
            prepare the global transaction ID design to be more easily extensible towards
            multiple binlog streams, without much extra effort.

            Whenever we store a global transaction ID (server_id, seq_no), we also store a
            _replication stream ID_. So we store this in Gtid_log_event and
            Gtid_list_log_event, write it in gtid.info (and relay-log.info), and remember
            it as part of the slave replication state. We can use a 32-bit unsigned
            integer as replication stream id.

            For the implemetation of GTID, this will always be zero. Later, we can
            implement that multi-source replication can assign different replication
            stream id to events from different upstream masters. The idea is that two
            events groups with distinct replication stream ID can be replicated in
            parallel and committed in any order to the binlog. This will allow a
            downstream slave S2 to also parallelise events from different upstream
            masters.

            Similarly, by allowing applications to annotate different transactions with
            different stream IDs, we can achieve the same kind of parallel replication as
            MySQL 5.6, in a way that is both simpler and more flexible.
            h2. Global transaction ID

            The purpose of Global transaction ID (*GTID*) is to make slave position
            independent of internal details of master's binlog (file name, file
            offset). This allows a simpler switch of a slave to a new master, as the
            current slave position is valid on the new master as well as the old.

            The GTID is some kind of a tag, that is attached to every transaction. The
            point is --- it must be globally unique, it must go into the binary log together
            with the transaction itself, it must be replicated to slaves, and it must be
            preserved when a transaction is applied on the slave. That is no matter how
            many slaves the transaction was replicated through (in a complex replication
            graph), on every slave and in all binary logs on slaves the transaction should
            have the same GTID.

            As part of this task, we will also make the slave replication state crash safe
            (can be recovered after a crash in a transactionally safe way).


            h2. Gtid_log_event

            A global trasaction id is a pair (server_id, seq_no). The *server_id* already exists
            to identify events originating at different servers. The *seq_no* is new, it is
            a 64-bit unsigned integer that increases monotonically (not necessarily
            without holes) at each commit on the master.

            Every binlog event group (eg. transaction, DDL, non-transactional statement)
            is annotated with its global transaction id. Most event groups are already
            bracketed with BEGIN/COMMIT events. We replace the BEGIN event with a new
            *Gtid_Log_Event* event. This event contains the seq_no of the global transaction
            id for the following event group (the server_id is stored in the event header
            of every event). We also include some flag bits.

            On the slave, the Gtid_Log_Event is applied like the BEGIN query
            event. However, the seq_no from the event is remembered, and preserved when
            the replicated event is binlogged, just like server_id currently is. A session
            variable *pseudo_seq_no* is introduced (requires SUPER to change) to similarly
            override seq_no, and used by mysqlbinlog to preserve gtid for
            mysqlbinlog|mysql style binlog apply.

            For event groups that currently have no BEGIN/COMMIT around them, a
            Gtid_Log_event is added before it. This event has a flag set to let the slave
            know that no COMMIT will follow, so it just applies to the following event
            group. This is for eg. DDL and certain out-of-band events like TRUNCATE of
            HEAP table after master restart.

            Old MariaDB slaves (or MySQL slaves) will not understand the new
            Gtid_log_event. We handle this using the existing mechanism for replacing
            events that old slaves cannot handle. For Gtid_log_event with no corresponding
            COMMIT we can just omit it or replace with a dummy event as appropriate. For
            the case with a corresponding COMMIT, we replace the Gtid_log_event with a
            normal BEGIN query event.

            To be able to do this, we make the Gtid_log_event be 38 bytes (19 bytes header
            + 19 bytes body), reserving a couple unused bytes for future expansin, as this
            is the minimum size for a BEGIN query event. Incidentally the current BEGIN
            event takes 68 bytes due to redundant information, so we still get 30 bytes
            space saved for every event group by introducing GTID.


            h2. GTID in binlog

            Whenever we write a new event group to the binlog, we allocate the next seq_no
            and put it in a Gtid_log_event at the start of the group. When we write
            through a cache, we put 0 there and back-patch the proper seq_no when we write
            the cache to the binlog, so that seq_no ordering follows commit order and
            binlog order (per server-id).

            When the server shuts down, we write (and fsync()) the current seq_no to a
            file *gtid.info* in the data directory. When we startup, we read it back in to
            resume from the correct number. We fsync() the gtid.info file before marking
            the last binlog file as cleanly closed. And if at startup we detect that last
            binlog was not cleanly closed, we do the normal binlog crash recovery, and as
            part of that extract the last used seq_no from the events seen there, instead
            of relying on the (probably corrupt) gtid.info.

            The server remembers (eg. in a hash) the last seq_no seen for every
            server_id. When a new binlog file is written, this list of GTIDs is written
            out at the start of the binlog as a new Gtid_list_log_event.

            This allows to find the location in the server binlogs of any given GTID (server_id,
            seq_no): start from the last binlog file, and scan backwards. For each binlog
            file, read the Gtid_list_log_event at the start. If server_id is found with a
            lower-or-equal seq_no, then the GTID is found in this file, and we can scan
            forward until we find it. If server_id is found with a bigger seq_no, then the GTID
            exists in an earlier binlog file (or has been purged if this is true for every
            binlog file). If server_id is not found in the Gtid_list_log_event, then the GTID
            was never seen by this server.


            h2. Slave replication state

            By the *slave replication state*, we denote what currently is the master binlog
            file name and file offset of the last event group applied on a slave. This is
            needed when reconnecting a slave to a master to resume replication at the
            correct point in the binlog event stream and not lose or duplicate any events.

            When GTIDs are used, the slave replication state is instead the GTID of the
            last event group applied on the slave. However, because of multi-master
            replication (and later perhaps parallel replication that re-orders events),
            the replication state becomes a _set_ of GTIDs last applied.

            The slave remembers the GTID with the maximal seq_no for every server_id value
            of every event group applied. If an event is received with lower-or-equal
            seq_no than already applied for that server_id, then the event is ignored
            (same as currently when receiving an event with own server_id). This avoids
            duplicating events in circular replication topologies.

            The slave replication state (list of GTIDs) is written to gtid.info at
            shutdown, and also written at the start of every slave binlog file. This makes
            it possible to recover after a slave crash, during the scan of the last
            binlog, same way as for seq_no on a master. We can in addition store it in
            relay-log.info if {{--log-slave-updates=OFF}}, so slaves can run with binlog
            disabled, however this will be as crash-unsafe as current replication.

            When the slave connects to the master with GTID enabled, it no longer needs to
            send master binlog file name and file offset. Instead it sends its replication
            state as a list of GTIDs. The master will search back through its binlogs as
            described above for the earliest of these GTIDs, and start sending events from
            that point on back to the slave. Note that if the server_id of a GTID has
            never been seen by a master, it can be ignored; however if it is seen in some
            Gtid_list_log_event but has been purged, then it is an error and slave connect
            fails (same as currently if slave asks for a binlog file that has been
            purged).

            When a slave connects using the old style (binlog file name and file offset),
            the master will send back the replication state corresponding to this
            position. This allows to automatically migrate to GTID; the next time the
            slave reconnects, it can use the replication state obtained from previous
            connect. It also allows to provision a new slave from a backup made with
            {{mysqldump --master-data}} or XtraDB. These provide the old-style binlog file
            name and positions to use --- and after first connect to the master, the slave
            can automatically switch over to use global transaction IDs.

            Note that the list of GTIDs kept on master and on slave is in fact identical,
            even though it is used for different purposes on master and slave. Of course,
            it is possible for a server to be both a master and a slave, and use the
            information for both purposes.


            h2. User interface

            The SHOW SLAVE STATUS command needs to be extended to also show the
            GTID-enabled replication state, which is the set of GTIDs with maximal seq_no
            per server-id applied on the slave. The SHOW MASTER STATUS should also show
            it.

            The CHANGE MASTER TO command can be used as before. If a GTID replication
            state is available, and no explicit filename/offset is given, and master
            supports GTID, then GTID will be used to automatically start from the correct
            event group.

            CHANGE MASTER should also be extended to allow to specify the current
            replication state. This should not normally be needed, but can be useful to
            experiment or recover from fatal server loss or corruption, etc. Specifying a
            GTID sets the maximal seq_no for the given server id. Specifying NULL for
            set_no in the GTID removes the server_id from the GTID list, leaving things as
            if that server_id was never seen before on the server.

            The slave_skip_counter can be used as before. It still records the GTID of any
            event group skipped.

            START SLAVE UNTIL is extended in syntax to take a GTID. It stops the slave
            when a GTID with same server_id and greater-or-equal seq_no is reached (if
            equal, the event is applied on the slave before stopping). If given a list of
            GTIDs, stops when any of them is reached. (Stopping when all GTIDs in a list
            have been reached can be achieved by a sequence of START SLAVE UNTIL
            commands).


            h2. Switching to a new slave

            Suppose we have a number of slaves replicating off of one master (or several
            masters with multi-source replication). GTIDs make it simpler to switch to
            using one of the slaves as a new master (because the original master died or
            is taken down for maintenance or whatever).

            In the simple case where there is no multi-source replication (and no parallel
            replication that reorders transactions, if that is later implemented), then
            the event stream is completely linear, and with the same sequence in every
            slave at any point in the replication hierarchy. Then if one runs SHOW SLAVE
            STATUS on each slave, there will be one slave that has greater-or-equal seq_no
            for each server_id than any other slave.

            Now we can simply promote that slave (there may be several equal to choose
            from) as the new master. All the other slaves can simply do CHANGE MASTER TO,
            specifying the connection details of the new master, and GTID will ensure that
            they continue at the correct position.

            Of course if the master can be stopped gracefully while switching slave, we
            can just let all slaves run until they have all replicated everything from the
            master. Then any slave can be promoted as the new master.

            If using multi-source (or possibly later implemented parallel replication),
            then matters can be more complex (but remember that this is not the common
            case --- only multi-source replication, where such switch of master is probably
            uncommon, and any later implemented parallel replication).

            Let us say that for two GTIDs we have

            bq. (sid1,seq_no1) <= (sid2,seq_no2) iff sid1==sid2 && seq_no1 <= seq_no2

            It is possible that if the old master disappears abruptly (ie. crashes), then
            for every pair of slaves S1, S2, the replication state of S1 and S2 may be
            in-commensurable: There is a GTID1 on S1 with no greater GTID on S2, and
            likewise a GTID2 on S2 with no greater GITD on S1.

            In this case no slave is immediately ready to take over as master without some
            event groups getting missing. However, we can still do a reliable master
            switch, as follows:

            First, arbitrarily pick any slave as the new master. Obtain the replication
            state (SHOW SLAVE STATUS) of all other slaves. The idea is that we will
            replicate all missing changes from every other slave to the newly selected
            master to make sure it has everything needed to fullfill the master role.

            For each slave S, we do a CHANGE MASTER TO on the server selected as new
            master. Then we START SLAVE UNTIL <gtid> for every gtid in the replication
            state of S. This ensures that we have every event group seen by S on the new
            master. By repeating for every slave server, we end up with a new master
            server that will have a replication state that is a superset of all the
            remaining slaves. We can then simply CHANGE MASTER TO on all the slaves, and
            continue.

            As part of this worklog, I will write a script that reads a list of connection
            strings to the set of slave servers, and goes through the above procedure,
            resulting in the first server on the list being promoted as the new master and
            every other slave on the list changed to replicate from the new master.


            h2. Comparison with MySQL 5.6 global transaction ID

            The main motivation for this design is my dislike for the MySQL 5.6 global
            transaction ID design.

            Despite all of its flaws (mainly lack of robustness), MySQL replication has
            been extremely successful. I believe the reason is that is transparent, in the
            sense that the way it works is conceptually simple to understand, and thus
            possible to tweak and manipulate by users. Replication consists simply of
            sending a stream of the changes done on a master to the slaves to be
            repeated. The slave replication state is simply the position in the stream.

            The design explained here preserves this conceptual simplicity. The
            replication state is still just the position in the stream. We just have a
            universal way to refer to that position that works on all servers across the
            replication topology. It is still possible to tweak and manipuate that
            position (slave_skip_counter, START SLAVE UNTIL, etc.)

            Multi-source replication makes the state more complex, since we have now a set
            of positions, but it is still conceptually sane.

            The MySQL 5.6 design, in contrast, loses the simple concept of replication
            state. The state is now an abstract set of all GTIDs ever applied on a slave,
            which is harder to grasp and manipulate. The concept of position in a
            replication stream becomes meaningless, as parallel replication can
            arbitrarily re-order events in the binlogs at different levels of the
            replication topology.

            (See below for plans on how to extend the design presented here to handle
            MySQL-style parallel replication in a way that tries to preserve the nice
            properties).


            h2. Comparison with the Google transaction ID patch

            I think this approach is rather similar in concept to the Google transaction
            ID patch. The main differences are probably:

             - The use of the binlog for persisting the replication state in a crash-safe
               manner. I believe this is a better approach than what the Google patch
               does.

             - The extensions to handle multi-source replication, which is necessary as
               this feature has been introduced to MariaDB.


            h2. Future expansion for parallel replication

            Parallel replication is not a part of this design. This task can be fully
            implemted as described here in a self-consistent way. However, parallel
            replication _is_ an important feature, and this section describes how to plan
            for being able to extend GTIDs later in a nice way to handle parallel
            replication.

            Now, parallel replication can come in two variants. One is where transactions
            are run in parallel, but still committed in the same order on the slave as on
            the master. This in-order parallelism has no conflicts with the GTID design
            described here, and can be implemented independently. MWL#184 is an example of
            this.

            The other out-of-order variant is what MySQL 5.6 does. Transactions are
            committed on a slave in different order than on the master, so the slave
            binlog has transactions in a different order than the master. But this kind of
            parallelism can potentially obtain higher degree on the slave, so can be
            desirable.

            Multi-source replication is related to out-of-order parallelism. On a
            multi-source slave, event groups are applied in parallel, but written
            interleaved with one another in some arbitrary way in a single
            binlog. So a multi-source slave S1 could itself be a master for a deeper-down
            slave S2, but currently events on S2 would have to be replicated in-order with
            no parallelism, which can be prohibitively expensive.

            With multi-source we have extra information available on S1: We know that
            events from two different upstream masters have no predefined ordering, while
            events from a single upstream master must be applied in the order given. If we
            record that information in the binlog of S1, then S2 can use that information
            to know that it is safe for it to also apply events from distinct uptream
            masters in parallel, but keep the ordering of events belonging to a single
            upstream master.

            If we do it this way, then we retain the nice conceptual understanding of a
            GTID as a well-defined position in a replication stream. If two GTIDs
            originate from the same upstream master, they have a well-defined ordering
            which will be preserved all across the replication topology.

            In effect we now have multiple replication streams in each binlog. Between
            different replication streams, there is no ordering implied, however within a
            single replication stream GTIDs uniquely and consistently define a simple
            linear order.

            We can extend this by allowing multiple user-defined streams originating at a
            single server, with the application having the responsibility of ensuring that
            different streams are really independent. For example, we could create N
            streams based on a hash of the used database, to get different databases
            replicated in parallel just as the MySQL 5.6 MTS (multi-threaded slave)
            feature. But we still retain the concept of GTID as position in a linear
            stream --- just with multiple streams possible.

            The full design of this will be written up in a different task. But we can
            prepare the global transaction ID design to be more easily extensible towards
            multiple binlog streams, without much extra effort.

            Whenever we store a global transaction ID (server_id, seq_no), we also store a
            _replication stream ID_. So we store this in Gtid_log_event and
            Gtid_list_log_event, write it in gtid.info (and relay-log.info), and remember
            it as part of the slave replication state. We can use a 32-bit unsigned
            integer as replication stream id.

            For the implemetation of GTID, this will always be zero. Later, we can
            implement that multi-source replication can assign different replication
            stream id to events from different upstream masters. The idea is that two
            events groups with distinct replication stream ID can be replicated in
            parallel and committed in any order to the binlog. This will allow a
            downstream slave S2 to also parallelise events from different upstream
            masters.

            Similarly, by allowing applications to annotate different transactions with
            different stream IDs, we can achieve the same kind of parallel replication as
            MySQL 5.6, in a way that is both simpler and more flexible.
            serg Sergei Golubchik made changes -
            knielsen Kristian Nielsen made changes -
            Status In Progress [ 3 ] Open [ 1 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.2 [ 11900 ]
            Fix Version/s 10.0.1 [ 11400 ]
            knielsen Kristian Nielsen made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.3 [ 12900 ]
            Fix Version/s 10.0.2 [ 11900 ]
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.4 [ 13101 ]
            Fix Version/s 10.0.3 [ 12900 ]
            knielsen Kristian Nielsen made changes -
            Priority Critical [ 2 ] Minor [ 4 ]
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.5 [ 13201 ]
            Fix Version/s 10.0.4 [ 13101 ]
            serg Sergei Golubchik made changes -
            Priority Minor [ 4 ] Major [ 3 ]
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            serg Sergei Golubchik made changes -
            Assignee Kristian Nielsen [ knielsen ] Sergei Golubchik [ serg ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Kristian Nielsen [ knielsen ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.4 [ 13101 ]
            Fix Version/s 10.0.5 [ 13201 ]
            serg Sergei Golubchik made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow defaullt [ 10651 ] MariaDB v2 [ 43310 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Workflow MariaDB v2 [ 43310 ] MariaDB v3 [ 62400 ]
            Elkin Andrei Elkin made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 62400 ] MariaDB v4 [ 131877 ]

            People

              knielsen Kristian Nielsen
              ratzpo Rasmus Johansson (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.