Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30386

flush binary logs delete_domain_id not deleting specific non present domain_id

Details

    Description

      We are trying to clean up the output of gtid_slave_pos so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

      While cleaning them up via flush binary logs delete_domain_id we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

      For this example we have
      db1195 - master
      db1117 - slave

      db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

      root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
      +------------------+
      | @@gtid_domain_id |
      +------------------+
      |        171966479 |
      +------------------+
      1 row in set (0.001 sec)
       
      root@db1117.eqiad.wmnet[(none)]>  select @@gtid_slave_pos\G
      *************************** 1. row ***************************
      @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
      1 row in set (0.001 sec)
       
      root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
      ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
      root@db1117.eqiad.wmnet[(none)]>
      

      This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.

      root@db1117.eqiad.wmnet[(none)]> show binary logs;
      +-------------------+------------+
      | Log_name          | File_size  |
      +-------------------+------------+
      | db1117-bin.016856 |    9845469 |
      | db1117-bin.016857 | 1048576295 |
      | db1117-bin.016858 | 1048781927 |
      | db1117-bin.016859 | 1048659990 |
      | db1117-bin.016860 |  125787283 |
      +-------------------+------------+
      5 rows in set (0.001 sec)
      

      This is its replication data:

      root@db1117.eqiad.wmnet[(none)]> show slave status\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: db1195.eqiad.wmnet
                         Master_User: repl
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: db1195-bin.001570
                 Read_Master_Log_Pos: 339370363
                      Relay_Log_File: db1117-relay-bin.000010
                       Relay_Log_Pos: 339370663
               Relay_Master_Log_File: db1195-bin.001570
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 339370363
                     Relay_Log_Space: 339371021
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: Yes
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: 0
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error:
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 172001292
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 21
      Slave_Non_Transactional_Groups: 111
          Slave_Transactional_Groups: 1216438
      1 row in set (0.001 sec)
      

      If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
      We use MIXED as binlog_format

       
      root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
      -rw-rw---- 1 mysql mysql  9.4M Jan 11 20:31 db1117-bin.016856
      -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
      -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
      -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
      -rw-rw---- 1 mysql mysql  213M Jan 11 23:24 db1117-bin.016860
      -rw-rw---- 1 mysql mysql   100 Jan 11 23:20 db1117-bin.index
      -rw-rw---- 1 mysql mysql   358 Jan 11 23:16 db1117-relay-bin.000009
      -rw-rw---- 1 mysql mysql  380M Jan 11 23:24 db1117-relay-bin.000010
      -rw-rw---- 1 mysql mysql    52 Jan 11 23:16 db1117-relay-bin.index
       
      root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
      db1117-bin.016856
      db1117-bin.016857
      db1117-bin.016858
      db1117-bin.016859
      db1117-bin.016860
      db1117-relay-bin.000009
      db1117-relay-bin.000010
      

      The same result on its master

      root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
      +------------------+
      | @@gtid_domain_id |
      +------------------+
      |        172001292 |
      +------------------+
      1 row in set (0.001 sec)
       
       
      root@db1195:/srv/sqldata#  ls | grep bin | grep -v index
      db1195-bin.001565
      db1195-bin.001566
      db1195-bin.001567
      db1195-bin.001568
      db1195-bin.001569
      db1195-bin.001570
       
      root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
      db1195-bin.001565
      db1195-bin.001566
      db1195-bin.001567
      db1195-bin.001568
      db1195-bin.001569
      db1195-bin.001570
      

      Logs have been flushed/rotated just in case on both hosts.

      On sql/rpl_gtid.cc we can see this piece of code

          for (not_match= true, k= 0; k < elem->hash.records; k++)
          {
            rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
            for (ulong l= 0; l < glev->count && not_match; l++)
              not_match= !(*d_gtid == glev->list[l]);
          }
       
          if (not_match)
          {
            sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                    "being deleted. Make sure to first purge those files",
                    *ptr_domain_id);
            errmsg= errbuf;
            goto end;
          }
      

      Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):

          Gtid list is supposed to come from a binlog's Gtid_list event and
          therefore should be a subset of the current binlog state. That is
          for every domain in the list the binlog state contains a gtid with
          sequence number not less than that of the list.
          Exceptions of this inclusion rule are:
            A. the list may still refer to gtids from already deleted domains.
               Files containing them must have been purged whereas the file
               with the list is not yet.
            B. out of order groups were injected
            C. manually build list of binlog files violating the inclusion
               constraint.
          While A is a normal case (not necessarily distinguishable from C though),
          B and C may require the user's attention so any (incl the A's suspected)
          inconsistency is diagnosed and *warned*.
        */
      

      Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
      In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

      Some other do work fine when being cleaned up on that same host:

      root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
      Query OK, 0 rows affected (0.002 sec)
       
      root@db1117.eqiad.wmnet[(none)]>
      

      The same happens when trying to delete the default gtid_domain_id 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts.

      Attachments

        Activity

          marostegui Manuel Arostegui created issue -
          marostegui Manuel Arostegui made changes -
          Field Original Value New Value
          Summary flush binary logs delete_domain_id not flush binary logs delete_domain_id not deleting specific non present domain_id
          marostegui Manuel Arostegui made changes -
          Description We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}
          We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}

          The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as set gtid_domain_id to a different value when we set up the hosts.
          danblack Daniel Black made changes -
          Assignee Brandon Nesterenko [ JIRAUSER48702 ]
          marostegui Manuel Arostegui made changes -
          Description We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}

          The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as set gtid_domain_id to a different value when we set up the hosts.
          We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}

          The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts.
          serg Sergei Golubchik made changes -
          Fix Version/s 10.4 [ 22408 ]
          Fix Version/s 10.5 [ 23123 ]
          Fix Version/s 10.6 [ 24028 ]
          Fix Version/s 10.7 [ 24805 ]
          Fix Version/s 10.8 [ 26121 ]
          Fix Version/s 10.9 [ 26905 ]
          Fix Version/s 10.10 [ 27530 ]
          Fix Version/s 10.11 [ 27614 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.7 [ 24805 ]
          knielsen Kristian Nielsen added a comment - - edited

          Thanks, Manuel Arostegui, for an excellent write-up of the problem.

          This part of the code you refer to looks obviously wrong:

              for (ulong l= 0; l < glev->count && not_match; l++)
                not_match= !(*d_gtid == glev->list[l]);
          

          The test is reversed. If the domain to be deleted has multiple entries in the GTID_LIST event (two or more different server_ids), then obviously all but one will not match.

          So this would make it impossible to delete any domain which was served by two different masters . When I get time, I'll try to write a test for this.

          Some extra information would be useful: The value of @@GTID_BINLOG_STATE, and the mysqlbinlog dump of the GTID_LIST event at the start of the first binlog file on the slave. And also the output of SHOW WARNINGS after the failing command, if any. I know this is an old bug report, so I understand if this is no longer possible to obtain.

          Also, I wonder if there is some confusion here. The DELETE_DOMAIN_ID is for removing domains from the @@GTID_BINLOG_STATE. But the report mentions deleting from the @@GTID_SLAVE_POS. The latter is done simply by setting @@GTID_SLAVE_POS to the new desired value without the unwanted domains. However, the observed behaviour still looks like a bug to me...

          knielsen Kristian Nielsen added a comment - - edited Thanks, Manuel Arostegui, for an excellent write-up of the problem. This part of the code you refer to looks obviously wrong: for (ulong l= 0; l < glev->count && not_match; l++) not_match= !(*d_gtid == glev->list[l]); The test is reversed. If the domain to be deleted has multiple entries in the GTID_LIST event (two or more different server_ids), then obviously all but one will not match. So this would make it impossible to delete any domain which was served by two different masters . When I get time, I'll try to write a test for this. Some extra information would be useful: The value of @@GTID_BINLOG_STATE, and the mysqlbinlog dump of the GTID_LIST event at the start of the first binlog file on the slave. And also the output of SHOW WARNINGS after the failing command, if any. I know this is an old bug report, so I understand if this is no longer possible to obtain. Also, I wonder if there is some confusion here. The DELETE_DOMAIN_ID is for removing domains from the @@GTID_BINLOG_STATE. But the report mentions deleting from the @@GTID_SLAVE_POS. The latter is done simply by setting @@GTID_SLAVE_POS to the new desired value without the unwanted domains. However, the observed behaviour still looks like a bug to me...

          > The test is reversed. If the domain to be deleted has multiple entries in the GTID_LIST event (two or more different > server_ids), then obviously all but one will not match.

          Sorry about that, I got confused. The code loops until it finds an entry that does not (not-match), ie. until it finds a matching GTID. So the test is not reversed.

          Still trying to understand how the origial behaviour could occur.

          • Kristian.
          knielsen Kristian Nielsen added a comment - > The test is reversed. If the domain to be deleted has multiple entries in the GTID_LIST event (two or more different > server_ids), then obviously all but one will not match. Sorry about that, I got confused. The code loops until it finds an entry that does not (not-match), ie. until it finds a matching GTID. So the test is not reversed. Still trying to understand how the origial behaviour could occur. Kristian.

          Yeah, sorry about the confusion with GTID_BINLOG_STATE and GTID_SLAVE_POS
          Essentially we were also expecting to get Gtid_IO_Pos from show slave status\G to get cleaned once we were cleaning up things with _ DELETE_DOMAIN_ID_

          marostegui Manuel Arostegui added a comment - Yeah, sorry about the confusion with GTID_BINLOG_STATE and GTID_SLAVE_POS Essentially we were also expecting to get Gtid_IO_Pos from show slave status\G to get cleaned once we were cleaning up things with _ DELETE_DOMAIN_ID_

          To conclude:

          From checking the code, it looks like this behaviour can only occur if the
          domain 171966484 to be deleted is present in the gtid_binlog_state. Then, if
          the domain 171966484 is missing from the GTID_LIST event in the first binlog
          file, this error will be thrown.

          This situation represents a corruption of the database state, as the
          gtid_binlog_state is supposed to always reflect the contents of the binlog.
          Without the value of the @@gtid_binlog_state at the time the problem occurs,
          we can only assume this was the case. It is thus hard to be sure how such
          corruption could have occured.

          The gtid_binlog_state is saved between server restarts in the file
          db1117-bin.state. If the binlog files were manually modified in an
          inconsistent way (for example db1117-bin.NNNNNN and db1117-bin.index files
          deleted or restored from backup while leaving db1117-bin.state in place),
          the observed behaviour could occur. Is this something that could have
          happened?

          • Kristian.
          knielsen Kristian Nielsen added a comment - To conclude: From checking the code, it looks like this behaviour can only occur if the domain 171966484 to be deleted is present in the gtid_binlog_state. Then, if the domain 171966484 is missing from the GTID_LIST event in the first binlog file, this error will be thrown. This situation represents a corruption of the database state, as the gtid_binlog_state is supposed to always reflect the contents of the binlog. Without the value of the @@gtid_binlog_state at the time the problem occurs, we can only assume this was the case. It is thus hard to be sure how such corruption could have occured. The gtid_binlog_state is saved between server restarts in the file db1117-bin.state. If the binlog files were manually modified in an inconsistent way (for example db1117-bin.NNNNNN and db1117-bin.index files deleted or restored from backup while leaving db1117-bin.state in place), the observed behaviour could occur. Is this something that could have happened? Kristian.

          Thanks for the answer. It is impossible to be 100% sure, but I am 99.9% sure we've not done any of that. It is very very unlikely we've done so. We always remove binlogs (if we have to) via purge binary logs to

          This is of course too late now but leaving it here:

          mysql:root@localhost [(none)]> select @@gtid_binlog_state;
          +-------------------------------------------------------------------------------------------------------------------------------------------------------+
          | @@gtid_binlog_state                                                                                                                                   |
          +-------------------------------------------------------------------------------------------------------------------------------------------------------+
          | 171966607-171966607-267605286,171970746-171970746-1813679613,171974854-171974854-64140767,171978763-171978763-83528410,172001292-172001292-1616193628 |
          +-------------------------------------------------------------------------------------------------------------------------------------------------------+
          1 row in set (0.000 sec)
          

          In any case, what is the recommended workaround if that'd have been the case? Is it just issuing reset master on the primary master and then rebuild the whole replication topology? If that is the case, that's pretty much impossible in most production environments.
          Is there a way to implement a solution from this cause I don't think it is something too weird to see on other environments (people deleting binlogs manually rather than via mariadb CLI)

          Would a reset slave all clean up all the weird states? That would be a lot easier to issue in production.
          Maybe an implementation of a way to force those deletions without checking what was present on the first binlog, which is unlikely to be kept in very busy production environments?.

          Thanks Kristian!

          marostegui Manuel Arostegui added a comment - Thanks for the answer. It is impossible to be 100% sure, but I am 99.9% sure we've not done any of that. It is very very unlikely we've done so. We always remove binlogs (if we have to) via purge binary logs to This is of course too late now but leaving it here: mysql:root@localhost [(none)]> select @@gtid_binlog_state; +-------------------------------------------------------------------------------------------------------------------------------------------------------+ | @@gtid_binlog_state | +-------------------------------------------------------------------------------------------------------------------------------------------------------+ | 171966607-171966607-267605286,171970746-171970746-1813679613,171974854-171974854-64140767,171978763-171978763-83528410,172001292-172001292-1616193628 | +-------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.000 sec) In any case, what is the recommended workaround if that'd have been the case? Is it just issuing reset master on the primary master and then rebuild the whole replication topology? If that is the case, that's pretty much impossible in most production environments. Is there a way to implement a solution from this cause I don't think it is something too weird to see on other environments (people deleting binlogs manually rather than via mariadb CLI) Would a reset slave all clean up all the weird states? That would be a lot easier to issue in production. Maybe an implementation of a way to force those deletions without checking what was present on the first binlog, which is unlikely to be kept in very busy production environments?. Thanks Kristian!

          Ok, a number of points here, I'll try to answer all of them:

          > Would a reset slave all clean up all the weird states? That would be a lot
          > easier to issue in production.

          So there's a concrete issue here about a wrong or misleading error message
          on FLUSH BINARY LOGS DELETE_DOMAIN_ID. And there's a more general issue
          about "wierd states". Let's take the general issue first.

          From the given slave positions and binlog states, it looks like every
          replication server is given its own gtid_domain_id, identical to server_id.
          Why is it done like this? This is usually the wrong way to do it. Different
          domain_id should be needed only for master-master replication (with both
          masters actively receiving updates), multi-source, or out-of-order parallel
          replication. And in these cases the domain_id would not be unique per server
          unless all servers are involved in master-master replication and actively
          being updated.

          Having 10 different domains in the replication position is a very
          complicated setup. The best way to avoid having unwanted domain_id in the
          slave position or binlog state is to not introduce them in the first place.
          Well, I don't know your setup, from the large sequence numbers it does seem
          to be a very busy system, so maybe it's for a reason. Just pointing out that
          there's absolutely no requirement to have different domain id per server,
          and often it's wrong to do so.

          You mention needing to clean up gtid_slave_pos, otherwise the slave cannot
          be pointed to a different master. This sounds like exactly the kind of
          problem that could be caused by wrongly setting different domain_id on
          servers that are replicating the same set of data. If you're eg. replicating
          A -> B -> C -> D, there should be just one domain_id shared among all
          servers. And then there are no domains to clean up if the replication
          topology changes to C -> D -> A -> B or whatever.

          But if all the domain_ids were needed at one point and then later no longer
          wanted, this can be handled too:

          > Maybe an implementation of a way to force
          > those deletions without checking what was present on the first binlog, which
          > is unlikely to be kept in very busy production environments?.

          This is already implemented. To delete from the slave position, use
          SET GLOBAL gtid_slave_pos=<position with domains deleted>. To delete from
          the binlog state, use FLUSH BINARY LOGS DELETE_DOMAIN_ID, after all binlog
          files containing transactions in those domains have been purged.

          Note there is no requirement that the first binlog file be present. It's the
          other way around: if old binlog files are still present that contain GTIDs
          from the domain, then the domain can not be deleted. The reason is that this
          would cause the gtid_binlog_state to not match the existing binlog files. As
          soon as any old binlog files containing the domain are purged, the domain
          can be deleted with DELETE_DOMAIN_ID.

          > Thanks for the answer. It is impossible to be 100% sure, but I am 99.9%
          > sure we've not done any of that. It is very very unlikely we've done so.
          > We always remove binlogs (if we have to) via purge binary logs to

          > @@gtid_binlog_state
          > | 171966607-171966607-267605286,171970746-171970746-1813679613,171974854-171974854-64140767,171978763-171978763-83528410,172001292-172001292-1616193628 |

          Right. So the domain 171966484 that couldn't be deleted is not in the
          current binlog state. Maybe it was never there, and the misleading error
          message appears for some unknown reason that I couldn't guess from looking
          at the code.

          The gtid_binlog_state is basically a summary of the GTID information in the
          binlog (GTID_LIST and GTID events). If the domain 171966484 is nowhere found
          in the binlogs, it shouldn't be in the gtid_binlog_state (and if it was, it
          would be written as a GTID_LIST event in any new binlog files).

          > In any case, what is the recommended workaround if that'd have been the
          > case? Is it just issuing reset master on the primary master and then
          > rebuild the whole replication topology? If that is the case, that's pretty
          > much impossible in most production environments.

          No, that is not necessary.

          If you really have the situation that domain D=171966484 is not in any
          binlog files but is in the gtid_binlog_state, then just wait for the
          existing binlog files to be purged (or speed it up with FLUSH BINARY LOGS
          and PURGE BINARY LOGS TO ...). The newly written binlog files will contain
          the gtid_binlog_state in the GTID_LIST events. So once all the old binlog
          files with incorrect GTID_LIST have been purged, the command will work. But
          it is not a situation that is likely to occur in practice.

          If alternatively the situation is that the domain is already not present in
          the binlog state (as we suspect here) when the DELETE_DOMAIN_ID command is
          run, of course the workaround is just to ignore the misleading error.

          > Is there a way to implement a solution from this cause I don't think it is
          > something too weird to see on other environments (people deleting binlogs
          > manually rather than via mariadb CLI)

          It's not a problem to delete old binlog files manually, it will not cause a
          problem with gtid_binlog_state. Only if you stop the server, delete all
          binlog files (but not the master-bin.state file) would you get the problem.
          Or if you replace all the master-bin.NNNNNN files from a backup (but leave
          the master-bin.state file in place). The solution to implement is to delete
          the .state file as well if you delete all of the binlog, or to restore
          .state from the backup if you restore the binlog.

          Hope this helps,

          • Kristian.
          knielsen Kristian Nielsen added a comment - Ok, a number of points here, I'll try to answer all of them: > Would a reset slave all clean up all the weird states? That would be a lot > easier to issue in production. So there's a concrete issue here about a wrong or misleading error message on FLUSH BINARY LOGS DELETE_DOMAIN_ID. And there's a more general issue about "wierd states". Let's take the general issue first. From the given slave positions and binlog states, it looks like every replication server is given its own gtid_domain_id, identical to server_id. Why is it done like this? This is usually the wrong way to do it. Different domain_id should be needed only for master-master replication (with both masters actively receiving updates), multi-source, or out-of-order parallel replication. And in these cases the domain_id would not be unique per server unless all servers are involved in master-master replication and actively being updated. Having 10 different domains in the replication position is a very complicated setup. The best way to avoid having unwanted domain_id in the slave position or binlog state is to not introduce them in the first place. Well, I don't know your setup, from the large sequence numbers it does seem to be a very busy system, so maybe it's for a reason. Just pointing out that there's absolutely no requirement to have different domain id per server, and often it's wrong to do so. You mention needing to clean up gtid_slave_pos, otherwise the slave cannot be pointed to a different master. This sounds like exactly the kind of problem that could be caused by wrongly setting different domain_id on servers that are replicating the same set of data. If you're eg. replicating A -> B -> C -> D, there should be just one domain_id shared among all servers. And then there are no domains to clean up if the replication topology changes to C -> D -> A -> B or whatever. But if all the domain_ids were needed at one point and then later no longer wanted, this can be handled too: > Maybe an implementation of a way to force > those deletions without checking what was present on the first binlog, which > is unlikely to be kept in very busy production environments?. This is already implemented. To delete from the slave position, use SET GLOBAL gtid_slave_pos=<position with domains deleted>. To delete from the binlog state, use FLUSH BINARY LOGS DELETE_DOMAIN_ID, after all binlog files containing transactions in those domains have been purged. Note there is no requirement that the first binlog file be present. It's the other way around: if old binlog files are still present that contain GTIDs from the domain, then the domain can not be deleted. The reason is that this would cause the gtid_binlog_state to not match the existing binlog files. As soon as any old binlog files containing the domain are purged, the domain can be deleted with DELETE_DOMAIN_ID. > Thanks for the answer. It is impossible to be 100% sure, but I am 99.9% > sure we've not done any of that. It is very very unlikely we've done so. > We always remove binlogs (if we have to) via purge binary logs to > @@gtid_binlog_state > | 171966607-171966607-267605286,171970746-171970746-1813679613,171974854-171974854-64140767,171978763-171978763-83528410,172001292-172001292-1616193628 | Right. So the domain 171966484 that couldn't be deleted is not in the current binlog state. Maybe it was never there, and the misleading error message appears for some unknown reason that I couldn't guess from looking at the code. The gtid_binlog_state is basically a summary of the GTID information in the binlog (GTID_LIST and GTID events). If the domain 171966484 is nowhere found in the binlogs, it shouldn't be in the gtid_binlog_state (and if it was, it would be written as a GTID_LIST event in any new binlog files). > In any case, what is the recommended workaround if that'd have been the > case? Is it just issuing reset master on the primary master and then > rebuild the whole replication topology? If that is the case, that's pretty > much impossible in most production environments. No, that is not necessary. If you really have the situation that domain D=171966484 is not in any binlog files but is in the gtid_binlog_state, then just wait for the existing binlog files to be purged (or speed it up with FLUSH BINARY LOGS and PURGE BINARY LOGS TO ...). The newly written binlog files will contain the gtid_binlog_state in the GTID_LIST events. So once all the old binlog files with incorrect GTID_LIST have been purged, the command will work. But it is not a situation that is likely to occur in practice. If alternatively the situation is that the domain is already not present in the binlog state (as we suspect here) when the DELETE_DOMAIN_ID command is run, of course the workaround is just to ignore the misleading error. > Is there a way to implement a solution from this cause I don't think it is > something too weird to see on other environments (people deleting binlogs > manually rather than via mariadb CLI) It's not a problem to delete old binlog files manually, it will not cause a problem with gtid_binlog_state. Only if you stop the server, delete all binlog files (but not the master-bin.state file) would you get the problem. Or if you replace all the master-bin.NNNNNN files from a backup (but leave the master-bin.state file in place). The solution to implement is to delete the .state file as well if you delete all of the binlog, or to restore .state from the backup if you restore the binlog. Hope this helps, Kristian.
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.8 [ 26121 ]

          Kristian, thanks a lot for your very detailed answer. And sorry it's taken me long to answer - very busy with some things.
          I am going to answer some of the points:

          >Having 10 different domains in the replication position is a very
          >complicated setup. The best way to avoid having unwanted domain_id in the
          >slave position or binlog state is to not introduce them in the first place.
          >Well, I don't know your setup, from the large sequence numbers it does seem
          >to be a very busy system, so maybe it's for a reason. Just pointing out that
          >there's absolutely no requirement to have different domain id per server,
          >and often it's wrong to do so.

          We introduced it years ago to try to get GTID+multisource enabled, but we ran into a very nasty bug MDEV-12012
          We decided to be consistent and enable gtid_domain_id for all our hosts at the time. But given your explanation and the fact that we totally discarded the fact that we are going to go for multi-source ever again, maybe we can simply discard it now.

          We do have A -> B -> C (and even more depth) in our infrastructure so that's why there're so many IDs there. I am going to discuss with the team about this.
          We only have master-master in a topology, so we could just leave it there.

          >If you really have the situation that domain D=171966484 is not in any
          >binlog files but is in the gtid_binlog_state, then just wait for the
          >existing binlog files to be purged (or speed it up with FLUSH BINARY LOGS
          >and PURGE BINARY LOGS TO ...). The newly written binlog files will contain
          >the gtid_binlog_state in the GTID_LIST events. So once all the old binlog
          >files with incorrect GTID_LIST have been purged, the command will work. But
          >it is not a situation that is likely to occur in practice.

          That's what actually made me file the bug. As you can see on the original post, that gtid_domain_id isn't present on any of the binlogs. In all these time those logs would have been flushed flushed a few times already.

          >If alternatively the situation is that the domain is already not present in
          >the binlog state (as we suspect here) when the DELETE_DOMAIN_ID command is
          >run, of course the workaround is just to ignore the misleading error.

          I think I will try to go for this one, but also, maybe it is just better for us to totally unset gtid_domain_id on the first place and then try to clean them up entirely, to avoid running into these issues.

          Thanks again for your detailed answer.

          marostegui Manuel Arostegui added a comment - Kristian, thanks a lot for your very detailed answer. And sorry it's taken me long to answer - very busy with some things. I am going to answer some of the points: >Having 10 different domains in the replication position is a very >complicated setup. The best way to avoid having unwanted domain_id in the >slave position or binlog state is to not introduce them in the first place. >Well, I don't know your setup, from the large sequence numbers it does seem >to be a very busy system, so maybe it's for a reason. Just pointing out that >there's absolutely no requirement to have different domain id per server, >and often it's wrong to do so. We introduced it years ago to try to get GTID+multisource enabled, but we ran into a very nasty bug MDEV-12012 We decided to be consistent and enable gtid_domain_id for all our hosts at the time. But given your explanation and the fact that we totally discarded the fact that we are going to go for multi-source ever again, maybe we can simply discard it now. We do have A -> B -> C (and even more depth) in our infrastructure so that's why there're so many IDs there. I am going to discuss with the team about this. We only have master-master in a topology, so we could just leave it there. >If you really have the situation that domain D=171966484 is not in any >binlog files but is in the gtid_binlog_state, then just wait for the >existing binlog files to be purged (or speed it up with FLUSH BINARY LOGS >and PURGE BINARY LOGS TO ...). The newly written binlog files will contain >the gtid_binlog_state in the GTID_LIST events. So once all the old binlog >files with incorrect GTID_LIST have been purged, the command will work. But >it is not a situation that is likely to occur in practice. That's what actually made me file the bug. As you can see on the original post, that gtid_domain_id isn't present on any of the binlogs. In all these time those logs would have been flushed flushed a few times already. >If alternatively the situation is that the domain is already not present in >the binlog state (as we suspect here) when the DELETE_DOMAIN_ID command is >run, of course the workaround is just to ignore the misleading error. I think I will try to go for this one, but also, maybe it is just better for us to totally unset gtid_domain_id on the first place and then try to clean them up entirely, to avoid running into these issues. Thanks again for your detailed answer.
          marostegui Manuel Arostegui added a comment - - edited

          Kristian, I think it would still be nice to understand why is making FLUSH BINARY LOGS DELETE_DOMAIN_ID fail on the first place with that misleading error because this is still very unclear to me and we'd still need it to be able to set gtid_domain_id=0 and clean up all the unused values to be able to fully have a clean environment without unexpected GTID errors when handling replication via GTID
          Let me show you the behavior on a master:

          root@db1125.eqiad.wmnet[(none)]> show slave status\G
          Empty set (0.001 sec)
           
          root@db1125.eqiad.wmnet[(none)]>
           
          root@db1125.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+-----------+
          | Log_name          | File_size |
          +-------------------+-----------+
          | db1125-bin.000046 |       419 |
          +-------------------+-----------+
          1 row in set (0.001 sec)
           
          root@db1125.eqiad.wmnet[(none)]> flush binary logs;
          Query OK, 0 rows affected (0.002 sec)
           
          root@db1125.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+-----------+
          | Log_name          | File_size |
          +-------------------+-----------+
          | db1125-bin.000046 |       467 |
          | db1125-bin.000047 |       419 |
          +-------------------+-----------+
          2 rows in set (0.001 sec)
           
          root@db1125.eqiad.wmnet[(none)]> purge binary logs to 'db1125-bin.000047';
          Query OK, 0 rows affected (0.001 sec)
           
          root@db1125.eqiad.wmnet[(none)]> show binlog events in 'db1125-bin.000047'
              -> ;
          +-------------------+-----+-------------------+-----------+-------------+--------------------------------------------------------------------------------------+
          | Log_name          | Pos | Event_type        | Server_id | End_log_pos | Info                                                                                 |
          +-------------------+-----+-------------------+-----------+-------------+--------------------------------------------------------------------------------------+
          | db1125-bin.000047 |   4 | Format_desc       | 171978825 |         256 | Server ver: 10.6.12-MariaDB-log, Binlog ver: 4                                       |
          | db1125-bin.000047 | 256 | Gtid_list         | 171978825 |         331 | [171970745-171970745-2370743389,171970577-171970577-53037843,0-171978825-1006906065] |
          | db1125-bin.000047 | 331 | Binlog_checkpoint | 171978825 |         375 | db1125-bin.000046                                                                    |
          | db1125-bin.000047 | 375 | Binlog_checkpoint | 171978825 |         419 | db1125-bin.000047                                                                    |
          +-------------------+-----+-------------------+-----------+-------------+--------------------------------------------------------------------------------------+
          4 rows in set (0.001 sec)
          
          

          Now, let's try to delete 171970595 which is indeed not in Gtid_list:

          root@db1125.eqiad.wmnet[(none)]> FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970595);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171970595') being deleted. Make sure to first purge those files.
          

          Per your comment earlier, it is expected that it will fail if you try to remove a domain id that is not in Gtid_list on the first event of the binlog, right?

          root@db1125.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          |                0 |
          +------------------+
          1 row in set (0.001 sec)
           
          root@db1125.eqiad.wmnet[(none)]> select @@gtid_slave_pos;
          +------------------------+
          | @@gtid_slave_pos       |
          +------------------------+
          | 0-171970569-1006906065 |
          +------------------------+
          1 row in set (0.001 sec)
          
          

          I suppose I can also run this on this host which is now a master, but was also a slave at some point (which is going to be the case for all the replication topologies out there, as soon as they have a master switchover) - so I'd guess this is not strictly necessary, or shouldn't be - thoughts?:

          set global gtid_slave_pos='';
          

          So my question is, is this still a valid bug with no workaround possible?

          Thanks for all your detailed explanations!

          marostegui Manuel Arostegui added a comment - - edited Kristian, I think it would still be nice to understand why is making FLUSH BINARY LOGS DELETE_DOMAIN_ID fail on the first place with that misleading error because this is still very unclear to me and we'd still need it to be able to set gtid_domain_id=0 and clean up all the unused values to be able to fully have a clean environment without unexpected GTID errors when handling replication via GTID Let me show you the behavior on a master: root@db1125.eqiad.wmnet[(none)]> show slave status\G Empty set (0.001 sec)   root@db1125.eqiad.wmnet[(none)]>   root@db1125.eqiad.wmnet[(none)]> show binary logs; +-------------------+-----------+ | Log_name | File_size | +-------------------+-----------+ | db1125-bin.000046 | 419 | +-------------------+-----------+ 1 row in set (0.001 sec)   root@db1125.eqiad.wmnet[(none)]> flush binary logs; Query OK, 0 rows affected (0.002 sec)   root@db1125.eqiad.wmnet[(none)]> show binary logs; +-------------------+-----------+ | Log_name | File_size | +-------------------+-----------+ | db1125-bin.000046 | 467 | | db1125-bin.000047 | 419 | +-------------------+-----------+ 2 rows in set (0.001 sec)   root@db1125.eqiad.wmnet[(none)]> purge binary logs to 'db1125-bin.000047'; Query OK, 0 rows affected (0.001 sec)   root@db1125.eqiad.wmnet[(none)]> show binlog events in 'db1125-bin.000047' -> ; +-------------------+-----+-------------------+-----------+-------------+--------------------------------------------------------------------------------------+ | Log_name | Pos | Event_type | Server_id | End_log_pos | Info | +-------------------+-----+-------------------+-----------+-------------+--------------------------------------------------------------------------------------+ | db1125-bin.000047 | 4 | Format_desc | 171978825 | 256 | Server ver: 10.6.12-MariaDB-log, Binlog ver: 4 | | db1125-bin.000047 | 256 | Gtid_list | 171978825 | 331 | [171970745-171970745-2370743389,171970577-171970577-53037843,0-171978825-1006906065] | | db1125-bin.000047 | 331 | Binlog_checkpoint | 171978825 | 375 | db1125-bin.000046 | | db1125-bin.000047 | 375 | Binlog_checkpoint | 171978825 | 419 | db1125-bin.000047 | +-------------------+-----+-------------------+-----------+-------------+--------------------------------------------------------------------------------------+ 4 rows in set (0.001 sec) Now, let's try to delete 171970595 which is indeed not in Gtid_list: root@db1125.eqiad.wmnet[(none)]> FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970595); ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171970595') being deleted. Make sure to first purge those files. Per your comment earlier, it is expected that it will fail if you try to remove a domain id that is not in Gtid_list on the first event of the binlog, right? root@db1125.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 0 | +------------------+ 1 row in set (0.001 sec)   root@db1125.eqiad.wmnet[(none)]> select @@gtid_slave_pos; +------------------------+ | @@gtid_slave_pos | +------------------------+ | 0-171970569-1006906065 | +------------------------+ 1 row in set (0.001 sec) I suppose I can also run this on this host which is now a master, but was also a slave at some point (which is going to be the case for all the replication topologies out there, as soon as they have a master switchover) - so I'd guess this is not strictly necessary, or shouldn't be - thoughts?: set global gtid_slave_pos=''; So my question is, is this still a valid bug with no workaround possible? Thanks for all your detailed explanations!

          Hi Manuel,

          Ah, I hadn't understood that you can still reproduce this. Thanks for persevering, I definitely want to get to the bottom of this then.

          Can you provide the value of @@global.gtid_binlog_state ? Does it contain the domain_id 171970595 that we are trying to delete? I suspect it does not, because the value of gtid_binlog_state is what is written in the GTID_LIST event, and this is shown as [171970745-171970745-2370743389,171970577-171970577-53037843,0-171978825-1006906065] in your output, which does not contain 171970595.

          Just in case this isn't clear: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(D) deletes D from the gtid_binlog_state. Not from the gtid_slave_pos or anywhere else.

          Thus, if 171970595 is not in gtid_binlog_state, then it is correct that the DELETE_DOMAIN_ID command fails. It is obviously wrong (and very confusing) that it fails with the wrong message.

          On the other hand, if 171970595 is in the gtid_binlog_state, then there is a bug that the GTID_LIST event written into the new binlog db1125-bin.000047 is missing the domain, and we need to try to understand how that happens.

          So let's get the value of gtid_binlog_state corresponding to a known occurrence of this problem, so we know which of these to cases we are facing.

          Again, thanks Manuel for persisting in this, and thanks for the detailed and obviously competent bug reports and information/discussions.

          knielsen Kristian Nielsen added a comment - Hi Manuel, Ah, I hadn't understood that you can still reproduce this. Thanks for persevering, I definitely want to get to the bottom of this then. Can you provide the value of @@global.gtid_binlog_state ? Does it contain the domain_id 171970595 that we are trying to delete? I suspect it does not, because the value of gtid_binlog_state is what is written in the GTID_LIST event, and this is shown as [171970745-171970745-2370743389,171970577-171970577-53037843,0-171978825-1006906065] in your output, which does not contain 171970595. Just in case this isn't clear: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(D) deletes D from the gtid_binlog_state. Not from the gtid_slave_pos or anywhere else. Thus, if 171970595 is not in gtid_binlog_state, then it is correct that the DELETE_DOMAIN_ID command fails. It is obviously wrong (and very confusing) that it fails with the wrong message. On the other hand, if 171970595 is in the gtid_binlog_state, then there is a bug that the GTID_LIST event written into the new binlog db1125-bin.000047 is missing the domain, and we need to try to understand how that happens. So let's get the value of gtid_binlog_state corresponding to a known occurrence of this problem, so we know which of these to cases we are facing. Again, thanks Manuel for persisting in this, and thanks for the detailed and obviously competent bug reports and information/discussions.

          Yeah, I have a test cluster I can run and break as we want

          171970595 isn't present indeed.

          root@db1125.eqiad.wmnet[(none)]> select @@global.gtid_binlog_state;
          +------------------------------------------------------------------------------------+
          | @@global.gtid_binlog_state                                                         |
          +------------------------------------------------------------------------------------+
          | 0-171978825-1006915085,171970577-171970577-53037843,171970745-171970745-2370743389 |
          +------------------------------------------------------------------------------------+
          1 row in set (0.001 sec)
          

          However, taking this answer:
          > Thus, if 171970595 is not in gtid_binlog_state, then it is correct that the DELETE_DOMAIN_ID command fails. It is obviously wrong (and very confusing) that > it fails with the wrong message.

          If it is correct that if fails, then what is the way to actually be able to proceed and delete it from gtid_slave_pos which is what is messing up with replication in the first place as it breaks with

          Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-XXX-XXXX, which is not in the master's binlog'
          

          For us, if we want to go back and using just domain_id=0, and start playing with GTID, we'd still need to clean them all as otherwise moving replicas around with GTID enabled ends up breaking because of the above error.

          marostegui Manuel Arostegui added a comment - Yeah, I have a test cluster I can run and break as we want 171970595 isn't present indeed. root@db1125.eqiad.wmnet[(none)]> select @@global.gtid_binlog_state; +------------------------------------------------------------------------------------+ | @@global.gtid_binlog_state | +------------------------------------------------------------------------------------+ | 0-171978825-1006915085,171970577-171970577-53037843,171970745-171970745-2370743389 | +------------------------------------------------------------------------------------+ 1 row in set (0.001 sec) However, taking this answer: > Thus, if 171970595 is not in gtid_binlog_state, then it is correct that the DELETE_DOMAIN_ID command fails. It is obviously wrong (and very confusing) that > it fails with the wrong message. If it is correct that if fails, then what is the way to actually be able to proceed and delete it from gtid_slave_pos which is what is messing up with replication in the first place as it breaks with Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-XXX-XXXX, which is not in the master's binlog' For us, if we want to go back and using just domain_id=0, and start playing with GTID, we'd still need to clean them all as otherwise moving replicas around with GTID enabled ends up breaking because of the above error.

          Or to put it more succinctly:

          1. If you want to remove a domain from the gtid_slave_pos, then the correct command to use is SET GLOBAL gtid_slave_pos= <new position>. FLUSH BINARY LOGS DELETE_DOMAIN_ID cannot be used for this.

          2. The real bug here might be that FLUSH BINARY LOGS DELETE_DOMAIN_ID gives the wrong error message, not that it does something wrong (to be confirmed).

          - Kristian.

          knielsen Kristian Nielsen added a comment - Or to put it more succinctly: 1. If you want to remove a domain from the gtid_slave_pos, then the correct command to use is SET GLOBAL gtid_slave_pos= <new position>. FLUSH BINARY LOGS DELETE_DOMAIN_ID cannot be used for this. 2. The real bug here might be that FLUSH BINARY LOGS DELETE_DOMAIN_ID gives the wrong error message, not that it does something wrong (to be confirmed). - Kristian.

          > 1. If you want to remove a domain from the gtid_slave_pos, then the correct command to use is SET GLOBAL gtid_slave_pos= <new position>. FLUSH > BINARY LOGS DELETE_DOMAIN_ID cannot be used for this.

          I am going to double check the possible scenarios of this + non existent domain_id and report back. Thanks!
          Not that this is easy to do on a busy environment though where the master cannot be set on read_only for a long period of time but I will try on the test cluster. Once the process is smooth we'd need to see how we can actually make this happen in production without breaking replication/skipping transactions!

          > The real bug here might be that FLUSH BINARY LOGS DELETE_DOMAIN_ID gives the wrong error message, not that it does something wrong (to be confirmed).

          Yes, at least if we can make it a lot more meaningful, that'd be already a big win

          Thank you again for your time

          marostegui Manuel Arostegui added a comment - > 1. If you want to remove a domain from the gtid_slave_pos, then the correct command to use is SET GLOBAL gtid_slave_pos= <new position>. FLUSH > BINARY LOGS DELETE_DOMAIN_ID cannot be used for this. I am going to double check the possible scenarios of this + non existent domain_id and report back. Thanks! Not that this is easy to do on a busy environment though where the master cannot be set on read_only for a long period of time but I will try on the test cluster. Once the process is smooth we'd need to see how we can actually make this happen in production without breaking replication/skipping transactions! > The real bug here might be that FLUSH BINARY LOGS DELETE_DOMAIN_ID gives the wrong error message, not that it does something wrong (to be confirmed). Yes, at least if we can make it a lot more meaningful, that'd be already a big win Thank you again for your time

          Thanks Manuel for the quick answer.

          > 171970595 isn't present indeed.

          Aha. So the error message is wrong. It should say that the domain 171970595 cannot be deleted from gtid_binlog_state because it is not present. I'll dig a bit more and see if I can reproduce this error locally.

          > what is the way to actually be able to proceed and delete it from gtid_slave_pos

          The way to delete a domain from the gtid_slave_pos is:

          SET GLOBAL gtid_slave_pos = "..."

          (Eg.in this case by specifying the old position with the to-be-deleted domains removed).

          For example, say that the gtid_slave_pos is:

          10-10-1000, 20-20-2000

          But the current master is using domain_id 20 as the only domain, and you are getting an error that position 10-10-1000 is not present on the master binlog and want to remove it. Simply do

          SET GLOBAL gtid_slave_pos = "20-20-2000";

          Maybe I should write up some article that clearly explains how to use the domain_id of MariaDB global transaction id, and see if I can get it visible somehow. There is a way to think of domains that should make it relatively easy to understand what is going on. But I know the whole GTID implementation is somewhat complex, and I remember seeing multiple descriptions around on the 'net that are completely wrong wrt. domain_id.

          - Kristian.

          knielsen Kristian Nielsen added a comment - Thanks Manuel for the quick answer. > 171970595 isn't present indeed. Aha. So the error message is wrong. It should say that the domain 171970595 cannot be deleted from gtid_binlog_state because it is not present. I'll dig a bit more and see if I can reproduce this error locally. > what is the way to actually be able to proceed and delete it from gtid_slave_pos The way to delete a domain from the gtid_slave_pos is: SET GLOBAL gtid_slave_pos = "..." (Eg.in this case by specifying the old position with the to-be-deleted domains removed). For example, say that the gtid_slave_pos is: 10-10-1000, 20-20-2000 But the current master is using domain_id 20 as the only domain, and you are getting an error that position 10-10-1000 is not present on the master binlog and want to remove it. Simply do SET GLOBAL gtid_slave_pos = "20-20-2000"; Maybe I should write up some article that clearly explains how to use the domain_id of MariaDB global transaction id, and see if I can get it visible somehow. There is a way to think of domains that should make it relatively easy to understand what is going on. But I know the whole GTID implementation is somewhat complex, and I remember seeing multiple descriptions around on the 'net that are completely wrong wrt. domain_id. - Kristian.

          A generic SQL to delete a domain @domain from the gtid_slave_pos:

              SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),CONCAT(",", @domain, "-[0-9]+-[0-9]+"), ""), 2);
          

          knielsen Kristian Nielsen added a comment - A generic SQL to delete a domain @domain from the gtid_slave_pos: SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),CONCAT(",", @domain, "-[0-9]+-[0-9]+"), ""), 2);

          Thank you Kristian.

          At this point, I am very suspicious on how corrupted all this could be. I have tried the following approach with a slave.
          This slave has the following replication topology

          A: Primary master with the following gtid_domain_id=171974662 (this runs pt-heartbeat)
          B: Intermediate master with the following gtid_domain_id=180367475 (this runs another instance of pt-heartbeat)

          So the replication is:
          A -> B -> Slave

          This is its replication status:

          root@db2180.codfw.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db2129.codfw.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db2129-bin.003453
                     Read_Master_Log_Pos: 250342937
                          Relay_Log_File: db2180-relay-bin.000002
                           Relay_Log_Pos: 17297697
                   Relay_Master_Log_File: db2129-bin.003453
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 250342937
                         Relay_Log_Space: 17298007
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 180367475
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 180367475-180367475-979771585,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711761670
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: optimistic
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
                        Slave_DDL_Groups: 0
          Slave_Non_Transactional_Groups: 0
              Slave_Transactional_Groups: 6434525
          

          Let's try to clean up the domain_ids. Stoping slave and capturing the GTID position from both hosts:

                             Gtid_IO_Pos: 180367475-180367475-979771613,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711765037
          

          So they are:
          180367475-180367475-979771613
          171974662-171974662-1711765037

          Let's clean the rest:

          root@db2180.codfw.wmnet[(none)]> SET GLOBAL gtid_slave_pos = "180367475-180367475-979771613,171974662-171974662-1711765037"
              -> ;
          Query OK, 0 rows affected (0.036 sec)
           
          root@db2180.codfw.wmnet[(none)]> start slave;
           
           
          show slave status\G (snipped some parts)
                           Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 180367475
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 180367475-180367475-979771613,171974662-171974662-1711765037
          

          There is absolutely no way the the old binlogs could have been purged. At this point I am not sure anymore whether this is trustable or not.
          Interestingly all come back to normal after doing:

          root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE;
          Query OK, 0 rows affected (0.034 sec)
           
          Query OK, 0 rows affected (0.035 sec)
           
          Query OK, 0 rows affected (0.034 sec)
           
          root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;
          Query OK, 0 rows affected (0.037 sec)
           
          Query OK, 0 rows affected (0.040 sec)
           
          Query OK, 0 rows affected (0.036 sec)
           
          root@db2180.codfw.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db2129.codfw.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db2129-bin.003453
                     Read_Master_Log_Pos: 275030845
                          Relay_Log_File: db2180-relay-bin.000002
                           Relay_Log_Pos: 2457404
                   Relay_Master_Log_File: db2129-bin.003453
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 259054498
                         Relay_Log_Space: 18434061
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 189
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 180367475
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 180367475-180367475-979771885,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711800425
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: optimistic
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: closing tables
                        Slave_DDL_Groups: 0
          Slave_Non_Transactional_Groups: 0
              Slave_Transactional_Groups: 6448397
          1 row in set (0.034 sec)
           
           
          
          

          marostegui Manuel Arostegui added a comment - Thank you Kristian. At this point, I am very suspicious on how corrupted all this could be. I have tried the following approach with a slave. This slave has the following replication topology A: Primary master with the following gtid_domain_id=171974662 (this runs pt-heartbeat) B: Intermediate master with the following gtid_domain_id=180367475 (this runs another instance of pt-heartbeat) So the replication is: A -> B -> Slave This is its replication status: root@db2180.codfw.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db2129.codfw.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db2129-bin.003453 Read_Master_Log_Pos: 250342937 Relay_Log_File: db2180-relay-bin.000002 Relay_Log_Pos: 17297697 Relay_Master_Log_File: db2129-bin.003453 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 250342937 Relay_Log_Space: 17298007 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 180367475 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 180367475-180367475-979771585,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711761670 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 6434525 Let's try to clean up the domain_ids. Stoping slave and capturing the GTID position from both hosts: Gtid_IO_Pos: 180367475-180367475-979771613,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711765037 So they are: 180367475-180367475-979771613 171974662-171974662-1711765037 Let's clean the rest: root@db2180.codfw.wmnet[(none)]> SET GLOBAL gtid_slave_pos = "180367475-180367475-979771613,171974662-171974662-1711765037" -> ; Query OK, 0 rows affected (0.036 sec)   root@db2180.codfw.wmnet[(none)]> start slave;     show slave status\G (snipped some parts) Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.' Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 180367475 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 180367475-180367475-979771613,171974662-171974662-1711765037 There is absolutely no way the the old binlogs could have been purged. At this point I am not sure anymore whether this is trustable or not. Interestingly all come back to normal after doing: root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE; Query OK, 0 rows affected (0.034 sec)   Query OK, 0 rows affected (0.035 sec)   Query OK, 0 rows affected (0.034 sec)   root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE; Query OK, 0 rows affected (0.037 sec)   Query OK, 0 rows affected (0.040 sec)   Query OK, 0 rows affected (0.036 sec)   root@db2180.codfw.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db2129.codfw.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db2129-bin.003453 Read_Master_Log_Pos: 275030845 Relay_Log_File: db2180-relay-bin.000002 Relay_Log_Pos: 2457404 Relay_Master_Log_File: db2129-bin.003453 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 259054498 Relay_Log_Space: 18434061 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 189 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 180367475 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 180367475-180367475-979771885,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711800425 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: closing tables Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 6448397 1 row in set (0.034 sec)    

          > At this point, I am very suspicious on how corrupted all this could be
          > At this point I am not sure anymore whether this is trustable or not.

          I understand this sentiment. But I think it will be useful for you to get out of this sentiment and get an understanding what is really going on.

          From what I see here, there is so far just one bug, the incorrectly worded error message (which is sufficiently wrong that it confused both of us). I believe the real problem is the configuration of domain_ids, which is doing something very differently from what you expect it to do.

          The TL;DR of it is that by giving every server a distinct domain_id, you have configured GTID for a topology where every server is an active master that is potentially replicating to every other server. This is obviously not what is intended, and most of these potential replication streams are not set up. Therefore, the GTID behaviour of the server is different from what you expect. Let me see if I can illustrate it:

          > Interestingly all come back to normal after doing [disabling GTID]

          Let's start from this, as this is familiar to you. Here, there is a single replication stream. The slave points to a specific location in the master binlog files, and from there it will apply events one after the other in sequence. This matches the non-multisource topology A -> B -> Slave

          Now imagine that we delete all data on the slave and do a RESET SLAVE. We can then reconnect the slave to the master, and it will fetch all events from the start of the binlog (RESET SLAVE makes the position empty). Eventually the slave will be restored to the original state when all events are applied. Note that if binlog files were purged on the master, this will silently lose data; this is something that GTID prevents by instead giving an error when the slave requests data that is missing on the master.

          If you were to RESET SLAVE without deleting any data, then you would instead need to RESET MASTER at a point where slave and master are in sync. Otherwise the slave will duplicate old events when it replicates from the start.

          Next, consider GTID replication with a simple gtid_slave_pos=0-1-100. This is the same as non-GTID replication, the slave points to a specific location in the master binlog files (just after transaction 0-1-100). It will apply events from there in sequence. Here also we can delete all data on the slave, set gtid_slave_pos to the empty position, and reconnect to the master. The empty position makes the slave start from the beginning of the binlog on the master and eventually restore the state by applying all events.

          All of this is with a single domain id. But imagine we have multi-source replication 1 -> 3, 2 -> 3, 1 -> 4, 2 -> 4. Here server3 has two different positions, one in each of server1 and server2's binlogs. In non-GTID replication there is no way for server3 to replicate from server4 as a master, because in general there is no single position it server4's binlog that corresponds to both of server3's current positions.

          But GTID can do this using the domain_id. Each domain_id in the gtid_slave_pos is essentially an independent replication stream. In this case, two domain_ids will be configured, say 1 and 2. Now server3 can start replicating from two different positions in server4's binlog. And it will continue keeping track of both independent positions so it can later move back to replicating from server1 and server2 individually again.

          Finally, suppose we remove server2 (and all its data) from the system (maybe a customer quit or something), leaving us with the topology 1 -> 4 -> 3. Now if we try to make server1 a slave of server3, 4 -> 3 -> 1, we get a problem. server1 is missing all of the data from the domain 2. It will connect to server3 with empty position in domain 2, and will need to replicate all events ever generated on server2. This will not be possible if any binlog files were ever purged on server3.

          To do this, we delete the domain_id=2 from the binlog state of server3: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(2). We do this on the master, because we want to change the history and pretend that domain 2 never existed in the system.

          Then later we might want to move server4 to replicate from server3: 3 -> 4, 3 -> 1. In this case, we want to remove the domain 2 from the slave position of server 4:

          SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),",2-[0-9]+-[0-9]+", ""), 2);
          

          You see, here DELETE_DOMAIN_ID corresponds to RESET MASTER, and SET gtid_slave_pos corresponds to RESET SLAVE in the non-GTID case. But we are doing it only in the domain 2 which we want to remove, while leaving the other domain intact. So it should be familiar to you once you start thinking of each domain as independent replication streams.

          In your case, because you configured so many domain_id's, MariaDB GTID is obliged to act as if you have a complex multi-source topology. To remove a domain, you therefore have to do it in the same way as in the example: Remove it from the master binlog_state at level N with DELETE_DOMAIN_ID, and at the same time remove it from the slave position at level (N+1). Note that you have to compare the gtid_binlog_state on the master with the gtid_slave_pos on the slave to understand what needs to be in sync. These changes have to be synced so the slave can connect to the master. But you can do it one level at a time, from the bottom up.

          Overall, anything you can do with non-GTID replication, you can do with GTID replication using a single domain only. You can do even more advanced things with GTID and multiple domain_id, but it is complex and you need to understand the concept of multiple independent replication streams to work effectively with this.

          I wrote this up as a trail for writing some general article that explains this succinctly. It's still longer than I would like :-/. Maybe it can still be helpful, and any feedback welcome. Also feel free to reach me on IRC #maria or Zulip for more interactive chat.

          knielsen Kristian Nielsen added a comment - > At this point, I am very suspicious on how corrupted all this could be > At this point I am not sure anymore whether this is trustable or not. I understand this sentiment. But I think it will be useful for you to get out of this sentiment and get an understanding what is really going on. From what I see here, there is so far just one bug, the incorrectly worded error message (which is sufficiently wrong that it confused both of us). I believe the real problem is the configuration of domain_ids, which is doing something very differently from what you expect it to do. The TL;DR of it is that by giving every server a distinct domain_id, you have configured GTID for a topology where every server is an active master that is potentially replicating to every other server. This is obviously not what is intended, and most of these potential replication streams are not set up. Therefore, the GTID behaviour of the server is different from what you expect. Let me see if I can illustrate it: > Interestingly all come back to normal after doing [disabling GTID] Let's start from this, as this is familiar to you. Here, there is a single replication stream. The slave points to a specific location in the master binlog files, and from there it will apply events one after the other in sequence. This matches the non-multisource topology A -> B -> Slave Now imagine that we delete all data on the slave and do a RESET SLAVE. We can then reconnect the slave to the master, and it will fetch all events from the start of the binlog (RESET SLAVE makes the position empty). Eventually the slave will be restored to the original state when all events are applied. Note that if binlog files were purged on the master, this will silently lose data; this is something that GTID prevents by instead giving an error when the slave requests data that is missing on the master. If you were to RESET SLAVE without deleting any data, then you would instead need to RESET MASTER at a point where slave and master are in sync. Otherwise the slave will duplicate old events when it replicates from the start. Next, consider GTID replication with a simple gtid_slave_pos=0-1-100. This is the same as non-GTID replication, the slave points to a specific location in the master binlog files (just after transaction 0-1-100). It will apply events from there in sequence. Here also we can delete all data on the slave, set gtid_slave_pos to the empty position, and reconnect to the master. The empty position makes the slave start from the beginning of the binlog on the master and eventually restore the state by applying all events. All of this is with a single domain id. But imagine we have multi-source replication 1 -> 3, 2 -> 3, 1 -> 4, 2 -> 4. Here server3 has two different positions, one in each of server1 and server2's binlogs. In non-GTID replication there is no way for server3 to replicate from server4 as a master, because in general there is no single position it server4's binlog that corresponds to both of server3's current positions. But GTID can do this using the domain_id. Each domain_id in the gtid_slave_pos is essentially an independent replication stream. In this case, two domain_ids will be configured, say 1 and 2. Now server3 can start replicating from two different positions in server4's binlog. And it will continue keeping track of both independent positions so it can later move back to replicating from server1 and server2 individually again. Finally, suppose we remove server2 (and all its data) from the system (maybe a customer quit or something), leaving us with the topology 1 -> 4 -> 3. Now if we try to make server1 a slave of server3, 4 -> 3 -> 1, we get a problem. server1 is missing all of the data from the domain 2. It will connect to server3 with empty position in domain 2, and will need to replicate all events ever generated on server2. This will not be possible if any binlog files were ever purged on server3. To do this, we delete the domain_id=2 from the binlog state of server3: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(2). We do this on the master, because we want to change the history and pretend that domain 2 never existed in the system. Then later we might want to move server4 to replicate from server3: 3 -> 4, 3 -> 1. In this case, we want to remove the domain 2 from the slave position of server 4: SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),",2-[0-9]+-[0-9]+", ""), 2); You see, here DELETE_DOMAIN_ID corresponds to RESET MASTER, and SET gtid_slave_pos corresponds to RESET SLAVE in the non-GTID case. But we are doing it only in the domain 2 which we want to remove, while leaving the other domain intact. So it should be familiar to you once you start thinking of each domain as independent replication streams. In your case, because you configured so many domain_id's, MariaDB GTID is obliged to act as if you have a complex multi-source topology. To remove a domain, you therefore have to do it in the same way as in the example: Remove it from the master binlog_state at level N with DELETE_DOMAIN_ID, and at the same time remove it from the slave position at level (N+1). Note that you have to compare the gtid_binlog_state on the master with the gtid_slave_pos on the slave to understand what needs to be in sync. These changes have to be synced so the slave can connect to the master. But you can do it one level at a time, from the bottom up. Overall, anything you can do with non-GTID replication, you can do with GTID replication using a single domain only. You can do even more advanced things with GTID and multiple domain_id, but it is complex and you need to understand the concept of multiple independent replication streams to work effectively with this. I wrote this up as a trail for writing some general article that explains this succinctly. It's still longer than I would like :-/. Maybe it can still be helpful, and any feedback welcome. Also feel free to reach me on IRC #maria or Zulip for more interactive chat.

          Back to the bug that DELETE_DOMAIN_ID gives the wrong error message when the domain is not present in the gtid_binlog_state. This really is very confusing, and I'd like to get it fixed.

          I tried to reproduce on 10.4.26, but in my test it worked as expected:

          RESET MASTER;
          SET GLOBAL gtid_binlog_state = '0-171978825-1006915085,171970577-171970577-53037843,171970745-171970745-2370743389';
          FLUSH BINARY LOGS;
          SHOW BINARY LOGS;
          Log_name	File_size
          master-bin.000001	423
          master-bin.000002	419
          PURGE BINARY LOGS TO 'master-bin.000002';
          SHOW BINARY LOGS;
          Log_name	File_size
          master-bin.000002	419
          SHOW BINLOG EVENTS IN 'master-bin.000002';
          Log_name	Pos	Event_type	Server_id	End_log_pos	Info
          master-bin.000002	4	Format_desc	1	256	Server ver: 10.4.26-MariaDB-debug-log, Binlog ver: 4
          master-bin.000002	256	Gtid_list	1	331	[171970577-171970577-53037843,0-171978825-1006915085,171970745-171970745-2370743389]
          master-bin.000002	331	Binlog_checkpoint	1	375	master-bin.000001
          master-bin.000002	375	Binlog_checkpoint	1	419	master-bin.000002
          FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970595);
          Warnings:
          Warning	1076	The gtid domain being deleted ('171970595') is not in the current binlog state
          

          No wrong error message, and a sensible warning.

          And looking at the code, it seems impossible to get this error for a non-existing domain (though as we know, "impossible" is only until the "aha"-moment where the real problem is discovered):

            for (ulong i= 0; i < ids->elements; i++) {
              elem= (rpl_binlog_state::element *) my_hash_search(&hash, (const uchar *) ptr_domain_id, 0);
              if (!elem) {
                continue;
           
              for (not_match= true, k= 0; k < elem->hash.records; k++)
                  ...
              if (not_match) {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
          

          If the domain is not present in the gtid_binlog_state, `elem` will be false and the `continue` is hit, skipping the error message that you saw...

          So I'm not sure ... can you still reproduce it easily in the test environment? Could we try with a custom server build with some extra logging, or something?
          Or maybe I will think of something clever if I sleep on it.

          - Kristian.

          knielsen Kristian Nielsen added a comment - Back to the bug that DELETE_DOMAIN_ID gives the wrong error message when the domain is not present in the gtid_binlog_state. This really is very confusing, and I'd like to get it fixed. I tried to reproduce on 10.4.26, but in my test it worked as expected: RESET MASTER; SET GLOBAL gtid_binlog_state = '0-171978825-1006915085,171970577-171970577-53037843,171970745-171970745-2370743389'; FLUSH BINARY LOGS; SHOW BINARY LOGS; Log_name File_size master-bin.000001 423 master-bin.000002 419 PURGE BINARY LOGS TO 'master-bin.000002'; SHOW BINARY LOGS; Log_name File_size master-bin.000002 419 SHOW BINLOG EVENTS IN 'master-bin.000002'; Log_name Pos Event_type Server_id End_log_pos Info master-bin.000002 4 Format_desc 1 256 Server ver: 10.4.26-MariaDB-debug-log, Binlog ver: 4 master-bin.000002 256 Gtid_list 1 331 [171970577-171970577-53037843,0-171978825-1006915085,171970745-171970745-2370743389] master-bin.000002 331 Binlog_checkpoint 1 375 master-bin.000001 master-bin.000002 375 Binlog_checkpoint 1 419 master-bin.000002 FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970595); Warnings: Warning 1076 The gtid domain being deleted ('171970595') is not in the current binlog state No wrong error message, and a sensible warning. And looking at the code, it seems impossible to get this error for a non-existing domain (though as we know, "impossible" is only until the "aha"-moment where the real problem is discovered): for (ulong i= 0; i < ids->elements; i++) { elem= (rpl_binlog_state::element *) my_hash_search(&hash, (const uchar *) ptr_domain_id, 0); if (!elem) { continue;   for (not_match= true, k= 0; k < elem->hash.records; k++) ... if (not_match) { sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') " "being deleted. Make sure to first purge those files", If the domain is not present in the gtid_binlog_state, `elem` will be false and the `continue` is hit, skipping the error message that you saw... So I'm not sure ... can you still reproduce it easily in the test environment? Could we try with a custom server build with some extra logging, or something? Or maybe I will think of something clever if I sleep on it. - Kristian.
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.9 [ 26905 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.10 [ 27530 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.4 [ 22408 ]

          People

            bnestere Brandon Nesterenko
            marostegui Manuel Arostegui
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.