Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30386

flush binary logs delete_domain_id not deleting specific non present domain_id

Details

    Description

      We are trying to clean up the output of gtid_slave_pos so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

      While cleaning them up via flush binary logs delete_domain_id we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

      For this example we have
      db1195 - master
      db1117 - slave

      db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

      root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
      +------------------+
      | @@gtid_domain_id |
      +------------------+
      |        171966479 |
      +------------------+
      1 row in set (0.001 sec)
       
      root@db1117.eqiad.wmnet[(none)]>  select @@gtid_slave_pos\G
      *************************** 1. row ***************************
      @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
      1 row in set (0.001 sec)
       
      root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
      ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
      root@db1117.eqiad.wmnet[(none)]>
      

      This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.

      root@db1117.eqiad.wmnet[(none)]> show binary logs;
      +-------------------+------------+
      | Log_name          | File_size  |
      +-------------------+------------+
      | db1117-bin.016856 |    9845469 |
      | db1117-bin.016857 | 1048576295 |
      | db1117-bin.016858 | 1048781927 |
      | db1117-bin.016859 | 1048659990 |
      | db1117-bin.016860 |  125787283 |
      +-------------------+------------+
      5 rows in set (0.001 sec)
      

      This is its replication data:

      root@db1117.eqiad.wmnet[(none)]> show slave status\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: db1195.eqiad.wmnet
                         Master_User: repl
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: db1195-bin.001570
                 Read_Master_Log_Pos: 339370363
                      Relay_Log_File: db1117-relay-bin.000010
                       Relay_Log_Pos: 339370663
               Relay_Master_Log_File: db1195-bin.001570
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 339370363
                     Relay_Log_Space: 339371021
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: Yes
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: 0
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error:
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 172001292
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 21
      Slave_Non_Transactional_Groups: 111
          Slave_Transactional_Groups: 1216438
      1 row in set (0.001 sec)
      

      If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
      We use MIXED as binlog_format

       
      root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
      -rw-rw---- 1 mysql mysql  9.4M Jan 11 20:31 db1117-bin.016856
      -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
      -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
      -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
      -rw-rw---- 1 mysql mysql  213M Jan 11 23:24 db1117-bin.016860
      -rw-rw---- 1 mysql mysql   100 Jan 11 23:20 db1117-bin.index
      -rw-rw---- 1 mysql mysql   358 Jan 11 23:16 db1117-relay-bin.000009
      -rw-rw---- 1 mysql mysql  380M Jan 11 23:24 db1117-relay-bin.000010
      -rw-rw---- 1 mysql mysql    52 Jan 11 23:16 db1117-relay-bin.index
       
      root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
      db1117-bin.016856
      db1117-bin.016857
      db1117-bin.016858
      db1117-bin.016859
      db1117-bin.016860
      db1117-relay-bin.000009
      db1117-relay-bin.000010
      

      The same result on its master

      root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
      +------------------+
      | @@gtid_domain_id |
      +------------------+
      |        172001292 |
      +------------------+
      1 row in set (0.001 sec)
       
       
      root@db1195:/srv/sqldata#  ls | grep bin | grep -v index
      db1195-bin.001565
      db1195-bin.001566
      db1195-bin.001567
      db1195-bin.001568
      db1195-bin.001569
      db1195-bin.001570
       
      root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
      db1195-bin.001565
      db1195-bin.001566
      db1195-bin.001567
      db1195-bin.001568
      db1195-bin.001569
      db1195-bin.001570
      

      Logs have been flushed/rotated just in case on both hosts.

      On sql/rpl_gtid.cc we can see this piece of code

          for (not_match= true, k= 0; k < elem->hash.records; k++)
          {
            rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
            for (ulong l= 0; l < glev->count && not_match; l++)
              not_match= !(*d_gtid == glev->list[l]);
          }
       
          if (not_match)
          {
            sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                    "being deleted. Make sure to first purge those files",
                    *ptr_domain_id);
            errmsg= errbuf;
            goto end;
          }
      

      Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):

          Gtid list is supposed to come from a binlog's Gtid_list event and
          therefore should be a subset of the current binlog state. That is
          for every domain in the list the binlog state contains a gtid with
          sequence number not less than that of the list.
          Exceptions of this inclusion rule are:
            A. the list may still refer to gtids from already deleted domains.
               Files containing them must have been purged whereas the file
               with the list is not yet.
            B. out of order groups were injected
            C. manually build list of binlog files violating the inclusion
               constraint.
          While A is a normal case (not necessarily distinguishable from C though),
          B and C may require the user's attention so any (incl the A's suspected)
          inconsistency is diagnosed and *warned*.
        */
      

      Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
      In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

      Some other do work fine when being cleaned up on that same host:

      root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
      Query OK, 0 rows affected (0.002 sec)
       
      root@db1117.eqiad.wmnet[(none)]>
      

      The same happens when trying to delete the default gtid_domain_id 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts.

      Attachments

        Activity

          marostegui Manuel Arostegui created issue -
          marostegui Manuel Arostegui made changes -
          Field Original Value New Value
          Summary flush binary logs delete_domain_id not flush binary logs delete_domain_id not deleting specific non present domain_id
          marostegui Manuel Arostegui made changes -
          Description We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}
          We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}

          The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as set gtid_domain_id to a different value when we set up the hosts.
          danblack Daniel Black made changes -
          Assignee Brandon Nesterenko [ JIRAUSER48702 ]
          marostegui Manuel Arostegui made changes -
          Description We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}

          The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as set gtid_domain_id to a different value when we set up the hosts.
          We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

          While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

          For this example we have
          db1195 - master
          db1117 - slave

          db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

          {code}
          root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 171966479 |
          +------------------+
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
          *************************** 1. row ***************************
          @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
          1 row in set (0.001 sec)

          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
          ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
          root@db1117.eqiad.wmnet[(none)]>
          {code}

          This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
          {code}
          root@db1117.eqiad.wmnet[(none)]> show binary logs;
          +-------------------+------------+
          | Log_name | File_size |
          +-------------------+------------+
          | db1117-bin.016856 | 9845469 |
          | db1117-bin.016857 | 1048576295 |
          | db1117-bin.016858 | 1048781927 |
          | db1117-bin.016859 | 1048659990 |
          | db1117-bin.016860 | 125787283 |
          +-------------------+------------+
          5 rows in set (0.001 sec)
          {code}

          This is its replication data:
          {code}
          root@db1117.eqiad.wmnet[(none)]> show slave status\G
          *************************** 1. row ***************************
                          Slave_IO_State: Waiting for master to send event
                             Master_Host: db1195.eqiad.wmnet
                             Master_User: repl
                             Master_Port: 3306
                           Connect_Retry: 60
                         Master_Log_File: db1195-bin.001570
                     Read_Master_Log_Pos: 339370363
                          Relay_Log_File: db1117-relay-bin.000010
                           Relay_Log_Pos: 339370663
                   Relay_Master_Log_File: db1195-bin.001570
                        Slave_IO_Running: Yes
                       Slave_SQL_Running: Yes
                         Replicate_Do_DB:
                     Replicate_Ignore_DB:
                      Replicate_Do_Table:
                  Replicate_Ignore_Table:
                 Replicate_Wild_Do_Table:
             Replicate_Wild_Ignore_Table:
                              Last_Errno: 0
                              Last_Error:
                            Skip_Counter: 0
                     Exec_Master_Log_Pos: 339370363
                         Relay_Log_Space: 339371021
                         Until_Condition: None
                          Until_Log_File:
                           Until_Log_Pos: 0
                      Master_SSL_Allowed: Yes
                      Master_SSL_CA_File:
                      Master_SSL_CA_Path:
                         Master_SSL_Cert:
                       Master_SSL_Cipher:
                          Master_SSL_Key:
                   Seconds_Behind_Master: 0
           Master_SSL_Verify_Server_Cert: No
                           Last_IO_Errno: 0
                           Last_IO_Error:
                          Last_SQL_Errno: 0
                          Last_SQL_Error:
             Replicate_Ignore_Server_Ids:
                        Master_Server_Id: 172001292
                          Master_SSL_Crl:
                      Master_SSL_Crlpath:
                              Using_Gtid: Slave_Pos
                             Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
                 Replicate_Do_Domain_Ids:
             Replicate_Ignore_Domain_Ids:
                           Parallel_Mode: conservative
                               SQL_Delay: 0
                     SQL_Remaining_Delay: NULL
                 Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                        Slave_DDL_Groups: 21
          Slave_Non_Transactional_Groups: 111
              Slave_Transactional_Groups: 1216438
          1 row in set (0.001 sec)
          {code}


          If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
          We use MIXED as binlog_format
          {code}

          root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
          -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
          -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
          -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
          -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
          -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
          -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
          -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
          -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index

          root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1117-bin.016856
          db1117-bin.016857
          db1117-bin.016858
          db1117-bin.016859
          db1117-bin.016860
          db1117-relay-bin.000009
          db1117-relay-bin.000010
          {code}


          The same result on its master
          {code}
          root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
          +------------------+
          | @@gtid_domain_id |
          +------------------+
          | 172001292 |
          +------------------+
          1 row in set (0.001 sec)


          root@db1195:/srv/sqldata# ls | grep bin | grep -v index
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570

          root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
          db1195-bin.001565
          db1195-bin.001566
          db1195-bin.001567
          db1195-bin.001568
          db1195-bin.001569
          db1195-bin.001570
          {code}

          Logs have been flushed/rotated just in case on both hosts.

          On _sql/rpl_gtid.cc_ we can see this piece of code

          {code}
              for (not_match= true, k= 0; k < elem->hash.records; k++)
              {
                rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
                for (ulong l= 0; l < glev->count && not_match; l++)
                  not_match= !(*d_gtid == glev->list[l]);
              }

              if (not_match)
              {
                sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
                        "being deleted. Make sure to first purge those files",
                        *ptr_domain_id);
                errmsg= errbuf;
                goto end;
              }
          {code}

          Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
          {code}
              Gtid list is supposed to come from a binlog's Gtid_list event and
              therefore should be a subset of the current binlog state. That is
              for every domain in the list the binlog state contains a gtid with
              sequence number not less than that of the list.
              Exceptions of this inclusion rule are:
                A. the list may still refer to gtids from already deleted domains.
                   Files containing them must have been purged whereas the file
                   with the list is not yet.
                B. out of order groups were injected
                C. manually build list of binlog files violating the inclusion
                   constraint.
              While A is a normal case (not necessarily distinguishable from C though),
              B and C may require the user's attention so any (incl the A's suspected)
              inconsistency is diagnosed and *warned*.
            */
          {code}

          Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
          In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

          Some other do work fine when being cleaned up on that same host:
          {code}
          root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
          Query OK, 0 rows affected (0.002 sec)

          root@db1117.eqiad.wmnet[(none)]>
          {code}

          The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts.
          serg Sergei Golubchik made changes -
          Fix Version/s 10.4 [ 22408 ]
          Fix Version/s 10.5 [ 23123 ]
          Fix Version/s 10.6 [ 24028 ]
          Fix Version/s 10.7 [ 24805 ]
          Fix Version/s 10.8 [ 26121 ]
          Fix Version/s 10.9 [ 26905 ]
          Fix Version/s 10.10 [ 27530 ]
          Fix Version/s 10.11 [ 27614 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.7 [ 24805 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.8 [ 26121 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.9 [ 26905 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.10 [ 27530 ]
          julien.fritsch Julien Fritsch made changes -
          Fix Version/s 10.4 [ 22408 ]

          People

            bnestere Brandon Nesterenko
            marostegui Manuel Arostegui
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.