[MDEV-30386] flush binary logs delete_domain_id not deleting specific non present domain_id - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.4.26
Fix Version/s: 10.5, 10.6, 10.11
Component/s: Replication
Labels:
- binlog
- flush
- gtid
Environment:
debian bullseye

Description

We are trying to clean up the output of gtid_slave_pos so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

While cleaning them up via flush binary logs delete_domain_id we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

For this example we have
db1195 - master
db1117 - slave

db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;

+------------------+

| @@gtid_domain_id |

+------------------+

|        171966479 |

+------------------+

1 row in set (0.001 sec)

root@db1117.eqiad.wmnet[(none)]>  select @@gtid_slave_pos\G

*************************** 1. row ***************************

@@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291

1 row in set (0.001 sec)

root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);

ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.

root@db1117.eqiad.wmnet[(none)]>

This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.

root@db1117.eqiad.wmnet[(none)]> show binary logs;

+-------------------+------------+

| Log_name          | File_size  |

+-------------------+------------+

| db1117-bin.016856 |    9845469 |

| db1117-bin.016857 | 1048576295 |

| db1117-bin.016858 | 1048781927 |

| db1117-bin.016859 | 1048659990 |

| db1117-bin.016860 |  125787283 |

+-------------------+------------+

5 rows in set (0.001 sec)

This is its replication data:

root@db1117.eqiad.wmnet[(none)]> show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: db1195.eqiad.wmnet

                   Master_User: repl

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: db1195-bin.001570

           Read_Master_Log_Pos: 339370363

                Relay_Log_File: db1117-relay-bin.000010

                 Relay_Log_Pos: 339370663

         Relay_Master_Log_File: db1195-bin.001570

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 339370363

               Relay_Log_Space: 339371021

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: Yes

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: 0

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 172001292

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: conservative

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

              Slave_DDL_Groups: 21

Slave_Non_Transactional_Groups: 111

    Slave_Transactional_Groups: 1216438

1 row in set (0.001 sec)

If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
We use MIXED as binlog_format

root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"

-rw-rw---- 1 mysql mysql  9.4M Jan 11 20:31 db1117-bin.016856

-rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857

-rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858

-rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859

-rw-rw---- 1 mysql mysql  213M Jan 11 23:24 db1117-bin.016860

-rw-rw---- 1 mysql mysql   100 Jan 11 23:20 db1117-bin.index

-rw-rw---- 1 mysql mysql   358 Jan 11 23:16 db1117-relay-bin.000009

-rw-rw---- 1 mysql mysql  380M Jan 11 23:24 db1117-relay-bin.000010

-rw-rw---- 1 mysql mysql    52 Jan 11 23:16 db1117-relay-bin.index

root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done

db1117-bin.016856

db1117-bin.016857

db1117-bin.016858

db1117-bin.016859

db1117-bin.016860

db1117-relay-bin.000009

db1117-relay-bin.000010

The same result on its master

root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;

+------------------+

| @@gtid_domain_id |

+------------------+

|        172001292 |

+------------------+

1 row in set (0.001 sec)

root@db1195:/srv/sqldata#  ls | grep bin | grep -v index

db1195-bin.001565

db1195-bin.001566

db1195-bin.001567

db1195-bin.001568

db1195-bin.001569

db1195-bin.001570

root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done

db1195-bin.001565

db1195-bin.001566

db1195-bin.001567

db1195-bin.001568

db1195-bin.001569

db1195-bin.001570

Logs have been flushed/rotated just in case on both hosts.

On sql/rpl_gtid.cc we can see this piece of code

    for (not_match= true, k= 0; k < elem->hash.records; k++)

      rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);

      for (ulong l= 0; l < glev->count && not_match; l++)

        not_match= !(*d_gtid == glev->list[l]);

    if (not_match)

      sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "

              "being deleted. Make sure to first purge those files",

              *ptr_domain_id);

      errmsg= errbuf;

      goto end;

Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):

    Gtid list is supposed to come from a binlog's Gtid_list event and

    therefore should be a subset of the current binlog state. That is

    for every domain in the list the binlog state contains a gtid with

    sequence number not less than that of the list.

    Exceptions of this inclusion rule are:

      A. the list may still refer to gtids from already deleted domains.

         Files containing them must have been purged whereas the file

         with the list is not yet.

      B. out of order groups were injected

      C. manually build list of binlog files violating the inclusion

         constraint.

    While A is a normal case (not necessarily distinguishable from C though),

    B and C may require the user's attention so any (incl the A's suspected)

    inconsistency is diagnosed and *warned*.

*/

Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

Some other do work fine when being cleaned up on that same host:

root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);

Query OK, 0 rows affected (0.002 sec)

root@db1117.eqiad.wmnet[(none)]>

The same happens when trying to delete the default gtid_domain_id 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts.

Attachments

Activity

No workflow transitions have been executed yet.

People

Assignee:: Brandon Nesterenko

Reporter:: Manuel Arostegui

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2023-01-11 23:40

Updated:: 2024-09-10 14:58

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server