[MDEV-30386] flush binary logs delete_domain_id not deleting specific non present domain_id - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.4.26
Fix Version/s: 10.5, 10.6, 10.11
Component/s: Replication
Labels:
- binlog
- flush
- gtid
Environment:
debian bullseye

Description

We are trying to clean up the output of gtid_slave_pos so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.

While cleaning them up via flush binary logs delete_domain_id we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.

For this example we have
db1195 - master
db1117 - slave

db1117 is the one we want to clean up and leave only replicating with db1195 domain's id

root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;

+------------------+

| @@gtid_domain_id |

+------------------+

|        171966479 |

+------------------+

1 row in set (0.001 sec)

root@db1117.eqiad.wmnet[(none)]>  select @@gtid_slave_pos\G

*************************** 1. row ***************************

@@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291

1 row in set (0.001 sec)

root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);

ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.

root@db1117.eqiad.wmnet[(none)]>

This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.

root@db1117.eqiad.wmnet[(none)]> show binary logs;

+-------------------+------------+

| Log_name          | File_size  |

+-------------------+------------+

| db1117-bin.016856 |    9845469 |

| db1117-bin.016857 | 1048576295 |

| db1117-bin.016858 | 1048781927 |

| db1117-bin.016859 | 1048659990 |

| db1117-bin.016860 |  125787283 |

+-------------------+------------+

5 rows in set (0.001 sec)

This is its replication data:

root@db1117.eqiad.wmnet[(none)]> show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: db1195.eqiad.wmnet

                   Master_User: repl

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: db1195-bin.001570

           Read_Master_Log_Pos: 339370363

                Relay_Log_File: db1117-relay-bin.000010

                 Relay_Log_Pos: 339370663

         Relay_Master_Log_File: db1195-bin.001570

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 339370363

               Relay_Log_Space: 339371021

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: Yes

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: 0

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 172001292

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: conservative

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

              Slave_DDL_Groups: 21

Slave_Non_Transactional_Groups: 111

    Slave_Transactional_Groups: 1216438

1 row in set (0.001 sec)

If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
We use MIXED as binlog_format

root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"

-rw-rw---- 1 mysql mysql  9.4M Jan 11 20:31 db1117-bin.016856

-rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857

-rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858

-rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859

-rw-rw---- 1 mysql mysql  213M Jan 11 23:24 db1117-bin.016860

-rw-rw---- 1 mysql mysql   100 Jan 11 23:20 db1117-bin.index

-rw-rw---- 1 mysql mysql   358 Jan 11 23:16 db1117-relay-bin.000009

-rw-rw---- 1 mysql mysql  380M Jan 11 23:24 db1117-relay-bin.000010

-rw-rw---- 1 mysql mysql    52 Jan 11 23:16 db1117-relay-bin.index

root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done

db1117-bin.016856

db1117-bin.016857

db1117-bin.016858

db1117-bin.016859

db1117-bin.016860

db1117-relay-bin.000009

db1117-relay-bin.000010

The same result on its master

root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;

+------------------+

| @@gtid_domain_id |

+------------------+

|        172001292 |

+------------------+

1 row in set (0.001 sec)

root@db1195:/srv/sqldata#  ls | grep bin | grep -v index

db1195-bin.001565

db1195-bin.001566

db1195-bin.001567

db1195-bin.001568

db1195-bin.001569

db1195-bin.001570

root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done

db1195-bin.001565

db1195-bin.001566

db1195-bin.001567

db1195-bin.001568

db1195-bin.001569

db1195-bin.001570

Logs have been flushed/rotated just in case on both hosts.

On sql/rpl_gtid.cc we can see this piece of code

    for (not_match= true, k= 0; k < elem->hash.records; k++)

      rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);

      for (ulong l= 0; l < glev->count && not_match; l++)

        not_match= !(*d_gtid == glev->list[l]);

    if (not_match)

      sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "

              "being deleted. Make sure to first purge those files",

              *ptr_domain_id);

      errmsg= errbuf;

      goto end;

Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):

    Gtid list is supposed to come from a binlog's Gtid_list event and

    therefore should be a subset of the current binlog state. That is

    for every domain in the list the binlog state contains a gtid with

    sequence number not less than that of the list.

    Exceptions of this inclusion rule are:

      A. the list may still refer to gtids from already deleted domains.

         Files containing them must have been purged whereas the file

         with the list is not yet.

      B. out of order groups were injected

      C. manually build list of binlog files violating the inclusion

         constraint.

    While A is a normal case (not necessarily distinguishable from C though),

    B and C may require the user's attention so any (incl the A's suspected)

    inconsistency is diagnosed and *warned*.

*/

Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.

Some other do work fine when being cleaned up on that same host:

root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);

Query OK, 0 rows affected (0.002 sec)

root@db1117.eqiad.wmnet[(none)]>

The same happens when trying to delete the default gtid_domain_id 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts.

Attachments

Activity

Descending order - Click to sort in ascending order

Kristian Nielsen added a comment - 2023-05-10 21:35

Back to the bug that DELETE_DOMAIN_ID gives the wrong error message when the domain is not present in the gtid_binlog_state. This really is very confusing, and I'd like to get it fixed.

I tried to reproduce on 10.4.26, but in my test it worked as expected:

RESET MASTER;

SET GLOBAL gtid_binlog_state = '0-171978825-1006915085,171970577-171970577-53037843,171970745-171970745-2370743389';

FLUSH BINARY LOGS;

SHOW BINARY LOGS;

Log_name	File_size

master-bin.000001	423

master-bin.000002	419

PURGE BINARY LOGS TO 'master-bin.000002';

SHOW BINARY LOGS;

Log_name	File_size

master-bin.000002	419

SHOW BINLOG EVENTS IN 'master-bin.000002';

Log_name	Pos	Event_type	Server_id	End_log_pos	Info

master-bin.000002	4	Format_desc	1	256	Server ver: 10.4.26-MariaDB-debug-log, Binlog ver: 4

master-bin.000002	256	Gtid_list	1	331	[171970577-171970577-53037843,0-171978825-1006915085,171970745-171970745-2370743389]

master-bin.000002	331	Binlog_checkpoint	1	375	master-bin.000001

master-bin.000002	375	Binlog_checkpoint	1	419	master-bin.000002

FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970595);

Warnings:

Warning	1076	The gtid domain being deleted ('171970595') is not in the current binlog state

No wrong error message, and a sensible warning.

And looking at the code, it seems impossible to get this error for a non-existing domain (though as we know, "impossible" is only until the "aha"-moment where the real problem is discovered):

  for (ulong i= 0; i < ids->elements; i++) {

    elem= (rpl_binlog_state::element *) my_hash_search(&hash, (const uchar *) ptr_domain_id, 0);

    if (!elem) {

      continue;

    for (not_match= true, k= 0; k < elem->hash.records; k++)

...

    if (not_match) {

      sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "

              "being deleted. Make sure to first purge those files",

If the domain is not present in the gtid_binlog_state, `elem` will be false and the `continue` is hit, skipping the error message that you saw...

So I'm not sure ... can you still reproduce it easily in the test environment? Could we try with a custom server build with some extra logging, or something?
Or maybe I will think of something clever if I sleep on it.

- Kristian.

Kristian Nielsen added a comment - 2023-05-10 21:35 Back to the bug that DELETE_DOMAIN_ID gives the wrong error message when the domain is not present in the gtid_binlog_state. This really is very confusing, and I'd like to get it fixed. I tried to reproduce on 10.4.26, but in my test it worked as expected: RESET MASTER; SET GLOBAL gtid_binlog_state = '0-171978825-1006915085,171970577-171970577-53037843,171970745-171970745-2370743389'; FLUSH BINARY LOGS; SHOW BINARY LOGS; Log_name File_size master-bin.000001 423 master-bin.000002 419 PURGE BINARY LOGS TO 'master-bin.000002'; SHOW BINARY LOGS; Log_name File_size master-bin.000002 419 SHOW BINLOG EVENTS IN 'master-bin.000002'; Log_name Pos Event_type Server_id End_log_pos Info master-bin.000002 4 Format_desc 1 256 Server ver: 10.4.26-MariaDB-debug-log, Binlog ver: 4 master-bin.000002 256 Gtid_list 1 331 [171970577-171970577-53037843,0-171978825-1006915085,171970745-171970745-2370743389] master-bin.000002 331 Binlog_checkpoint 1 375 master-bin.000001 master-bin.000002 375 Binlog_checkpoint 1 419 master-bin.000002 FLUSH BINARY LOGS DELETE_DOMAIN_ID=(171970595); Warnings: Warning 1076 The gtid domain being deleted ('171970595') is not in the current binlog state No wrong error message, and a sensible warning. And looking at the code, it seems impossible to get this error for a non-existing domain (though as we know, "impossible" is only until the "aha"-moment where the real problem is discovered): for (ulong i= 0; i < ids->elements; i++) { elem= (rpl_binlog_state::element *) my_hash_search(&hash, (const uchar *) ptr_domain_id, 0); if (!elem) { continue; for (not_match= true, k= 0; k < elem->hash.records; k++) ... if (not_match) { sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') " "being deleted. Make sure to first purge those files", If the domain is not present in the gtid_binlog_state, `elem` will be false and the `continue` is hit, skipping the error message that you saw... So I'm not sure ... can you still reproduce it easily in the test environment? Could we try with a custom server build with some extra logging, or something? Or maybe I will think of something clever if I sleep on it. - Kristian.

Kristian Nielsen added a comment - 2023-05-10 11:02

> At this point, I am very suspicious on how corrupted all this could be
> At this point I am not sure anymore whether this is trustable or not.

I understand this sentiment. But I think it will be useful for you to get out of this sentiment and get an understanding what is really going on.

From what I see here, there is so far just one bug, the incorrectly worded error message (which is sufficiently wrong that it confused both of us). I believe the real problem is the configuration of domain_ids, which is doing something very differently from what you expect it to do.

The TL;DR of it is that by giving every server a distinct domain_id, you have configured GTID for a topology where every server is an active master that is potentially replicating to every other server. This is obviously not what is intended, and most of these potential replication streams are not set up. Therefore, the GTID behaviour of the server is different from what you expect. Let me see if I can illustrate it:

> Interestingly all come back to normal after doing [disabling GTID]

Let's start from this, as this is familiar to you. Here, there is a single replication stream. The slave points to a specific location in the master binlog files, and from there it will apply events one after the other in sequence. This matches the non-multisource topology A -> B -> Slave

Now imagine that we delete all data on the slave and do a RESET SLAVE. We can then reconnect the slave to the master, and it will fetch all events from the start of the binlog (RESET SLAVE makes the position empty). Eventually the slave will be restored to the original state when all events are applied. Note that if binlog files were purged on the master, this will silently lose data; this is something that GTID prevents by instead giving an error when the slave requests data that is missing on the master.

If you were to RESET SLAVE without deleting any data, then you would instead need to RESET MASTER at a point where slave and master are in sync. Otherwise the slave will duplicate old events when it replicates from the start.

Next, consider GTID replication with a simple gtid_slave_pos=0-1-100. This is the same as non-GTID replication, the slave points to a specific location in the master binlog files (just after transaction 0-1-100). It will apply events from there in sequence. Here also we can delete all data on the slave, set gtid_slave_pos to the empty position, and reconnect to the master. The empty position makes the slave start from the beginning of the binlog on the master and eventually restore the state by applying all events.

All of this is with a single domain id. But imagine we have multi-source replication 1 -> 3, 2 -> 3, 1 -> 4, 2 -> 4. Here server3 has two different positions, one in each of server1 and server2's binlogs. In non-GTID replication there is no way for server3 to replicate from server4 as a master, because in general there is no single position it server4's binlog that corresponds to both of server3's current positions.

But GTID can do this using the domain_id. Each domain_id in the gtid_slave_pos is essentially an independent replication stream. In this case, two domain_ids will be configured, say 1 and 2. Now server3 can start replicating from two different positions in server4's binlog. And it will continue keeping track of both independent positions so it can later move back to replicating from server1 and server2 individually again.

Finally, suppose we remove server2 (and all its data) from the system (maybe a customer quit or something), leaving us with the topology 1 -> 4 -> 3. Now if we try to make server1 a slave of server3, 4 -> 3 -> 1, we get a problem. server1 is missing all of the data from the domain 2. It will connect to server3 with empty position in domain 2, and will need to replicate all events ever generated on server2. This will not be possible if any binlog files were ever purged on server3.

To do this, we delete the domain_id=2 from the binlog state of server3: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(2). We do this on the master, because we want to change the history and pretend that domain 2 never existed in the system.

Then later we might want to move server4 to replicate from server3: 3 -> 4, 3 -> 1. In this case, we want to remove the domain 2 from the slave position of server 4:

SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),",2-[0-9]+-[0-9]+", ""), 2);

You see, here DELETE_DOMAIN_ID corresponds to RESET MASTER, and SET gtid_slave_pos corresponds to RESET SLAVE in the non-GTID case. But we are doing it only in the domain 2 which we want to remove, while leaving the other domain intact. So it should be familiar to you once you start thinking of each domain as independent replication streams.

In your case, because you configured so many domain_id's, MariaDB GTID is obliged to act as if you have a complex multi-source topology. To remove a domain, you therefore have to do it in the same way as in the example: Remove it from the master binlog_state at level N with DELETE_DOMAIN_ID, and at the same time remove it from the slave position at level (N+1). Note that you have to compare the gtid_binlog_state on the master with the gtid_slave_pos on the slave to understand what needs to be in sync. These changes have to be synced so the slave can connect to the master. But you can do it one level at a time, from the bottom up.

Overall, anything you can do with non-GTID replication, you can do with GTID replication using a single domain only. You can do even more advanced things with GTID and multiple domain_id, but it is complex and you need to understand the concept of multiple independent replication streams to work effectively with this.

I wrote this up as a trail for writing some general article that explains this succinctly. It's still longer than I would like :-/. Maybe it can still be helpful, and any feedback welcome. Also feel free to reach me on IRC #maria or Zulip for more interactive chat.

Kristian Nielsen added a comment - 2023-05-10 11:02 > At this point, I am very suspicious on how corrupted all this could be > At this point I am not sure anymore whether this is trustable or not. I understand this sentiment. But I think it will be useful for you to get out of this sentiment and get an understanding what is really going on. From what I see here, there is so far just one bug, the incorrectly worded error message (which is sufficiently wrong that it confused both of us). I believe the real problem is the configuration of domain_ids, which is doing something very differently from what you expect it to do. The TL;DR of it is that by giving every server a distinct domain_id, you have configured GTID for a topology where every server is an active master that is potentially replicating to every other server. This is obviously not what is intended, and most of these potential replication streams are not set up. Therefore, the GTID behaviour of the server is different from what you expect. Let me see if I can illustrate it: > Interestingly all come back to normal after doing [disabling GTID] Let's start from this, as this is familiar to you. Here, there is a single replication stream. The slave points to a specific location in the master binlog files, and from there it will apply events one after the other in sequence. This matches the non-multisource topology A -> B -> Slave Now imagine that we delete all data on the slave and do a RESET SLAVE. We can then reconnect the slave to the master, and it will fetch all events from the start of the binlog (RESET SLAVE makes the position empty). Eventually the slave will be restored to the original state when all events are applied. Note that if binlog files were purged on the master, this will silently lose data; this is something that GTID prevents by instead giving an error when the slave requests data that is missing on the master. If you were to RESET SLAVE without deleting any data, then you would instead need to RESET MASTER at a point where slave and master are in sync. Otherwise the slave will duplicate old events when it replicates from the start. Next, consider GTID replication with a simple gtid_slave_pos=0-1-100. This is the same as non-GTID replication, the slave points to a specific location in the master binlog files (just after transaction 0-1-100). It will apply events from there in sequence. Here also we can delete all data on the slave, set gtid_slave_pos to the empty position, and reconnect to the master. The empty position makes the slave start from the beginning of the binlog on the master and eventually restore the state by applying all events. All of this is with a single domain id. But imagine we have multi-source replication 1 -> 3, 2 -> 3, 1 -> 4, 2 -> 4. Here server3 has two different positions, one in each of server1 and server2's binlogs. In non-GTID replication there is no way for server3 to replicate from server4 as a master, because in general there is no single position it server4's binlog that corresponds to both of server3's current positions. But GTID can do this using the domain_id. Each domain_id in the gtid_slave_pos is essentially an independent replication stream. In this case, two domain_ids will be configured, say 1 and 2. Now server3 can start replicating from two different positions in server4's binlog. And it will continue keeping track of both independent positions so it can later move back to replicating from server1 and server2 individually again. Finally, suppose we remove server2 (and all its data) from the system (maybe a customer quit or something), leaving us with the topology 1 -> 4 -> 3. Now if we try to make server1 a slave of server3, 4 -> 3 -> 1, we get a problem. server1 is missing all of the data from the domain 2. It will connect to server3 with empty position in domain 2, and will need to replicate all events ever generated on server2. This will not be possible if any binlog files were ever purged on server3. To do this, we delete the domain_id=2 from the binlog state of server3: FLUSH BINARY LOGS DELETE_DOMAIN_ID=(2). We do this on the master, because we want to change the history and pretend that domain 2 never existed in the system. Then later we might want to move server4 to replicate from server3: 3 -> 4, 3 -> 1. In this case, we want to remove the domain 2 from the slave position of server 4: SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),",2-[0-9]+-[0-9]+", ""), 2); You see, here DELETE_DOMAIN_ID corresponds to RESET MASTER, and SET gtid_slave_pos corresponds to RESET SLAVE in the non-GTID case. But we are doing it only in the domain 2 which we want to remove, while leaving the other domain intact. So it should be familiar to you once you start thinking of each domain as independent replication streams. In your case, because you configured so many domain_id's, MariaDB GTID is obliged to act as if you have a complex multi-source topology. To remove a domain, you therefore have to do it in the same way as in the example: Remove it from the master binlog_state at level N with DELETE_DOMAIN_ID, and at the same time remove it from the slave position at level (N+1). Note that you have to compare the gtid_binlog_state on the master with the gtid_slave_pos on the slave to understand what needs to be in sync. These changes have to be synced so the slave can connect to the master. But you can do it one level at a time, from the bottom up. Overall, anything you can do with non-GTID replication, you can do with GTID replication using a single domain only. You can do even more advanced things with GTID and multiple domain_id, but it is complex and you need to understand the concept of multiple independent replication streams to work effectively with this. I wrote this up as a trail for writing some general article that explains this succinctly. It's still longer than I would like :-/. Maybe it can still be helpful, and any feedback welcome. Also feel free to reach me on IRC #maria or Zulip for more interactive chat.

Manuel Arostegui added a comment - 2023-05-10 06:18

Thank you Kristian.

At this point, I am very suspicious on how corrupted all this could be. I have tried the following approach with a slave.
This slave has the following replication topology

A: Primary master with the following gtid_domain_id=171974662 (this runs pt-heartbeat)
B: Intermediate master with the following gtid_domain_id=180367475 (this runs another instance of pt-heartbeat)

So the replication is:
A -> B -> Slave

This is its replication status:

root@db2180.codfw.wmnet[(none)]> show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: db2129.codfw.wmnet

                   Master_User: repl

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: db2129-bin.003453

           Read_Master_Log_Pos: 250342937

                Relay_Log_File: db2180-relay-bin.000002

                 Relay_Log_Pos: 17297697

         Relay_Master_Log_File: db2129-bin.003453

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 250342937

               Relay_Log_Space: 17298007

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: Yes

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: 0

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 180367475

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 180367475-180367475-979771585,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711761670

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: optimistic

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

              Slave_DDL_Groups: 0

Slave_Non_Transactional_Groups: 0

    Slave_Transactional_Groups: 6434525

Let's try to clean up the domain_ids. Stoping slave and capturing the GTID position from both hosts:

                   Gtid_IO_Pos: 180367475-180367475-979771613,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711765037

So they are:
180367475-180367475-979771613
171974662-171974662-1711765037

Let's clean the rest:

root@db2180.codfw.wmnet[(none)]> SET GLOBAL gtid_slave_pos = "180367475-180367475-979771613,171974662-171974662-1711765037"

    -> ;

Query OK, 0 rows affected (0.036 sec)

root@db2180.codfw.wmnet[(none)]> start slave;

show slave status\G (snipped some parts)

                 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 180367475

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 180367475-180367475-979771613,171974662-171974662-1711765037

There is absolutely no way the the old binlogs could have been purged. At this point I am not sure anymore whether this is trustable or not.
Interestingly all come back to normal after doing:

root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE;

Query OK, 0 rows affected (0.034 sec)

Query OK, 0 rows affected (0.035 sec)

Query OK, 0 rows affected (0.034 sec)

root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE;

Query OK, 0 rows affected (0.037 sec)

Query OK, 0 rows affected (0.040 sec)

Query OK, 0 rows affected (0.036 sec)

root@db2180.codfw.wmnet[(none)]> show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Waiting for master to send event

                   Master_Host: db2129.codfw.wmnet

                   Master_User: repl

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: db2129-bin.003453

           Read_Master_Log_Pos: 275030845

                Relay_Log_File: db2180-relay-bin.000002

                 Relay_Log_Pos: 2457404

         Relay_Master_Log_File: db2129-bin.003453

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 259054498

               Relay_Log_Space: 18434061

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: Yes

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: 189

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 180367475

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 180367475-180367475-979771885,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711800425

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: optimistic

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: closing tables

              Slave_DDL_Groups: 0

Slave_Non_Transactional_Groups: 0

    Slave_Transactional_Groups: 6448397

1 row in set (0.034 sec)

Manuel Arostegui added a comment - 2023-05-10 06:18 Thank you Kristian. At this point, I am very suspicious on how corrupted all this could be. I have tried the following approach with a slave. This slave has the following replication topology A: Primary master with the following gtid_domain_id=171974662 (this runs pt-heartbeat) B: Intermediate master with the following gtid_domain_id=180367475 (this runs another instance of pt-heartbeat) So the replication is: A -> B -> Slave This is its replication status: root@db2180.codfw.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db2129.codfw.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db2129-bin.003453 Read_Master_Log_Pos: 250342937 Relay_Log_File: db2180-relay-bin.000002 Relay_Log_Pos: 17297697 Relay_Master_Log_File: db2129-bin.003453 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 250342937 Relay_Log_Space: 17298007 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 180367475 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 180367475-180367475-979771585,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711761670 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 6434525 Let's try to clean up the domain_ids. Stoping slave and capturing the GTID position from both hosts: Gtid_IO_Pos: 180367475-180367475-979771613,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711765037 So they are: 180367475-180367475-979771613 171974662-171974662-1711765037 Let's clean the rest: root@db2180.codfw.wmnet[(none)]> SET GLOBAL gtid_slave_pos = "180367475-180367475-979771613,171974662-171974662-1711765037" -> ; Query OK, 0 rows affected (0.036 sec) root@db2180.codfw.wmnet[(none)]> start slave; show slave status\G (snipped some parts) Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.' Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 180367475 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 180367475-180367475-979771613,171974662-171974662-1711765037 There is absolutely no way the the old binlogs could have been purged. At this point I am not sure anymore whether this is trustable or not. Interestingly all come back to normal after doing: root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=no; START SLAVE; Query OK, 0 rows affected (0.034 sec) Query OK, 0 rows affected (0.035 sec) Query OK, 0 rows affected (0.034 sec) root@db2180.codfw.wmnet[(none)]> STOP SLAVE; CHANGE MASTER TO MASTER_USE_GTID=Slave_pos; START SLAVE; Query OK, 0 rows affected (0.037 sec) Query OK, 0 rows affected (0.040 sec) Query OK, 0 rows affected (0.036 sec) root@db2180.codfw.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db2129.codfw.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db2129-bin.003453 Read_Master_Log_Pos: 275030845 Relay_Log_File: db2180-relay-bin.000002 Relay_Log_Pos: 2457404 Relay_Master_Log_File: db2129-bin.003453 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 259054498 Relay_Log_Space: 18434061 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 189 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 180367475 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 180367475-180367475-979771885,171978805-171978805-2291133789,180363389-180363389-11242988,171970594-171970594-1063329989,180363370-180363370-7087924,171978904-171978904-198543478,171974883-171974883-1921892293,171974662-171974662-1711800425 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: closing tables Slave_DDL_Groups: 0 Slave_Non_Transactional_Groups: 0 Slave_Transactional_Groups: 6448397 1 row in set (0.034 sec)

Kristian Nielsen added a comment - 2023-05-09 15:10

A generic SQL to delete a domain @domain from the gtid_slave_pos:

    SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),CONCAT(",", @domain, "-[0-9]+-[0-9]+"), ""), 2);

Kristian Nielsen added a comment - 2023-05-09 15:10 A generic SQL to delete a domain @domain from the gtid_slave_pos: SET GLOBAL gtid_slave_pos = SUBSTRING(REGEXP_REPLACE(CONCAT(",", @@GLOBAL.gtid_slave_pos),CONCAT(",", @domain, "-[0-9]+-[0-9]+"), ""), 2);

Kristian Nielsen added a comment - 2023-05-09 14:50

Thanks Manuel for the quick answer.

> 171970595 isn't present indeed.

Aha. So the error message is wrong. It should say that the domain 171970595 cannot be deleted from gtid_binlog_state because it is not present. I'll dig a bit more and see if I can reproduce this error locally.

> what is the way to actually be able to proceed and delete it from gtid_slave_pos

The way to delete a domain from the gtid_slave_pos is:

SET GLOBAL gtid_slave_pos = "..."

(Eg.in this case by specifying the old position with the to-be-deleted domains removed).

For example, say that the gtid_slave_pos is:

10-10-1000, 20-20-2000

But the current master is using domain_id 20 as the only domain, and you are getting an error that position 10-10-1000 is not present on the master binlog and want to remove it. Simply do

SET GLOBAL gtid_slave_pos = "20-20-2000";

Maybe I should write up some article that clearly explains how to use the domain_id of MariaDB global transaction id, and see if I can get it visible somehow. There is a way to think of domains that should make it relatively easy to understand what is going on. But I know the whole GTID implementation is somewhat complex, and I remember seeing multiple descriptions around on the 'net that are completely wrong wrt. domain_id.

- Kristian.

Kristian Nielsen added a comment - 2023-05-09 14:50 Thanks Manuel for the quick answer. > 171970595 isn't present indeed. Aha. So the error message is wrong. It should say that the domain 171970595 cannot be deleted from gtid_binlog_state because it is not present. I'll dig a bit more and see if I can reproduce this error locally. > what is the way to actually be able to proceed and delete it from gtid_slave_pos The way to delete a domain from the gtid_slave_pos is: SET GLOBAL gtid_slave_pos = "..." (Eg.in this case by specifying the old position with the to-be-deleted domains removed). For example, say that the gtid_slave_pos is: 10-10-1000, 20-20-2000 But the current master is using domain_id 20 as the only domain, and you are getting an error that position 10-10-1000 is not present on the master binlog and want to remove it. Simply do SET GLOBAL gtid_slave_pos = "20-20-2000"; Maybe I should write up some article that clearly explains how to use the domain_id of MariaDB global transaction id, and see if I can get it visible somehow. There is a way to think of domains that should make it relatively easy to understand what is going on. But I know the whole GTID implementation is somewhat complex, and I remember seeing multiple descriptions around on the 'net that are completely wrong wrt. domain_id. - Kristian.

View 12 older comments

People

Assignee:: Brandon Nesterenko

Reporter:: Manuel Arostegui

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2023-01-11 23:40

Updated:: 2024-09-10 14:58

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server