Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.4.26
-
debian bullseye
Description
We are trying to clean up the output of gtid_slave_pos so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.
While cleaning them up via flush binary logs delete_domain_id we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case.
For this example we have
db1195 - master
db1117 - slave
db1117 is the one we want to clean up and leave only replicating with db1195 domain's id
root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id;
|
+------------------+
|
| @@gtid_domain_id |
|
+------------------+
|
| 171966479 |
|
+------------------+
|
1 row in set (0.001 sec)
|
|
root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G
|
*************************** 1. row ***************************
|
@@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291
|
1 row in set (0.001 sec)
|
|
root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484);
|
ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files.
|
root@db1117.eqiad.wmnet[(none)]>
|
This host has one master (db1195) but also has log-slave-updates enabled for backup purposes.
root@db1117.eqiad.wmnet[(none)]> show binary logs;
|
+-------------------+------------+
|
| Log_name | File_size |
|
+-------------------+------------+
|
| db1117-bin.016856 | 9845469 |
|
| db1117-bin.016857 | 1048576295 |
|
| db1117-bin.016858 | 1048781927 |
|
| db1117-bin.016859 | 1048659990 |
|
| db1117-bin.016860 | 125787283 |
|
+-------------------+------------+
|
5 rows in set (0.001 sec)
|
This is its replication data:
root@db1117.eqiad.wmnet[(none)]> show slave status\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: db1195.eqiad.wmnet
|
Master_User: repl
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: db1195-bin.001570
|
Read_Master_Log_Pos: 339370363
|
Relay_Log_File: db1117-relay-bin.000010
|
Relay_Log_Pos: 339370663
|
Relay_Master_Log_File: db1195-bin.001570
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 339370363
|
Relay_Log_Space: 339371021
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: Yes
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 0
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 172001292
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: conservative
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
|
Slave_DDL_Groups: 21
|
Slave_Non_Transactional_Groups: 111
|
Slave_Transactional_Groups: 1216438
|
1 row in set (0.001 sec)
|
If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any.
We use MIXED as binlog_format
|
root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin"
|
-rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856
|
-rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857
|
-rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858
|
-rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859
|
-rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860
|
-rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index
|
-rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009
|
-rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010
|
-rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index
|
|
root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
|
db1117-bin.016856
|
db1117-bin.016857
|
db1117-bin.016858
|
db1117-bin.016859
|
db1117-bin.016860
|
db1117-relay-bin.000009
|
db1117-relay-bin.000010
|
The same result on its master
root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id;
|
+------------------+
|
| @@gtid_domain_id |
|
+------------------+
|
| 172001292 |
|
+------------------+
|
1 row in set (0.001 sec)
|
|
|
root@db1195:/srv/sqldata# ls | grep bin | grep -v index
|
db1195-bin.001565
|
db1195-bin.001566
|
db1195-bin.001567
|
db1195-bin.001568
|
db1195-bin.001569
|
db1195-bin.001570
|
|
root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done
|
db1195-bin.001565
|
db1195-bin.001566
|
db1195-bin.001567
|
db1195-bin.001568
|
db1195-bin.001569
|
db1195-bin.001570
|
Logs have been flushed/rotated just in case on both hosts.
On sql/rpl_gtid.cc we can see this piece of code
for (not_match= true, k= 0; k < elem->hash.records; k++)
|
{
|
rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k);
|
for (ulong l= 0; l < glev->count && not_match; l++)
|
not_match= !(*d_gtid == glev->list[l]);
|
}
|
|
if (not_match)
|
{
|
sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') "
|
"being deleted. Make sure to first purge those files",
|
*ptr_domain_id);
|
errmsg= errbuf;
|
goto end;
|
}
|
Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/):
Gtid list is supposed to come from a binlog's Gtid_list event and
|
therefore should be a subset of the current binlog state. That is
|
for every domain in the list the binlog state contains a gtid with
|
sequence number not less than that of the list.
|
Exceptions of this inclusion rule are:
|
A. the list may still refer to gtids from already deleted domains.
|
Files containing them must have been purged whereas the file
|
with the list is not yet.
|
B. out of order groups were injected
|
C. manually build list of binlog files violating the inclusion
|
constraint.
|
While A is a normal case (not necessarily distinguishable from C though),
|
B and C may require the user's attention so any (incl the A's suspected)
|
inconsistency is diagnosed and *warned*.
|
*/
|
Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not.
In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology.
Some other do work fine when being cleaned up on that same host:
root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884);
|
Query OK, 0 rows affected (0.002 sec)
|
|
root@db1117.eqiad.wmnet[(none)]>
|
The same happens when trying to delete the default gtid_domain_id 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts.
Attachments
Activity
Field | Original Value | New Value |
---|---|---|
Summary | flush binary logs delete_domain_id not | flush binary logs delete_domain_id not deleting specific non present domain_id |
Description |
We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.
While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case. For this example we have db1195 - master db1117 - slave db1117 is the one we want to clean up and leave only replicating with db1195 domain's id {code} root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 171966479 | +------------------+ 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G *************************** 1. row *************************** @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484); ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files. root@db1117.eqiad.wmnet[(none)]> {code} This host has one master (db1195) but also has log-slave-updates enabled for backup purposes. {code} root@db1117.eqiad.wmnet[(none)]> show binary logs; +-------------------+------------+ | Log_name | File_size | +-------------------+------------+ | db1117-bin.016856 | 9845469 | | db1117-bin.016857 | 1048576295 | | db1117-bin.016858 | 1048781927 | | db1117-bin.016859 | 1048659990 | | db1117-bin.016860 | 125787283 | +-------------------+------------+ 5 rows in set (0.001 sec) {code} This is its replication data: {code} root@db1117.eqiad.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db1195.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1195-bin.001570 Read_Master_Log_Pos: 339370363 Relay_Log_File: db1117-relay-bin.000010 Relay_Log_Pos: 339370663 Relay_Master_Log_File: db1195-bin.001570 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 339370363 Relay_Log_Space: 339371021 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 172001292 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 21 Slave_Non_Transactional_Groups: 111 Slave_Transactional_Groups: 1216438 1 row in set (0.001 sec) {code} If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any. We use MIXED as binlog_format {code} root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin" -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858 -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859 -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860 -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009 -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010 -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1117-bin.016856 db1117-bin.016857 db1117-bin.016858 db1117-bin.016859 db1117-bin.016860 db1117-relay-bin.000009 db1117-relay-bin.000010 {code} The same result on its master {code} root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 172001292 | +------------------+ 1 row in set (0.001 sec) root@db1195:/srv/sqldata# ls | grep bin | grep -v index db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 {code} Logs have been flushed/rotated just in case on both hosts. On _sql/rpl_gtid.cc_ we can see this piece of code {code} for (not_match= true, k= 0; k < elem->hash.records; k++) { rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k); for (ulong l= 0; l < glev->count && not_match; l++) not_match= !(*d_gtid == glev->list[l]); } if (not_match) { sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') " "being deleted. Make sure to first purge those files", *ptr_domain_id); errmsg= errbuf; goto end; } {code} Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/): {code} Gtid list is supposed to come from a binlog's Gtid_list event and therefore should be a subset of the current binlog state. That is for every domain in the list the binlog state contains a gtid with sequence number not less than that of the list. Exceptions of this inclusion rule are: A. the list may still refer to gtids from already deleted domains. Files containing them must have been purged whereas the file with the list is not yet. B. out of order groups were injected C. manually build list of binlog files violating the inclusion constraint. While A is a normal case (not necessarily distinguishable from C though), B and C may require the user's attention so any (incl the A's suspected) inconsistency is diagnosed and *warned*. */ {code} Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not. In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology. Some other do work fine when being cleaned up on that same host: {code} root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884); Query OK, 0 rows affected (0.002 sec) root@db1117.eqiad.wmnet[(none)]> {code} |
We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.
While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case. For this example we have db1195 - master db1117 - slave db1117 is the one we want to clean up and leave only replicating with db1195 domain's id {code} root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 171966479 | +------------------+ 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G *************************** 1. row *************************** @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484); ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files. root@db1117.eqiad.wmnet[(none)]> {code} This host has one master (db1195) but also has log-slave-updates enabled for backup purposes. {code} root@db1117.eqiad.wmnet[(none)]> show binary logs; +-------------------+------------+ | Log_name | File_size | +-------------------+------------+ | db1117-bin.016856 | 9845469 | | db1117-bin.016857 | 1048576295 | | db1117-bin.016858 | 1048781927 | | db1117-bin.016859 | 1048659990 | | db1117-bin.016860 | 125787283 | +-------------------+------------+ 5 rows in set (0.001 sec) {code} This is its replication data: {code} root@db1117.eqiad.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db1195.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1195-bin.001570 Read_Master_Log_Pos: 339370363 Relay_Log_File: db1117-relay-bin.000010 Relay_Log_Pos: 339370663 Relay_Master_Log_File: db1195-bin.001570 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 339370363 Relay_Log_Space: 339371021 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 172001292 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 21 Slave_Non_Transactional_Groups: 111 Slave_Transactional_Groups: 1216438 1 row in set (0.001 sec) {code} If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any. We use MIXED as binlog_format {code} root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin" -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858 -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859 -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860 -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009 -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010 -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1117-bin.016856 db1117-bin.016857 db1117-bin.016858 db1117-bin.016859 db1117-bin.016860 db1117-relay-bin.000009 db1117-relay-bin.000010 {code} The same result on its master {code} root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 172001292 | +------------------+ 1 row in set (0.001 sec) root@db1195:/srv/sqldata# ls | grep bin | grep -v index db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 {code} Logs have been flushed/rotated just in case on both hosts. On _sql/rpl_gtid.cc_ we can see this piece of code {code} for (not_match= true, k= 0; k < elem->hash.records; k++) { rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k); for (ulong l= 0; l < glev->count && not_match; l++) not_match= !(*d_gtid == glev->list[l]); } if (not_match) { sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') " "being deleted. Make sure to first purge those files", *ptr_domain_id); errmsg= errbuf; goto end; } {code} Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/): {code} Gtid list is supposed to come from a binlog's Gtid_list event and therefore should be a subset of the current binlog state. That is for every domain in the list the binlog state contains a gtid with sequence number not less than that of the list. Exceptions of this inclusion rule are: A. the list may still refer to gtids from already deleted domains. Files containing them must have been purged whereas the file with the list is not yet. B. out of order groups were injected C. manually build list of binlog files violating the inclusion constraint. While A is a normal case (not necessarily distinguishable from C though), B and C may require the user's attention so any (incl the A's suspected) inconsistency is diagnosed and *warned*. */ {code} Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not. In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology. Some other do work fine when being cleaned up on that same host: {code} root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884); Query OK, 0 rows affected (0.002 sec) root@db1117.eqiad.wmnet[(none)]> {code} The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as set gtid_domain_id to a different value when we set up the hosts. |
Assignee | Brandon Nesterenko [ JIRAUSER48702 ] |
Description |
We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.
While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case. For this example we have db1195 - master db1117 - slave db1117 is the one we want to clean up and leave only replicating with db1195 domain's id {code} root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 171966479 | +------------------+ 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G *************************** 1. row *************************** @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484); ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files. root@db1117.eqiad.wmnet[(none)]> {code} This host has one master (db1195) but also has log-slave-updates enabled for backup purposes. {code} root@db1117.eqiad.wmnet[(none)]> show binary logs; +-------------------+------------+ | Log_name | File_size | +-------------------+------------+ | db1117-bin.016856 | 9845469 | | db1117-bin.016857 | 1048576295 | | db1117-bin.016858 | 1048781927 | | db1117-bin.016859 | 1048659990 | | db1117-bin.016860 | 125787283 | +-------------------+------------+ 5 rows in set (0.001 sec) {code} This is its replication data: {code} root@db1117.eqiad.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db1195.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1195-bin.001570 Read_Master_Log_Pos: 339370363 Relay_Log_File: db1117-relay-bin.000010 Relay_Log_Pos: 339370663 Relay_Master_Log_File: db1195-bin.001570 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 339370363 Relay_Log_Space: 339371021 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 172001292 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 21 Slave_Non_Transactional_Groups: 111 Slave_Transactional_Groups: 1216438 1 row in set (0.001 sec) {code} If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any. We use MIXED as binlog_format {code} root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin" -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858 -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859 -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860 -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009 -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010 -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1117-bin.016856 db1117-bin.016857 db1117-bin.016858 db1117-bin.016859 db1117-bin.016860 db1117-relay-bin.000009 db1117-relay-bin.000010 {code} The same result on its master {code} root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 172001292 | +------------------+ 1 row in set (0.001 sec) root@db1195:/srv/sqldata# ls | grep bin | grep -v index db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 {code} Logs have been flushed/rotated just in case on both hosts. On _sql/rpl_gtid.cc_ we can see this piece of code {code} for (not_match= true, k= 0; k < elem->hash.records; k++) { rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k); for (ulong l= 0; l < glev->count && not_match; l++) not_match= !(*d_gtid == glev->list[l]); } if (not_match) { sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') " "being deleted. Make sure to first purge those files", *ptr_domain_id); errmsg= errbuf; goto end; } {code} Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/): {code} Gtid list is supposed to come from a binlog's Gtid_list event and therefore should be a subset of the current binlog state. That is for every domain in the list the binlog state contains a gtid with sequence number not less than that of the list. Exceptions of this inclusion rule are: A. the list may still refer to gtids from already deleted domains. Files containing them must have been purged whereas the file with the list is not yet. B. out of order groups were injected C. manually build list of binlog files violating the inclusion constraint. While A is a normal case (not necessarily distinguishable from C though), B and C may require the user's attention so any (incl the A's suspected) inconsistency is diagnosed and *warned*. */ {code} Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not. In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology. Some other do work fine when being cleaned up on that same host: {code} root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884); Query OK, 0 rows affected (0.002 sec) root@db1117.eqiad.wmnet[(none)]> {code} The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as set gtid_domain_id to a different value when we set up the hosts. |
We are trying to clean up the output of _gtid_slave_pos_ so we can use GTID to move slaves around with Orchestrator without them complaining about positions not existing on binary logs (as they might be leftovers from previous masters, no longer existing on those replication chains.
While cleaning them up via _flush binary logs delete_domain_id_ we are running into some errors about some existing GTID domains currently being present on binary logs, which is not the case. For this example we have db1195 - master db1117 - slave db1117 is the one we want to clean up and leave only replicating with db1195 domain's id {code} root@db1117.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 171966479 | +------------------+ 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> select @@gtid_slave_pos\G *************************** 1. row *************************** @@gtid_slave_pos: 0-171966484-2731336216,171966484-171966484-7582228474,171966512-171966512-1959889139,171966556-171966556-1824632116,171966562-171966562-962004828,171970746-171970746-808478946,171974733-171974733-2008457625,171974884-171974884-9104192396,171978763-171978763-83528410,172001292-172001292-1428114291 1 row in set (0.001 sec) root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171966484); ERROR 1076 (HY000): Could not delete gtid domain. Reason: binlog files may contain gtids from the domain ('171966484') being deleted. Make sure to first purge those files. root@db1117.eqiad.wmnet[(none)]> {code} This host has one master (db1195) but also has log-slave-updates enabled for backup purposes. {code} root@db1117.eqiad.wmnet[(none)]> show binary logs; +-------------------+------------+ | Log_name | File_size | +-------------------+------------+ | db1117-bin.016856 | 9845469 | | db1117-bin.016857 | 1048576295 | | db1117-bin.016858 | 1048781927 | | db1117-bin.016859 | 1048659990 | | db1117-bin.016860 | 125787283 | +-------------------+------------+ 5 rows in set (0.001 sec) {code} This is its replication data: {code} root@db1117.eqiad.wmnet[(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: db1195.eqiad.wmnet Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: db1195-bin.001570 Read_Master_Log_Pos: 339370363 Relay_Log_File: db1117-relay-bin.000010 Relay_Log_Pos: 339370663 Relay_Master_Log_File: db1195-bin.001570 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 339370363 Relay_Log_Space: 339371021 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 172001292 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-171966484-2731336216,171966484-171966484-7582228474,171974733-171974733-2008457625,171966562-171966562-962004828,171970746-171970746-808478946,171966512-171966512-1959889139,171974884-171974884-9104192396,171966556-171966556-1824632116,171978763-171978763-83528410,172001292-172001292-1428128537 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: conservative SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it Slave_DDL_Groups: 21 Slave_Non_Transactional_Groups: 111 Slave_Transactional_Groups: 1216438 1 row in set (0.001 sec) {code} If we scan the binlogs (and relay log of this slave, just in case) we can see that domain_id isn't present on any. We use MIXED as binlog_format {code} root@db1117:/srv/sqldata.m1# ls -lh | egrep "bin" -rw-rw---- 1 mysql mysql 9.4M Jan 11 20:31 db1117-bin.016856 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:31 db1117-bin.016857 -rw-rw---- 1 mysql mysql 1001M Jan 11 22:52 db1117-bin.016858 -rw-rw---- 1 mysql mysql 1001M Jan 11 23:20 db1117-bin.016859 -rw-rw---- 1 mysql mysql 213M Jan 11 23:24 db1117-bin.016860 -rw-rw---- 1 mysql mysql 100 Jan 11 23:20 db1117-bin.index -rw-rw---- 1 mysql mysql 358 Jan 11 23:16 db1117-relay-bin.000009 -rw-rw---- 1 mysql mysql 380M Jan 11 23:24 db1117-relay-bin.000010 -rw-rw---- 1 mysql mysql 52 Jan 11 23:16 db1117-relay-bin.index root@db1117:/srv/sqldata.m1# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1117-bin.016856 db1117-bin.016857 db1117-bin.016858 db1117-bin.016859 db1117-bin.016860 db1117-relay-bin.000009 db1117-relay-bin.000010 {code} The same result on its master {code} root@db1195.eqiad.wmnet[(none)]> select @@gtid_domain_id; +------------------+ | @@gtid_domain_id | +------------------+ | 172001292 | +------------------+ 1 row in set (0.001 sec) root@db1195:/srv/sqldata# ls | grep bin | grep -v index db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 root@db1195:/srv/sqldata# for i in ` ls | grep bin | grep -v index`; do echo $i; mysqlbinlog -vvv --base64-output=DECODE-ROW $i | grep "171966484" ; done db1195-bin.001565 db1195-bin.001566 db1195-bin.001567 db1195-bin.001568 db1195-bin.001569 db1195-bin.001570 {code} Logs have been flushed/rotated just in case on both hosts. On _sql/rpl_gtid.cc_ we can see this piece of code {code} for (not_match= true, k= 0; k < elem->hash.records; k++) { rpl_gtid *d_gtid= (rpl_gtid *)my_hash_element(&elem->hash, k); for (ulong l= 0; l < glev->count && not_match; l++) not_match= !(*d_gtid == glev->list[l]); } if (not_match) { sprintf(errbuf, "binlog files may contain gtids from the domain ('%u') " "being deleted. Make sure to first purge those files", *ptr_domain_id); errmsg= errbuf; goto end; } {code} Which per the following comment seems to be getting the list of domain_ids from GTID_LIST_EVENT (https://mariadb.com/kb/en/gtid_list_event/): {code} Gtid list is supposed to come from a binlog's Gtid_list event and therefore should be a subset of the current binlog state. That is for every domain in the list the binlog state contains a gtid with sequence number not less than that of the list. Exceptions of this inclusion rule are: A. the list may still refer to gtids from already deleted domains. Files containing them must have been purged whereas the file with the list is not yet. B. out of order groups were injected C. manually build list of binlog files violating the inclusion constraint. While A is a normal case (not necessarily distinguishable from C though), B and C may require the user's attention so any (incl the A's suspected) inconsistency is diagnosed and *warned*. */ {code} Using that documentation page, I would have expected to find that domain_id present on any of the existing binlogs, but as can be seen above, it is not. In fact, that gtid_domain_id doesn't belong to any of the hosts present on this replication topology. Some other do work fine when being cleaned up on that same host: {code} root@db1117.eqiad.wmnet[(none)]> flush binary logs delete_domain_id=(171974884); Query OK, 0 rows affected (0.002 sec) root@db1117.eqiad.wmnet[(none)]> {code} The same happens when trying to delete the default _gtid_domain_id_ 0 which has never been present on this system as we set gtid_domain_id to a different value when we set up the hosts. |
Fix Version/s | 10.4 [ 22408 ] | |
Fix Version/s | 10.5 [ 23123 ] | |
Fix Version/s | 10.6 [ 24028 ] | |
Fix Version/s | 10.7 [ 24805 ] | |
Fix Version/s | 10.8 [ 26121 ] | |
Fix Version/s | 10.9 [ 26905 ] | |
Fix Version/s | 10.10 [ 27530 ] | |
Fix Version/s | 10.11 [ 27614 ] |
Fix Version/s | 10.7 [ 24805 ] |
Fix Version/s | 10.8 [ 26121 ] |
Fix Version/s | 10.9 [ 26905 ] |
Fix Version/s | 10.10 [ 27530 ] |
Fix Version/s | 10.4 [ 22408 ] |
Thanks, Manuel Arostegui, for an excellent write-up of the problem.
This part of the code you refer to looks obviously wrong:
for (ulong l= 0; l < glev->count && not_match; l++)
not_match= !(*d_gtid == glev->list[l]);
The test is reversed. If the domain to be deleted has multiple entries in the GTID_LIST event (two or more different server_ids), then obviously all but one will not match.
So this would make it impossible to delete any domain which was served by two different masters
. When I get time, I'll try to write a test for this.
Some extra information would be useful: The value of @@GTID_BINLOG_STATE, and the mysqlbinlog dump of the GTID_LIST event at the start of the first binlog file on the slave. And also the output of SHOW WARNINGS after the failing command, if any. I know this is an old bug report, so I understand if this is no longer possible to obtain.
Also, I wonder if there is some confusion here. The DELETE_DOMAIN_ID is for removing domains from the @@GTID_BINLOG_STATE. But the report mentions deleting from the @@GTID_SLAVE_POS. The latter is done simply by setting @@GTID_SLAVE_POS to the new desired value without the unwanted domains. However, the observed behaviour still looks like a bug to me...