Details
-
Bug
-
Status: Confirmed (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.25
-
Oracle Linux Server 9.7
-
Can result in data loss
Description
Hi there, I've got a scripted switchover running in automated tests (gtid is used, gtid_strict_mode is on, replica read_only is on, continous but moderate test payload - see below) where I lately spotted unexpected behaviour concerning "shutdown wait for all replicas;". In contrast to my expectations, some transactions / bin log events seem to have been applied on old-primary but not finally sent to old-replica. This naturally resulted in new-replica diverging from new-primary such that a "start replica;" on new-replica failed with fatal error 1236.
Here is what I do in the scripted switchover, all strictly sequential in one code block from python/bash (say that host mdb02 is old-primary / new-replica and mdb03 is old-replica / new-primary):
# bring everything down
|
# mdb02 as old-primary | mdb03 as old-replica
|
mariadb -e "flush tables with read lock; shutdown wait for all replicas;"
|
| mariadb -e "stop all replicas; reset replica all;"
|
| systemctl stop mariadb.service
|
|
|
# sed read_only to ON sed read_only to OFF
|
|
|
# get everything up again
|
# mdb02 as new-replica | mdb03 as new-primary
|
| systemctl start mariadb.service
|
systemctl start mariadb.service
|
mariadb -e "stop replica;"
|
mariadb -e "change master to master_host=... master_use_gtid=current_pos;"
|
mariadb -e "start replica;"
|
mariadb -e "reset master;"
|
Ok, this is the mariadb log output (extracts) from the actual problem in question (again, mdb02 leftbound and mdb03 somewhere indented to the right):
mdb02 as old-primary | mdb03 as old-replica
|
2026-03-29 20:05:13 0 [Note] /usr/sbin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
|
2026-03-29 20:05:13 0 [Note] /usr/sbin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
|
| 2026-03-29 20:05:13 9 [Note] Slave SQL thread exiting, replication stopped in log 'log-bin-287274328.000002' at position 58033618; GTID position '0-287274328-2576600', master: ps-mdb-test02.ham4.portrix-systems.de:3306
|
2026-03-29 20:05:15 0 [Note] InnoDB: Starting shutdown...
|
2026-03-29 20:05:16 0 [Note] InnoDB: Shutdown completed
|
| 2026-03-29 20:05:16 0 [Note] InnoDB: Starting shutdown...
|
| 2026-03-29 20:05:19 0 [Note] InnoDB: Shutdown completed
|
|
|
# mdb02 as new-replica | mdb03 as new-primary
|
| 2026-03-29 20:05:19 0 [Note] Starting MariaDB 10.6.25-MariaDB-log
|
| 2026-03-29 20:05:21 0 [Note] /usr/sbin/mariadbd: ready for connections
|
2026-03-29 20:05:22 0 [Note] Starting MariaDB 10.6.25-MariaDB-log
|
2026-03-29 20:05:22 0 [Note] /usr/sbin/mariadbd: ready for connections
|
2026-03-29 20:05:34 7 [Note] Slave I/O thread: connected to master 'replication_user@ps-mdb-test03.ham4.portrix-systems.de:3306',replication starts at GTID position '0-287274328-2576601'
|
2026-03-29 20:05:34 7 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-287274328-2576601, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions', Internal MariaDB error code: 1236
|
2026-03-29 20:05:34 7 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position 0-287274328-2576601, master ps-mdb-test03.ham4.portrix-systems.de:3306
|
Finally, as mentioned above, the mockdata payload is just some inserts and deletes on a simple table as follows:
create database if not exists mockdata;
|
create table if not exists mockdata.mockdata (
|
id bigint unsigned auto_increment primary key,
|
mock_name text not null,
|
created_on timestamp);
|
insert into mockdata.mockdata (mock_name, created_on)
|
select concat('Hello Peter ', seq), current_timestamp from test.seq_1_to_5;
|
delete from mockdata.mockdata where created_on < (date_sub(current_timestamp, interval 7 day));
|
Yet again for the actual problem (time window) in question I've found this:
# mdb02
|
7200613 Hello "Peter 5" 2026-03-29 20:05:13.0
|
7200612 Hello "Peter 4" 2026-03-29 20:05:13.0
|
7200611 Hello "Peter 3" 2026-03-29 20:05:13.0
|
7200610 Hello "Peter 2" 2026-03-29 20:05:13.0
|
7200609 Hello "Peter 1" 2026-03-29 20:05:13.0
|
|
|
7200606 Hello "Peter 5" 2026-03-29 20:05:01.0
|
7200605 Hello "Peter 4" 2026-03-29 20:05:01.0
|
7200604 Hello "Peter 3" 2026-03-29 20:05:01.0
|
7200603 Hello "Peter 2" 2026-03-29 20:05:01.0
|
7200602 Hello "Peter 1" 2026-03-29 20:05:01.0
|
|
|
# mdb03
|
7200611 Hello "Peter 5" 2026-03-29 20:05:22.0
|
7200610 Hello "Peter 4" 2026-03-29 20:05:22.0
|
7200609 Hello "Peter 3" 2026-03-29 20:05:22.0
|
7200608 Hello "Peter 2" 2026-03-29 20:05:22.0
|
7200607 Hello "Peter 1" 2026-03-29 20:05:22.0
|
|
|
7200606 Hello "Peter 5" 2026-03-29 20:05:01.0
|
7200605 Hello "Peter 4" 2026-03-29 20:05:01.0
|
7200604 Hello "Peter 3" 2026-03-29 20:05:01.0
|
7200603 Hello "Peter 2" 2026-03-29 20:05:01.0
|
7200602 Hello "Peter 1" 2026-03-29 20:05:01.0
|
Obviously, the mockdata timestamped "2026-03-29 20:05:13.0" did not succeed to make it to old-replica -> the reason for the divergence.
Any idea someone?
Regards Peter