[MDEV-39244] "shutdown wait for all replicas;" does not seem to work - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 10.6.25
Fix Version/s: N/A
Component/s: Replication
Labels:
None
Environment:
Oracle Linux Server 9.7

Bug Category:
Can result in data loss

Description

Hi there, I've got a scripted switchover running in automated tests (gtid is used, gtid_strict_mode is on, replica read_only is on, continous but moderate test payload - see below) where I lately spotted unexpected behaviour concerning "shutdown wait for all replicas;". In contrast to my expectations, some transactions / bin log events seem to have been applied on old-primary but not finally sent to old-replica. This naturally resulted in new-replica diverging from new-primary such that a "start replica;" on new-replica failed with fatal error 1236.
Here is what I do in the scripted switchover, all strictly sequential in one code block from python/bash (say that host mdb02 is old-primary / new-replica and mdb03 is old-replica / new-primary):

# bring everything down

# mdb02 as old-primary         | mdb03 as old-replica

mariadb -e "flush tables with read lock; shutdown wait for all replicas;"

                               | mariadb -e "stop all replicas; reset replica all;"

                               | systemctl stop mariadb.service

# sed read_only to ON          sed read_only to OFF

# get everything up again

# mdb02 as new-replica         | mdb03 as new-primary

                               | systemctl start mariadb.service

systemctl start mariadb.service

mariadb -e "stop replica;"

mariadb -e "change master to master_host=... master_use_gtid=current_pos;"

mariadb -e "start replica;"

mariadb -e "reset master;"

Ok, this is the mariadb log output (extracts) from the actual problem in question (again, mdb02 leftbound and mdb03 somewhere indented to the right):

mdb02 as old-primary            | mdb03 as old-replica

2026-03-29 20:05:13 0 [Note] /usr/sbin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown

2026-03-29 20:05:13 0 [Note] /usr/sbin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown

                                | 2026-03-29 20:05:13 9 [Note] Slave SQL thread exiting, replication stopped in log 'log-bin-287274328.000002' at position 58033618; GTID position '0-287274328-2576600', master: ps-mdb-test02.ham4.portrix-systems.de:3306

2026-03-29 20:05:15 0 [Note] InnoDB: Starting shutdown...

2026-03-29 20:05:16 0 [Note] InnoDB: Shutdown completed

                                | 2026-03-29 20:05:16 0 [Note] InnoDB: Starting shutdown...

                                | 2026-03-29 20:05:19 0 [Note] InnoDB: Shutdown completed

# mdb02 as new-replica          | mdb03 as new-primary

                                | 2026-03-29 20:05:19 0 [Note] Starting MariaDB 10.6.25-MariaDB-log

                                | 2026-03-29 20:05:21 0 [Note] /usr/sbin/mariadbd: ready for connections

2026-03-29 20:05:22 0 [Note] Starting MariaDB 10.6.25-MariaDB-log

2026-03-29 20:05:22 0 [Note] /usr/sbin/mariadbd: ready for connections

2026-03-29 20:05:34 7 [Note] Slave I/O thread: connected to master 'replication_user@ps-mdb-test03.ham4.portrix-systems.de:3306',replication starts at GTID position '0-287274328-2576601'

2026-03-29 20:05:34 7 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-287274328-2576601, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions', Internal MariaDB error code: 1236

2026-03-29 20:05:34 7 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position 0-287274328-2576601, master ps-mdb-test03.ham4.portrix-systems.de:3306

Finally, as mentioned above, the mockdata payload is just some inserts and deletes on a simple table as follows:

create database if not exists mockdata;

create table if not exists mockdata.mockdata (

  id bigint unsigned auto_increment primary key,

  mock_name text not null,

  created_on timestamp);

insert into mockdata.mockdata (mock_name, created_on)

select concat('Hello Peter ', seq), current_timestamp from test.seq_1_to_5;

delete from mockdata.mockdata where created_on < (date_sub(current_timestamp, interval 7 day));

Yet again for the actual problem (time window) in question I've found this:

# mdb02

7200613	Hello "Peter 5"	2026-03-29 20:05:13.0

7200612	Hello "Peter 4"	2026-03-29 20:05:13.0

7200611	Hello "Peter 3"	2026-03-29 20:05:13.0

7200610	Hello "Peter 2"	2026-03-29 20:05:13.0

7200609	Hello "Peter 1"	2026-03-29 20:05:13.0

7200606	Hello "Peter 5"	2026-03-29 20:05:01.0

7200605	Hello "Peter 4"	2026-03-29 20:05:01.0

7200604	Hello "Peter 3"	2026-03-29 20:05:01.0

7200603	Hello "Peter 2"	2026-03-29 20:05:01.0

7200602	Hello "Peter 1"	2026-03-29 20:05:01.0

# mdb03

7200611	Hello "Peter 5"	2026-03-29 20:05:22.0

7200610	Hello "Peter 4"	2026-03-29 20:05:22.0

7200609	Hello "Peter 3"	2026-03-29 20:05:22.0

7200608	Hello "Peter 2"	2026-03-29 20:05:22.0

7200607	Hello "Peter 1"	2026-03-29 20:05:22.0

7200606	Hello "Peter 5"	2026-03-29 20:05:01.0

7200605	Hello "Peter 4"	2026-03-29 20:05:01.0

7200604	Hello "Peter 3"	2026-03-29 20:05:01.0

7200603	Hello "Peter 2"	2026-03-29 20:05:01.0

7200602	Hello "Peter 1"	2026-03-29 20:05:01.0

Obviously, the mockdata timestamped "2026-03-29 20:05:13.0" did not succeed to make it to old-replica -> the reason for the divergence.

Any idea someone?

Regards Peter

Attachments

Activity

People

Assignee:: Brandon Nesterenko

Reporter:: psys

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2026-04-02 08:08

Updated:: 2026-07-02 07:12

Resolved:: 2026-07-02 07:12

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.