Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.12
-
None
Description
We want to build 1 + N replication MariaDB environment,
but we noticed that something not as we expected.
As default setting,
relay_log_purge = ON
relay_log_recovery = OFF
According to document:
https://mariadb.com/kb/en/replication-and-binary-log-system-variables/
relay_log_purge = ON will purge relay log files that no longer necessary (all content in relay log files were applied on replica node),
relay_log_recovery = ON replica will drop all relay logs that haven't yet been processed, and retrieve relay logs from the primary.
I think "relay logs that haven't yet been processed" means not all content in relay log files were applied on replica node.
In our test, we had these relay log files not applied:
-rw-rw---- 1 mariadba mariadba 350 Aug 17 12:39 /var/mariadb/log_base/relay-bin.000045
|
-rw-rw---- 1 mariadba mariadba 1073741994 Aug 17 12:45 /var/mariadb/log_base/relay-bin.000046
|
...
|
-rw-rw---- 1 mariadba mariadba 1014 Aug 17 13:28 /var/mariadb/log_base/relay-bin.index
|
-rw-rw---- 1 mariadba mariadba 456786881 Aug 17 13:46 /var/mariadb/log_base/relay-bin.000070
|
We expected that when we STOP SLAVE then START SLAVE, those relay log files should not be dropped. and replica node will try to get binary log contents from primary node, from relay log files last position.
When we STOP SLAVE, those relay logs files still on disk, but when we START SLAVE, those relay log files were dropped! Only these files still in disk:
-rw-rw---- 1 mariadba mariadba 256 Aug 17 13:52 /var/mariadb/log_base/relay-bin.000001
|
-rw-rw---- 1 mariadba mariadba 39 Aug 17 13:52 /var/mariadb/log_base/relay-bin.index
|
And since primary node already purged binary log files by binlog_expire_logs_seconds setting. replica node could not get binary logs from primary node, showed error below:
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
Seemed replica node wanted to get binary log from position which replica node already applied, not relay logs last position.
And of course, we had to restore replica node from primary node's full backup for this situation.
For another testing, when we just STOP SLAVE SQL_THREAD, keep IO_THREAD working, when we START SLAVE SQL_THREAD, replica node kept relay log files, and applied change from relay log, Then we STOP SLAVE SQL_THREAD for period time, let relay log files growing to multiple files, then START SLAVE SQL_THREAD, replica node kept relay log files and applied change from relay log, when SQL_THREAD not applied all relay logs, we STOP SLAVE IO_THREAD, relay log files still on disk and SQL_THREAD worked, then we START SLAVE IO_THREAD, this time, relay log files still on disk, too, SQL_THREAD and IO_THREAD were both wokred. Seemed MariaDB needed SQL_THREAD kept working to keep relay log files on disk.
Did this situation right? Although we set relay_log_recovery = OFF, MariaDB still drop all relay log files from disk then try to get binary log contents from primary node. This should have risk when SQL_THREAD not applied as fast as primary node. Replica node already got binary logs from primary node and stored these information to relay log files on disk, we think it should not be dropped.
Please help us to confirm this issue, thank you.