[MDEV-25395] server recovery hits replication event checksum error Created: 2021-04-12  Updated: 2023-04-27

Status: Stalled
Project: MariaDB Server
Component/s: Replication, Server
Affects Version/s: 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Andrei Elkin Assignee: Andrei Elkin
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Relates
relates to MDEV-21117 refine the server binlog-based recove... Closed

 Description   

In unlike case of a crash when @@global.binlog_checksum is changing from none to
crc32 and that only the first of two Binlog_checkpoint_log_event gets written to
the crc32 rotated binlog file, the following recovery faces a checksum verification error.

How to repeat:

set @@global.binlog_checksum=none; 
set @@global.debug_dbug='d,crash_before_write_second_checkpoint_event';
set @@global.binlog_checksum=crc32; # => CRASH

Now at the server restart having --master-verify-checksum=1 the error log
receives the following

 [ERROR] Replication event checksum verification failed while reading from a log file
 [ERROR] Error in Log_event::read_log_event(): 'Replication event checksum verification failed while reading from a log file', data_len: 25, event_type: 163

nevertheless the server proceeds to ignore them and finishes initialization.

The simulation label is defined as

--- a/sql/log.cc
+++ b/sql/log.cc
@@ -6784,6 +6784,11 @@ void MYSQL_BIN_LOG::purge()
 
 void MYSQL_BIN_LOG::checkpoint_and_purge(ulong binlog_id)
 {
+  DBUG_EXECUTE_IF("crash_before_write_second_checkpoint_event",
+                  flush_io_cache(&log_file);
+                  mysql_file_sync(log_file.file, MYF(MY_WME));
+                  DBUG_SUICIDE(););
+
   do_checkpoint_request(binlog_id);
   purge();
 }



 Comments   
Comment by Andrei Elkin [ 2021-04-12 ]

A patch is made in MDEV-21117 branch to be updated shortly with few more commits
dealing with that issue.

Comment by Andrei Elkin [ 2021-04-15 ]

serg: the patch had to be refined to satisfy existing tests which were benevolent to checksum errors
at recovery. 412e696fd2b implements a plan discussed on slack.

The server stops now when master-checksum-verify = 1 and error messages contain
binlog-offset of the corrupted event.

Comment by Andrei Elkin [ 2021-08-11 ]

To,
> fdle changes - ok, I've seen them in 21117.
cur_log/etc - clear.

> What are the changes around prev_event_pos for?

I replied:
_

I think I moved it as Recovery_context member out to satisfy
builds compiled without HAVE_REPLICATION.
 
Notice
 
int TC_LOG_BINLOG::recover()
...
#ifdef HAVE_REPLICATION
  Recovery_context ctx;
#endif

_

Comment by Sergei Golubchik [ 2022-01-10 ]

412e696fd2bc is ok to push

Generated at Thu Feb 08 09:37:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.