[MDEV-28986] rpl tests sometimes failing on freebsd builders Created: 2022-06-30  Updated: 2023-07-25  Resolved: 2023-07-25

Status: Closed
Project: MariaDB Server
Component/s: CI
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Angelique Sklavounos (Inactive) Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates

 Description   

10.6 2fa3ada07202bfd8764cff0bf2e5bbf8

rpl.rpl_xa 'innodb,mix'                  w2 [ fail ]
        Test ended at 2022-06-28 18:58:40
 
CURRENT_TEST: rpl.rpl_xa
mysqltest: In included file "./include/sync_with_master_gtid.inc": 
included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa.inc at line 350:
included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa.test at line 4:
At line 48: Failed to sync with master
 
The result from queries just before the failure was:
< snip >
connection master;
xa recover;
formatID	gtrid_length	bqual_length	data
1	11	0	skip_binlog
set @@session.sql_log_bin = OFF;
xa rollback 'skip_binlog';
set @@session.sql_log_bin = ON;
include/save_master_gtid.inc
*** Zero must be in the list:
connection master;
xa recover;
formatID	gtrid_length	bqual_length	data
*** At the end of skip_log_binb section gtid list has 0 more:
flush logs;
show binlog events in 'master-bin.000007' limit 1,1;
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
master-bin.000007	#	Gtid_list	1	#	[0-1-25]
connection slave;
include/sync_with_master_gtid.inc
Timeout in master_gtid_wait('0-1-25', 120), current slave GTID position is: 0-1-11.

The replica error log shows:

2022-06-28 18:56:39 7 [ERROR] Error reading packet from server: bogus data in log event; the first event 'master-bin.000001' at 4, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362. (server_errno=1236)
2022-06-28 18:56:39 7 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 4, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362.', Internal MariaDB error code: 1236
2022-06-28 18:56:39 7 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 343, master 127.0.0.1:16020

Test always seems to be retry-pass.
Also seen on 10.7, 10.10 bb branches.
Var dir: https://drive.google.com/drive/folders/1FGUV55ChN3Wpc4qG7XeFw4nDnaWzzAJj

rpl.rpl_xa_gtid_pos_auto_engine failures seem similar (replica error log has same messages):

10.6 2fa3ada07202bfd8764cff0bf2e5bbf8

rpl.rpl_xa_gtid_pos_auto_engine 'innodb,mix' w2 [ fail ]
        Test ended at 2022-06-28 19:00:47
 
CURRENT_TEST: rpl.rpl_xa_gtid_pos_auto_engine
mysqltest: In included file "./include/sync_with_master_gtid.inc": 
included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa.inc at line 350:
included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa_gtid_pos_auto_engine.test at line 16:
At line 48: Failed to sync with master
 
The result from queries just before the failure was:
< snip >
connection master;
xa recover;
formatID	gtrid_length	bqual_length	data
1	11	0	skip_binlog
set @@session.sql_log_bin = OFF;
xa rollback 'skip_binlog';
set @@session.sql_log_bin = ON;
include/save_master_gtid.inc
*** Zero must be in the list:
connection master;
xa recover;
formatID	gtrid_length	bqual_length	data
*** At the end of skip_log_binb section gtid list has 0 more:
flush logs;
show binlog events in 'master-bin.000007' limit 1,1;
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
master-bin.000007	#	Gtid_list	1	#	[0-1-25]
connection slave;
include/sync_with_master_gtid.inc
Timeout in master_gtid_wait('0-1-25', 120), current slave GTID position is: 0-1-11.



 Comments   
Comment by Brandon Nesterenko [ 2022-07-15 ]

Other tests which seem to be failing with the same error on kvm-ports-freebsd130-amd64, starting around the same date:

rpl.rpl_auto_increment
rpl.rpl_skip_replication
rpl.rpl_checksum_cache
rpl.rpl_row_tbl_metadata
rpl.rpl_mark_optimize_tbl_ddl
rpl.rpl_stm_innodb
rpl.rpl_sp
rpl.rpl_rotate_logs
rpl.rpl_temp_table
rpl.rpl_parallel_wrong_exec_master_pos
binlog_encryption.rpl_skip_replication

Comment by Angelique Sklavounos (Inactive) [ 2022-07-25 ]

And rpl.rpl_row_img_sequence https://buildbot.askmonty.org/buildbot/builders/kvm-ports-freebsd130-amd64/builds/2093

The var dirs are in the Google Drive link above

10.3 8aa37c264f1c67beb9dc5cd4127379f0

rpl.rpl_row_img_sequence 'innodb,row'    w2 [ fail ]
        Test ended at 2022-07-25 09:04:51
 
CURRENT_TEST: rpl.rpl_row_img_sequence
mysqltest: In included file "./include/rpl_row_img_general_loop.inc": 
included from /usr/local/share/mysql-test/suite/rpl/t/rpl_row_img_sequence.test at line 54:
At line 42: Failed to sync with master
 
The result from queries just before the failure was:
< snip >
# Verifying all expected column ids appear in binlog event output..
# ..success
# Verifying only expected column ids appear in binlog event output..
# ..success
# Validate server_3 binlogged NEXTVAL with the correct columns
connection server_3;
FLUSH LOGS;
include/ensure_binlog_row_event_columns.inc [(1,2,3,4,5,6,7,8)]
# MYSQL_BINLOG mysqld_datadir/binlog_filename -vv > assert_file
# Verifying all expected column ids appear in binlog event output..
# ..success
# Verifying only expected column ids appear in binlog event output..
# ..success
# Cleanup
connection server_1;
DROP TABLE s1;
include/save_master_gtid.inc
connection server_3;
include/sync_with_master_gtid.inc
Timeout in master_gtid_wait('0-1-41', 120), current slave GTID position is: 0-1-40,1-2-2.

2022-07-25  9:02:50 16 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000017' at 805, the last event read from 'master-bin.000027' at 359, the last byte read from 'master-bin.000027' at 378.', Internal MariaDB error code: 1236

Comment by Angelique Sklavounos (Inactive) [ 2022-08-22 ]

And rpl.rpl_sporadic_master https://buildbot.askmonty.org/buildbot/builders/kvm-ports-freebsd130-amd64/builds/2602

10.6 f02ca429f70c16be2b2e3d5671d9990c

rpl.rpl_sporadic_master 'row'            w4 [ fail ]
        Test ended at 2022-08-22 06:57:08
 
CURRENT_TEST: rpl.rpl_sporadic_master
analyze: sync_with_master
mysqltest: In included file "/usr/local/share/mysql-test/suite/rpl/include/rpl_sporadic_master.inc": 
included from /usr/local/share/mysql-test/suite/rpl/t/rpl_sporadic_master.test at line 2:
At line 26: sync_slave_with_master failed: 'select master_pos_wait('master-bin.000002', 820, 300, '')' returned -1 indicating timeout after 300 seconds
 
The result from queries just before the failure was:
include/master-slave.inc
[connection master]
create table t2(n int);
create table t1(n int not null auto_increment primary key);
insert into t1 values (NULL),(NULL);
truncate table t1;
insert into t1 values (4),(NULL);
connection slave;
include/stop_slave.inc
include/start_slave.inc
connection master;
insert into t1 values (NULL),(NULL);
flush logs;
truncate table t1;
insert into t1 values (10),(NULL),(NULL),(NULL),(NULL),(NULL);
 
 
 == /tmp/var/4/tmp/analyze-sync_with_master-mysqld.1.err ==
 
############################## default ##############################
 
**** SHOW WARNINGS on default ****
SHOW WARNINGS;
Level	Code	Message
 
**** SELECT replication-related variables on default ****
SELECT NOW(), @@SERVER_ID;
NOW()	@@SERVER_ID
2022-08-22 06:57:07	1
 
**** SHOW SLAVE STATUS on default ****
SHOW SLAVE STATUS;
 
**** SHOW MASTER STATUS on default ****
SHOW MASTER STATUS;
File	master-bin.000002
Position	820
Binlog_Do_DB	
Binlog_Ignore_DB	
 
**** SHOW SLAVE HOSTS on default ****
SHOW SLAVE HOSTS;
 
**** SHOW PROCESSLIST on default ****
SHOW PROCESSLIST;
Id	User	Host	db	Command	Time	State	Info	Progress
12	root	localhost	NULL	Query	0	starting	SHOW PROCESSLIST	0.000
 
**** SHOW BINARY LOGS on default ****
SHOW BINARY LOGS;
Log_name	File_size
master-bin.000001	1599
master-bin.000002	820
 
**** SHOW BINLOG EVENTS on default ****
binlog_name = 'master-bin.000002'
SHOW BINLOG EVENTS IN 'master-bin.000002';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
master-bin.000002	4	Format_desc	1	256	Server ver: 10.6.10-MariaDB-log, Binlog ver: 4
master-bin.000002	256	Gtid_list	1	299	[0-1-6]
master-bin.000002	299	Binlog_checkpoint	1	343	master-bin.000001
master-bin.000002	343	Binlog_checkpoint	1	387	master-bin.000002
master-bin.000002	387	Gtid	1	429	GTID 0-1-7
master-bin.000002	429	Query	1	513	use `test`; truncate table t1
master-bin.000002	513	Gtid	1	555	BEGIN GTID 0-1-8
master-bin.000002	555	Annotate_rows	1	639	insert into t1 values (10),(NULL),(NULL),(NULL),(NULL),(NULL)
master-bin.000002	639	Table_map	1	684	table_id: 34 (test.t1)
master-bin.000002	684	Write_rows_v1	1	747	table_id: 34 flags: STMT_END_F
master-bin.000002	747	Query	1	820	COMMIT
 
**** SHOW RELAYLOG EVENTS on default ****
relaylog_name = 'No such row'
SHOW RELAYLOG EVENTS IN 'No such row';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
connection default;
 
 == /tmp/var/4/tmp/analyze-sync_with_master-mysqld.2.err ==
 
############################## default ##############################
 
**** SHOW WARNINGS on default ****
SHOW WARNINGS;
Level	Code	Message
 
**** SELECT replication-related variables on default ****
SELECT NOW(), @@SERVER_ID;
NOW()	@@SERVER_ID
2022-08-22 06:57:07	2
 
**** SHOW SLAVE STATUS on default ****
SHOW SLAVE STATUS;
Slave_IO_State	
Master_Host	127.0.0.1
Master_User	root
Master_Port	16060
Connect_Retry	1
Master_Log_File	master-bin.000002
Read_Master_Log_Pos	343
Relay_Log_File	slave-relay-bin.000005
Relay_Log_Pos	643
Relay_Master_Log_File	master-bin.000002
Slave_IO_Running	No
Slave_SQL_Running	Yes
Replicate_Do_DB	
Replicate_Ignore_DB	
Replicate_Do_Table	
Replicate_Ignore_Table	
Replicate_Wild_Do_Table	
Replicate_Wild_Ignore_Table	
Last_Errno	0
Last_Error	
Skip_Counter	0
Exec_Master_Log_Pos	343
Relay_Log_Space	1000
Until_Condition	None
Until_Log_File	
Until_Log_Pos	0
Master_SSL_Allowed	No
Master_SSL_CA_File	
Master_SSL_CA_Path	
Master_SSL_Cert	
Master_SSL_Cipher	
Master_SSL_Key	
Seconds_Behind_Master	NULL
Master_SSL_Verify_Server_Cert	No
Last_IO_Errno	1236
Last_IO_Error	Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 1290, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362.'
Last_SQL_Errno	0
Last_SQL_Error	
Replicate_Ignore_Server_Ids	
Master_Server_Id	1
Master_SSL_Crl	
Master_SSL_Crlpath	
Using_Gtid	No
Gtid_IO_Pos	
Replicate_Do_Domain_Ids	
Replicate_Ignore_Domain_Ids	
Parallel_Mode	optimistic
SQL_Delay	0
SQL_Remaining_Delay	NULL
Slave_SQL_Running_State	Slave has read all relay log; waiting for more updates
Slave_DDL_Groups	3
Slave_Non_Transactional_Groups	3
Slave_Transactional_Groups	0
 
**** SHOW MASTER STATUS on default ****
SHOW MASTER STATUS;
File	slave-bin.000001
Position	1550
Binlog_Do_DB	
Binlog_Ignore_DB	
 
**** SHOW SLAVE HOSTS on default ****
SHOW SLAVE HOSTS;
 
**** SHOW PROCESSLIST on default ****
SHOW PROCESSLIST;
Id	User	Host	db	Command	Time	State	Info	Progress
12	system user		NULL	Slave_SQL	300	Slave has read all relay log; waiting for more updates	NULL	0.000
13	root	localhost	NULL	Query	0	starting	SHOW PROCESSLIST	0.000
 
**** SHOW BINARY LOGS on default ****
SHOW BINARY LOGS;
Log_name	File_size
slave-bin.000001	1550
 
**** SHOW BINLOG EVENTS on default ****
binlog_name = 'slave-bin.000001'
SHOW BINLOG EVENTS IN 'slave-bin.000001';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
slave-bin.000001	4	Format_desc	2	256	Server ver: 10.6.10-MariaDB-log, Binlog ver: 4
slave-bin.000001	256	Gtid_list	2	285	[]
slave-bin.000001	285	Binlog_checkpoint	2	328	slave-bin.000001
slave-bin.000001	328	Gtid	1	370	GTID 0-1-1
slave-bin.000001	370	Query	1	468	use `test`; create table t2(n int)
slave-bin.000001	468	Gtid	1	510	GTID 0-1-2
slave-bin.000001	510	Query	1	644	use `test`; create table t1(n int not null auto_increment primary key)
slave-bin.000001	644	Gtid	1	686	BEGIN GTID 0-1-3
slave-bin.000001	686	Annotate_rows	1	744	insert into t1 values (NULL),(NULL)
slave-bin.000001	744	Table_map	1	789	table_id: 33 (test.t1)
slave-bin.000001	789	Write_rows_v1	1	832	table_id: 33 flags: STMT_END_F
slave-bin.000001	832	Query	1	905	COMMIT
slave-bin.000001	905	Gtid	1	947	GTID 0-1-4
slave-bin.000001	947	Query	1	1031	use `test`; truncate table t1
slave-bin.000001	1031	Gtid	1	1073	BEGIN GTID 0-1-5
slave-bin.000001	1073	Annotate_rows	1	1128	insert into t1 values (4),(NULL)
slave-bin.000001	1128	Table_map	1	1173	table_id: 34 (test.t1)
slave-bin.000001	1173	Write_rows_v1	1	1216	table_id: 34 flags: STMT_END_F
slave-bin.000001	1216	Query	1	1289	COMMIT
slave-bin.000001	1289	Gtid	1	1331	BEGIN GTID 0-1-6
slave-bin.000001	1331	Annotate_rows	1	1389	insert into t1 values (NULL),(NULL)
slave-bin.000001	1389	Table_map	1	1434	table_id: 34 (test.t1)
slave-bin.000001	1434	Write_rows_v1	1	1477	table_id: 34 flags: STMT_END_F
slave-bin.000001	1477	Query	1	1550	COMMIT
 
**** SHOW RELAYLOG EVENTS on default ****
relaylog_name = 'slave-relay-bin.000005'
SHOW RELAYLOG EVENTS IN 'slave-relay-bin.000005';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
slave-relay-bin.000005	4	Format_desc	2	256	Server ver: 10.6.10-MariaDB-log, Binlog ver: 4
slave-relay-bin.000005	256	Rotate	1	0	master-bin.000002;pos=4
slave-relay-bin.000005	304	Format_desc	1	256	Server ver: 10.6.10-MariaDB-log, Binlog ver: 4
slave-relay-bin.000005	556	Gtid_list	1	299	[0-1-6]
slave-relay-bin.000005	599	Binlog_checkpoint	1	343	master-bin.000001
connection default;
 
 
 
 - saving '/tmp/var/4/log/rpl.rpl_sporadic_master-row/' to '/tmp/var/log/rpl.rpl_sporadic_master-row/'

Comment by Angelique Sklavounos (Inactive) [ 2022-09-07 ]

And rpl.rpl_savepoint https://buildbot.askmonty.org/buildbot/builders/kvm-ports-freebsd130-amd64/builds/2845

10.6 1c730da66

rpl.rpl_savepoint 'innodb,stmt'          w3 [ fail ]
        Test ended at 2022-09-06 14:19:23
 
CURRENT_TEST: rpl.rpl_savepoint
mysqltest: In included file "./include/check_slave_param.inc": 
included from ./include/check_slave_no_error.inc at line 28:
included from ./include/rpl_end.inc at line 74:
included from /usr/local/share/mysql-test/suite/rpl/t/rpl_savepoint.test at line 44:
At line 31: Wrong value for slave parameter Last_IO_Errno
 
The result from queries just before the failure was:
< snip >
master-bin.000002	1200
master-bin.000003	546
 
**** SHOW BINLOG EVENTS on server_1 ****
binlog_name = 'master-bin.000003'
SHOW BINLOG EVENTS IN 'master-bin.000003';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
master-bin.000003	4	Format_desc	1	256	Server ver: 10.6.10-MariaDB-log, Binlog ver: 4
master-bin.000003	256	Gtid_list	1	299	[0-1-6]
master-bin.000003	299	Binlog_checkpoint	1	343	master-bin.000002
master-bin.000003	343	Binlog_checkpoint	1	387	master-bin.000003
master-bin.000003	387	Gtid	1	429	GTID 0-1-7
master-bin.000003	429	Query	1	546	use `test`; DROP TABLE `nt` /* generated by server */
 
**** SHOW RELAYLOG EVENTS on server_1 ****
relaylog_name = 'No such row'
SHOW RELAYLOG EVENTS IN 'No such row';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
connection server_2;
Wrong value for Last_IO_Errno. Expected '0', got '1236'

The replica error log shows (https://buildbot.askmonty.org/buildbot/builders/kvm-ports-freebsd130-amd64/builds/2845/steps/mtr/logs/mysqld.2.err.3)

2022-09-06 14:19:23 7 [ERROR] Error reading packet from server: bogus data in log event; the first event 'master-bin.000001' at 4, the last event read from 'master-bin.000003' at 343, the last byte read from 'master-bin.000003' at 362. (server_errno=1236)
2022-09-06 14:19:23 7 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 4, the last event read from 'master-bin.000003' at 343, the last byte read from 'master-bin.000003' at 362.', Internal MariaDB error code: 1236

Comment by Sergei Golubchik [ 2022-09-23 ]

rpl.rpl_row_img_sequence_full has just failed in the bb-10.11-release branch: https://buildbot.askmonty.org/buildbot/builders/kvm-ports-freebsd130-amd64/builds/3192/steps/mtr/logs/stdio

Comment by Angelique Sklavounos (Inactive) [ 2023-02-08 ]

rpl.rpl_row_img_sequence_min
https://buildbot.mariadb.net/buildbot/builders/kvm-bintar-freebsd130-amd64/builds/1777/steps/mtr/logs/mysqld.3.err.4

2023-02-08  2:51:34 14 [Note] Slave I/O thread: Start asynchronous replication to master 'root@127.0.0.1:16061' in log 'master-bin.000001' at position 571
2023-02-08  2:51:34 15 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000001' at position 571, relay log './mysqld-relay-bin.000002' position: 871
2023-02-08  2:51:34 14 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16061',replication started in log 'master-bin.000001' at position 571
2023-02-08  2:51:37 14 [ERROR] Error reading packet from server: bogus data in log event; the first event 'master-bin.000001' at 571, the last event read from 'master-bin.000015' at 359, the last byte read from 'master-bin.000015' at 378. (server_errno=1236)
2023-02-08  2:51:37 14 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 571, the last event read from 'master-bin.000015' at 359, the last byte read from 'master-bin.000015' at 378.', Internal MariaDB error code: 1236
2023-02-08  2:51:37 14 [Note] Slave I/O thread exiting, read up to log 'master-bin.000015', position 359

Comment by Brandon Nesterenko [ 2023-07-02 ]

It looks like this might be a problem with disk synchronization. The issue seems to only happen when dumping from a newly rotated binary log, right after the binlog background thread writes its binlog_checkpoint_event. It seems after the BCE is "written", the binlog dump thread sees that the end_pos is updated from the IO_Cache, yet when it tries to read the data, it only gets NULL bytes. It seems more prevalent when running concurrently with more heavy-weight tests which fill swap space, and when full, this bug seems to occur quite frequently.

marko had mentioned a similar problem he ran into:

We had to deal with something similar after implementing strict write-ahead-logging in MDEV-24626. Any contents of a newly created file is possible after a killed server. Often it has the correct length and is filled with NUL bytes; sometimes, the length is 0. It is also possible that some pages inside the file have been written. The InnoDB I/O layer is of course different: we use asynchronous writes (io_uring or libaio when available; simulated with threads and pwrite() otherwise), and we always pre-allocate files by writing them with zeroes. I guess that the binlog would append to files. In that case, what would be read from the file near-concurrently with the write could depend on the file system or its parameters.

Elkin noted a potential similarity with MDEV-29981.

Mentioning knielsen for reference.

Comment by Kristian Nielsen [ 2023-07-22 ]

This failure only occurs when running with vardir in the freebsd tmpfs:

sudo mount -t tmpfs tmpfs /tmp
./mtr --vardir=/tmp/var ...

Without mounting tmpfs on /tmp first, it doesn't fail, with the tmpfs it fails quickly.

Possibly there's a bug in freebsd 13.0 tmpfs? It sure looks like there's some race so that reader thread reads zeros after writer thread has written Binlog_checksum_log_event. It seems that should never be possible, reader should either get EOF or valid data.

Maybe it's possible to repeat it with a simpler test program that could be reported to freebsd upstream...

Comment by Kristian Nielsen [ 2023-07-22 ]

I reproduced the failure using FreeBSD ktrace. This shows that the FreeBSD kernel is returning incorrect data:

# File open for write:
 33358 898673 mariadbd 8.128300 CALL  openat(AT_FDCWD,0x1842d40,0x100201<O_WRONLY|O_CREAT|O_CLOEXEC>,0660<S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP>)
 33358 898673 mariadbd 8.128324 NAMI  "./master-bin.000002"
 33358 898673 mariadbd 8.128358 RET   openat 84/0x54
 
# File write first part:
 33358 898673 mariadbd 8.128619 CALL  write(0x54,0x808cf03d8,0x157)
 33358 898673 mariadbd 8.128651 GIO   fd 84 wrote 343 bytes
       0x0000 fe62 696e f23e bc64 0f01 0000 00fc 0000  |.bin.>.d........|
       0x0010 0000 0100 0001 0004 0031 312e 322e 302d  |.........11.2.0-|
       0x0020 4d61 7269 6144 422d 6c6f 6700 0000 0000  |MariaDB-log.....|
       0x0030 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0040 0000 0000 0000 0000 0000 0000 0000 0013  |................|
       0x0050 380d 0008 0012 0004 0404 0412 0000 e400  |8...............|
       0x0060 041a 0800 0000 0808 0802 0000 000a 0a0a  |................|
       0x0070 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0080 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0090 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00a0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00b0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00c0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00d0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00e0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00f0 0413 0400 0d08 0808 0a0a 0a01 f184 ad09  |................|
       0x0100 f23e bc64 a301 0000 002b 0000 002b 0100  |.>.d.....+...+..|
       0x0110 0000 0001 0000 0000 0000 0001 0000 0007  |................|
       0x0120 0000 0000 0000 004b c30b 8ff2 3ebc 64a1  |.......K....>.d.|
       0x0130 0100 0000 2c00 0000 5701 0000 0000 1100  |....,...W.......|
       0x0140 0000 6d61 7374 6572 2d62 696e 2e30 3030  |..master-bin.000|
       0x0150 3030 31d1 53cd ff                        |001.S..|
 33358 898673 mariadbd 8.128848 RET   write 343/0x157
 
# File open for read:
 33358 898672 mariadbd 8.130722 CALL  openat(AT_FDCWD,0x7fffdec20010,0x100000<O_RDONLY|O_CLOEXEC>)
 33358 898672 mariadbd 8.130747 NAMI  "./master-bin.000002"
 33358 898672 mariadbd 8.130772 RET   openat 23/0x17
 
# Write write second part starts:
 33358 898649 mariadbd 8.130846 CALL  write(0x54,0x808cf03d8,0x2c)
 
# Read of file starts concurrently:
 33358 898672 mariadbd 8.131108 CALL  read(0x17,0x807e31558,0x4000)
 33358 898672 mariadbd 8.131192 GIO   fd 23 read 387 bytes
       0x0000 fe62 696e f23e bc64 0f01 0000 00fc 0000  |.bin.>.d........|
       0x0010 0000 0100 0001 0004 0031 312e 322e 302d  |.........11.2.0-|
       0x0020 4d61 7269 6144 422d 6c6f 6700 0000 0000  |MariaDB-log.....|
       0x0030 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0040 0000 0000 0000 0000 0000 0000 0000 0013  |................|
       0x0050 380d 0008 0012 0004 0404 0412 0000 e400  |8...............|
       0x0060 041a 0800 0000 0808 0802 0000 000a 0a0a  |................|
       0x0070 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0080 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0090 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00a0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00b0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00c0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00d0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00e0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x00f0 0413 0400 0d08 0808 0a0a 0a01 f184 ad09  |................|
       0x0100 f23e bc64 a301 0000 002b 0000 002b 0100  |.>.d.....+...+..|
       0x0110 0000 0001 0000 0000 0000 0001 0000 0007  |................|
       0x0120 0000 0000 0000 004b c30b 8ff2 3ebc 64a1  |.......K....>.d.|
       0x0130 0100 0000 2c00 0000 5701 0000 0000 1100  |....,...W.......|
       0x0140 0000 6d61 7374 6572 2d62 696e 2e30 3030  |..master-bin.000|
       0x0150 3030 31d1 53cd ff00 0000 0000 0000 0000  |001.S...........|
       0x0160 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0170 0000 0000 0000 0000 0000 0000 0000 0000  |................|
       0x0180 0000 00                                  |...|
 33358 898672 mariadbd 8.131301 RET   read 387/0x183
 
# Second write completes. The read sees the length of the file including the
# second write, but the data is missing, all zeros.
 33358 898649 mariadbd 8.131540 GIO   fd 84 wrote 44 bytes
       0x0000 f23e bc64 a101 0000 002c 0000 0083 0100  |.>.d.....,......|
       0x0010 0000 0011 0000 006d 6173 7465 722d 6269  |.......master-bi|
       0x0020 6e2e 3030 3030 3032 0539 6465            |n.000002.9de|
 33358 898649 mariadbd 8.131596 RET   write 44/0x2c

Here we clearly see that the dump thread is reading the file in parallel with the binlog background thread writing the second binlog checkpoint. The read sees the length as if the second write had completed, but it does not see the correct data, it reads all zeros.

So this is not a bug in MariaDB, it's a problem with FreeBSD.
We can solve this by not using tmpfs for /tmp/var on the FreeBSD 13.0 builder.

Comment by Kristian Nielsen [ 2023-07-23 ]

Confirmed as a FreeBSD bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272678

Comment by Kristian Nielsen [ 2023-07-24 ]

Fix for buildbot: https://github.com/MariaDB/mariadb.org-tools/pull/176
elenst are you the one who can update the old buildbot to include this fix?

Comment by Elena Stepanova [ 2023-07-24 ]

As discussed elsewhere, we are going to try their suggested workaround first,

sudo sysctl debug.vn_io_pgcache_read_enable=0

It should be effective in buildbot starting from build 4870 on FreeBSD 130 bintar builder.

If it doesn't help, we'll merge the pull request above.

Comment by Kristian Nielsen [ 2023-07-25 ]

Fixed by using a work-around in buildbot for the FreeBSD bug:

https://github.com/MariaDB/mariadb.org-tools/commit/ecbc909d5bdd0838be28ca61014857fadd70cd67

Generated at Thu Feb 08 10:04:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.