Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28986

rpl tests sometimes failing on freebsd builders

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL)
    • N/A
    • CI
    • None

    Description

      10.6 2fa3ada07202bfd8764cff0bf2e5bbf8

      rpl.rpl_xa 'innodb,mix'                  w2 [ fail ]
              Test ended at 2022-06-28 18:58:40
       
      CURRENT_TEST: rpl.rpl_xa
      mysqltest: In included file "./include/sync_with_master_gtid.inc": 
      included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa.inc at line 350:
      included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa.test at line 4:
      At line 48: Failed to sync with master
       
      The result from queries just before the failure was:
      < snip >
      connection master;
      xa recover;
      formatID	gtrid_length	bqual_length	data
      1	11	0	skip_binlog
      set @@session.sql_log_bin = OFF;
      xa rollback 'skip_binlog';
      set @@session.sql_log_bin = ON;
      include/save_master_gtid.inc
      *** Zero must be in the list:
      connection master;
      xa recover;
      formatID	gtrid_length	bqual_length	data
      *** At the end of skip_log_binb section gtid list has 0 more:
      flush logs;
      show binlog events in 'master-bin.000007' limit 1,1;
      Log_name	Pos	Event_type	Server_id	End_log_pos	Info
      master-bin.000007	#	Gtid_list	1	#	[0-1-25]
      connection slave;
      include/sync_with_master_gtid.inc
      Timeout in master_gtid_wait('0-1-25', 120), current slave GTID position is: 0-1-11.
      

      The replica error log shows:

      2022-06-28 18:56:39 7 [ERROR] Error reading packet from server: bogus data in log event; the first event 'master-bin.000001' at 4, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362. (server_errno=1236)
      2022-06-28 18:56:39 7 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 4, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362.', Internal MariaDB error code: 1236
      2022-06-28 18:56:39 7 [Note] Slave I/O thread exiting, read up to log 'master-bin.000002', position 343, master 127.0.0.1:16020
      

      Test always seems to be retry-pass.
      Also seen on 10.7, 10.10 bb branches.
      Var dir: https://drive.google.com/drive/folders/1FGUV55ChN3Wpc4qG7XeFw4nDnaWzzAJj

      rpl.rpl_xa_gtid_pos_auto_engine failures seem similar (replica error log has same messages):

      10.6 2fa3ada07202bfd8764cff0bf2e5bbf8

      rpl.rpl_xa_gtid_pos_auto_engine 'innodb,mix' w2 [ fail ]
              Test ended at 2022-06-28 19:00:47
       
      CURRENT_TEST: rpl.rpl_xa_gtid_pos_auto_engine
      mysqltest: In included file "./include/sync_with_master_gtid.inc": 
      included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa.inc at line 350:
      included from /usr/local/share/mysql-test/suite/rpl/t/rpl_xa_gtid_pos_auto_engine.test at line 16:
      At line 48: Failed to sync with master
       
      The result from queries just before the failure was:
      < snip >
      connection master;
      xa recover;
      formatID	gtrid_length	bqual_length	data
      1	11	0	skip_binlog
      set @@session.sql_log_bin = OFF;
      xa rollback 'skip_binlog';
      set @@session.sql_log_bin = ON;
      include/save_master_gtid.inc
      *** Zero must be in the list:
      connection master;
      xa recover;
      formatID	gtrid_length	bqual_length	data
      *** At the end of skip_log_binb section gtid list has 0 more:
      flush logs;
      show binlog events in 'master-bin.000007' limit 1,1;
      Log_name	Pos	Event_type	Server_id	End_log_pos	Info
      master-bin.000007	#	Gtid_list	1	#	[0-1-25]
      connection slave;
      include/sync_with_master_gtid.inc
      Timeout in master_gtid_wait('0-1-25', 120), current slave GTID position is: 0-1-11.
      

      Attachments

        Activity

          I reproduced the failure using FreeBSD ktrace. This shows that the FreeBSD kernel is returning incorrect data:

          # File open for write:
           33358 898673 mariadbd 8.128300 CALL  openat(AT_FDCWD,0x1842d40,0x100201<O_WRONLY|O_CREAT|O_CLOEXEC>,0660<S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP>)
           33358 898673 mariadbd 8.128324 NAMI  "./master-bin.000002"
           33358 898673 mariadbd 8.128358 RET   openat 84/0x54
           
          # File write first part:
           33358 898673 mariadbd 8.128619 CALL  write(0x54,0x808cf03d8,0x157)
           33358 898673 mariadbd 8.128651 GIO   fd 84 wrote 343 bytes
                 0x0000 fe62 696e f23e bc64 0f01 0000 00fc 0000  |.bin.>.d........|
                 0x0010 0000 0100 0001 0004 0031 312e 322e 302d  |.........11.2.0-|
                 0x0020 4d61 7269 6144 422d 6c6f 6700 0000 0000  |MariaDB-log.....|
                 0x0030 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0040 0000 0000 0000 0000 0000 0000 0000 0013  |................|
                 0x0050 380d 0008 0012 0004 0404 0412 0000 e400  |8...............|
                 0x0060 041a 0800 0000 0808 0802 0000 000a 0a0a  |................|
                 0x0070 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0080 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0090 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00a0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00b0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00c0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00d0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00e0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00f0 0413 0400 0d08 0808 0a0a 0a01 f184 ad09  |................|
                 0x0100 f23e bc64 a301 0000 002b 0000 002b 0100  |.>.d.....+...+..|
                 0x0110 0000 0001 0000 0000 0000 0001 0000 0007  |................|
                 0x0120 0000 0000 0000 004b c30b 8ff2 3ebc 64a1  |.......K....>.d.|
                 0x0130 0100 0000 2c00 0000 5701 0000 0000 1100  |....,...W.......|
                 0x0140 0000 6d61 7374 6572 2d62 696e 2e30 3030  |..master-bin.000|
                 0x0150 3030 31d1 53cd ff                        |001.S..|
           33358 898673 mariadbd 8.128848 RET   write 343/0x157
           
          # File open for read:
           33358 898672 mariadbd 8.130722 CALL  openat(AT_FDCWD,0x7fffdec20010,0x100000<O_RDONLY|O_CLOEXEC>)
           33358 898672 mariadbd 8.130747 NAMI  "./master-bin.000002"
           33358 898672 mariadbd 8.130772 RET   openat 23/0x17
           
          # Write write second part starts:
           33358 898649 mariadbd 8.130846 CALL  write(0x54,0x808cf03d8,0x2c)
           
          # Read of file starts concurrently:
           33358 898672 mariadbd 8.131108 CALL  read(0x17,0x807e31558,0x4000)
           33358 898672 mariadbd 8.131192 GIO   fd 23 read 387 bytes
                 0x0000 fe62 696e f23e bc64 0f01 0000 00fc 0000  |.bin.>.d........|
                 0x0010 0000 0100 0001 0004 0031 312e 322e 302d  |.........11.2.0-|
                 0x0020 4d61 7269 6144 422d 6c6f 6700 0000 0000  |MariaDB-log.....|
                 0x0030 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0040 0000 0000 0000 0000 0000 0000 0000 0013  |................|
                 0x0050 380d 0008 0012 0004 0404 0412 0000 e400  |8...............|
                 0x0060 041a 0800 0000 0808 0802 0000 000a 0a0a  |................|
                 0x0070 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0080 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0090 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00a0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00b0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00c0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00d0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00e0 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x00f0 0413 0400 0d08 0808 0a0a 0a01 f184 ad09  |................|
                 0x0100 f23e bc64 a301 0000 002b 0000 002b 0100  |.>.d.....+...+..|
                 0x0110 0000 0001 0000 0000 0000 0001 0000 0007  |................|
                 0x0120 0000 0000 0000 004b c30b 8ff2 3ebc 64a1  |.......K....>.d.|
                 0x0130 0100 0000 2c00 0000 5701 0000 0000 1100  |....,...W.......|
                 0x0140 0000 6d61 7374 6572 2d62 696e 2e30 3030  |..master-bin.000|
                 0x0150 3030 31d1 53cd ff00 0000 0000 0000 0000  |001.S...........|
                 0x0160 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0170 0000 0000 0000 0000 0000 0000 0000 0000  |................|
                 0x0180 0000 00                                  |...|
           33358 898672 mariadbd 8.131301 RET   read 387/0x183
           
          # Second write completes. The read sees the length of the file including the
          # second write, but the data is missing, all zeros.
           33358 898649 mariadbd 8.131540 GIO   fd 84 wrote 44 bytes
                 0x0000 f23e bc64 a101 0000 002c 0000 0083 0100  |.>.d.....,......|
                 0x0010 0000 0011 0000 006d 6173 7465 722d 6269  |.......master-bi|
                 0x0020 6e2e 3030 3030 3032 0539 6465            |n.000002.9de|
           33358 898649 mariadbd 8.131596 RET   write 44/0x2c
          

          Here we clearly see that the dump thread is reading the file in parallel with the binlog background thread writing the second binlog checkpoint. The read sees the length as if the second write had completed, but it does not see the correct data, it reads all zeros.

          So this is not a bug in MariaDB, it's a problem with FreeBSD.
          We can solve this by not using tmpfs for /tmp/var on the FreeBSD 13.0 builder.

          knielsen Kristian Nielsen added a comment - I reproduced the failure using FreeBSD ktrace. This shows that the FreeBSD kernel is returning incorrect data: # File open for write: 33358 898673 mariadbd 8.128300 CALL openat(AT_FDCWD,0x1842d40,0x100201<O_WRONLY|O_CREAT|O_CLOEXEC>,0660<S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP>) 33358 898673 mariadbd 8.128324 NAMI "./master-bin.000002" 33358 898673 mariadbd 8.128358 RET openat 84/0x54   # File write first part: 33358 898673 mariadbd 8.128619 CALL write(0x54,0x808cf03d8,0x157) 33358 898673 mariadbd 8.128651 GIO fd 84 wrote 343 bytes 0x0000 fe62 696e f23e bc64 0f01 0000 00fc 0000 |.bin.>.d........| 0x0010 0000 0100 0001 0004 0031 312e 322e 302d |.........11.2.0-| 0x0020 4d61 7269 6144 422d 6c6f 6700 0000 0000 |MariaDB-log.....| 0x0030 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0040 0000 0000 0000 0000 0000 0000 0000 0013 |................| 0x0050 380d 0008 0012 0004 0404 0412 0000 e400 |8...............| 0x0060 041a 0800 0000 0808 0802 0000 000a 0a0a |................| 0x0070 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0080 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0090 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00a0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00b0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00c0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00d0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00e0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00f0 0413 0400 0d08 0808 0a0a 0a01 f184 ad09 |................| 0x0100 f23e bc64 a301 0000 002b 0000 002b 0100 |.>.d.....+...+..| 0x0110 0000 0001 0000 0000 0000 0001 0000 0007 |................| 0x0120 0000 0000 0000 004b c30b 8ff2 3ebc 64a1 |.......K....>.d.| 0x0130 0100 0000 2c00 0000 5701 0000 0000 1100 |....,...W.......| 0x0140 0000 6d61 7374 6572 2d62 696e 2e30 3030 |..master-bin.000| 0x0150 3030 31d1 53cd ff |001.S..| 33358 898673 mariadbd 8.128848 RET write 343/0x157   # File open for read: 33358 898672 mariadbd 8.130722 CALL openat(AT_FDCWD,0x7fffdec20010,0x100000<O_RDONLY|O_CLOEXEC>) 33358 898672 mariadbd 8.130747 NAMI "./master-bin.000002" 33358 898672 mariadbd 8.130772 RET openat 23/0x17   # Write write second part starts: 33358 898649 mariadbd 8.130846 CALL write(0x54,0x808cf03d8,0x2c)   # Read of file starts concurrently: 33358 898672 mariadbd 8.131108 CALL read(0x17,0x807e31558,0x4000) 33358 898672 mariadbd 8.131192 GIO fd 23 read 387 bytes 0x0000 fe62 696e f23e bc64 0f01 0000 00fc 0000 |.bin.>.d........| 0x0010 0000 0100 0001 0004 0031 312e 322e 302d |.........11.2.0-| 0x0020 4d61 7269 6144 422d 6c6f 6700 0000 0000 |MariaDB-log.....| 0x0030 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0040 0000 0000 0000 0000 0000 0000 0000 0013 |................| 0x0050 380d 0008 0012 0004 0404 0412 0000 e400 |8...............| 0x0060 041a 0800 0000 0808 0802 0000 000a 0a0a |................| 0x0070 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0080 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0090 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00a0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00b0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00c0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00d0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00e0 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x00f0 0413 0400 0d08 0808 0a0a 0a01 f184 ad09 |................| 0x0100 f23e bc64 a301 0000 002b 0000 002b 0100 |.>.d.....+...+..| 0x0110 0000 0001 0000 0000 0000 0001 0000 0007 |................| 0x0120 0000 0000 0000 004b c30b 8ff2 3ebc 64a1 |.......K....>.d.| 0x0130 0100 0000 2c00 0000 5701 0000 0000 1100 |....,...W.......| 0x0140 0000 6d61 7374 6572 2d62 696e 2e30 3030 |..master-bin.000| 0x0150 3030 31d1 53cd ff00 0000 0000 0000 0000 |001.S...........| 0x0160 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0170 0000 0000 0000 0000 0000 0000 0000 0000 |................| 0x0180 0000 00 |...| 33358 898672 mariadbd 8.131301 RET read 387/0x183   # Second write completes. The read sees the length of the file including the # second write, but the data is missing, all zeros. 33358 898649 mariadbd 8.131540 GIO fd 84 wrote 44 bytes 0x0000 f23e bc64 a101 0000 002c 0000 0083 0100 |.>.d.....,......| 0x0010 0000 0011 0000 006d 6173 7465 722d 6269 |.......master-bi| 0x0020 6e2e 3030 3030 3032 0539 6465 |n.000002.9de| 33358 898649 mariadbd 8.131596 RET write 44/0x2c Here we clearly see that the dump thread is reading the file in parallel with the binlog background thread writing the second binlog checkpoint. The read sees the length as if the second write had completed, but it does not see the correct data, it reads all zeros. So this is not a bug in MariaDB, it's a problem with FreeBSD. We can solve this by not using tmpfs for /tmp/var on the FreeBSD 13.0 builder.
          knielsen Kristian Nielsen added a comment - Confirmed as a FreeBSD bug: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272678

          Fix for buildbot: https://github.com/MariaDB/mariadb.org-tools/pull/176
          elenst are you the one who can update the old buildbot to include this fix?

          knielsen Kristian Nielsen added a comment - Fix for buildbot: https://github.com/MariaDB/mariadb.org-tools/pull/176 elenst are you the one who can update the old buildbot to include this fix?

          As discussed elsewhere, we are going to try their suggested workaround first,

          sudo sysctl debug.vn_io_pgcache_read_enable=0
          

          It should be effective in buildbot starting from build 4870 on FreeBSD 130 bintar builder.

          If it doesn't help, we'll merge the pull request above.

          elenst Elena Stepanova added a comment - As discussed elsewhere, we are going to try their suggested workaround first, sudo sysctl debug.vn_io_pgcache_read_enable=0 It should be effective in buildbot starting from build 4870 on FreeBSD 130 bintar builder. If it doesn't help, we'll merge the pull request above.
          knielsen Kristian Nielsen added a comment - Fixed by using a work-around in buildbot for the FreeBSD bug: https://github.com/MariaDB/mariadb.org-tools/commit/ecbc909d5bdd0838be28ca61014857fadd70cd67

          People

            knielsen Kristian Nielsen
            angelique.sklavounos Angelique Sklavounos (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.