Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11491

binlog_encryption.rpl_checksum fails sporadically in buildbot

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.1(EOL)
    • 10.1.20
    • Tests
    • None

    Description

      http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/1900/steps/test/logs/stdio

      binlog_encryption.rpl_checksum 'mix'     w4 [ fail ]
              Test ended at 2016-12-05 13:58:46
       
      CURRENT_TEST: binlog_encryption.rpl_checksum
      --- /home/buildbot/maria-slave/power8-vlp03-bintar-debug/build/mysql-test/suite/binlog_encryption/rpl_checksum.result	2016-12-05 13:18:06.104205208 -0500
      +++ /home/buildbot/maria-slave/power8-vlp03-bintar-debug/build/mysql-test/suite/binlog_encryption/rpl_checksum.reject	2016-12-05 13:58:45.454421184 -0500
      @@ -71,7 +71,7 @@
       set @@global.debug_dbug='d,simulate_slave_unaware_checksum';
       start slave;
       include/wait_for_slave_io_error.inc [errno=1236]
      -Last_IO_Error = 'Got fatal error 1236 from master when reading data from binary log: 'Slave can not handle replication events with the checksum that master is configured to log; the first event 'master-bin.000009' at 404, the last event read from 'master-bin.000010' at 4, the last byte read from 'master-bin.000010' at 249.''
      +Last_IO_Error = 'Got fatal error 1236 from master when reading data from binary log: 'Slave can not handle replication events with the checksum that master is configured to log; the first event 'master-bin.000009' at 444, the last event read from 'master-bin.000010' at 4, the last byte read from 'master-bin.000010' at 249.''
       select count(*) as zero from t1;
       zero
       0
       
      mysqltest: Result content mismatch
      

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova created issue -
            elenst Elena Stepanova made changes -
            Field Original Value New Value
            elenst Elena Stepanova made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            elenst Elena Stepanova added a comment - - edited

            I can get this error quite reliably with this crude injection:

            diff --git a/mysql-test/extra/rpl_tests/rpl_checksum.inc b/mysql-test/extra/rpl_tests/rpl_checksum.inc
            index 14664fd..44d3776 100644
            --- a/mysql-test/extra/rpl_tests/rpl_checksum.inc
            +++ b/mysql-test/extra/rpl_tests/rpl_checksum.inc
            @@ -15,6 +15,9 @@ call mtr.add_suppression('Slave can not handle replication events with the check
             call mtr.add_suppression('Replication event checksum verification failed');
             # due to C failure simulation
             call mtr.add_suppression('Relay log write failure: could not queue event from master');
            +--disable_query_log
            +set @@global.debug_dbug='d,mdev11491';
            +--enable_query_log
             call mtr.add_suppression('Master is configured to log replication events with checksum, but will not send such events to slaves that cannot process them');
             
             # A. read/write access to the global vars:
            @@ -88,6 +91,10 @@ create table t1 (a int);
             flush logs;
             flush logs;
             flush logs;
            +--sleep 10
            +--disable_query_log
            +set @@global.debug_dbug='';
            +--enable_query_log
             
             sync_slave_with_master;
             #connection slave;
            diff --git a/sql/log.cc b/sql/log.cc
            index 45ab5c8..6e0a87b 100644
            --- a/sql/log.cc
            +++ b/sql/log.cc
            @@ -9630,6 +9630,7 @@ binlog_background_thread(void *arg __attribute__((unused)))
                   DEBUG_SYNC(thd, "binlog_background_thread_before_mark_xid_done");
                   /* Grab next pointer first, as mark_xid_done() may free the element. */
                   next= queue->next_in_queue;
            +      DBUG_EXECUTE_IF("mdev11491", { my_sleep(1000000); };);
                   mysql_bin_log.mark_xid_done(queue->binlog_id, true);
                   queue= next;
             
            

            So, it's a race condition for really slow systems (disks?) when mark_xid_done is considerably delayed.
            We need to ensure that it happens before we rotate to the log which for which the position is later printed in the error message.

            elenst Elena Stepanova added a comment - - edited I can get this error quite reliably with this crude injection: diff --git a/mysql-test/extra/rpl_tests/rpl_checksum.inc b/mysql-test/extra/rpl_tests/rpl_checksum.inc index 14664fd..44d3776 100644 --- a/mysql-test/extra/rpl_tests/rpl_checksum.inc +++ b/mysql-test/extra/rpl_tests/rpl_checksum.inc @@ -15,6 +15,9 @@ call mtr.add_suppression('Slave can not handle replication events with the check call mtr.add_suppression('Replication event checksum verification failed'); # due to C failure simulation call mtr.add_suppression('Relay log write failure: could not queue event from master'); +--disable_query_log +set @@global.debug_dbug='d,mdev11491'; +--enable_query_log call mtr.add_suppression('Master is configured to log replication events with checksum, but will not send such events to slaves that cannot process them'); # A. read/write access to the global vars: @@ -88,6 +91,10 @@ create table t1 (a int); flush logs; flush logs; flush logs; +--sleep 10 +--disable_query_log +set @@global.debug_dbug=''; +--enable_query_log sync_slave_with_master; #connection slave; diff --git a/sql/log.cc b/sql/log.cc index 45ab5c8..6e0a87b 100644 --- a/sql/log.cc +++ b/sql/log.cc @@ -9630,6 +9630,7 @@ binlog_background_thread(void *arg __attribute__((unused))) DEBUG_SYNC(thd, "binlog_background_thread_before_mark_xid_done"); /* Grab next pointer first, as mark_xid_done() may free the element. */ next= queue->next_in_queue; + DBUG_EXECUTE_IF("mdev11491", { my_sleep(1000000); };); mysql_bin_log.mark_xid_done(queue->binlog_id, true); queue= next; So, it's a race condition for really slow systems (disks?) when mark_xid_done is considerably delayed. We need to ensure that it happens before we rotate to the log which for which the position is later printed in the error message.
            elenst Elena Stepanova made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            elenst Elena Stepanova added a comment - https://github.com/MariaDB/server/commit/870d7589c6c4596b0f1a641add5071469e2c94a7
            elenst Elena Stepanova made changes -
            Fix Version/s 10.1.20 [ 22112 ]
            Fix Version/s 10.1 [ 16100 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            alice Alice Sherepa made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 78644 ] MariaDB v4 [ 151325 ]

            People

              elenst Elena Stepanova
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.