Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6549

Assertion `(thd->variables.option_bits & (1ULL << 25)) == 0' failed with parallel replication

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.13
    • 10.0.14
    • None
    • None

    Description

      10.0/sql/log_event.cc:6555: virtual int Gtid_log_event::do_apply_event(rpl_group_info*): Assertion `(thd->variables.option_bits & (1ULL << 25)) == 0' failed.
      140807  2:27:35 [ERROR] mysqld got signal 6 ;

      #5  0x00007fdc71db07c0 in *__GI_abort () at abort.c:92
      #6  0x00007fdc71da66f1 in *__GI___assert_fail (assertion=0x1019e60 "(thd->variables.option_bits & (1ULL << 25)) == 0", file=<optimized out>, line=6555, function=0x101cae0 "virtual int Gtid_log_event::do_apply_event(rpl_group_info*)") at assert.c:81
      #7  0x00000000009561c4 in Gtid_log_event::do_apply_event (this=0x7fdc47336470, rgi=0x7fdc483dc800) at 10.0/sql/log_event.cc:6555
      #8  0x00000000005edf08 in Log_event::apply_event (this=0x7fdc47336470, rgi=0x7fdc483dc800) at 10.0/sql/log_event.h:1343
      #9  0x00000000005e4f8e in apply_event_and_update_pos (ev=0x7fdc47336470, thd=0x7fdc478bd070, rgi=0x7fdc483dc800, rpt=0x7fdc7163eeb0) at 10.0/sql/slave.cc:3255
      #10 0x00000000007f06b9 in rpt_handle_event (qev=0x7fdc46d83770, rpt=0x7fdc7163eeb0) at 10.0/sql/rpl_parallel.cc:36
      #11 0x00000000007f1f7b in handle_rpl_parallel_thread (arg=0x7fdc7163eeb0) at 10.0/sql/rpl_parallel.cc:724
      #12 0x00007fdc7395fb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
      #13 0x00007fdc71e5720d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

      Stack trace from:

      revision-id: sergii@pisem.net-20140805163520-tz0w14h8uju0ti9n
      revno: 4330
      branch-nick: 10.0

      To reproduce:

      • put the attached set of binary logs into the master datadir;
      • start master with log-bin=mysql-bin;
      • start slave with --slave-skip-errors=1049,1305,1539,1505,1317 --slave-parallel-threads=8;
      • start replication with MASTER_USE_GTID = current_pos;
      • wait.

      Note: most likely not all slave-skip-errors values are needed, but some are.

      The failure happens sporadically in different places over the course of replicating the binary logs. For me it happens 2-3 times before the slave catches up with the master.

      The binary logs were created during a usual stress test on Power 8 machine and were not tampered with anyhow.

      Could not reproduce without parallel replication.

      Attachments

        Activity

          Ok, I can reproduce as described.

          The thread that asserts turns out to have needed to do a deadlock retry of the transaction executed just before the one that asserts.

          It seems likely that the problem is missing cleanup of some transactional state flag in case of transaction retry. I'll dig into it.

          knielsen Kristian Nielsen added a comment - Ok, I can reproduce as described. The thread that asserts turns out to have needed to do a deadlock retry of the transaction executed just before the one that asserts. It seems likely that the problem is missing cleanup of some transactional state flag in case of transaction retry. I'll dig into it.

          Ok, so the real problem here turned out to be as follows:

          When a replicated transaction updates (but does not commit yet) the update of
          gtid_slave_pos, the flag is cleared that marks that a GTID exists to be so
          recorded in gtid_slave_pos. Then if the transaction is later killed due to a
          deadlock and retried, the flag remained cleared, so that during the retry, the
          update of gtid_slave_pos would not happen, and this the GTID position would be
          wrong.

          As a side effect the code would also skip clearing OPTION_GTID_BEGIN when
          skipping the gtid_slave_pos update, and this is what causes the assertion, but
          the real (and serious) problem is the missing update of GTID position.

          Good catch!

          knielsen Kristian Nielsen added a comment - Ok, so the real problem here turned out to be as follows: When a replicated transaction updates (but does not commit yet) the update of gtid_slave_pos, the flag is cleared that marks that a GTID exists to be so recorded in gtid_slave_pos. Then if the transaction is later killed due to a deadlock and retried, the flag remained cleared, so that during the retry, the update of gtid_slave_pos would not happen, and this the GTID position would be wrong. As a side effect the code would also skip clearing OPTION_GTID_BEGIN when skipping the gtid_slave_pos update, and this is what causes the assertion, but the real (and serious) problem is the missing update of GTID position. Good catch!

          Pushed to 10.0.14

          knielsen Kristian Nielsen added a comment - Pushed to 10.0.14

          People

            knielsen Kristian Nielsen
            elenst Elena Stepanova
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.