Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6718

Server crashed in Gtid_log_event::Gtid_log_event with parallel replication

Details

    Description

      #3  <signal handler called>
      #4  0x000000001069f6dc in Gtid_log_event::Gtid_log_event (this=0x3fff7a425c70, buf=0x3fff74b6bdb0 "6m\fT\242\001", event_len=40, description_event=0x3fff73be02b0) at 10.0/sql/log_event.cc:6392
      #5  0x0000000010692a58 in Log_event::read_log_event (buf=0x3fff74b6bdb0 "6m\fT\242\001", event_len=40, error=0x3fffb1575bf0, description_event=0x3fff73be02b0, crc_check=1 '\001') at 10.0/sql/log_event.cc:1666
      #6  0x0000000010692000 in Log_event::read_log_event (file=0x3fffb1576210, log_lock=0x0, description_event=0x3fff73be02b0, crc_check=1 '\001') at 10.0/sql/log_event.cc:1503
      #7  0x000000001048a1e0 in retry_event_group (rgi=0x3fff7ae67800, rpt=0x3fffb0fb8ab0, orig_qev=0x3fff74b26a70) at 10.0/sql/rpl_parallel.cc:373
      #8  0x000000001048b63c in handle_rpl_parallel_thread (arg=0x3fffb0fb8ab0) at 10.0/sql/rpl_parallel.cc:779
      #9  0x000000001097cd58 in pfs_spawn_thread (arg=0x3fffb0e6ec70) at 10.0/storage/perfschema/pfs.cc:1860
      #10 0x00003fffb248c460 in .start_thread () from /lib64/power8/libpthread.so.0

      Stack trace from:

      revision-id: jplindst@mariadb.org-20140825103533-p9jodvh91r9tueo1
      revno: 4362
      branch-nick: 10.0

      All threads' stack traces are attached.
      Binary logs from master are attached.

      It happened once so far during tests on Power.
      The test was run with ~64 threads on master, SBR, 8 slave parallel threads, thread_handling=pool-of-threads. Here is the complete RQG command line (for informational purposes, it does not necessarily reproduce the issue):

      perl ./runall-new.pl \
      --queries=100000000 \
      --no-mask \
      --seed=1410100487 \
      --threads=64 \
      --duration=600 \
      --queries=100M \
      --reporters=QueryTimeout,Backtrace,ErrorLog,Deadlock \
      --validators=TransformerNoComparator \
      --transformers=ExecuteAsPreparedTwice \
      --redefine=conf/mariadb/general-workarounds.yy \
      --redefine=conf/mariadb/10.0-features-redefine.yy \
      --mysqld=--log_output=FILE \
      --mysqld=--log_bin_trust_function_creators=1 \
      --mysqld=--performance_schema=1 \
      --mysqld=--query_cache_size=64M \
      --grammar=conf/replication/replication-ddl_sql.yy \
      --gendata=conf/replication/replication-ddl_data.zz \
      --rpl_mode=statement \
      --mysqld=--slave-skip-errors=1054,1317,1049,1305,1539,1505 \
      --mysqld=--slave_parallel_threads=8 \
      --mysqld=--thread_handling=pool-of-threads \
      --mtr-build-thread=300 \
      --mask=57406 \
      --basedir1=<10.0 basedir>
      --vardir1=<vardir>

      revision-id: elenst@montyprogram.com-20140905081817-dei34iw9d0dhywd1
      revno: 979
      branch-nick: randgen-mariadb-patches

      Attachments

        1. binlogs.tar.gz
          765 kB
        2. full_threads
          139 kB
        3. threads
          49 kB

        Activity

          It is crashing on this line:

          uint8 post_header_len= description_event->post_header_len[GTID_EVENT-1];

          The context is the Gtid_log_event constructor:

          Gtid_log_event::Gtid_log_event(const char *buf, uint event_len,
          const Format_description_log_event *description_event)
          : Log_event(buf, description_event), seq_no(0), commit_id(0)
          {
          uint8 header_size= description_event->common_header_len;
          uint8 post_header_len= description_event->post_header_len[GTID_EVENT-1];
          if (event_len < header_size + post_header_len ||
          post_header_len < GTID_HEADER_LEN)
          return;

          Looks like maybe the description_event pointer is incorrect or something like that?

          knielsen Kristian Nielsen added a comment - It is crashing on this line: uint8 post_header_len= description_event->post_header_len [GTID_EVENT-1] ; The context is the Gtid_log_event constructor: Gtid_log_event::Gtid_log_event(const char *buf, uint event_len, const Format_description_log_event *description_event) : Log_event(buf, description_event), seq_no(0), commit_id(0) { uint8 header_size= description_event->common_header_len; uint8 post_header_len= description_event->post_header_len [GTID_EVENT-1] ; if (event_len < header_size + post_header_len || post_header_len < GTID_HEADER_LEN) return; Looks like maybe the description_event pointer is incorrect or something like that?

          I think I can repeat the problem by running the test case rpl.rpl_parallel_retry a lot of times with --mem --parallel=3 --repeat=1000 (on standard amd64 linux).

          At least the problem looks very similar, corrupt description_event below retry_event_group(). It fails in various ways depending on exactly when the corruption triggers, but the root cause seems likely to be the same.

          knielsen Kristian Nielsen added a comment - I think I can repeat the problem by running the test case rpl.rpl_parallel_retry a lot of times with --mem --parallel=3 --repeat=1000 (on standard amd64 linux). At least the problem looks very similar, corrupt description_event below retry_event_group(). It fails in various ways depending on exactly when the corruption triggers, but the root cause seems likely to be the same.

          Here is a proposed patch:

          http://lists.askmonty.org/pipermail/commits/2014-September/006667.html

          The problem is incorrect handling of the format description event that is used
          when reading events to be retried from the relay log. It was using a
          description event owned by another thread (the SQL driver thread), so there
          was a rare race where this would change (during relay log rotation) just at
          the moment where a worker thread would attempt to simultaneously use it during
          retrying a transaction that failed due to a deadlock.

          knielsen Kristian Nielsen added a comment - Here is a proposed patch: http://lists.askmonty.org/pipermail/commits/2014-September/006667.html The problem is incorrect handling of the format description event that is used when reading events to be retried from the relay log. It was using a description event owned by another thread (the SQL driver thread), so there was a rare race where this would change (during relay log rotation) just at the moment where a worker thread would attempt to simultaneously use it during retrying a transaction that failed due to a deadlock.
          knielsen Kristian Nielsen added a comment - Pushed to 10.0.15: http://lists.askmonty.org/pipermail/commits/2014-November/006974.html

          People

            knielsen Kristian Nielsen
            elenst Elena Stepanova
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.