Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10391

During async GTID replication Galera crashes after error writing to binlog

    XMLWordPrintable

Details

    Description

      While running 3 separate Galera clusters of 2 nodes each with P2P async master-master replication between clusters using the second galera node in each cluster as a master to each slave in case of failure we ran into this error where replication stopped.

      2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950")
      2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': WSREP: FSM: no such a transition COMMITTING -> ROLLED_BACK
      160718 17:16:33 [ERROR] mysqld got signal 6 ;

      After reboot the node lost the slave settings due to SST. After recreating the slave settings the GTID position was intact but would not start with MASTER_USE_GTID=current_pos with error message:

      Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-104-68680, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'.

      Because the slave died while replicating during a load test from another node running it, I don't think it could have diverged. Also, I attempted to increase the GTID_SLAVE_POS incrementally by 10 transactions and received the same error message and couldn't get replication to resume. Additionally, we were running all on the same gtid_domain_id with unique server ids per cluster, with each node in the cluster with the same server_id to avoid duplicated replication. Many tests ran fine until we hit the error writing the binlog and the corresponding WSREP error.

      Full log:

      2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950")
      2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': WSREP: FSM: no such a transition COMMITTING -> ROLLED_BACK
      160718 17:16:33 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.

      To report this bug, see https://mariadb.com/kb/en/reporting-bugs

      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.

      Server version: 10.1.14-MariaDB-1~xenial
      key_buffer_size=25165824
      read_buffer_size=131072
      max_used_connections=107
      max_threads=202
      thread_count=12
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468245 K bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.

      Thread pointer: 0x0x7f62f10ba008
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f6530ce6838 thread_stack 0x48400
      /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55a203d5333e]
      /usr/sbin/mysqld(handle_fatal_signal+0x34d)[0x55a2038a73ad]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x113d0)[0x7f67bbc8f3d0]
      /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f67bb25f418]
      /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f67bb26101a]
      /usr/lib/galera/libgalera_smm.so(ZN6galera3FSMINS_9TrxHandle5StateENS1_10TransitionENS_10EmptyGuardENS_11EmptyActionEE8shift_toES2+0x1b8)[0x7f67b1983048]
      /usr/lib/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM13post_rollbackEPNS_9TrxHandleE+0x26)[0x7f67b1977856]
      /usr/lib/galera/libgalera_smm.so(galera_post_rollback+0x6b)[0x7f67b199881b]
      /usr/sbin/mysqld(+0x52bf30)[0x55a20383ff30]
      /usr/sbin/mysqld(+0x52c098)[0x55a203840098]
      /usr/sbin/mysqld(_Z17ha_rollback_transP3THDb+0xfa)[0x55a2038a9dba]
      /usr/sbin/mysqld(_Z15ha_commit_transP3THDb+0x5bc)[0x55a2038aa61c]
      /usr/sbin/mysqld(_Z12trans_commitP3THD+0x5b)[0x55a2037fce5b]
      /usr/sbin/mysqld(_ZN13Xid_log_event14do_apply_eventEP14rpl_group_info+0xae)[0x55a2039678fe]
      /usr/sbin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14rpl_group_infoP19rpl_parallel_thread+0x1e1)[0x55a2036a3a31]
      /usr/sbin/mysqld(handle_slave_sql+0x2abb)[0x55a2036a6f3b]
      /usr/sbin/mysqld(+0x702b31)[0x55a203a16b31]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x76fa)[0x7f67bbc856fa]
      /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f67bb330b5d]

      Trying to get some variables.
      Some pointers may be invalid and cause the dump to abort.
      Query (0x0):
      Connection ID (thread ID): 381
      Status: NOT_KILLED

      Attachments

        Issue Links

          Activity

            People

              jplindst Jan Lindström (Inactive)
              sysguru Ryan Lavelle
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.