[MDEV-10391] During async GTID replication Galera crashes after error writing to binlog - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Trivial
Resolution: Incomplete
Affects Version/s: 10.1.14
Fix Version/s: N/A
Component/s: Galera, Replication
Labels:
- galera
- replication
Environment:
Ubuntu 16.04 Amazon m4.xlarge

Description

While running 3 separate Galera clusters of 2 nodes each with P2P async master-master replication between clusters using the second galera node in each cluster as a master to each slave in case of failure we ran into this error where replication stopped.

2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950")
2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': WSREP: FSM: no such a transition COMMITTING -> ROLLED_BACK
160718 17:16:33 [ERROR] mysqld got signal 6 ;

After reboot the node lost the slave settings due to SST. After recreating the slave settings the GTID position was intact but would not start with MASTER_USE_GTID=current_pos with error message:

Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-104-68680, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions'.

Because the slave died while replicating during a load test from another node running it, I don't think it could have diverged. Also, I attempted to increase the GTID_SLAVE_POS incrementally by 10 transactions and received the same error message and couldn't get replication to resume. Additionally, we were running all on the same gtid_domain_id with unique server ids per cluster, with each node in the cluster with the same server_id to avoid duplicated replication. Many tests ran fine until we hit the error writing the binlog and the corresponding WSREP error.

Full log:

2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950")
2016-07-18 17:16:33 140072587262720 [ERROR] Master 'va_2': WSREP: FSM: no such a transition COMMITTING -> ROLLED_BACK
160718 17:16:33 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.14-MariaDB-1~xenial
key_buffer_size=25165824
read_buffer_size=131072
max_used_connections=107
max_threads=202
thread_count=12
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468245 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x7f62f10ba008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f6530ce6838 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55a203d5333e]
/usr/sbin/mysqld(handle_fatal_signal+0x34d)[0x55a2038a73ad]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x113d0)[0x7f67bbc8f3d0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f67bb25f418]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f67bb26101a]
/usr/lib/galera/libgalera_smm.so(ZN6galera3FSMINS_9TrxHandle5StateENS1_10TransitionENS_10EmptyGuardENS_11EmptyActionEE8shift_toES2+0x1b8)[0x7f67b1983048]
/usr/lib/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM13post_rollbackEPNS_9TrxHandleE+0x26)[0x7f67b1977856]
/usr/lib/galera/libgalera_smm.so(galera_post_rollback+0x6b)[0x7f67b199881b]
/usr/sbin/mysqld(+0x52bf30)[0x55a20383ff30]
/usr/sbin/mysqld(+0x52c098)[0x55a203840098]
/usr/sbin/mysqld(_Z17ha_rollback_transP3THDb+0xfa)[0x55a2038a9dba]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDb+0x5bc)[0x55a2038aa61c]
/usr/sbin/mysqld(_Z12trans_commitP3THD+0x5b)[0x55a2037fce5b]
/usr/sbin/mysqld(_ZN13Xid_log_event14do_apply_eventEP14rpl_group_info+0xae)[0x55a2039678fe]
/usr/sbin/mysqld(_Z26apply_event_and_update_posP9Log_eventP3THDP14rpl_group_infoP19rpl_parallel_thread+0x1e1)[0x55a2036a3a31]
/usr/sbin/mysqld(handle_slave_sql+0x2abb)[0x55a2036a6f3b]
/usr/sbin/mysqld(+0x702b31)[0x55a203a16b31]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76fa)[0x7f67bbc856fa]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f67bb330b5d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0):
Connection ID (thread ID): 381
Status: NOT_KILLED

Attachments

Issue Links

relates to

MDEV-10259 mysqld crash with certain statement length and order with Galera and encrypt-tmp-files=1

Closed

During async GTID replication Galera crashes after error writing to binlog

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration