[MDEV-18248] Multiple crashes on Galera 5-nodes after simple ALTER-statement Created: 2019-01-15  Updated: 2019-03-14  Resolved: 2019-03-14

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Galera, Storage Engine - InnoDB
Affects Version/s: 10.2.14
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Martijn Keijzer Assignee: Unassigned
Resolution: Incomplete Votes: 1
Labels: crash, galera, innodb, need_feedback
Environment:

RedHat 6.10



 Description   

While update a simple table with an simple alter statement the 5-node multimaster Galera stops working an the table gets corrupted. This has happened several times nog with different tables and simple alterstatements (without triggers).

The errorlog shows this:

2019-01-15 10:47:19 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) synced with group.
2019-01-15 11:07:45 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) desyncs itself from group
2019-01-15 11:07:46 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) resyncs itself to group
2019-01-15 11:07:46 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) synced with group.
2019-01-15 11:27:40 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) desyncs itself from group
2019-01-15 11:27:41 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) resyncs itself to group
2019-01-15 11:27:41 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) synced with group.
2019-01-15 11:47:23 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) desyncs itself from group
2019-01-15 11:47:24 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) resyncs itself to group
2019-01-15 11:47:24 140487941920512 [Note] WSREP: Member 1.0 (elavsu6211.company.local) synced with group.
2019-01-15 12:24:39 140452405958400 [Note] WSREP: MDL BF-BF conflict

schema: databasename
request: (8227134 seqno 46874664 wsrep (2, 1, 0) cmd 3 3 ALTER TABLE `subject_bpztools_aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_s$
granted: (15 seqno 46874665 wsrep (1, 0, 0) cmd 0 147 (null))
2019-01-15 12:24:40 140452405958400 [Note] WSREP: MDL BF-BF conflict
schema: databasename
request: (8227134 seqno 46874664 wsrep (2, 1, 0) cmd 3 3 ALTER TABLE `subject_bpztools_aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_s$
granted: (15 seqno 46874665 wsrep (1, 0, 0) cmd 0 147 (null))
2019-01-15 12:24:40 140452405958400 [Note] WSREP: MDL BF-BF conflict
schema: databasename
request: (8227134 seqno 46874664 wsrep (2, 1, 0) cmd 3 3 ALTER TABLE `subject_bpztools_aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_s$
granted: (11 seqno 46874666 wsrep (1, 0, 0) cmd 0 147 (null))
2019-01-15 12:24:40 0x7fbd9fc3d700 InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/row/row0merge.cc l$

InnoDB: Failing assertion: table->get_ref_count() == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/xtradbinnodb-recovery-modes/
InnoDB: about forcing recovery.

190115 12:24:40 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.2.14-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=837
max_threads=1502
thread_count=280

It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 3431472 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fbe2d906c18
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went terribly wrong...
stack_bottom = 0x7fbd9fc3cd80 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x55f4e00d8fab]
/usr/sbin/mysqld(handle_fatal_signal+0x535)[0x55f4dfbad005]
/lib64/libpthread.so.0(+0xf7e0)[0x7fc5f97f67e0]
/lib64/libc.so.6(gsignal+0x35)[0x7fc5f7e50495]
/lib64/libc.so.6(abort+0x175)[0x7fc5f7e51c75]
/usr/sbin/mysqld(+0x47c4eb)[0x55f4df97a4eb]
/usr/sbin/mysqld(+0x90edcc)[0x55f4dfe0cdcc]
/usr/sbin/mysqld(+0x873236)[0x55f4dfd71236]
/usr/sbin/mysqld(_Z17mysql_alter_tableP3THDPcS1_P14HA_CREATE_INFOP10TABLE_LISTP10Alter_infojP8st_orderb+0x29ed)[0x55f4dfab181d]
/usr/sbin/mysqld(_ZN19Sql_cmd_alter_table7executeEP3THD+0x3ae)[0x55f4dfaf62fe]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0xf81)[0x55f4dfa2b251]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x29a)[0x55f4dfa327ca]
/usr/sbin/mysqld(+0x5348c0)[0x55f4dfa328c0]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x18cd)[0x55f4dfa346fd]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x16e)[0x55f4dfa350ee]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x16f)[0x55f4dfaf335f]
/usr/sbin/mysqld(handle_one_connection+0x44)[0x55f4dfaf3484]
/lib64/libpthread.so.0(+0x7aa1)[0x7fc5f97eeaa1]
/lib64/libc.so.6(clone+0x6d)[0x7fc5f7f06bdd]

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x7fbe2d9141f0): is an invalid pointer

Connection ID (thread ID): 8227134
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_push$

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
We think the query pointer is invalid, but we will try to print it anyway.

Query: ALTER TABLE `subject_bpztools_aagenda` ADD `id_subject_cat` int(11) NULL DEFAULT '0' AFTER `id_subject`, ADD INDEX `id_subject_cat` (`id_subject_cat`)



 Comments   
Comment by Elena Stepanova [ 2019-01-15 ]

We have an open issue (general, not Galera-specific) with this assertion failure: MDEV-15776
I suppose you might be hitting the same problem.

martijnk, is it possible that your instances could be running two parallel ALTER TABLE at the same time? (Asking because that's what our test case in MDEV-15776 does to reproduce the problem)

Comment by Martijn Keijzer [ 2019-01-15 ]

We have only a very critical production-environment and an extrabackup, for me it is not possible to do other tests. The same as in MDEV-15776 is that the errorlogs of all the nodes says every second that the table is corrupted. In a previous crash it was not possible to DROP the table because it didn't exist, but if I tried to CREATE the table it already exists... What can I do?

An ALTER works for me now whe I stop mysql on 4 of the 5 nodes and do the changes at the first node and then connect the 4 nodes after the alters... I a previous (other) setup of my databasesystem no problems did occur (Linux Redhat 6, 3 node cluster) an d I did several alters in a week on the cluster.

Other dan in MDEV-15776 is that the error mentoined in mine is and WSREP MDL BF-BF conflict

Comment by Elena Stepanova [ 2019-02-14 ]

Sorry for not being clear, I didn't suggest you to run the tests, I was asking if it's possible that your instance was running two parallel ALTER TABLE around the time of failure. It would indicate possible relation to the above-mentioned bug MDEV-15776.

For not being able to re-create or drop table, it's an unfortunate known result of DDL not being completely crash-safe. When your instance happens to crash at a certain stage of performing DDL, it can indeed leave the table in a half-dead state. Maybe marko can suggest how to get rid of the remains.

Generated at Thu Feb 08 08:42:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.