[MDEV-20893] Signal 6 on joining a cluster Created: 2019-10-24  Updated: 2020-05-18  Resolved: 2020-05-18

Status: Closed
Project: MariaDB Server
Component/s: Galera SST, Storage Engine - InnoDB
Affects Version/s: 10.3.18
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Timon van Rooijen Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 0
Labels: need_feedback


 Description   

We have a 5-node cluster running where i want to do an running upgrade from 10.3.12 to 10.3.18. Steps:

1. set server server5 maintenance on Maxscale
2. systemctl stop MariaDB-server
3. yum upgrade MariaDB-server galera
4. systemctl start MariaDB-server
5. Server keeps crashing. See log file below:

2019-10-24 13:45:14 2 [Note] WSREP: Receiving IST: 19149 writesets, seqnos 69494459-69513608
2019-10-24 13:45:14 0 [Note] WSREP: Receiving IST...  0.0% (    0/19149 events) complete.
2019-10-24 13:45:14 0 [Note] Reading of all Master_info entries succeeded
2019-10-24 13:45:14 0 [Note] Added new Master_info '' to hash table
2019-10-24 13:45:14 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.3.18-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
2019-10-24 13:45:14 0x7fd6781c5700  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.18/storage/innobase/row/row0ins.cc line 266
InnoDB: Failing assertion: !cursor->index->is_committed()
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
191024 13:45:14 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.3.18-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=1002
thread_count=11
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2333962 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fd5280009a8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fd6781c4d50 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x560b9c24106e]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x560b9bcde11f]
sigaction.c:0(__restore_rt)[0x7fd67f45e5d0]
:0(__GI_raise)[0x7fd67d731207]
:0(__GI_abort)[0x7fd67d7328f8]
/usr/sbin/mysqld(+0x4d3856)[0x560b9ba22856]
/usr/sbin/mysqld(+0x9de499)[0x560b9bf2d499]
/usr/sbin/mysqld(+0x9de5e3)[0x560b9bf2d5e3]
/usr/sbin/mysqld(+0xa19f8b)[0x560b9bf68f8b]
/usr/sbin/mysqld(+0xa1f87a)[0x560b9bf6e87a]
/usr/sbin/mysqld(+0x9efc2b)[0x560b9bf3ec2b]
/usr/sbin/mysqld(+0x92cfa4)[0x560b9be7bfa4]
/usr/sbin/mysqld(_ZN7handler13ha_update_rowEPKhS1_+0x1c2)[0x560b9bce9ca2]
/usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0x291)[0x560b9bde2f61]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x24c)[0x560b9bdd64ec]
/usr/sbin/mysqld(wsrep_apply_cb+0x4ac)[0x560b9bc59b1c]
src/trx_handle.cpp:312(galera::TrxHandle::apply(void*, wsrep_cb_status (*)(void*, void const*, unsigned long, unsigned int, wsrep_trx_meta const*), wsrep_trx_meta const&) const)[0x7fd679ed0818]
src/replicator_smm.cpp:92(apply_trx_ws(void*, wsrep_cb_status (*)(void*, void const*, unsigned long, unsigned int, wsrep_trx_meta const*), wsrep_cb_status (*)(void*, unsigned int, wsrep_trx_meta const*, bool*, bool), galera::TrxHandle const&, wsrep_trx_meta const&))[0x7fd679f0d993]
src/replicator_smm.cpp:450(galera::ReplicatorSMM::apply_trx(void*, galera::TrxHandle*))[0x7fd679f10aec]
src/gu_mutex.hpp:38(gu::Mutex::unlock() const)[0x7fd679f1e64e]
src/replicator_smm.cpp:368(galera::ReplicatorSMM::async_recv(void*))[0x7fd679f1458b]
src/wsrep_provider.cpp:271(galera_recv)[0x7fd679f222a8]
/usr/sbin/mysqld(+0x70b83c)[0x560b9bc5a83c]
/usr/sbin/mysqld(start_wsrep_THD+0x2da)[0x560b9bc4daea]
pthread_create.c:0(start_thread)[0x7fd67f456dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fd67d7f8ead]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fd5d801737b): UPDATE `mym_hospitality1_customer_137`.`tbl_customer_data` SET `id`='1852582', `uuid`='cd2ae6fb-f662-11e9-90d7-1e2f4a16327f', `flow_id`='3517', `flows_executed`='[]', `date_created`='2019-10-24 15:33:02', `date_updated`=NULL, `date_started`=NULL, `date_last_run`='2019-10-24 15:33:02', `current_step`='start', `next_step`=NULL, `last_step`='', `num_runs`='0', `status`=0, `flow_status`=1, `data`='{\"PropertyCode\":\"LEU\",\"Type\":\"Reservation\",\"Reference\":\"LEU-FX95765\",\"Status\":\"Modified\",\"Language\":\"\",\"Updates\":[]}', `webhook_data`='{}', `file_size_start`='0', `file_size_end`='0', `system_data`='{}', `runaflow_stack`='{}', `error_log`='{}', `flow_error_id`='0', `search_reference`='LEU-FX95765', `environment`=0, `customer_id`='137' WHERE `mym_hospitality1_customer_137`.`tbl_customer_data`.`id`='1852582'
Connection ID (thread ID): 8
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             31106                31106                processes 
Max open files            16364                16364                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       31106                31106                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core



 Comments   
Comment by Marko Mäkelä [ 2020-04-17 ]

timonr, sorry, I missed this report until now.

While InnoDB notices corruption, the cause of the corruption could be a bug in the Galera snapshot transfer. It could be that an error was overlooked, or there is a bug in the IST that introduces this corruption.

Can you narrow down this failure? Or analyze a core dump of the crashed process? In the stack frame of the assertion failure, you should be able to do

print cursor->index->name
print cursor->index->table->name

to find the affected table.

Without a SQL-based test case, I am afraid that I cannot do much. Any file copying operation must be treated as a possible cause of the corruption. If you can provide a script that repeats this on an empty database, we can start looking at this.

Generated at Thu Feb 08 09:03:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.