[MDEV-13747] MariaDB 10.1.21 server sudden crash Created: 2017-09-06  Updated: 2017-12-13  Resolved: 2017-12-13

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.21
Fix Version/s: 10.1.30

Type: Bug Priority: Major
Reporter: Artur Čuvašov Assignee: Andrii Nikitin (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS 7.3.1611


Issue Links:
Relates
relates to MDEV-9510 Segmentation fault in binlog thread c... Closed
relates to MDEV-10501 MariaDB crash Closed

 Description   

2017-09-06 14:50:21 140465624627968 [Note] WSREP: cluster conflict due to high priority abort for threads:
2017-09-06 14:50:21 140465624627968 [Note] WSREP: Winning thread: 
   THD: 25160306, mode: total order, state: executing, conflict: no conflict, seqno: 1598878419
   SQL: ALTER TABLE `catalog_product_index_price` ENABLE KEYS
2017-09-06 14:50:21 140465624627968 [Note] WSREP: Victim thread: 
   THD: 25160307, mode: local, state: executing, conflict: no conflict, seqno: -1
   SQL: SELECT GET_LOCK('magentodev19.index_process_18', '5')
2017-09-06 14:50:21 140465624627968 [Note] WSREP: MDL conflict db=magentodev19 table=catalog_product_index_price ticket=4 solved by abort
170906 14:50:21 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.21-MariaDB
key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=469
max_threads=1026
thread_count=409
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2286396 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7faa805c5008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7faa76b20080 thread_stack 0x48400
(my_addr_resolve failure: fork)
/usr/sbin/mysqld(my_print_stacktrace+0x2e) [0x7fc8e12309ce]
/usr/sbin/mysqld(handle_fatal_signal+0x305) [0x7fc8e0d56355]
/lib64/libpthread.so.0(+0xf370) [0x7fc8e0372370]
/usr/sbin/mysqld(MDL_lock::Ticket_list::remove_ticket(MDL_ticket*)+0x11) [0x7fc8e0cac3c1]
/usr/sbin/mysqld(MDL_lock::remove_ticket(LF_PINS*, MDL_lock::Ticket_list MDL_lock::*, MDL_ticket*)+0x8c) [0x7fc8e0cacedc]
/usr/sbin/mysqld(MDL_context::release_lock(enum_mdl_duration, MDL_ticket*)+0x24) [0x7fc8e0cadf34]
/usr/sbin/mysqld(MDL_context::release_locks_stored_before(enum_mdl_duration, MDL_ticket*)+0x35) [0x7fc8e0cadfb5]
/usr/sbin/mysqld(mysql_execute_command(THD*)+0xaed) [0x7fc8e0bd10fd]
/usr/sbin/mysqld(mysql_parse(THD*, char*, unsigned int, Parser_state*)+0x332) [0x7fc8e0bd9a02]
/usr/sbin/mysqld(+0x439229) [0x7fc8e0bda229]
/usr/sbin/mysqld(dispatch_command(enum_server_command, THD*, char*, unsigned int)+0x1f5a) [0x7fc8e0bdc83a]
/usr/sbin/mysqld(do_command(THD*)+0x14a) [0x7fc8e0bdd6aa]
/usr/sbin/mysqld(do_handle_one_connection(THD*)+0x18a) [0x7fc8e0ca485a]
/usr/sbin/mysqld(handle_one_connection+0x40) [0x7fc8e0ca4a00]
/usr/sbin/mysqld(+0x96d05d) [0x7fc8e110e05d]
/lib64/libpthread.so.0(+0x7dc5) [0x7fc8e036adc5]
/lib64/libc.so.6(clone+0x6d) [0x7fc8de78973d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7faa0b3bc020): INSERT INTO MCHECKPOINT (CP_MCHAIN_ID,CP_SHOP_ID,CP_SYSTEM,CP_ACTION,CP_SUBACTION,CP_LASTCHECKED,CP_LASTSUCCESS,CP_LASTALERTED,CP_LASTSWEEPED,ETS,UTS) VALUES (53,'Chain','DT','Export','CustArtSystemLst','20170906145021',' ',' ',' ', '20170906145021648', '20170906145021648')
Connection ID (thread ID): 25160372
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off

After this WSREP Recovery process starts.



 Comments   
Comment by Artur Čuvašov [ 2017-09-09 ]

Another nde has crashed today with a similar message:

```
2017-09-09 19:02:13 140325335070464 [Warning] WSREP: send monitor wait timed out, waited for PT10.1646S
2017-09-09 19:02:13 140325335070464 [Warning] WSREP: Failed to report last committed 1613666648, -4 (Interrupted system call)
2017-09-09 19:03:04 140325335070464 [Warning] WSREP: Failed to report last committed 1613666773, -4 (Interrupted system call)
2017-09-09 19:03:10 140325335070464 [Warning] WSREP: Failed to report last committed 1613666799, -4 (Interrupted system call)
2017-09-09 19:03:19 140325335070464 [Warning] WSREP: Failed to report last committed 1613666811, -4 (Interrupted system call)
2017-09-09 19:05:25 140325335070464 [Warning] WSREP: Failed to report last committed 1613666909, -4 (Interrupted system call)
2017-09-09 19:05:30 140325335070464 [Warning] WSREP: Failed to report last committed 1613666917, -4 (Interrupted system call)
2017-09-09 19:07:18 140325335070464 [Warning] WSREP: Failed to report last committed 1613666973, -4 (Interrupted system call)
2017-09-09 19:07:25 140325335070464 [Warning] WSREP: Failed to report last committed 1613666982, -4 (Interrupted system call)
2017-09-09 19:08:37 140325335070464 [Warning] WSREP: Failed to report last committed 1613667135, -4 (Interrupted system call)
2017-09-09 19:08:41 140325335070464 [Warning] WSREP: Failed to report last committed 1613667167, -4 (Interrupted system call)
2017-09-09 19:09:57 140325335070464 [Warning] WSREP: Failed to report last committed 1613667240, -4 (Interrupted system call)
2017-09-09 19:10:00 140325335070464 [Warning] WSREP: Failed to report last committed 1613667281, -4 (Interrupted system call)
2017-09-09 19:11:03 140325335070464 [Warning] WSREP: Failed to report last committed 1613667374, -4 (Interrupted system call)
2017-09-09 19:11:06 140325335070464 [Warning] WSREP: Failed to report last committed 1613667457, -4 (Interrupted system call)
2017-09-09 19:11:21 140325335070464 [Warning] WSREP: Failed to report last committed 1613667484, -4 (Interrupted system call)
2017-09-09 19:11:42 140325335070464 [Warning] WSREP: Failed to report last committed 1613667685, -4 (Interrupted system call)
2017-09-09 19:12:38 140325335070464 [Warning] WSREP: Failed to report last committed 1613668016, -4 (Interrupted system call)
170909 19:12:40 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.21-MariaDB
key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=27
max_threads=1026
thread_count=13
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2286396 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f8a1fc12008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fa835f090a0 thread_stack 0x48400
(my_addr_resolve failure: fork)
/usr/sbin/mysqld(my_print_stacktrace+0x2e) [0x7fa836ab29ce]
/usr/sbin/mysqld(handle_fatal_signal+0x305) [0x7fa8365d8355]
/lib64/libpthread.so.0(+0xf370) [0x7fa835bf4370]
/usr/sbin/mysqld(MYSQL_BIN_LOG::mark_xid_done(unsigned long, bool)+0xe8) [0x7fa836699d68]
/usr/sbin/mysqld(binlog_background_thread+0x2c5) [0x7fa83669a405]
/usr/sbin/mysqld(+0x96d05d) [0x7fa83699005d]
/lib64/libpthread.so.0(+0x7dc5) [0x7fa835becdc5]
/lib64/libc.so.6(clone+0x6d) [0x7fa83400b73d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0):
Connection ID (thread ID): 3
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
```

Comment by Andrii Nikitin (Inactive) [ 2017-09-11 ]

Hi mtxd ,
I am sorry that you still observe issue identical to MDEV-10501 - we are still trying to produce reliable test case to reproduce the problem.
Did you consider disabling binary logging to work around the crash?

To collect related info, (which may help in troubleshooting) could you provide:
1. content of configuration files
2. execute commands below in one of the nodes and send us generated file /tmp/sql.txt :
tee /tmp/sql.txt;
show global variables;
show global status;
show processlist;
show engine innodb status\G
do sleep(10);
show global status;
show processlist;
show engine innodb status\G
notee;
3. upload as much as possible content of error log from every node for each occurrence of the crash

Comment by Andrii Nikitin (Inactive) [ 2017-09-11 ]

Disabling binary logging is known workaround to MDEV-9510 and current crashes really look like duplicate of that problem

Comment by Artur Čuvašov [ 2017-09-11 ]

Does it mean it's just enough to add skip-log-bin and comment log-bin, expire-logs-days, sync-binlog ? What about binlog_format ?

Since this is a know workaround for more than a year, is anything clear about the reason and a solution? Some people might need binary logs together with galera clustering.

Comment by Andrii Nikitin (Inactive) [ 2017-09-11 ]

I can say only that crash is related to binary logging and it is confirmed that if binary logging doesn't happen - the crash will not occur.
It is very possible that commenting out only expire-logs-days (and probably somehow making sure purge of binary logs doesn't happen on 'unlucky' moments) will be enough.
I personally did spend several days on MDEV-9510 (and related) and didn't come up with solution. Now the ticket is assigned to another qualified engineer, so I hope he will come up with solution. I will still analyze new data in this ticket and see if new ideas come up.

Comment by Brendan P [ 2017-10-08 ]

Some further information,

On our cluster we observed that the node we do most of the writes to will eventually stop purging binlogs, expire-logs-days ceases to function.
No amount of purge binlog commands will purge anything either. When the server is in this state we find it is extremely easy to trigger the crash.

Running a reset master will delete all the binary logs as expected, but the server has crashed several times randomly after, this could be 10 or 30 minutes after the fact or never at all, however expire-logs-days does work again if the server hasn't crashed. We think that pt-online-schema-change could be a factor, it has crashed the server many times with the same random time period after running it during alters, and also seems to be a cause of the binary log purging failure after some time after a few concurrent successful alters.

Comment by Artur Čuvašov [ 2017-10-08 ]

Check this out if any of your apps use "ENABLE/DISABLE KEYS" construction:

https://binary-data.github.io/2017/04/05/magento-mysql-crash-deadlock-when-index-under-highload/
https://github.com/OpenMage/magento-lts/pull/188

Br,
Arthur

Comment by Daniel Black [ 2017-10-09 ]

Also worth noting GET_LOCK is a known limitation and unsupported in Galera https://mariadb.com/kb/en/library/mariadb-galera-cluster-known-limitations/

Comment by Andrei Elkin [ 2017-12-13 ]

Fixed by MDEV-9510.

Generated at Thu Feb 08 08:08:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.