[MDEV-25378] MariaDB 10.4.18 (galera) crash Created: 2021-04-09  Updated: 2021-05-03  Resolved: 2021-04-13

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.18
Fix Version/s: 10.4.19

Type: Bug Priority: Major
Reporter: Derk-Jan Hartman Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Debian GNU/Linux 8.11 (jessie) x86_64


Issue Links:
Relates
relates to MDEV-25579 MariaDB 10.4.18 crashes with InnoDB: ... Closed
relates to MDEV-24923 Port selected Galera conflict resolut... Closed

 Description   

Yesterday my entire cluster crashed and was eventually reinitialised from scratch. In the process we also updated MariaDB on some of the servers. This afternoon I had this crash on node 5, which then took out 4 other nodes, leaving only 1 (everything recovered luckily)

MariaDB 10.4.18
galera: 4.7(ree4f10fc)

The crash that started it seems to have been:
(some information obfuscated)
```
2021-04-09 14:37:31 1 [ERROR] InnoDB: Conflicting lock on table: ``.`table1` index: PRIMARY that has lock
RECORD LOCKS space id 267 page no 21525 n bits 120 index PRIMARY of table ``.`table1` trx id 196886519 lock mode S locks rec but not gap
Record lock, heap no 47 PHYSICAL RECORD: n_fields 8; compact format; info bits 0
0: len 8; hex 000000000057d9c6; asc W ;;
1: len 6; hex 000000000000; asc ;;
2: len 7; hex 80000000000000; asc ;;
3: len 4; hex 8000166d; asc m;;
4: SQL NULL;
5: len 11; hex 424d5f5748415453415050; asc astring;;
6: len 13; hex 2b393730353637353535383033; asc astring;;
7: len 1; hex 80; asc ;;

2021-04-09 14:37:31 1 [ERROR] InnoDB: WSREP state:
2021-04-09 14:37:31 1 [ERROR] WSREP: Thread BF trx_id: 196886521 thread: 1 seqno: 99055 client_state: exec client_mode: high priority transaction_mode: replaying applier: 1 toi: 0 local: 0 query: UPDATE table1 SET `profileId` = number, `isBot` = 0 WHERE `serviceId` = number AND `channelId` = 'astring' AND `channelUserId` = 'astring'?Jp`
2021-04-09 14:37:31 1 [ERROR] WSREP: Thread BF trx_id: 196886519 thread: 34 seqno: 99054 client_state: exec client_mode: high priority transaction_mode: executing applier: 1 toi: 0 local: 0 query: INSERT INTO table2 (`uuid`, `type`, `conversationId`, `timestamp`, `channelUserId`, `channelAccountId`) VALUES ('guid', 'string', 11731790, 1617971851864, 5546465, 10799)?Jp`
2021-04-09 14:37:31 0x7f53edbe1700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.4.18/storage/innobase/lock/lock0lock.cc line 706
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
210409 14:37:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.4.18-MariaDB-1:10.4.18+maria~bionic-log
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=129
max_threads=65537
thread_count=78
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 403037846 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f533005de18
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f53edbe0d28 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5605d336452e]
/usr/sbin/mysqld(handle_fatal_signal+0x515)[0x5605d2dda2d5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f59e4c648a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f59e3598f47]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f59e359a8b1]
/usr/sbin/mysqld(+0x5889fd)[0x5605d2ae09fd]
/usr/sbin/mysqld(+0xa71d49)[0x5605d2fc9d49]
/usr/sbin/mysqld(+0xa72ecd)[0x5605d2fcaecd]
/usr/sbin/mysqld(+0xa782d1)[0x5605d2fd02d1]
/usr/sbin/mysqld(+0xb10c37)[0x5605d3068c37]
/usr/sbin/mysqld(+0xb16d74)[0x5605d306ed74]
/usr/sbin/mysqld(+0xa353b2)[0x5605d2f8d3b2]
/usr/sbin/mysqld(ZN7handler10ha_rnd_posEPhS0+0x132)[0x5605d2ddf562]
/usr/sbin/mysqld(+0x7a47c3)[0x5605d2cfc7c3]
/usr/sbin/mysqld(_ZN14Rows_log_event8find_rowEP14rpl_group_info+0x3a0)[0x5605d2edc4e0]
/usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0xe0)[0x5605d2edcb90]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x237)[0x5605d2ecfcb7]
/usr/sbin/mysqld(_Z18wsrep_apply_eventsP3THDP14Relay_log_infoPKvm+0x1d8)[0x5605d2d4d488]
/usr/sbin/mysqld(_ZN22Wsrep_replayer_service15apply_write_setERKN5wsrep7ws_metaERKNS0_12const_bufferERNS0_14mutable_bufferE+0xf4)[0x5605d2d344a4]
/usr/sbin/mysqld(+0xe83432)[0x5605d33db432]
/usr/sbin/mysqld(_ZN5wsrep12server_state8on_applyERNS_21high_priority_serviceERKNS_9ws_handleERKNS_7ws_metaERKNS_12const_bufferE+0x155)[0x5605d33dc245]
/usr/sbin/mysqld(+0xe98af4)[0x5605d33f0af4]
/usr/lib/galera/libgalera_smm.so(+0x30fb5)[0x7f59e157cfb5]
/usr/lib/galera/libgalera_smm.so(+0x4845c)[0x7f59e159445c]
/usr/lib/galera/libgalera_smm.so(+0x1fd75)[0x7f59e156bd75]
/usr/sbin/mysqld(_ZN5wsrep18wsrep_provider_v266replayERKNS_9ws_handleEPNS_21high_priority_serviceE+0x29)[0x5605d33f1459]
/usr/sbin/mysqld(_ZN20Wsrep_client_service6replayEv+0xf4)[0x5605d2d340d4]
/usr/sbin/mysqld(_ZN5wsrep11transaction6replayERSt11unique_lockINS_5mutexEE+0x96)[0x5605d33ea996]
/usr/sbin/mysqld(_ZN5wsrep11transaction15after_statementEv+0x1b4)[0x5605d33ec5a4]
/usr/sbin/mysqld(_ZN5wsrep12client_state15after_statementEv+0xb9)[0x5605d33cf229]
/usr/sbin/mysqld(+0x6757ad)[0x5605d2bcd7ad]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x22d2)[0x5605d2bd01e2]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x11c)[0x5605d2bd0aec]
/usr/sbin/mysqld(_Z11tp_callbackP13TP_connection+0xef)[0x5605d2dab70f]
/usr/sbin/mysqld(+0xa0b3e0)[0x5605d2f633e0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f59e4c596db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f59e367ba3f]
```

other nodes then tried to recover, did quorum checks and start state transfer routines, which caused 4 of them to crash with this stack trace
```
=================================================
2021-04-09 14:37:44 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
210409 14:37:44 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.4.18-MariaDB-1:10.4.18+maria~bionic-log
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=91
max_threads=65537
thread_count=83
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 403037846 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f74ec000c08
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f75741c9e08 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5632ddda152e]
/usr/sbin/mysqld(handle_fatal_signal+0x515)[0x5632dd8172d5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f7b5ac268a0]
/usr/sbin/mysqld(+0x60843d)[0x5632dd59d43d]
/usr/sbin/mysqld(_Z11lock_tablesP3THDP10TABLE_LISTjj+0x4a0)[0x5632dd5a3460]
/usr/sbin/mysqld(_Z20open_and_lock_tablesP3THDRK14DDL_options_stP10TABLE_LISTbjP19Prelocking_strategy+0x8a)[0x5632dd5a4eea]
/usr/sbin/mysqld(_Z24open_n_lock_single_tableP3THDP10TABLE_LIST13thr_lock_typejP19Prelocking_strategy+0x41)[0x5632dd5a4f91]
/usr/sbin/mysqld(+0x7f78a7)[0x5632dd78c8a7]
/usr/sbin/mysqld(+0x7f8324)[0x5632dd78d324]
/usr/sbin/mysqld(_ZNK12Wsrep_schema12restore_viewEP3THDRKN5wsrep2idE+0x24b)[0x5632dd79039b]
/usr/sbin/mysqld(_ZN20Wsrep_server_service8log_viewEPN5wsrep21high_priority_serviceERKNS0_4viewE+0x700)[0x5632dd999dc0]
/usr/sbin/mysqld(_ZN5wsrep12server_state7on_viewERKNS_4viewEPNS_21high_priority_serviceE+0x3c5)[0x5632dde1eef5]
/usr/sbin/mysqld(+0xe9bc1c)[0x5632dde30c1c]
/usr/lib/galera/libgalera_smm.so(+0x3ee89)[0x7f7b5754ce89]
/usr/lib/galera/libgalera_smm.so(+0x3fc98)[0x7f7b5754dc98]
/usr/lib/galera/libgalera_smm.so(+0x4b188)[0x7f7b57559188]
/usr/lib/galera/libgalera_smm.so(+0x4b3be)[0x7f7b575593be]
/usr/lib/galera/libgalera_smm.so(+0x7c848)[0x7f7b5758a848]
/usr/lib/galera/libgalera_smm.so(+0x7caa2)[0x7f7b5758aaa2]
/usr/lib/galera/libgalera_smm.so(+0x49890)[0x7f7b57557890]
/usr/lib/galera/libgalera_smm.so(+0x1c2ab)[0x7f7b5752a2ab]
/usr/sbin/mysqld(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0xe)[0x5632dde2e1de]
/usr/sbin/mysqld(+0x7f758f)[0x5632dd78c58f]
/usr/sbin/mysqld(_Z15start_wsrep_THDPv+0x413)[0x5632dd77bf33]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f7b5ac1b6db]
2021-04-09 14:37:46 0 [Note] WSREP: (1cc929aa-b399, 'tcp://0.0.0.0:4567') reconnecting to dd6e91d4-a06b (tcp://10.30.126.13:4567), attempt 0
2021-04-09 14:37:46 0 [Note] WSREP: (1cc929aa-b399, 'tcp://0.0.0.0:4567') reconnecting to cc37987d-b641 (tcp://10.35.126.15:4567), attempt 0
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f7b5963da3f]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 17
Status: NOT_KILLED
```

This all looks very not normal... I'm not sure how to analyse this.
The only thing that I could find that looks a little bit similar is MDEV-23851

Any help with figuring out what this stack trace means would be appreciated



 Comments   
Comment by Marko Mäkelä [ 2021-04-09 ]

jplindst changed that code in MDEV-24923. The function wsrep_assert_no_bf_bf_wait() that reports the fatal error would only be available in debug builds.

I think that he should analyze this failure and comment on whether it could be a real error.

Comment by Derk-Jan Hartman [ 2021-04-13 ]

For various reasons we have since updated the cluster and we are now on 10.5.9. We also switched from wsrep_sst_method rsync to mariadbbackup.

I think however that we are finding a similar problem however, see also MDEV-25405. So far the problem just hasn't been as catastrophic as on our older setup, where it caused complete cluster lockup on the nodes.

Comment by Jan Lindström (Inactive) [ 2021-04-13 ]

trx_id: 196886519 has seqno: 99054 and is holding the lock. while trx_id: 196886521 seqno: 99055 is requesting conflicting record lock. Because trx_id: 196886521 has larger seqno it should wait for it turn i.e. do lock wait. This has been now fixed in MDEV-24923.

Comment by Derk-Jan Hartman [ 2021-04-13 ]

As far as I can tell 10.4.19 and 10.5.10 have not yet been packaged for ubuntu.. But as soon as they are available we will update I guess, because the HA part isn't too HA right now.

Comment by Tom [ 2021-05-01 ]

How long should I expect to wait until a Debian package is available?

I had a cluster of 3 nodes running 10.4.15. One upgraded to 10.4.18 on Thursday and on Saturday morning I got a crash like the one Derk-Jan Hartman reported. I worry that we'll get more such crashes on that node if I don't revert it to 10.4.15.

Had another crash this afternoon. Will look at reverting now.

Generated at Thu Feb 08 09:37:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.