[MDEV-25992] Galera 3 Server crash with signal 6 after RBR event apply failed - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.3.29
Fix Version/s: 10.3.32
Component/s: Data Manipulation - Delete, Galera, Platform Debian, Storage Engine - InnoDB
Labels:
- crash
- galera
- innodb
Environment:

Hide
3 Node MariaDB / Galera cluster with ProxySQL v2.0.15-20-g32bb92cd proxy in front.

10.3.29-MariaDB-1:10.3.29+maria~bionic-log
  wsrep_patch: wsrep_25.24
  wsrep_provider: 25.3.33(r15123524)

OS: Ubuntu 20.04 LTS
Kernel: 4.15.0-144-generic

Show
3 Node MariaDB / Galera cluster with ProxySQL v2.0.15-20-g32bb92cd proxy in front. 10.3.29-MariaDB-1:10.3.29+maria~bionic-log   wsrep_patch: wsrep_25.24   wsrep_provider: 25.3.33(r15123524) OS: Ubuntu 20.04 LTS Kernel: 4.15.0-144-generic

Description

We have a 3 node cluster in our UAT environment, with all traffic going to node 3, an incoming delete conflict causes nodes 1 & 2 to crash. This causes node 3 to go non-primary (as expected).

The crash is always on the non-write nodes (either one of them or both crash) that are applying the deletes concurrently.

the delete is always on the same table named "blobs" with a self-referencing foreign key:

Of note, the logged SQL for the conflict appears to have some garbled data on the end of it (that I can't quite capture in this form):

"SQL: DELETE FROM blobs WHERE id = '7432858'???`^S^F"

The table has been rebuilt with an alter table engine = innodb, yet the issue still occours.

The crashes started 5 days after we upgraded from:

10.3.24-MariaDB-1:10.3.24+maria~bionic-log
patch: wsrep_25.24
prov: 25.3.29(r3902)

... and on a related note but I must emphasise different cluster entirely; our Production cluster which has yet to be upgraded (version as above) is uttering "[ERROR] InnoDB: Record field 15 len 18446744073709551615" which I've traced back to the same collection of tables in the same schema. I've been unable to identify any corruption in the tables themselves (by selecting out data and forcing index usage). The UAT cluster for which this report relates hasn't uttered these messaged, but I've a sneeking suspicion there is some, even if loose relationship between the issues.

Detail of the crash in the UAT env

*************************** 1. row ***************************

       Table: blobs

Create Table: CREATE TABLE `blobs` (

  `id` int(11) NOT NULL AUTO_INCREMENT,

  `original_blob_id` int(11) DEFAULT NULL,

  `sys_name` varchar(100) DEFAULT NULL,

  `storage_loc` varchar(50) DEFAULT NULL,

  `storage_loc_pref` varchar(50) DEFAULT NULL,

  `storage_loc_specific` varchar(50) DEFAULT NULL,

  `save_path` varchar(255) DEFAULT NULL,

  `file_url` varchar(255) DEFAULT NULL,

  `filename` varchar(255) NOT NULL,

  `filesize` int(11) NOT NULL,

  `content_type` varchar(50) NOT NULL,

  `authcode` varchar(50) NOT NULL,

  `blob_hash` varchar(40) NOT NULL,

  `is_media_upload` tinyint(1) NOT NULL,

  `title` varchar(255) NOT NULL,

  `dim_w` int(11) NOT NULL,

  `dim_h` int(11) NOT NULL,

  `date_created` datetime NOT NULL,

  `is_temp` tinyint(1) NOT NULL,

  PRIMARY KEY (`id`),

  KEY `IDX_896C3E356BBE2052` (`original_blob_id`),

  KEY `authcode_idx` (`authcode`),

  KEY `storage_loc_idx` (`storage_loc`,`storage_loc_pref`),

  KEY `sys_name_idx` (`sys_name`),

  KEY `date_created_idx` (`date_created`,`is_temp`),

  KEY `storage_loc_pref_idx` (`storage_loc_pref`),

  CONSTRAINT `FK_896C3E356BBE2052` FOREIGN KEY (`original_blob_id`) REFERENCES `blobs` (`id`) ON DELETE SET NULL

) ENGINE=InnoDB AUTO_INCREMENT=7700386 DEFAULT CHARSET=utf

node 1 crash:

2021-06-22 12:35:02 11 [Note] WSREP: cluster conflict due to high priority abort for threads:

2021-06-22 12:35:02 11 [Note] WSREP: Winning thread:

   THD: 11, mode: applier, state: executing, conflict: no conflict, seqno: 89327026

   SQL: DELETE FROM blobs WHERE id = '7432858'???`^S^F

2021-06-22 12:35:02 11 [Note] WSREP: Victim thread:

   THD: 15, mode: applier, state: idle, conflict: no conflict, seqno: 89327028

   SQL: NULL

2021-06-22 12:35:02 0 [ERROR] WSREP: Trx 89327026 tries to abort slave trx 89327028. This could be caused by:

        1) unsupported configuration options combination, please check documentation.

        2) a bug in the code.

        3) a database corruption.

 Node consistency compromized, need to abort. Restart the node to resync with cluster.

210622 12:35:02 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.29-MariaDB-1:10.3.29+maria~bionic-log

key_buffer_size=134217728

read_buffer_size=134217728

max_used_connections=26

max_threads=1002

thread_count=24

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 262821945 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x12c00000

/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x560ee4349b8e]

/usr/sbin/mysqld(handle_fatal_signal+0x515)[0x560ee3dde8c5]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7f9ca6c54980]

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f9ca688ffb7]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f9ca6891921]

/usr/sbin/mysqld(+0x90a914)[0x560ee3f68914]

/usr/sbin/mysqld(+0x90f83c)[0x560ee3f6d83c]

/usr/sbin/mysqld(handle_manager+0x1f3)[0x560ee3bf0a83]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f9ca6c496db]

/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f9ca697271f]

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /srv/galera-uat/mysql

Resource Limits:

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            seconds

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        0                    unlimited            bytes

Max resident set          unlimited            unlimited            bytes

Max processes             15430                15430                processes

Max open files            4096                 4096                 files

Max locked memory         67108864             67108864             bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       15430                15430                signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Max realtime timeout      unlimited            unlimited            us

Core pattern: |/usr/share/apport/apport %p %s %c %d %P %E

node 2 crash:

2021-06-22 12:35:02 15 [Note] WSREP: cluster conflict due to high priority abort for threads:

2021-06-22 12:35:02 15 [Note] WSREP: Winning thread:

   THD: 15, mode: applier, state: executing, conflict: no conflict, seqno: 89327026

   SQL: DELETE FROM blobs WHERE id = '7432858'???`^S^F

2021-06-22 12:35:02 15 [Note] WSREP: Victim thread:

   THD: 9, mode: applier, state: executing, conflict: no conflict, seqno: 89327028

   SQL: DELETE FROM blobs WHERE id = '7432867'???`^S^F

2021-06-22 12:35:02 0 [ERROR] WSREP: Trx 89327026 tries to abort slave trx 89327028. This could be caused by:

        1) unsupported configuration options combination, please check documentation.

        2) a bug in the code.

        3) a database corruption.

 Node consistency compromized, need to abort. Restart the node to resync with cluster.

210622 12:35:02 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.29-MariaDB-1:10.3.29+maria~bionic-log

key_buffer_size=134217728

read_buffer_size=134217728

max_used_connections=8

max_threads=1002

thread_count=23

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 262821945 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

2021-06-22 12:35:02 9 [Warning] WSREP: conflict state after RBR event applying: 1, 89327028

2021-06-22 12:35:02 9 [Warning] WSREP: RBR event apply failed, rolling back: 89327028

stack_bottom = 0x0 thread_stack 0x12c00000

/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5591922ceb8e]

/usr/sbin/mysqld(handle_fatal_signal+0x515)[0x559191d638c5]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7ffb20e45980]

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7ffb20a80fb7]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7ffb20a82921]

/usr/sbin/mysqld(+0x90a914)[0x559191eed914]

/usr/sbin/mysqld(+0x90f83c)[0x559191ef283c]

/usr/sbin/mysqld(handle_manager+0x1f3)[0x559191b75a83]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7ffb20e3a6db]

/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7ffb20b6371f]

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /srv/galera-uat/mysql

Resource Limits:

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            seconds

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        0                    unlimited            bytes

Max resident set          unlimited            unlimited            bytes

Max processes             15430                15430                processes

Max open files            4096                 4096                 files

Max locked memory         67108864             67108864             bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       15430                15430                signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Max realtime timeout      unlimited            unlimited            us

Core pattern: |/usr/share/apport/apport %p %s %c %d %P %E

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gdb.114.125551.1624967167.gz
2021-06-29 16:30
40 kB
Glyn Astill
mariadb1_uat.txt
2021-06-22 16:38
4 kB
Glyn Astill

Issue Links

is caused by

MDEV-25114 Crash: WSREP: invalid state ROLLED_BACK (FATAL)

Closed

MariaDB Server

Galera 3 Server crash with signal 6 after RBR event apply failed

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration