[MDEV-25835] mariadb 10.3.29 galera cluster crashes with errors like: "[ERROR] WSREP: Trx 236236 tries to abort slave trx 236238." - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2.36, 10.3.29
Fix Version/s: 10.2.41, 10.3.32, 10.4.22, 10.5.13, 10.6.5
Component/s: Galera
Labels:
None
Environment:

Hide
KVM machine with ubuntu bionic cloud image. Dockerised deployment of mariadb 10.3.29 from mariadb repository based on ubuntu bionic docker image:
$ uname -a
Linux os-ctrl-1 4.15.0-109-generic #110-Ubuntu SMP Tue Jun 23 02:39:32 UTC 2020 x8
$ docker exec -it mariadb mysql --version
mysql Ver 15.1 Distrib 10.3.29-MariaDB, for debian-linux-gnu (x86_64) using readline 5.26_64 x86_64 x86_64 GNU/Linux

Show
KVM machine with ubuntu bionic cloud image. Dockerised deployment of mariadb 10.3.29 from mariadb repository based on ubuntu bionic docker image: $ uname -a Linux os-ctrl-1 4.15.0-109-generic #110-Ubuntu SMP Tue Jun 23 02:39:32 UTC 2020 x8 $ docker exec -it mariadb mysql --version mysql Ver 15.1 Distrib 10.3.29-MariaDB, for debian-linux-gnu (x86_64) using readline 5.26_64 x86_64 x86_64 GNU/Linux

Description

Starting with mariadb 10.3.29 when deploying a three node galera cluster, we are seeing crashes with errors like in [1].
Please find attached the log files from all three nodes and their configurations. A core dump of the crash in [1] can also be provided.

[1]

2021-06-01  9:06:16 0 [ERROR] WSREP: Trx 236236 tries to abort slave trx 236238. This could be caused by:

	1) unsupported configuration options combination, please check documentation.

	2) a bug in the code.

	3) a database corruption.

 Node consistency compromized, need to abort. Restart the node to resync with cluster.

210601  9:06:16 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.29-MariaDB-1:10.3.29+maria~bionic-log

key_buffer_size=67108864

read_buffer_size=131072

max_used_connections=1

max_threads=10002

thread_count=12

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 22055535 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5647d689bb8e]

/usr/sbin/mysqld(handle_fatal_signal+0x515)[0x5647d63308c5]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fd6316ed980]

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fd631328fb7]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fd63132a921]

/usr/sbin/mysqld(+0x90a914)[0x5647d64ba914]

/usr/sbin/mysqld(+0x90f83c)[0x5647d64bf83c]

/usr/sbin/mysqld(handle_manager+0x1f3)[0x5647d6142a83]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fd6316e26db]

/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fd63140b71f]

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /var/lib/mysql

Resource Limits:

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            seconds

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        unlimited            unlimited            bytes

Max resident set          unlimited            unlimited            bytes

Max processes             unlimited            unlimited            processes

Max open files            1048576              1048576              files

Max locked memory         16777216             16777216             bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       64055                64055                signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Max realtime timeout      unlimited            unlimited            us

Core pattern: core

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

os-ctrl-0-galera.cnf
1 kB
2021-06-01 10:09
os-ctrl-0-mariadb.log
275 kB
2021-06-01 10:13
os-ctrl-1-galera.cnf
1 kB
2021-06-01 10:09
os-ctrl-1-mariadb.log
373 kB
2021-06-01 10:13
os-ctrl-2-galera.cnf
1 kB
2021-06-01 10:09
os-ctrl-2-mariadb.log
143 kB
2021-06-01 10:13

Issue Links

is duplicated by

MDEV-26697 MariaDB keeps rebooting after upgrade from 10.3.23 to 10.3.31

Closed

relates to

MDEV-24978 SIGABRT in __libc_message

Closed

Activity

Ascending order - Click to sort in descending order

Jan Horstmann created issue - 2021-06-01 10:22

Julien Fritsch made changes - 2021-06-03 09:22

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-25114~~ [ ~~MDEV-25114~~ ]

Walter Doekes added a comment - 2021-06-03 11:26

Okay, so we notice this too:

2021-06-02 20:58:51 0 [Note] WSREP: Signalling provider to continue.

2021-06-02 20:58:51 0 [Note] WSREP: SST received: 13af80aa-836a-11eb-8c23-32f57cb601d4:54087553

2021-06-02 20:58:51 2 [Note] WSREP: Receiving IST: 1683 writesets, seqnos 54087553-54089236

2021-06-02 20:58:51 0 [Note] WSREP: Receiving IST...  0.0% (   0/1683 events) complete.

2021-06-02 20:58:51 0 [Note] Reading of all Master_info entries succeeded

2021-06-02 20:58:51 0 [Note] Added new Master_info '' to hash table

2021-06-02 20:58:51 0 [Note] /usr/sbin/mysqld: ready for connections.

Version: '10.3.29-MariaDB-1:10.3.29+maria~bionic-log'  socket: '/run/mysqld/MY_DEPLOYMENT-mysqld.sock'  port: 3306  mariadb.org binary distribution

2021-06-02 20:58:51 0 [ERROR] WSREP: Trx 54088471 tries to abort slave trx 54088527. This could be caused by:

        1) unsupported configuration options combination, please check documentation.

        2) a bug in the code.

        3) a database corruption.

 Node consistency compromized, need to abort. Restart the node to resync with cluster.

It went into a crashloop, with the same: Trx 54088471 tries to abort slave trx 54088527.

First core dump looks like:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51

#1  0x00007f30f06fa9fb in __GI_abort () at abort.c:100

#2  0x000055d85b55d914 in wsrep_abort_slave_trx (bf_seqno=<optimized out>, victim_seqno=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:18893

#3  0x000055d85b56283c in bg_wsrep_kill_trx (void_arg=0x7f1088023830) at ./storage/innobase/handler/ha_innodb.cc:19058

#4  0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112

#5  0x000055d85b8ea67a in pfs_spawn_thread (arg=0x55d8c61e5228) at ./storage/perfschema/pfs.cc:1869

#6  0x00007f30f0ab26db in start_thread (arg=0x7f1153399700) at pthread_create.c:463

        case QUERY_IDLE:

                WSREP_DEBUG("kill IDLE for " TRX_ID_FMT, victim_trx->id);

                if (wsrep_thd_exec_mode(thd) == REPL_RECV) {

                        WSREP_DEBUG("kill BF IDLE, seqno: %lld",

                                    (long long)wsrep_thd_trx_seqno(thd));

                        wsrep_abort_slave_trx(arg->bf_seqno,     // <-- CRASH

                                              wsrep_thd_trx_seqno(thd));

                        goto ret_unlock;

                /* This will lock thd from proceeding after net_read() */

                wsrep_thd_set_conflict_state(thd, ABORTING);

(gdb) print *arg

$2 = {thd_id = 67, trx_id = 7493842400, bf_seqno = 54088471, signal = 1}

(gdb) print *(unsigned*)((void*)thd + 0x58e0)

$4 = 54088527

(gdb) print thd->wsrep_trx_meta

$22 = {gtid = {uuid = {data = "\023\257\200\252\203j\021\353\214#2\365|\266", <incomplete sequence \324>}, seqno = 54088527}, depends_on = 54088417}

(gdb) print thd->proc_info

$23 = 0x7f0fb0006570 "Applied write set 54088527"

(gdb) info locals

rcode = <optimized out>

arg = 0x7f1088023830

thd = 0x7f0fb0000c08

victim_trx = 0x7f242978d1f8

awake = false

(gdb) up

#4  0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112

(gdb) info locals

next = 0x0

cb = 0x7f1088023868

error = 0

abstime = {tv_sec = 94387747591648, tv_nsec = 139848169358893}

reset_flush_time = true

Went back to 10.3.25 because of ~~MDEV-25190~~ (among others).

Let me know if I can get you more debug info from the dump.

Walter Doekes added a comment - 2021-06-03 11:26 Okay, so we notice this too: 2021-06-02 20:58:51 0 [Note] WSREP: Signalling provider to continue. 2021-06-02 20:58:51 0 [Note] WSREP: SST received: 13af80aa-836a-11eb-8c23-32f57cb601d4:54087553 2021-06-02 20:58:51 2 [Note] WSREP: Receiving IST: 1683 writesets, seqnos 54087553-54089236 2021-06-02 20:58:51 0 [Note] WSREP: Receiving IST... 0.0% ( 0/1683 events) complete. 2021-06-02 20:58:51 0 [Note] Reading of all Master_info entries succeeded 2021-06-02 20:58:51 0 [Note] Added new Master_info '' to hash table 2021-06-02 20:58:51 0 [Note] /usr/sbin/mysqld: ready for connections. Version: '10.3.29-MariaDB-1:10.3.29+maria~bionic-log' socket: '/run/mysqld/MY_DEPLOYMENT-mysqld.sock' port: 3306 mariadb.org binary distribution 2021-06-02 20:58:51 0 [ERROR] WSREP: Trx 54088471 tries to abort slave trx 54088527. This could be caused by: 1) unsupported configuration options combination, please check documentation. 2) a bug in the code. 3) a database corruption. Node consistency compromized, need to abort. Restart the node to resync with cluster. It went into a crashloop, with the same: Trx 54088471 tries to abort slave trx 54088527. First core dump looks like: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007f30f06fa9fb in __GI_abort () at abort.c:100 #2 0x000055d85b55d914 in wsrep_abort_slave_trx (bf_seqno=<optimized out>, victim_seqno=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:18893 #3 0x000055d85b56283c in bg_wsrep_kill_trx (void_arg=0x7f1088023830) at ./storage/innobase/handler/ha_innodb.cc:19058 #4 0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112 #5 0x000055d85b8ea67a in pfs_spawn_thread (arg=0x55d8c61e5228) at ./storage/perfschema/pfs.cc:1869 #6 0x00007f30f0ab26db in start_thread (arg=0x7f1153399700) at pthread_create.c:463 case QUERY_IDLE: { WSREP_DEBUG("kill IDLE for " TRX_ID_FMT, victim_trx->id); if (wsrep_thd_exec_mode(thd) == REPL_RECV) { WSREP_DEBUG("kill BF IDLE, seqno: %lld", (long long)wsrep_thd_trx_seqno(thd)); wsrep_abort_slave_trx(arg->bf_seqno, // <-- CRASH wsrep_thd_trx_seqno(thd)); goto ret_unlock; } /* This will lock thd from proceeding after net_read() */ wsrep_thd_set_conflict_state(thd, ABORTING); (gdb) print *arg $2 = {thd_id = 67, trx_id = 7493842400, bf_seqno = 54088471, signal = 1} (gdb) print *(unsigned*)((void*)thd + 0x58e0) $4 = 54088527 (gdb) print thd->wsrep_trx_meta $22 = {gtid = {uuid = {data = "\023\257\200\252\203j\021\353\214#2\365|\266", <incomplete sequence \324>}, seqno = 54088527}, depends_on = 54088417} (gdb) print thd->proc_info $23 = 0x7f0fb0006570 "Applied write set 54088527" (gdb) info locals rcode = <optimized out> arg = 0x7f1088023830 thd = 0x7f0fb0000c08 victim_trx = 0x7f242978d1f8 awake = false (gdb) up #4 0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112 (gdb) info locals next = 0x0 cb = 0x7f1088023868 error = 0 abstime = {tv_sec = 94387747591648, tv_nsec = 139848169358893} reset_flush_time = true Went back to 10.3.25 because of MDEV-25190 (among others). Let me know if I can get you more debug info from the dump.

Jan Horstmann added a comment - 2021-06-04 12:26

@wdoekes would you consider it to be save to go back from 10.3.29 to 10.3.28 or lower? Did you do a full SST per node and which method do you use?

Jan Horstmann added a comment - 2021-06-04 12:26 @wdoekes would you consider it to be save to go back from 10.3.29 to 10.3.28 or lower? Did you do a full SST per node and which method do you use?

Walter Doekes added a comment - 2021-06-04 13:55

If you're not affected by ~~MDEV-25190~~, then I guess 10.3.28 is fine. We've had a bad streak after 10.3.25, so we're reverting back to that.

I'm not aware what ops did when reverting. SSTs generally start automatically here after certain failures, more often than appears necessary. So if there was an SST, I'm not sure if it was intentional either.

The crash loop started on the second node, apparently. ("Rebooted, started maria on the 2nd node. Crashloop. Downgrading made it happy again.")

Walter Doekes added a comment - 2021-06-04 13:55 If you're not affected by MDEV-25190 , then I guess 10.3.28 is fine. We've had a bad streak after 10.3.25, so we're reverting back to that. I'm not aware what ops did when reverting. SSTs generally start automatically here after certain failures, more often than appears necessary. So if there was an SST, I'm not sure if it was intentional either. The crash loop started on the second node, apparently. ("Rebooted, started maria on the 2nd node. Crashloop. Downgrading made it happy again.")

Kyle Joiner (Inactive) made changes - 2021-06-16 15:25

Affects Version/s

10.2.36 [ 25026 ]

Julien Fritsch made changes - 2021-06-18 06:18

Fix Version/s		10.2 [ 14601 ]
Fix Version/s		10.3 [ 22126 ]

Julien Fritsch made changes - 2021-06-18 06:18

Assignee

Jan Lindström [ jplindst ]

Patrick Schlirf added a comment - 2021-06-25 09:24

Hey guys,

are there any updates?

We downgraded after the first three times this happened and freeze the patch cycle of a big number of database servers to prevent to hit this bug again.

Kind regards,
Patrick

Patrick Schlirf added a comment - 2021-06-25 09:24 Hey guys, are there any updates? We downgraded after the first three times this happened and freeze the patch cycle of a big number of database servers to prevent to hit this bug again. Kind regards, Patrick

Julien Fritsch made changes - 2021-07-08 09:11

Priority

Major [ 3 ]

Critical [ 2 ]

Jan Lindström (Inactive) added a comment - 2021-07-26 06:56 - edited

This issue will be fixed when (if)~~MDEV-25114~~ is reviewed and applied.

Jan Lindström (Inactive) added a comment - 2021-07-26 06:56 - edited This issue will be fixed when (if) MDEV-25114 is reviewed and applied.

Julien Fritsch made changes - 2021-08-12 12:08

Link

This issue is blocked by ~~MDEV-25114~~ [ ~~MDEV-25114~~ ]

Julien Fritsch made changes - 2021-08-12 12:08

Link

This issue relates to ~~MDEV-25114~~ [ ~~MDEV-25114~~ ]

Jan Lindström (Inactive) added a comment - 2021-09-20 09:17

I need more information. Can you run with --wsrep-debug=1 and provide error logs ? Naturally, steps how to reproduce would be most useful.

Jan Lindström (Inactive) added a comment - 2021-09-20 09:17 I need more information. Can you run with --wsrep-debug=1 and provide error logs ? Naturally, steps how to reproduce would be most useful.

Jan Lindström (Inactive) made changes - 2021-09-20 09:17

Labels

need_feedback

Julien Fritsch made changes - 2021-09-28 16:02

Assignee

Jan Lindström [ jplindst ]

Pramod Mahto [ pramod.mahto@mariadb.com ]

Pramod Mahto made changes - 2021-10-05 13:57

Assignee

Pramod Mahto [ pramod.mahto@mariadb.com ]

Muhammad Irfan [ muhammad.irfan ]

Richard Stracke made changes - 2021-10-06 13:47

Assignee

Muhammad Irfan [ muhammad.irfan ]

Jan Lindström [ jplindst ]

Richard Stracke made changes - 2021-10-06 13:48

Labels

need_feedback

Jan Lindström (Inactive) made changes - 2021-10-07 06:27

Assignee

Jan Lindström [ jplindst ]

Seppo Jaakola [ seppo ]

Seppo Jaakola made changes - 2021-10-20 14:57

Status

Open [ 1 ]

In Progress [ 3 ]

Seppo Jaakola added a comment - 2021-10-20 15:08

This issue seems to cause applier-applier conflicts, configuring for only one applier (wsrep_slave_threads=1) should be immediate help for this.
I suspect this is a duplicate of ~~MDEV-24978~~, as the table definition has foreign key constraint over multi-byte character set

Seppo Jaakola added a comment - 2021-10-20 15:08 This issue seems to cause applier-applier conflicts, configuring for only one applier (wsrep_slave_threads=1) should be immediate help for this. I suspect this is a duplicate of MDEV-24978 , as the table definition has foreign key constraint over multi-byte character set

Seppo Jaakola made changes - 2021-10-20 15:09

Link

This issue relates to ~~MDEV-24978~~ [ ~~MDEV-24978~~ ]

Seppo Jaakola added a comment - 2021-10-20 15:09

probably a duplicate

Seppo Jaakola added a comment - 2021-10-20 15:09 probably a duplicate

Seppo Jaakola made changes - 2021-10-27 08:10

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Jan Lindström (Inactive) made changes - 2021-10-27 08:17

Link

This issue is blocked by ~~MDEV-25114~~ [ ~~MDEV-25114~~ ]

Jan Lindström (Inactive) made changes - 2021-10-27 08:17

Fix Version/s		10.2.41 [ 26032 ]
Fix Version/s		10.3.32 [ 26029 ]
Fix Version/s		10.4.22 [ 26031 ]
Fix Version/s		10.5.13 [ 26026 ]
Fix Version/s		10.6.5 [ 26034 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.3 [ 22126 ]
Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Daniel Black made changes - 2021-10-29 02:29

Link

This issue is duplicated by ~~MDEV-26697~~ [ ~~MDEV-26697~~ ]

Daniel Black added a comment - 2021-10-29 05:20

With 47ba5523046094db33e68c92a182491a629bbd56 reverting ~~MDEV-24978~~, is this still closed?

Daniel Black added a comment - 2021-10-29 05:20 With 47ba5523046094db33e68c92a182491a629bbd56 reverting MDEV-24978 , is this still closed?

Jan Lindström (Inactive) added a comment - 2021-10-29 08:34

Yes, ~~MDEV-24978~~ was fixed on d5a15f04f4ab0738b0a5f993b208bcfaad522fd4

Jan Lindström (Inactive) added a comment - 2021-10-29 08:34 Yes, MDEV-24978 was fixed on d5a15f04f4ab0738b0a5f993b208bcfaad522fd4

Sergei Golubchik made changes - 2021-12-06 21:53

Workflow

MariaDB v3 [ 122368 ]

MariaDB v4 [ 159354 ]

Hartmut Holzgraefe made changes - 2022-02-10 11:30

Link

This issue relates to MENT-1410 [ MENT-1410 ]

Jira Automation (IT) made changes - 2024-07-04 02:23

Zendesk Related Tickets

135202 147109 133910

People

Assignee:: Seppo Jaakola

Reporter:: Jan Horstmann

Votes:: 8 Vote for this issue

Watchers:: 18 Start watching this issue

Dates

Created:: 2021-06-01 10:22

Updated:: 2024-07-07 21:16

Resolved:: 2021-10-27 08:17

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration