[MDEV-23101] SIGSEGV in lock_rec_unlock() when Galera is enabled - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4.13, 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.2.35, 10.3.26, 10.4.16, 10.5.7, 10.6.0
Component/s: Galera, Server
Labels:
None

Description

Server is crashing few times a day with the following stacktrace:

/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]

/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]

/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]

/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]

/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]

/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]

/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]

/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]

/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]

/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]

/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]

/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]

/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]

/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]

/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]

/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]

/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]

/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]

/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]

/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]

3 Top frames, resolved
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908] lock_rec_has_to_wait_in_queue(ib_lock_t const*) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79).
/usr/sbin/mysqld(0xa9d47e)[0x55f2dc4ee47e] lock_rec_unlock(trx_t, buf_block_t const, unsigned char const*, lock_mode) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214).
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626] row_unlock_for_mysql(row_prebuilt_t*, unsigned long) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2093).

We tried to set up for core dump generation, but it appears it crashes trying to access memory and therefore no core dump is generated:

200703 13:10:16 [ERROR] mysqld got signal 11 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.4.13-MariaDB-log

key_buffer_size=134217728

read_buffer_size=2097152

max_used_connections=10223

max_threads=20002

thread_count=8406

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 123517134 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7faf48086b98

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7facc4190cc0 thread_stack 0x49000

*** buffer overflow detected ***: /usr/sbin/mysqld terminated

======= Backtrace: =========

/lib64/libc.so.6(__fortify_fail+0x37)[0x7fcadad5a9e7]

/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]

/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]

/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]

/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]

/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]

/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]

/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]

/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]

/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]

/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]

/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]

/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]

/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]

/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]

/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]

/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]

/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]

/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]

/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]

/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]

======= Memory map: ========

55f2dba51000-55f2dceca000 r-xp 00000000 fd:00 436210356                  /usr/sbin/mysqld

55f2dd0ca000-55f2dd1fd000 r--p 01479000 fd:00 436210356                  /usr/sbin/mysqld

55f2dd1fd000-55f2dd2b7000 rw-p 015ac000 fd:00 436210356                  /usr/sbin/mysqld

55f2dd2b7000-55f2ddb66000 rw-p 00000000 00:00 0

55f2de4cb000-55f348284000 rw-p 00000000 00:00 0                          [heap]

7fabcc000000-7fabcdb8a000 rw-p 00000000 00:00 0

[ ... ]

7fcadcdd6000-7fcadcdd7000 rw-p 00000000 00:00 0

7fff76cfc000-7fff76d1d000 rw-p 00000000 00:00 0                          [stack]

7fff76d84000-7fff76d86000 r-xp 00000000 00:00 0                          [vdso]

ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Fatal signal 6 while backtracing

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

thd.txt
71 kB
2020-08-12 08:24
trx.txt
8 kB
2020-08-12 08:24

Issue Links

causes

MDEV-23897 ASAN SEGV in lock_grant_and_move_on_page

Closed

relates to

MDEV-23186 mysqld doesn't create core dump if crashing while backtracing or dumping memory

Closed

MDEV-23409 Galera node with role master crashes in lock_rec_has_to_wait_in_queue()

Closed

MDEV-14846 InnoDB: assertion on trx->state because of deadlock error ignored

Closed

MDEV-20666 Assertion failures prebuilt->sql_stat_start ... in row_search / index_read upon concurrent INSERT and CREATE .. SELECT

Closed

MDEV-21910 KIlling thread on Galera could cause mutex deadlock

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

Rick Pizzi (Inactive) created issue - 2020-07-06 10:42

Elena Stepanova made changes - 2020-07-06 11:18

Field	Original Value	New Value
Labels		need_feedback

Rick Pizzi (Inactive) added a comment - 2020-07-06 12:11

Resolving the last 3 levels of stack trace:

/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]

lock_rec_has_to_wait_in_queue(ib_lock_t const*) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79).

/usr/sbin/mysqld(0xa9d47e)[0x55f2dc4ee47e]

lock_rec_unlock(trx_t*, buf_block_t const*, unsigned char const*, lock_mode) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214).

/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]

row_unlock_for_mysql(row_prebuilt_t*, unsigned long) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2093).

Rick Pizzi (Inactive) added a comment - 2020-07-06 12:11 Resolving the last 3 levels of stack trace: /usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908] lock_rec_has_to_wait_in_queue(ib_lock_t const*) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79). /usr/sbin/mysqld(0xa9d47e)[0x55f2dc4ee47e] lock_rec_unlock(trx_t*, buf_block_t const*, unsigned char const*, lock_mode) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214). /usr/sbin/mysqld(+0xb22626)[0x55f2dc573626] row_unlock_for_mysql(row_prebuilt_t*, unsigned long) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2093).

Elena Stepanova added a comment - 2020-07-06 12:24

The closest that we have among open issues seems to be ~~MDEV-14846~~ (and consequently ~~MDEV-20666~~ which ~~MDEV-14846~~ links itself to). Both of them are only known to cause problems on debug builds, but it doesn't mean that the underlying cause can't affect non-debug ones.

Elena Stepanova added a comment - 2020-07-06 12:24 The closest that we have among open issues seems to be MDEV-14846 (and consequently MDEV-20666 which MDEV-14846 links itself to). Both of them are only known to cause problems on debug builds, but it doesn't mean that the underlying cause can't affect non-debug ones.

Elena Stepanova made changes - 2020-07-06 12:24

Link

This issue relates to ~~MDEV-14846~~ [ ~~MDEV-14846~~ ]

Elena Stepanova made changes - 2020-07-06 12:24

Link

This issue relates to ~~MDEV-20666~~ [ ~~MDEV-20666~~ ]

Elena Stepanova made changes - 2020-07-06 12:24

Fix Version/s		10.4 [ 22408 ]
Assignee		Oleksandr Byelkin [ sanja ]
Labels	need_feedback

Oleksandr Byelkin added a comment - 2020-07-07 12:20

rpizzi if it crashes repeatadly where is test suite?

Oleksandr Byelkin added a comment - 2020-07-07 12:20 rpizzi if it crashes repeatadly where is test suite?

Rick Pizzi (Inactive) added a comment - 2020-07-07 12:25

This is a production system crashing, of a customer of RDBA. We were unable to reproduce the crash, especially because the cause is not known.

Rick Pizzi (Inactive) added a comment - 2020-07-07 12:25 This is a production system crashing, of a customer of RDBA. We were unable to reproduce the crash, especially because the cause is not known.

Oleksandr Byelkin added a comment - 2020-07-07 12:39

Ah, that stack trace in innodb is a part which was unresolved
so it looks like innodb crash

Oleksandr Byelkin added a comment - 2020-07-07 12:39 Ah, that stack trace in innodb is a part which was unresolved so it looks like innodb crash

Oleksandr Byelkin made changes - 2020-07-07 12:39

Comment

[ A comment with security level 'Developers' was removed. ]

Oleksandr Byelkin made changes - 2020-07-07 12:39

Comment

[ I see here at least 2 different traces...
1) in my_addr_resolve
2) lock_rec_has_to_wait_in_queue in innodb
is it correct? how thay are related? ]

Oleksandr Byelkin made changes - 2020-07-07 12:43

Assignee

Oleksandr Byelkin [ sanja ]

Marko Mäkelä [ marko ]

Oleksandr Byelkin added a comment - 2020-07-07 12:44

update processing order by very strange, what they do at startup?

Oleksandr Byelkin added a comment - 2020-07-07 12:44 update processing order by very strange, what they do at startup?

Marko Mäkelä added a comment - 2020-07-14 11:57

It seems that lock is an invalid pointer:

                for (lock = first_lock; lock != NULL;

                         lock = lock_rec_get_next(heap_no, lock)) {

                        if (lock_get_wait(lock)

                                && !lock_rec_has_to_wait_in_queue(lock)) {

The signal is sent in lock_rec_has_to_wait_in_queue(lock), which appears to be inlined in the stack trace:

	for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no);

	     lock != wait_lock;

	     lock = lock_rec_get_next_on_page_const(lock)) {

		const byte*	p = (const byte*) &lock[1];

		if (heap_no < lock_rec_get_n_bits(lock)

We would fail to stop the iteration if we never find wait_lock. Invoking lock_rec_get_n_bits(lock=NULL) would certainly cause the SIGSEGV.

rpizzi, Does the crash involve Galera? Or innodb_lock_schedule_algorithm=VATS? The root cause of this failure should be an invalid call to unlock a row. It might also involve some complex join. Unlocking rows ("semi-consistent reads") was originally introduced in MySQL Bug #3300.

Marko Mäkelä added a comment - 2020-07-14 11:57 It seems that lock is an invalid pointer: for (lock = first_lock; lock != NULL; lock = lock_rec_get_next(heap_no, lock)) { if (lock_get_wait(lock) && !lock_rec_has_to_wait_in_queue(lock)) { The signal is sent in lock_rec_has_to_wait_in_queue(lock) , which appears to be inlined in the stack trace: for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no); lock != wait_lock; lock = lock_rec_get_next_on_page_const(lock)) { const byte* p = ( const byte*) &lock[1]; if (heap_no < lock_rec_get_n_bits(lock) We would fail to stop the iteration if we never find wait_lock . Invoking lock_rec_get_n_bits(lock=NULL) would certainly cause the SIGSEGV. rpizzi , Does the crash involve Galera? Or innodb_lock_schedule_algorithm=VATS ? The root cause of this failure should be an invalid call to unlock a row. It might also involve some complex join. Unlocking rows ("semi-consistent reads") was originally introduced in MySQL Bug #3300.

Rick Pizzi (Inactive) added a comment - 2020-07-14 12:40

Yes this is a galera node, crash always happens on the node that has the master role marko

Rick Pizzi (Inactive) added a comment - 2020-07-14 12:40 Yes this is a galera node, crash always happens on the node that has the master role marko

Marko Mäkelä added a comment - 2020-07-14 12:56

As far as I can tell, lock_rec_unlock() is continuously holding lock_sys->mutex, which should protect the creation and destruction of explicit record locks. The most plausible explanation for the crash is the following code in lock_rec_has_to_wait_in_queue(lock), which causes us to skip wait_lock and reach the end of the list:

	for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no);

	     lock != wait_lock;

	     lock = lock_rec_get_next_on_page_const(lock)) {

		const byte*	p = (const byte*) &lock[1];

		if (heap_no < lock_rec_get_n_bits(lock)

		    && (p[bit_offset] & bit_mask)

		    && lock_has_to_wait(wait_lock, lock)) {

#ifdef WITH_WSREP

			if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) &&

			    wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) {

				if (UNIV_UNLIKELY(wsrep_debug)) {

					mtr_t mtr;

					ib::info() << "WSREP: waiting BF trx: " << ib::hex(wait_lock->trx->id)

						   << " query: " << wsrep_thd_query(wait_lock->trx->mysql_thd);

					lock_rec_print(stderr, wait_lock, mtr);

					ib::info() << "WSREP: do not wait another BF trx: " << ib::hex(lock->trx->id)

						   << " query: " << wsrep_thd_query(lock->trx->mysql_thd);

					lock_rec_print(stderr, lock, mtr);

				/* don't wait for another BF lock */

				continue;

#endif /* WITH_WSREP */

			return(lock);

	return(NULL);

rpizzi, do you see those WSREP messages in the server error log right before the crash?

It might be that this bug would be addressed by fixing ~~MDEV-21910~~, or that the above WITH_WSREP code is trying to work around ~~MDEV-21910~~. In any case, I think that the code is misplaced, and the code should be moved to the relevant caller of lock_rec_has_to_wait_in_queue().

The problematic code is already present in 10.1 (which is the first major version where the WITH_WSREP changes are part of the main server).

Marko Mäkelä added a comment - 2020-07-14 12:56 As far as I can tell, lock_rec_unlock() is continuously holding lock_sys->mutex , which should protect the creation and destruction of explicit record locks. The most plausible explanation for the crash is the following code in lock_rec_has_to_wait_in_queue(lock) , which causes us to skip wait_lock and reach the end of the list: for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no); lock != wait_lock; lock = lock_rec_get_next_on_page_const(lock)) { const byte* p = ( const byte*) &lock[1]; if (heap_no < lock_rec_get_n_bits(lock) && (p[bit_offset] & bit_mask) && lock_has_to_wait(wait_lock, lock)) { #ifdef WITH_WSREP if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) && wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) { if (UNIV_UNLIKELY(wsrep_debug)) { mtr_t mtr; ib::info() << "WSREP: waiting BF trx: " << ib::hex(wait_lock->trx->id) << " query: " << wsrep_thd_query(wait_lock->trx->mysql_thd); lock_rec_print(stderr, wait_lock, mtr); ib::info() << "WSREP: do not wait another BF trx: " << ib::hex(lock->trx->id) << " query: " << wsrep_thd_query(lock->trx->mysql_thd); lock_rec_print(stderr, lock, mtr); } /* don't wait for another BF lock */ continue ; } #endif /* WITH_WSREP */ return (lock); } } return (NULL); rpizzi , do you see those WSREP messages in the server error log right before the crash? It might be that this bug would be addressed by fixing MDEV-21910 , or that the above WITH_WSREP code is trying to work around MDEV-21910 . In any case, I think that the code is misplaced, and the code should be moved to the relevant caller of lock_rec_has_to_wait_in_queue() . The problematic code is already present in 10.1 (which is the first major version where the WITH_WSREP changes are part of the main server).

Marko Mäkelä made changes - 2020-07-14 12:56

Link

This issue relates to ~~MDEV-21910~~ [ ~~MDEV-21910~~ ]

Marko Mäkelä made changes - 2020-07-14 12:56

Component/s		Galera [ 10124 ]
Fix Version/s		10.1 [ 16100 ]
Fix Version/s		10.2 [ 14601 ]
Fix Version/s		10.3 [ 22126 ]
Fix Version/s		10.5 [ 23123 ]
Affects Version/s		10.1 [ 16100 ]
Affects Version/s		10.2 [ 14601 ]
Affects Version/s		10.3 [ 22126 ]
Affects Version/s		10.4 [ 22408 ]
Affects Version/s		10.5 [ 23123 ]
Assignee	Marko Mäkelä [ marko ]	Jan Lindström [ jplindst ]

Marko Mäkelä made changes - 2020-07-14 12:58

Summary

server crashing repeatedly

SIGSEGV in lock_rec_unlock() when Galera is enabled

Rick Pizzi (Inactive) added a comment - 2020-07-14 13:34

No we did not see that error but maybe it requires wsrep_debug = ON. I will try and enable it.

Rick Pizzi (Inactive) added a comment - 2020-07-14 13:34 No we did not see that error but maybe it requires wsrep_debug = ON. I will try and enable it.

Marko Mäkelä added a comment - 2020-07-15 13:41

I believe that this code should be disabled by default. However, until ~~MDEV-19563~~ was addressed, Galera was recommending setting the deprecated parameter innodb_locks_unsafe_for_binlog. The lock_rec_unlock() should only be effective if ‘semi-consistent reads’ are enabled (when that deprecated parameter is set, or the transaction isolation level is READ UNCOMMITTED or READ COMMITTED).

Marko Mäkelä added a comment - 2020-07-15 13:41 I believe that this code should be disabled by default. However, until MDEV-19563 was addressed, Galera was recommending setting the deprecated parameter innodb_locks_unsafe_for_binlog . The lock_rec_unlock() should only be effective if ‘semi-consistent reads’ are enabled (when that deprecated parameter is set, or the transaction isolation level is READ UNCOMMITTED or READ COMMITTED ).

Jan Lindström (Inactive) added a comment - 2020-07-15 14:41

In Galera case it should go that piece of code only if we have two transactions running on applier node making them both Brute Force (BF) as there is wsrep_thd_is_BF conditions for both locks. I'm surprised if error log does not have anything else about this situation.

Jan Lindström (Inactive) added a comment - 2020-07-15 14:41 In Galera case it should go that piece of code only if we have two transactions running on applier node making them both Brute Force (BF) as there is wsrep_thd_is_BF conditions for both locks. I'm surprised if error log does not have anything else about this situation.

Rick Pizzi (Inactive) added a comment - 2020-07-15 14:45

So get surprised Nothing in logs.

Rick Pizzi (Inactive) added a comment - 2020-07-15 14:45 So get surprised Nothing in logs.

Nick (Inactive) made changes - 2020-07-15 16:47

Assignee

Jan Lindström [ jplindst ]

Nick [ nicklamb ]

Rick Pizzi (Inactive) made changes - 2020-07-15 20:05

Link

This issue relates to ~~MDEV-23186~~ [ ~~MDEV-23186~~ ]

Rick Pizzi (Inactive) added a comment - 2020-07-16 07:21

We need to run the server with the already existing option to skip backtraces. I have investigated the source code and in fact that's where it crashes the 2nd time, so using --skip-stack-trace should avoid that.

Rick Pizzi (Inactive) added a comment - 2020-07-16 07:21 We need to run the server with the already existing option to skip backtraces. I have investigated the source code and in fact that's where it crashes the 2nd time, so using --skip-stack-trace should avoid that.

Julien Fritsch made changes - 2020-07-16 15:32

Assignee

Nick [ nicklamb ]

Julius Goryavsky [ sysprg ]

Julien Fritsch made changes - 2020-07-16 15:33

Status

Open [ 1 ]

In Progress [ 3 ]

Jan Lindström (Inactive) added a comment - 2020-07-20 07:01

rpizzi Does the problem repeat with --wsrep-debug=ON ? That setting has some negative effect on performance but I think we need that error log with it to narrow down the search.

Jan Lindström (Inactive) added a comment - 2020-07-20 07:01 rpizzi Does the problem repeat with --wsrep-debug=ON ? That setting has some negative effect on performance but I think we need that error log with it to narrow down the search.

Julien Fritsch made changes - 2020-07-20 07:55

Assignee

Julius Goryavsky [ sysprg ]

Jan Lindström [ jplindst ]

Julien Fritsch made changes - 2020-07-20 07:55

Labels

need_feedback

Rick Pizzi (Inactive) added a comment - 2020-07-20 08:26

jplindst the crash did not happen since about 2 weeks now (without any change that we are aware of).
We have disabled backtracing in the hope to get a core dump. wsrep_debug in galera 4 is too much verbose and can't be enabled with customers which have a busy workload like in this case, as it produces gigabytes of logs per hour. If we get a core it's not important anyway.

Thanks

Rick

Rick Pizzi (Inactive) added a comment - 2020-07-20 08:26 jplindst the crash did not happen since about 2 weeks now (without any change that we are aware of). We have disabled backtracing in the hope to get a core dump. wsrep_debug in galera 4 is too much verbose and can't be enabled with customers which have a busy workload like in this case, as it produces gigabytes of logs per hour. If we get a core it's not important anyway. Thanks Rick

Jan Lindström (Inactive) added a comment - 2020-07-29 12:27

I was trying to understand why we would not want to wait another BF transaction lock in lock_rec_has_to_wait_in_queue() function. However, I could not find any reason, if something like that happens there is no option except wait for your turn. I even tested it by removing the code and running regression test suite and could not find a single regression. For some reason there is more complex conditions on e.g. MySQL 8.0 version of this code.

Jan Lindström (Inactive) added a comment - 2020-07-29 12:27 I was trying to understand why we would not want to wait another BF transaction lock in lock_rec_has_to_wait_in_queue() function. However, I could not find any reason, if something like that happens there is no option except wait for your turn. I even tested it by removing the code and running regression test suite and could not find a single regression. For some reason there is more complex conditions on e.g. MySQL 8.0 version of this code.

Jan Lindström (Inactive) made changes - 2020-08-03 05:50

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Jan Lindström (Inactive) made changes - 2020-08-03 06:02

Fix Version/s

10.1 [ 16100 ]

Jan Lindström (Inactive) made changes - 2020-08-03 06:06

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2020-08-04 04:56

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Roel Van de Paar added a comment - 2020-08-04 05:33 - edited

*** buffer overflow detected ***

: If it is possible to use an ASAN instrumented binary for some time (for example on a clone/binlog slave, or even on a low-use production machine) then we may be able to more easily see the bug show up the error log.

Roel Van de Paar added a comment - 2020-08-04 05:33 - edited *** buffer overflow detected *** : If it is possible to use an ASAN instrumented binary for some time (for example on a clone/binlog slave, or even on a low-use production machine) then we may be able to more easily see the bug show up the error log.

Roel Van de Paar made changes - 2020-08-04 05:44

Description

Server is crashing few times a day with the following stacktrace:

{code}
/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]
/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]
/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]
/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]
/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]
/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]
/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]
{code}

We tried to set up for core dump generation, but it appears it crashes trying to access memory and therefore no core dump is generated:

{code}
200703 13:10:16 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.4.13-MariaDB-log
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=10223
max_threads=20002
thread_count=8406
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 123517134 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7faf48086b98
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7facc4190cc0 thread_stack 0x49000
*** buffer overflow detected ***: /usr/sbin/mysqld terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fcadad5a9e7]
/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]
/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]
/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]
/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]
/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]
/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]
/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]
======= Memory map: ========
55f2dba51000-55f2dceca000 r-xp 00000000 fd:00 436210356 /usr/sbin/mysqld
55f2dd0ca000-55f2dd1fd000 r--p 01479000 fd:00 436210356 /usr/sbin/mysqld
55f2dd1fd000-55f2dd2b7000 rw-p 015ac000 fd:00 436210356 /usr/sbin/mysqld
55f2dd2b7000-55f2ddb66000 rw-p 00000000 00:00 0
55f2de4cb000-55f348284000 rw-p 00000000 00:00 0 [heap]
7fabcc000000-7fabcdb8a000 rw-p 00000000 00:00 0

[ ... ]

7fcadcdd6000-7fcadcdd7000 rw-p 00000000 00:00 0
7fff76cfc000-7fff76d1d000 rw-p 00000000 00:00 0 [stack]
7fff76d84000-7fff76d86000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Fatal signal 6 while backtracing

{code}

Server is crashing few times a day with the following stacktrace:

{code}
/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]
/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]
/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]
/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]
/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]
/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]
/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]
{code}

{noformat:title=3 Top frames, resolved}
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908] lock_rec_has_to_wait_in_queue(ib_lock_t const*) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79).
/usr/sbin/mysqld(0xa9d47e)[0x55f2dc4ee47e] lock_rec_unlock(trx_t*, buf_block_t const*, unsigned char const*, lock_mode) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214).
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626] row_unlock_for_mysql(row_prebuilt_t*, unsigned long) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2093).
{noformat}

We tried to set up for core dump generation, but it appears it crashes trying to access memory and therefore no core dump is generated:

{code}
200703 13:10:16 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.4.13-MariaDB-log
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=10223
max_threads=20002
thread_count=8406
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 123517134 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7faf48086b98
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7facc4190cc0 thread_stack 0x49000
*** buffer overflow detected ***: /usr/sbin/mysqld terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fcadad5a9e7]
/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]
/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]
/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]
/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]
/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]
/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]
/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]
======= Memory map: ========
55f2dba51000-55f2dceca000 r-xp 00000000 fd:00 436210356 /usr/sbin/mysqld
55f2dd0ca000-55f2dd1fd000 r--p 01479000 fd:00 436210356 /usr/sbin/mysqld
55f2dd1fd000-55f2dd2b7000 rw-p 015ac000 fd:00 436210356 /usr/sbin/mysqld
55f2dd2b7000-55f2ddb66000 rw-p 00000000 00:00 0
55f2de4cb000-55f348284000 rw-p 00000000 00:00 0 [heap]
7fabcc000000-7fabcdb8a000 rw-p 00000000 00:00 0

[ ... ]

7fcadcdd6000-7fcadcdd7000 rw-p 00000000 00:00 0
7fff76cfc000-7fff76d1d000 rw-p 00000000 00:00 0 [stack]
7fff76d84000-7fff76d86000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Fatal signal 6 while backtracing

{code}

Rick Pizzi (Inactive) added a comment - 2020-08-04 06:27 - edited

Roelthis is what we (RDBA Team) have been asking since the beginning.
We can certainly run a special binary as long as it is production quality, the only issue is the crash is not happening anymore since 2 weeks (it was happening almost daily before). We did not change anything and neither did the customer.

Rick

Rick Pizzi (Inactive) added a comment - 2020-08-04 06:27 - edited Roel this is what we (RDBA Team) have been asking since the beginning. We can certainly run a special binary as long as it is production quality, the only issue is the crash is not happening anymore since 2 weeks (it was happening almost daily before). We did not change anything and neither did the customer. Rick

Sergei Golubchik made changes - 2020-08-04 11:12

Priority

Blocker [ 1 ]

Major [ 3 ]

Valerii Kravchuk made changes - 2020-08-05 11:54

Labels

need_feedback

Valerii Kravchuk made changes - 2020-08-05 12:00

Priority

Major [ 3 ]

Critical [ 2 ]

Jan Lindström (Inactive) added a comment - 2020-08-05 13:00

This confirms that we could end up referencing NULL pointer on lock_rec_get_n_bits because of bug on lock_rec_has_to_wait_in_queue that is fixed a8ec45863b958757da61af3b2ce0a38b0a79d92c for 10.2 and it should be merged to other releases.

Jan Lindström (Inactive) added a comment - 2020-08-05 13:00 This confirms that we could end up referencing NULL pointer on lock_rec_get_n_bits because of bug on lock_rec_has_to_wait_in_queue that is fixed a8ec45863b958757da61af3b2ce0a38b0a79d92c for 10.2 and it should be merged to other releases.

Marko Mäkelä made changes - 2020-08-05 15:42

Link

This issue relates to ~~MDEV-23409~~ [ ~~MDEV-23409~~ ]

Marko Mäkelä added a comment - 2020-08-05 17:46

valerii posted information that confirms that the READ COMMITTED isolation level was being used: trx->isolation_level=1=TRX_ISO_READ_COMMITTED. It is the designed way of enabling the semi-consistent read for UPDATE operations. The idea of MySQL Bug #3300 is to unlock non-matching rows in a table scan.

Setting the isolation level to the default REPEATABLE READ should avoid hitting this Galera bug.

jplindst must debug this in order to create a repeatable test case. It is not enough to prevent a crash when Galera is enabled; we must also ensure that all locks are being replicated correctly across the cluster.

Marko Mäkelä added a comment - 2020-08-05 17:46 valerii posted information that confirms that the READ COMMITTED isolation level was being used: trx->isolation_level=1=TRX_ISO_READ_COMMITTED . It is the designed way of enabling the semi-consistent read for UPDATE operations. The idea of MySQL Bug #3300 is to unlock non-matching rows in a table scan. Setting the isolation level to the default REPEATABLE READ should avoid hitting this Galera bug. jplindst must debug this in order to create a repeatable test case. It is not enough to prevent a crash when Galera is enabled; we must also ensure that all locks are being replicated correctly across the cluster.

Jan Lindström (Inactive) made changes - 2020-08-06 05:53

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2020-08-06 10:18

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Jan Lindström (Inactive) made changes - 2020-08-06 10:18

Labels

need_feedback

Marko Mäkelä added a comment - 2020-08-07 09:43

I checked all assignments to trx_t::isolation_level in MariaDB 10.4.13, and it looks like the only way for InnoDB to use READ COMMITTED is when thd_tx_isolation() returns ISO_READ_COMMITTED. Outside the SQL parser, the only assignments that refer to that constant are related to Galera:

git grep -w ISO_READ_COMMITTED mariadb-10.4.13
mariadb-10.4.13:sql/wsrep_high_priority_service.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_high_priority_service.cc: thd->tx_isolation = ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_schema.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_sst.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_storage_service.cc: thd->variables.tx_isolation = ISO_READ_COMMITTED;

Based on this finding, it might be that this bug cannot be worked around.

Marko Mäkelä added a comment - 2020-08-07 09:43 I checked all assignments to trx_t::isolation_level in MariaDB 10.4.13, and it looks like the only way for InnoDB to use READ COMMITTED is when thd_tx_isolation() returns ISO_READ_COMMITTED . Outside the SQL parser, the only assignments that refer to that constant are related to Galera: git grep -w ISO_READ_COMMITTED mariadb-10.4.13 mariadb-10.4.13:sql/wsrep_high_priority_service.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED; mariadb-10.4.13:sql/wsrep_high_priority_service.cc: thd->tx_isolation = ISO_READ_COMMITTED; mariadb-10.4.13:sql/wsrep_schema.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED; mariadb-10.4.13:sql/wsrep_sst.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED; mariadb-10.4.13:sql/wsrep_storage_service.cc: thd->variables.tx_isolation = ISO_READ_COMMITTED; Based on this finding, it might be that this bug cannot be worked around.

Jan Lindström (Inactive) added a comment - 2020-08-08 10:38

Analysis:

wsrep_high_priority_service : Used only on applier threads (i.e. brute force)
wsrep_schema : Used only when node starts from empty database
wsrep_sst : Used on joiner during SST only
wsrep_storage_service : Used for streaming replication

There is no indication that these are involved on this case.

Jan Lindström (Inactive) added a comment - 2020-08-08 10:38 Analysis: wsrep_high_priority_service : Used only on applier threads (i.e. brute force) wsrep_schema : Used only when node starts from empty database wsrep_sst : Used on joiner during SST only wsrep_storage_service : Used for streaming replication There is no indication that these are involved on this case.

Jan Lindström (Inactive) made changes - 2020-08-12 07:13

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Jan Lindström (Inactive) added a comment - 2020-08-12 08:22 - edited

I made analysis from full stack dump and found out following points:

Thread 9222 (Thread 0x7f694c3bf700 (LWP 54589)):

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

#1  0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158

#2  wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325

#3  os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502

#4  0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f694c3ba7e0: 0x7f858c6db810) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471

#5  0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=1937) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477

#6  0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=n_spins@entry=30, n_delay=4, line=line@entry=1937, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594

#7  0x000055dee6f2e051 in lock_rec_lock (impl=impl@entry=false, mode=1027, block=0x7f7fa2f61118, heap_no=55, index=index@entry=0x7f6f5009db78, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:1937

#8  0x000055dee6f2ebc9 in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=<optimized out>, rec=rec@entry=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=index@entry=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, gap_mode=1024, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:5829

#9  0x000055dee6a3d5a0 in sel_set_rec_lock (pcur=<optimized out>, rec=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, type=<optimized out>, thr=0x7f6b88779e48, mtr=0x7f694c3bb870) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:1264

#10 0x000055dee6fc99c9 in row_search_mvcc (buf=<optimized out>, mode=PAGE_CUR_G, prebuilt=0x7f6b88779628, match_mode=<optimized out>, direction=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:5063

#11 0x000055dee6ee91d7 in ha_innobase::general_fetch (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005", direction=1, match_mode=0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/handler/ha_innodb.cc:9544

#12 0x000055dee6d3f411 in handler::ha_rnd_next (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005") at /usr/src/debug/MariaDB-10.4.13/src_0/sql/handler.cc:2839

#13 0x000055dee6e6da6c in rr_sequential (info=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.cc:477

#14 0x000055dee6bdea68 in read_record (this=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.h:69

#15 mysql_update (thd=thd@entry=0x7f6ef8305878, table_list=0x7f6ef8313f00, fields=..., values=..., conds=0x7f6ef8314d50, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f694c3bc560, updated_return=updated_return@entry=0x7f694c3bc620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:975

#16 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6ef8305878) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376

#17 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6ef8305878, rawbuf=rawbuf@entry=0x7f6ef8313ba0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f694c3be350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

This is local UPDATE waiting lock_sys->mutex in lock_rec_lock().

Thread 6699 (Thread 0x7f696e127700 (LWP 715)):

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

#1  0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158

#2  wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325

#3  os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502

#4  0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f696e123460: 0x7f858c6db190) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471

#5  0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477

#6  0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594

#7  0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6bd00c9148, block=0x7f7fa2f5c0f0, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170

#8  0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f882010b8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091

#9  0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f704c285228, table_list=0x7f704c27b190, fields=..., values=..., conds=0x7f704c27bfe0, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f696e124560, updated_return=updated_return@entry=0x7f696e124620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168

#10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f704c285228) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376

#11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f704c285228, rawbuf=rawbuf@entry=0x7f704c27ae30 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f696e126350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

Here local UPDATE waiting lock_sys->mutex on lock_rec_unlock().

Thread 3484 (Thread 0x7f6968817700 (LWP 54282)):

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185

#1  0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158

#2  wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325

#3  os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502

#4  0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f6968813460: 0x7f858c6db410) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471

#5  0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477

#6  0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594

#7  0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6ea427bdd8, block=0x7f7fa2f5d608, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170

#8  0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6ab8578ec8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091

#9  0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x55df52f78a68, table_list=0x55df52f87930, fields=..., values=..., conds=0x55df52f88780, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6968814560, updated_return=updated_return@entry=0x7f6968814620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168

#10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x55df52f78a68) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376

#11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x55df52f78a68, rawbuf=rawbuf@entry=0x55df52f875d0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6968816350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

This is also local UPDATE waiting for lock_sys->mutex in lock_rec_unlock(). Finally, the crashing thread:

Thread 1 (Thread 0x7f6a24847700 (LWP 423789)):

#0  0x00007f884411e9d1 in __pthread_kill (threadid=<optimized out>, signo=11) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:59

#1  0x000055dee6d3affe in handle_fatal_signal (sig=11) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/signal_handler.cc:343

#2  <signal handler called>

#3  lock_rec_get_n_bits (lock=0x0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79

#4  lock_rec_has_to_wait_in_queue (wait_lock=wait_lock@entry=0x7f80d12e3210) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:2050

#5  0x000055dee6f2847e in lock_rec_unlock (trx=trx@entry=0x7f858c6a69b8, block=<optimized out>, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214

#6  0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f50094958, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091

#7  0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f6d8c0921d8, table_list=0x7f6d8c0a17c0, fields=..., values=..., conds=0x7f6d8c0a2610, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6a24844560, updated_return=updated_return@entry=0x7f6a24844620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168

#8  0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6d8c0921d8) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376

#9  0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6d8c0921d8, rawbuf=rawbuf@entry=0x7f6d8c0a1460 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6a24846350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

Finds NULL pointer on lock variable. Analysis:

Customer is using single-master Galera topology. There is no indication of any applier threads running during crash.
Transaction isolation level for crashing trx is READ COMMITTED
There is no indication of Galera replay

As this is single-master condition in lock_rec_has_to_wait_in_queue() i.e.

			if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) &&

			    wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) {

is always false.

As this is master node this is not lock replication problem
At this point I do not see any evidence that this is Galera bug
Listed threads are only one's that are interesting at the crash point

Jan Lindström (Inactive) added a comment - 2020-08-12 08:22 - edited I made analysis from full stack dump and found out following points: Thread 9222 (Thread 0x7f694c3bf700 (LWP 54589)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158 #2 wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325 #3 os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502 #4 0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f694c3ba7e0: 0x7f858c6db810) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471 #5 0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=1937) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477 #6 0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=n_spins@entry=30, n_delay=4, line=line@entry=1937, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594 #7 0x000055dee6f2e051 in lock_rec_lock (impl=impl@entry=false, mode=1027, block=0x7f7fa2f61118, heap_no=55, index=index@entry=0x7f6f5009db78, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:1937 #8 0x000055dee6f2ebc9 in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=<optimized out>, rec=rec@entry=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=index@entry=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, gap_mode=1024, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:5829 #9 0x000055dee6a3d5a0 in sel_set_rec_lock (pcur=<optimized out>, rec=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, type=<optimized out>, thr=0x7f6b88779e48, mtr=0x7f694c3bb870) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:1264 #10 0x000055dee6fc99c9 in row_search_mvcc (buf=<optimized out>, mode=PAGE_CUR_G, prebuilt=0x7f6b88779628, match_mode=<optimized out>, direction=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:5063 #11 0x000055dee6ee91d7 in ha_innobase::general_fetch (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005", direction=1, match_mode=0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/handler/ha_innodb.cc:9544 #12 0x000055dee6d3f411 in handler::ha_rnd_next (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005") at /usr/src/debug/MariaDB-10.4.13/src_0/sql/handler.cc:2839 #13 0x000055dee6e6da6c in rr_sequential (info=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.cc:477 #14 0x000055dee6bdea68 in read_record (this=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.h:69 #15 mysql_update (thd=thd@entry=0x7f6ef8305878, table_list=0x7f6ef8313f00, fields=..., values=..., conds=0x7f6ef8314d50, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f694c3bc560, updated_return=updated_return@entry=0x7f694c3bc620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:975 #16 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6ef8305878) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376 #17 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6ef8305878, rawbuf=rawbuf@entry=0x7f6ef8313ba0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f694c3be350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900 This is local UPDATE waiting lock_sys->mutex in lock_rec_lock(). Thread 6699 (Thread 0x7f696e127700 (LWP 715)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158 #2 wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325 #3 os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502 #4 0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f696e123460: 0x7f858c6db190) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471 #5 0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477 #6 0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594 #7 0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6bd00c9148, block=0x7f7fa2f5c0f0, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170 #8 0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f882010b8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091 #9 0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f704c285228, table_list=0x7f704c27b190, fields=..., values=..., conds=0x7f704c27bfe0, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f696e124560, updated_return=updated_return@entry=0x7f696e124620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168 #10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f704c285228) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376 #11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f704c285228, rawbuf=rawbuf@entry=0x7f704c27ae30 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f696e126350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900 Here local UPDATE waiting lock_sys->mutex on lock_rec_unlock(). Thread 3484 (Thread 0x7f6968817700 (LWP 54282)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158 #2 wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325 #3 os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502 #4 0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f6968813460: 0x7f858c6db410) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471 #5 0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477 #6 0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594 #7 0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6ea427bdd8, block=0x7f7fa2f5d608, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170 #8 0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6ab8578ec8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091 #9 0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x55df52f78a68, table_list=0x55df52f87930, fields=..., values=..., conds=0x55df52f88780, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6968814560, updated_return=updated_return@entry=0x7f6968814620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168 #10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x55df52f78a68) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376 #11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x55df52f78a68, rawbuf=rawbuf@entry=0x55df52f875d0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6968816350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900 This is also local UPDATE waiting for lock_sys->mutex in lock_rec_unlock(). Finally, the crashing thread: Thread 1 (Thread 0x7f6a24847700 (LWP 423789)): #0 0x00007f884411e9d1 in __pthread_kill (threadid=<optimized out>, signo=11) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:59 #1 0x000055dee6d3affe in handle_fatal_signal (sig=11) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/signal_handler.cc:343 #2 <signal handler called> #3 lock_rec_get_n_bits (lock=0x0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79 #4 lock_rec_has_to_wait_in_queue (wait_lock=wait_lock@entry=0x7f80d12e3210) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:2050 #5 0x000055dee6f2847e in lock_rec_unlock (trx=trx@entry=0x7f858c6a69b8, block=<optimized out>, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214 #6 0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f50094958, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091 #7 0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f6d8c0921d8, table_list=0x7f6d8c0a17c0, fields=..., values=..., conds=0x7f6d8c0a2610, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6a24844560, updated_return=updated_return@entry=0x7f6a24844620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168 #8 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6d8c0921d8) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376 #9 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6d8c0921d8, rawbuf=rawbuf@entry=0x7f6d8c0a1460 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6a24846350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900 Finds NULL pointer on lock variable. Analysis: Customer is using single-master Galera topology. There is no indication of any applier threads running during crash. Transaction isolation level for crashing trx is READ COMMITTED There is no indication of Galera replay As this is single-master condition in lock_rec_has_to_wait_in_queue() i.e. if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) && wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) { is always false. As this is master node this is not lock replication problem At this point I do not see any evidence that this is Galera bug Listed threads are only one's that are interesting at the crash point

Jan Lindström (Inactive) made changes - 2020-08-12 08:23

Assignee

Jan Lindström [ jplindst ]

Marko Mäkelä [ marko ]

Jan Lindström (Inactive) made changes - 2020-08-12 08:24

Attachment

thd.txt [ 53215 ]

Jan Lindström (Inactive) made changes - 2020-08-12 08:24

Attachment

trx.txt [ 53216 ]

Jan Lindström (Inactive) made changes - 2020-08-12 09:49

Labels

need_feedback

Marko Mäkelä made changes - 2020-08-12 12:25

Assignee

Marko Mäkelä [ marko ]

Jan Lindström [ jplindst ]

Jan Lindström (Inactive) added a comment - 2020-08-18 13:27

https://github.com/MariaDB/server/commit/fb2fc051004d11b4aeea81683596f505b58cab23

This is only for requested condition move with additional instrumentation when something goes wrong. I have failed to reproduce the actual case using RQG, mtr and sysbench. At the moment I do not have good ideas how to continue. One possibility is to give special build to customer and see if we still hit same bug. At least in that case we have additional information on error log to analyze.

Jan Lindström (Inactive) added a comment - 2020-08-18 13:27 https://github.com/MariaDB/server/commit/fb2fc051004d11b4aeea81683596f505b58cab23 This is only for requested condition move with additional instrumentation when something goes wrong. I have failed to reproduce the actual case using RQG, mtr and sysbench. At the moment I do not have good ideas how to continue. One possibility is to give special build to customer and see if we still hit same bug. At least in that case we have additional information on error log to analyze.

Jan Lindström (Inactive) made changes - 2020-08-18 13:27

Status

In Progress [ 3 ]

In Review [ 10002 ]

Jan Lindström (Inactive) made changes - 2020-08-20 04:10

Assignee

Jan Lindström [ jplindst ]

Marko Mäkelä [ marko ]

Marko Mäkelä added a comment - 2020-08-24 08:07

Some code cleanup in this area is welcome, but I think that more effort is needed to reproduce this error.

Please revise the patch and request a further review.

When it comes to the suspicion that the loop would fail to terminate without Galera involvement, I think that it is extremely unlikely. The function lock_rec_unlock() would acquire lock_sys.mutex and trx_t::mutex, find a matching lock, and proceed to release the lock while holding those mutexes. The lock should not possibly disappear while we are holding those mutexes. Besides, locks should only be released by the thread that is executing the transaction, not by other threads. Other threads may only convert implicit locks into explicit ones, on behalf of the lock-holder transaction. That activity is blocked by both said mutexes.

Marko Mäkelä added a comment - 2020-08-24 08:07 Some code cleanup in this area is welcome, but I think that more effort is needed to reproduce this error. Please revise the patch and request a further review. When it comes to the suspicion that the loop would fail to terminate without Galera involvement, I think that it is extremely unlikely. The function lock_rec_unlock() would acquire lock_sys.mutex and trx_t::mutex , find a matching lock, and proceed to release the lock while holding those mutexes. The lock should not possibly disappear while we are holding those mutexes. Besides, locks should only be released by the thread that is executing the transaction, not by other threads. Other threads may only convert implicit locks into explicit ones, on behalf of the lock-holder transaction. That activity is blocked by both said mutexes.

Marko Mäkelä made changes - 2020-08-24 08:07

Assignee	Marko Mäkelä [ marko ]	Jan Lindström [ jplindst ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Jan Lindström (Inactive) added a comment - 2020-08-27 04:17

https://github.com/MariaDB/server/commit/c2c4a77004e38bcdb735c330b5b9529c625b372d

I have run lot of randgen runs on both release build and debug build, but still failed to reproduce this problem (some others I have found and created MDEVs for them).

Jan Lindström (Inactive) added a comment - 2020-08-27 04:17 https://github.com/MariaDB/server/commit/c2c4a77004e38bcdb735c330b5b9529c625b372d I have run lot of randgen runs on both release build and debug build, but still failed to reproduce this problem (some others I have found and created MDEVs for them).

Jan Lindström (Inactive) made changes - 2020-08-27 04:17

Assignee	Jan Lindström [ jplindst ]	Marko Mäkelä [ marko ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Marko Mäkelä added a comment - 2020-09-02 09:03

This looks mostly OK now. Please revert some unnecessary changes (in particular, the change that would have made lock_trx_handle_wait() dead code), and request one more review.

Marko Mäkelä added a comment - 2020-09-02 09:03 This looks mostly OK now. Please revert some unnecessary changes (in particular, the change that would have made lock_trx_handle_wait() dead code), and request one more review.

Marko Mäkelä made changes - 2020-09-02 09:03

Assignee	Marko Mäkelä [ marko ]	Jan Lindström [ jplindst ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Jan Lindström (Inactive) added a comment - 2020-09-07 11:02 - edited

Jan Lindström (Inactive) added a comment - 2020-09-07 11:02 - edited 10.2: https://github.com/MariaDB/server/commit/6830ebe08139fda352573b2f224f3bcec280494f 10.3: https://github.com/MariaDB/server/commit/45093c0da83ecad05414d3092cb637df30361506 10.4: https://github.com/MariaDB/server/commit/2713dacd5d69654536b2a2c03eec693c799f025b 10.5: https://github.com/MariaDB/server/commit/d48ba030015b07d04ba2c78985be2a8208933131

Rick Pizzi (Inactive) added a comment - 2020-09-07 14:56

jplindst the customer is running the debug build since last week. We will try to trigger the crash again. Usually it appears a few hours later some DDL has been run.

Rick Pizzi (Inactive) added a comment - 2020-09-07 14:56 jplindst the customer is running the debug build since last week. We will try to trigger the crash again. Usually it appears a few hours later some DDL has been run.

Jan Lindström (Inactive) added a comment - 2020-09-07 16:27

There is bigger difference between 10.3 and 10.4 (Galera 4), but all that is on server code.

Jan Lindström (Inactive) added a comment - 2020-09-07 16:27 There is bigger difference between 10.3 and 10.4 (Galera 4), but all that is on server code.

Jan Lindström (Inactive) made changes - 2020-09-07 16:27

Assignee	Jan Lindström [ jplindst ]	Marko Mäkelä [ marko ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Marko Mäkelä added a comment - 2020-09-09 15:05

I posted a few minor comments to the 10.2 version. OK push to 10.2 after addressing those. I will use the other versions for reference when merging to later branches.

Marko Mäkelä added a comment - 2020-09-09 15:05 I posted a few minor comments to the 10.2 version. OK push to 10.2 after addressing those. I will use the other versions for reference when merging to later branches.

Marko Mäkelä made changes - 2020-09-09 15:05

Assignee	Marko Mäkelä [ marko ]	Jan Lindström [ jplindst ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Jan Lindström (Inactive) made changes - 2020-09-10 10:19

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Jan Lindström (Inactive) made changes - 2020-09-10 10:19

issue.field.resolutiondate

2020-09-10 10:19:56.0

2020-09-10 10:19:56.562

Jan Lindström (Inactive) made changes - 2020-09-10 10:20

Fix Version/s		10.2.34 [ 24505 ]
Fix Version/s		10.3.25 [ 24506 ]
Fix Version/s		10.4.15 [ 24507 ]
Fix Version/s		10.5.6 [ 24508 ]
Fix Version/s		10.6.0 [ 24431 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Fix Version/s	10.5 [ 23123 ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Ralf Gebhardt made changes - 2020-10-06 18:35

Fix Version/s

10.5.7 [ 25019 ]

Ralf Gebhardt made changes - 2020-10-06 18:38

Fix Version/s

10.5.6 [ 24508 ]

Ralf Gebhardt made changes - 2020-10-06 18:40

Fix Version/s

10.4.16 [ 25020 ]

Ralf Gebhardt made changes - 2020-10-06 18:42

Fix Version/s

10.4.15 [ 24507 ]

Ralf Gebhardt made changes - 2020-10-06 18:43

Fix Version/s

10.3.26 [ 25021 ]

Ralf Gebhardt made changes - 2020-10-06 18:45

Fix Version/s

10.3.25 [ 24506 ]

Ralf Gebhardt made changes - 2020-10-06 18:47

Fix Version/s

10.2.35 [ 25022 ]

Ralf Gebhardt made changes - 2020-10-06 18:48

Fix Version/s

10.2.34 [ 24505 ]

Marko Mäkelä made changes - 2020-10-06 18:53

Link

This issue causes ~~MDEV-23897~~ [ ~~MDEV-23897~~ ]

Sergei Golubchik made changes - 2021-12-06 21:51

Workflow

MariaDB v3 [ 110893 ]

MariaDB v4 [ 158061 ]

Jira Automation (IT) made changes - 2024-07-04 03:03

Zendesk Related Tickets

177484 152496

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Rick Pizzi (Inactive)

Votes:: 5 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 2020-07-06 10:42

Updated:: 2024-07-07 21:58

Resolved:: 2020-09-10 10:19

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration