[MDEV-23101] SIGSEGV in lock_rec_unlock() when Galera is enabled Created: 2020-07-06  Updated: 2021-04-19  Resolved: 2020-09-10

Status: Closed
Project: MariaDB Server
Component/s: Galera, Server
Affects Version/s: 10.1, 10.4.13, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.2.35, 10.3.26, 10.4.16, 10.5.7, 10.6.0

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 5
Labels: None

Attachments: Text File thd.txt     Text File trx.txt    
Issue Links:
Problem/Incident
causes MDEV-23897 ASAN SEGV in lock_grant_and_move_on_page Closed
Relates
relates to MDEV-23186 mysqld doesn't create core dump if cr... Closed
relates to MDEV-23409 Galera node with role master crashes ... Closed
relates to MDEV-14846 InnoDB: assertion on trx->state becau... Closed
relates to MDEV-20666 Assertion failures prebuilt->sql_stat... Stalled
relates to MDEV-21910 KIlling thread on Galera could cause ... Closed

 Description   

Server is crashing few times a day with the following stacktrace:

/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]
/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]
/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]
/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]
/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]
/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]
/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]

3 Top frames, resolved

/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908] lock_rec_has_to_wait_in_queue(ib_lock_t const*) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79).
 /usr/sbin/mysqld(0xa9d47e)[0x55f2dc4ee47e] lock_rec_unlock(trx_t*, buf_block_t const*, unsigned char const*, lock_mode) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214).
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626] row_unlock_for_mysql(row_prebuilt_t*, unsigned long) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2093).

We tried to set up for core dump generation, but it appears it crashes trying to access memory and therefore no core dump is generated:

200703 13:10:16 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.13-MariaDB-log
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=10223
max_threads=20002
thread_count=8406
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 123517134 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7faf48086b98
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7facc4190cc0 thread_stack 0x49000
*** buffer overflow detected ***: /usr/sbin/mysqld terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fcadad5a9e7]
/lib64/libc.so.6(+0x115b62)[0x7fcadad58b62]
/lib64/libc.so.6(+0x117947)[0x7fcadad5a947]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x55f2dc88214a]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x55f2dc86a862]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55f2dc300f7f]
/lib64/libpthread.so.0(+0xf5d0)[0x7fcadc9a65d0]
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
/usr/sbin/mysqld(+0xa9d47e)[0x55f2dc4ee47e]
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
/usr/sbin/mysqld(_Z12mysql_updateP3THDP10TABLE_LISTR4ListI4ItemES6_PS4_jP8st_orderybPySA_+0x2110)[0x55f2dc1a5c60]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x2ff5)[0x55f2dc0fcd15]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x1fb)[0x55f2dc1028ab]
/usr/sbin/mysqld(+0x5a0090)[0x55f2dbff1090]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x28f8)[0x55f2dc1059d8]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x109)[0x55f2dc106049]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1fa)[0x55f2dc1e3bfa]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55f2dc1e3cdd]
/usr/sbin/mysqld(+0xdcb78d)[0x55f2dc81c78d]
/lib64/libpthread.so.0(+0x7dd5)[0x7fcadc99edd5]
/lib64/libc.so.6(clone+0x6d)[0x7fcadad40ead]
======= Memory map: ========
55f2dba51000-55f2dceca000 r-xp 00000000 fd:00 436210356                  /usr/sbin/mysqld
55f2dd0ca000-55f2dd1fd000 r--p 01479000 fd:00 436210356                  /usr/sbin/mysqld
55f2dd1fd000-55f2dd2b7000 rw-p 015ac000 fd:00 436210356                  /usr/sbin/mysqld
55f2dd2b7000-55f2ddb66000 rw-p 00000000 00:00 0 
55f2de4cb000-55f348284000 rw-p 00000000 00:00 0                          [heap]
7fabcc000000-7fabcdb8a000 rw-p 00000000 00:00 0 
 
[ ... ]
 
7fcadcdd6000-7fcadcdd7000 rw-p 00000000 00:00 0 
7fff76cfc000-7fff76d1d000 rw-p 00000000 00:00 0                          [stack]
7fff76d84000-7fff76d86000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
Fatal signal 6 while backtracing



 Comments   
Comment by Rick Pizzi [ 2020-07-06 ]

Resolving the last 3 levels of stack trace:

/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
lock_rec_has_to_wait_in_queue(ib_lock_t const*) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79).
 
/usr/sbin/mysqld(0xa9d47e)[0x55f2dc4ee47e]
lock_rec_unlock(trx_t*, buf_block_t const*, unsigned char const*, lock_mode) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214).
 
 
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
row_unlock_for_mysql(row_prebuilt_t*, unsigned long) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2093).

Comment by Elena Stepanova [ 2020-07-06 ]

The closest that we have among open issues seems to be MDEV-14846 (and consequently MDEV-20666 which MDEV-14846 links itself to). Both of them are only known to cause problems on debug builds, but it doesn't mean that the underlying cause can't affect non-debug ones.

Comment by Oleksandr Byelkin [ 2020-07-07 ]

rpizzi if it crashes repeatadly where is test suite?

Comment by Rick Pizzi [ 2020-07-07 ]

This is a production system crashing, of a customer of RDBA. We were unable to reproduce the crash, especially because the cause is not known.

Comment by Oleksandr Byelkin [ 2020-07-07 ]

Ah, that stack trace in innodb is a part which was unresolved
so it looks like innodb crash

Comment by Oleksandr Byelkin [ 2020-07-07 ]

update processing order by very strange, what they do at startup?

Comment by Marko Mäkelä [ 2020-07-14 ]

It seems that lock is an invalid pointer:

                for (lock = first_lock; lock != NULL;
                         lock = lock_rec_get_next(heap_no, lock)) {
                        if (lock_get_wait(lock)
                                && !lock_rec_has_to_wait_in_queue(lock)) {

The signal is sent in lock_rec_has_to_wait_in_queue(lock), which appears to be inlined in the stack trace:

	for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no);
	     lock != wait_lock;
	     lock = lock_rec_get_next_on_page_const(lock)) {
 
		const byte*	p = (const byte*) &lock[1];
 
		if (heap_no < lock_rec_get_n_bits(lock)

We would fail to stop the iteration if we never find wait_lock. Invoking lock_rec_get_n_bits(lock=NULL) would certainly cause the SIGSEGV.

rpizzi, Does the crash involve Galera? Or innodb_lock_schedule_algorithm=VATS? The root cause of this failure should be an invalid call to unlock a row. It might also involve some complex join. Unlocking rows ("semi-consistent reads") was originally introduced in MySQL Bug #3300.

Comment by Rick Pizzi [ 2020-07-14 ]

Yes this is a galera node, crash always happens on the node that has the master role marko

Comment by Marko Mäkelä [ 2020-07-14 ]

As far as I can tell, lock_rec_unlock() is continuously holding lock_sys->mutex, which should protect the creation and destruction of explicit record locks. The most plausible explanation for the crash is the following code in lock_rec_has_to_wait_in_queue(lock), which causes us to skip wait_lock and reach the end of the list:

	for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no);
	     lock != wait_lock;
	     lock = lock_rec_get_next_on_page_const(lock)) {
 
		const byte*	p = (const byte*) &lock[1];
 
		if (heap_no < lock_rec_get_n_bits(lock)
		    && (p[bit_offset] & bit_mask)
		    && lock_has_to_wait(wait_lock, lock)) {
#ifdef WITH_WSREP
			if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) &&
			    wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) {
				if (UNIV_UNLIKELY(wsrep_debug)) {
					mtr_t mtr;
					ib::info() << "WSREP: waiting BF trx: " << ib::hex(wait_lock->trx->id)
						   << " query: " << wsrep_thd_query(wait_lock->trx->mysql_thd);
					lock_rec_print(stderr, wait_lock, mtr);
					ib::info() << "WSREP: do not wait another BF trx: " << ib::hex(lock->trx->id)
						   << " query: " << wsrep_thd_query(lock->trx->mysql_thd);
					lock_rec_print(stderr, lock, mtr);
				}
				/* don't wait for another BF lock */
				continue;
			}
#endif /* WITH_WSREP */
 
			return(lock);
		}
	}
 
	return(NULL);

rpizzi, do you see those WSREP messages in the server error log right before the crash?

It might be that this bug would be addressed by fixing MDEV-21910, or that the above WITH_WSREP code is trying to work around MDEV-21910. In any case, I think that the code is misplaced, and the code should be moved to the relevant caller of lock_rec_has_to_wait_in_queue().

The problematic code is already present in 10.1 (which is the first major version where the WITH_WSREP changes are part of the main server).

Comment by Rick Pizzi [ 2020-07-14 ]

No we did not see that error but maybe it requires wsrep_debug = ON. I will try and enable it.

Comment by Marko Mäkelä [ 2020-07-15 ]

I believe that this code should be disabled by default. However, until MDEV-19563 was addressed, Galera was recommending setting the deprecated parameter innodb_locks_unsafe_for_binlog. The lock_rec_unlock() should only be effective if ‘semi-consistent reads’ are enabled (when that deprecated parameter is set, or the transaction isolation level is READ UNCOMMITTED or READ COMMITTED).

Comment by Jan Lindström (Inactive) [ 2020-07-15 ]

In Galera case it should go that piece of code only if we have two transactions running on applier node making them both Brute Force (BF) as there is wsrep_thd_is_BF conditions for both locks. I'm surprised if error log does not have anything else about this situation.

Comment by Rick Pizzi [ 2020-07-15 ]

So get surprised Nothing in logs.

Comment by Rick Pizzi [ 2020-07-16 ]

We need to run the server with the already existing option to skip backtraces. I have investigated the source code and in fact that's where it crashes the 2nd time, so using --skip-stack-trace should avoid that.

Comment by Jan Lindström (Inactive) [ 2020-07-20 ]

rpizzi Does the problem repeat with --wsrep-debug=ON ? That setting has some negative effect on performance but I think we need that error log with it to narrow down the search.

Comment by Rick Pizzi [ 2020-07-20 ]

jplindst the crash did not happen since about 2 weeks now (without any change that we are aware of).
We have disabled backtracing in the hope to get a core dump. wsrep_debug in galera 4 is too much verbose and can't be enabled with customers which have a busy workload like in this case, as it produces gigabytes of logs per hour. If we get a core it's not important anyway.

Thanks

Rick

Comment by Jan Lindström (Inactive) [ 2020-07-29 ]

I was trying to understand why we would not want to wait another BF transaction lock in lock_rec_has_to_wait_in_queue() function. However, I could not find any reason, if something like that happens there is no option except wait for your turn. I even tested it by removing the code and running regression test suite and could not find a single regression. For some reason there is more complex conditions on e.g. MySQL 8.0 version of this code.

Comment by Roel Van de Paar [ 2020-08-04 ]

*** buffer overflow detected ***

: If it is possible to use an ASAN instrumented binary for some time (for example on a clone/binlog slave, or even on a low-use production machine) then we may be able to more easily see the bug show up the error log.

Comment by Rick Pizzi [ 2020-08-04 ]

Roelthis is what we (RDBA Team) have been asking since the beginning.
We can certainly run a special binary as long as it is production quality, the only issue is the crash is not happening anymore since 2 weeks (it was happening almost daily before). We did not change anything and neither did the customer.

Rick

Comment by Jan Lindström (Inactive) [ 2020-08-05 ]

This confirms that we could end up referencing NULL pointer on lock_rec_get_n_bits because of bug on lock_rec_has_to_wait_in_queue that is fixed a8ec45863b958757da61af3b2ce0a38b0a79d92c for 10.2 and it should be merged to other releases.

Comment by Marko Mäkelä [ 2020-08-05 ]

valerii posted information that confirms that the READ COMMITTED isolation level was being used: trx->isolation_level=1=TRX_ISO_READ_COMMITTED. It is the designed way of enabling the semi-consistent read for UPDATE operations. The idea of MySQL Bug #3300 is to unlock non-matching rows in a table scan.

Setting the isolation level to the default REPEATABLE READ should avoid hitting this Galera bug.

jplindst must debug this in order to create a repeatable test case. It is not enough to prevent a crash when Galera is enabled; we must also ensure that all locks are being replicated correctly across the cluster.

Comment by Marko Mäkelä [ 2020-08-07 ]

I checked all assignments to trx_t::isolation_level in MariaDB 10.4.13, and it looks like the only way for InnoDB to use READ COMMITTED is when thd_tx_isolation() returns ISO_READ_COMMITTED. Outside the SQL parser, the only assignments that refer to that constant are related to Galera:

git grep -w ISO_READ_COMMITTED mariadb-10.4.13

mariadb-10.4.13:sql/wsrep_high_priority_service.cc:  thd->variables.tx_isolation= ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_high_priority_service.cc:  thd->tx_isolation          = ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_schema.cc:  thd->variables.tx_isolation= ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_sst.cc:    thd->variables.tx_isolation= ISO_READ_COMMITTED;
mariadb-10.4.13:sql/wsrep_storage_service.cc:  thd->variables.tx_isolation = ISO_READ_COMMITTED;

Based on this finding, it might be that this bug cannot be worked around.

Comment by Jan Lindström (Inactive) [ 2020-08-08 ]

Analysis:

  • wsrep_high_priority_service : Used only on applier threads (i.e. brute force)
  • wsrep_schema : Used only when node starts from empty database
  • wsrep_sst : Used on joiner during SST only
  • wsrep_storage_service : Used for streaming replication

There is no indication that these are involved on this case.

Comment by Jan Lindström (Inactive) [ 2020-08-12 ]

I made analysis from full stack dump and found out following points:

Thread 9222 (Thread 0x7f694c3bf700 (LWP 54589)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158
#2  wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325
#3  os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502
#4  0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f694c3ba7e0: 0x7f858c6db810) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471
#5  0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=1937) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477
#6  0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=n_spins@entry=30, n_delay=4, line=line@entry=1937, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594
#7  0x000055dee6f2e051 in lock_rec_lock (impl=impl@entry=false, mode=1027, block=0x7f7fa2f61118, heap_no=55, index=index@entry=0x7f6f5009db78, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:1937
#8  0x000055dee6f2ebc9 in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=<optimized out>, rec=rec@entry=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=index@entry=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, gap_mode=1024, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:5829
#9  0x000055dee6a3d5a0 in sel_set_rec_lock (pcur=<optimized out>, rec=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, type=<optimized out>, thr=0x7f6b88779e48, mtr=0x7f694c3bb870) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:1264
#10 0x000055dee6fc99c9 in row_search_mvcc (buf=<optimized out>, mode=PAGE_CUR_G, prebuilt=0x7f6b88779628, match_mode=<optimized out>, direction=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:5063
#11 0x000055dee6ee91d7 in ha_innobase::general_fetch (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005", direction=1, match_mode=0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/handler/ha_innodb.cc:9544
#12 0x000055dee6d3f411 in handler::ha_rnd_next (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005") at /usr/src/debug/MariaDB-10.4.13/src_0/sql/handler.cc:2839
#13 0x000055dee6e6da6c in rr_sequential (info=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.cc:477
#14 0x000055dee6bdea68 in read_record (this=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.h:69
#15 mysql_update (thd=thd@entry=0x7f6ef8305878, table_list=0x7f6ef8313f00, fields=..., values=..., conds=0x7f6ef8314d50, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f694c3bc560, updated_return=updated_return@entry=0x7f694c3bc620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:975
#16 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6ef8305878) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
#17 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6ef8305878, rawbuf=rawbuf@entry=0x7f6ef8313ba0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f694c3be350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

This is local UPDATE waiting lock_sys->mutex in lock_rec_lock().

Thread 6699 (Thread 0x7f696e127700 (LWP 715)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158
#2  wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325
#3  os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502
#4  0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f696e123460: 0x7f858c6db190) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471
#5  0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477
#6  0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594
#7  0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6bd00c9148, block=0x7f7fa2f5c0f0, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170
#8  0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f882010b8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091
#9  0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f704c285228, table_list=0x7f704c27b190, fields=..., values=..., conds=0x7f704c27bfe0, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f696e124560, updated_return=updated_return@entry=0x7f696e124620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168
#10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f704c285228) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
#11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f704c285228, rawbuf=rawbuf@entry=0x7f704c27ae30 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f696e126350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

Here local UPDATE waiting lock_sys->mutex on lock_rec_unlock().

Thread 3484 (Thread 0x7f6968817700 (LWP 54282)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158
#2  wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325
#3  os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502
#4  0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f6968813460: 0x7f858c6db410) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471
#5  0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477
#6  0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594
#7  0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6ea427bdd8, block=0x7f7fa2f5d608, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170
#8  0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6ab8578ec8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091
#9  0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x55df52f78a68, table_list=0x55df52f87930, fields=..., values=..., conds=0x55df52f88780, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6968814560, updated_return=updated_return@entry=0x7f6968814620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168
#10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x55df52f78a68) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
#11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x55df52f78a68, rawbuf=rawbuf@entry=0x55df52f875d0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6968816350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

This is also local UPDATE waiting for lock_sys->mutex in lock_rec_unlock(). Finally, the crashing thread:

Thread 1 (Thread 0x7f6a24847700 (LWP 423789)):
#0  0x00007f884411e9d1 in __pthread_kill (threadid=<optimized out>, signo=11) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:59
#1  0x000055dee6d3affe in handle_fatal_signal (sig=11) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/signal_handler.cc:343
#2  <signal handler called>
#3  lock_rec_get_n_bits (lock=0x0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79
#4  lock_rec_has_to_wait_in_queue (wait_lock=wait_lock@entry=0x7f80d12e3210) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:2050
#5  0x000055dee6f2847e in lock_rec_unlock (trx=trx@entry=0x7f858c6a69b8, block=<optimized out>, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214
#6  0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f50094958, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091
#7  0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f6d8c0921d8, table_list=0x7f6d8c0a17c0, fields=..., values=..., conds=0x7f6d8c0a2610, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6a24844560, updated_return=updated_return@entry=0x7f6a24844620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168
#8  0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6d8c0921d8) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
#9  0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6d8c0921d8, rawbuf=rawbuf@entry=0x7f6d8c0a1460 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6a24846350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900

Finds NULL pointer on lock variable. Analysis:

  • Customer is using single-master Galera topology. There is no indication of any applier threads running during crash.
  • Transaction isolation level for crashing trx is READ COMMITTED
  • There is no indication of Galera replay
  • As this is single-master condition in lock_rec_has_to_wait_in_queue() i.e.

    			if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) &&
    			    wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) {
    

    is always false.

  • As this is master node this is not lock replication problem
  • At this point I do not see any evidence that this is Galera bug
  • Listed threads are only one's that are interesting at the crash point
Comment by Jan Lindström (Inactive) [ 2020-08-18 ]

https://github.com/MariaDB/server/commit/fb2fc051004d11b4aeea81683596f505b58cab23

This is only for requested condition move with additional instrumentation when something goes wrong. I have failed to reproduce the actual case using RQG, mtr and sysbench. At the moment I do not have good ideas how to continue. One possibility is to give special build to customer and see if we still hit same bug. At least in that case we have additional information on error log to analyze.

Comment by Marko Mäkelä [ 2020-08-24 ]

Some code cleanup in this area is welcome, but I think that more effort is needed to reproduce this error.

Please revise the patch and request a further review.

When it comes to the suspicion that the loop would fail to terminate without Galera involvement, I think that it is extremely unlikely. The function lock_rec_unlock() would acquire lock_sys.mutex and trx_t::mutex, find a matching lock, and proceed to release the lock while holding those mutexes. The lock should not possibly disappear while we are holding those mutexes. Besides, locks should only be released by the thread that is executing the transaction, not by other threads. Other threads may only convert implicit locks into explicit ones, on behalf of the lock-holder transaction. That activity is blocked by both said mutexes.

Comment by Jan Lindström (Inactive) [ 2020-08-27 ]

https://github.com/MariaDB/server/commit/c2c4a77004e38bcdb735c330b5b9529c625b372d

I have run lot of randgen runs on both release build and debug build, but still failed to reproduce this problem (some others I have found and created MDEVs for them).

Comment by Marko Mäkelä [ 2020-09-02 ]

This looks mostly OK now. Please revert some unnecessary changes (in particular, the change that would have made lock_trx_handle_wait() dead code), and request one more review.

Comment by Jan Lindström (Inactive) [ 2020-09-07 ]
Comment by Rick Pizzi [ 2020-09-07 ]

jplindst the customer is running the debug build since last week. We will try to trigger the crash again. Usually it appears a few hours later some DDL has been run.

Comment by Jan Lindström (Inactive) [ 2020-09-07 ]

There is bigger difference between 10.3 and 10.4 (Galera 4), but all that is on server code.

Comment by Marko Mäkelä [ 2020-09-09 ]

I posted a few minor comments to the 10.2 version. OK push to 10.2 after addressing those. I will use the other versions for reference when merging to later branches.

Generated at Thu Feb 08 09:19:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.