|
Resolving the last 3 levels of stack trace:
/usr/sbin/mysqld(+0xa9a908)[0x55f2dc4eb908]
|
lock_rec_has_to_wait_in_queue(ib_lock_t const*) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79).
|
|
/usr/sbin/mysqld(0xa9d47e)[0x55f2dc4ee47e]
|
lock_rec_unlock(trx_t*, buf_block_t const*, unsigned char const*, lock_mode) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214).
|
|
|
/usr/sbin/mysqld(+0xb22626)[0x55f2dc573626]
|
row_unlock_for_mysql(row_prebuilt_t*, unsigned long) (/usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2093).
|
|
|
|
The closest that we have among open issues seems to be MDEV-14846 (and consequently MDEV-20666 which MDEV-14846 links itself to). Both of them are only known to cause problems on debug builds, but it doesn't mean that the underlying cause can't affect non-debug ones.
|
|
rpizzi if it crashes repeatadly where is test suite?
|
|
This is a production system crashing, of a customer of RDBA. We were unable to reproduce the crash, especially because the cause is not known.
|
|
Ah, that stack trace in innodb is a part which was unresolved
so it looks like innodb crash
|
|
update processing order by very strange, what they do at startup?
|
|
It seems that lock is an invalid pointer:
for (lock = first_lock; lock != NULL;
|
lock = lock_rec_get_next(heap_no, lock)) {
|
if (lock_get_wait(lock)
|
&& !lock_rec_has_to_wait_in_queue(lock)) {
|
The signal is sent in lock_rec_has_to_wait_in_queue(lock), which appears to be inlined in the stack trace:
for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no);
|
lock != wait_lock;
|
lock = lock_rec_get_next_on_page_const(lock)) {
|
|
const byte* p = (const byte*) &lock[1];
|
|
if (heap_no < lock_rec_get_n_bits(lock)
|
We would fail to stop the iteration if we never find wait_lock. Invoking lock_rec_get_n_bits(lock=NULL) would certainly cause the SIGSEGV.
rpizzi, Does the crash involve Galera? Or innodb_lock_schedule_algorithm=VATS? The root cause of this failure should be an invalid call to unlock a row. It might also involve some complex join. Unlocking rows ("semi-consistent reads") was originally introduced in MySQL Bug #3300.
|
|
Yes this is a galera node, crash always happens on the node that has the master role marko
|
|
As far as I can tell, lock_rec_unlock() is continuously holding lock_sys->mutex, which should protect the creation and destruction of explicit record locks. The most plausible explanation for the crash is the following code in lock_rec_has_to_wait_in_queue(lock), which causes us to skip wait_lock and reach the end of the list:
for (lock = lock_rec_get_first_on_page_addr(hash, space, page_no);
|
lock != wait_lock;
|
lock = lock_rec_get_next_on_page_const(lock)) {
|
|
const byte* p = (const byte*) &lock[1];
|
|
if (heap_no < lock_rec_get_n_bits(lock)
|
&& (p[bit_offset] & bit_mask)
|
&& lock_has_to_wait(wait_lock, lock)) {
|
#ifdef WITH_WSREP
|
if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) &&
|
wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) {
|
if (UNIV_UNLIKELY(wsrep_debug)) {
|
mtr_t mtr;
|
ib::info() << "WSREP: waiting BF trx: " << ib::hex(wait_lock->trx->id)
|
<< " query: " << wsrep_thd_query(wait_lock->trx->mysql_thd);
|
lock_rec_print(stderr, wait_lock, mtr);
|
ib::info() << "WSREP: do not wait another BF trx: " << ib::hex(lock->trx->id)
|
<< " query: " << wsrep_thd_query(lock->trx->mysql_thd);
|
lock_rec_print(stderr, lock, mtr);
|
}
|
/* don't wait for another BF lock */
|
continue;
|
}
|
#endif /* WITH_WSREP */
|
|
return(lock);
|
}
|
}
|
|
return(NULL);
|
rpizzi, do you see those WSREP messages in the server error log right before the crash?
It might be that this bug would be addressed by fixing MDEV-21910, or that the above WITH_WSREP code is trying to work around MDEV-21910. In any case, I think that the code is misplaced, and the code should be moved to the relevant caller of lock_rec_has_to_wait_in_queue().
The problematic code is already present in 10.1 (which is the first major version where the WITH_WSREP changes are part of the main server).
|
|
No we did not see that error but maybe it requires wsrep_debug = ON. I will try and enable it.
|
|
I believe that this code should be disabled by default. However, until MDEV-19563 was addressed, Galera was recommending setting the deprecated parameter innodb_locks_unsafe_for_binlog. The lock_rec_unlock() should only be effective if ‘semi-consistent reads’ are enabled (when that deprecated parameter is set, or the transaction isolation level is READ UNCOMMITTED or READ COMMITTED).
|
|
In Galera case it should go that piece of code only if we have two transactions running on applier node making them both Brute Force (BF) as there is wsrep_thd_is_BF conditions for both locks. I'm surprised if error log does not have anything else about this situation.
|
|
So get surprised Nothing in logs.
|
|
We need to run the server with the already existing option to skip backtraces. I have investigated the source code and in fact that's where it crashes the 2nd time, so using --skip-stack-trace should avoid that.
|
|
rpizzi Does the problem repeat with --wsrep-debug=ON ? That setting has some negative effect on performance but I think we need that error log with it to narrow down the search.
|
|
jplindst the crash did not happen since about 2 weeks now (without any change that we are aware of).
We have disabled backtracing in the hope to get a core dump. wsrep_debug in galera 4 is too much verbose and can't be enabled with customers which have a busy workload like in this case, as it produces gigabytes of logs per hour. If we get a core it's not important anyway.
Thanks
Rick
|
|
I was trying to understand why we would not want to wait another BF transaction lock in lock_rec_has_to_wait_in_queue() function. However, I could not find any reason, if something like that happens there is no option except wait for your turn. I even tested it by removing the code and running regression test suite and could not find a single regression. For some reason there is more complex conditions on e.g. MySQL 8.0 version of this code.
|
|
*** buffer overflow detected ***
|
: If it is possible to use an ASAN instrumented binary for some time (for example on a clone/binlog slave, or even on a low-use production machine) then we may be able to more easily see the bug show up the error log.
|
|
Roelthis is what we (RDBA Team) have been asking since the beginning.
We can certainly run a special binary as long as it is production quality, the only issue is the crash is not happening anymore since 2 weeks (it was happening almost daily before). We did not change anything and neither did the customer.
Rick
|
|
This confirms that we could end up referencing NULL pointer on lock_rec_get_n_bits because of bug on lock_rec_has_to_wait_in_queue that is fixed a8ec45863b958757da61af3b2ce0a38b0a79d92c for 10.2 and it should be merged to other releases.
|
|
valerii posted information that confirms that the READ COMMITTED isolation level was being used: trx->isolation_level=1=TRX_ISO_READ_COMMITTED. It is the designed way of enabling the semi-consistent read for UPDATE operations. The idea of MySQL Bug #3300 is to unlock non-matching rows in a table scan.
Setting the isolation level to the default REPEATABLE READ should avoid hitting this Galera bug.
jplindst must debug this in order to create a repeatable test case. It is not enough to prevent a crash when Galera is enabled; we must also ensure that all locks are being replicated correctly across the cluster.
|
|
I checked all assignments to trx_t::isolation_level in MariaDB 10.4.13, and it looks like the only way for InnoDB to use READ COMMITTED is when thd_tx_isolation() returns ISO_READ_COMMITTED. Outside the SQL parser, the only assignments that refer to that constant are related to Galera:
|
git grep -w ISO_READ_COMMITTED mariadb-10.4.13
|
mariadb-10.4.13:sql/wsrep_high_priority_service.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED;
|
mariadb-10.4.13:sql/wsrep_high_priority_service.cc: thd->tx_isolation = ISO_READ_COMMITTED;
|
mariadb-10.4.13:sql/wsrep_schema.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED;
|
mariadb-10.4.13:sql/wsrep_sst.cc: thd->variables.tx_isolation= ISO_READ_COMMITTED;
|
mariadb-10.4.13:sql/wsrep_storage_service.cc: thd->variables.tx_isolation = ISO_READ_COMMITTED;
|
Based on this finding, it might be that this bug cannot be worked around.
|
|
Analysis:
- wsrep_high_priority_service : Used only on applier threads (i.e. brute force)
- wsrep_schema : Used only when node starts from empty database
- wsrep_sst : Used on joiner during SST only
- wsrep_storage_service : Used for streaming replication
There is no indication that these are involved on this case.
|
|
I made analysis from full stack dump and found out following points:
Thread 9222 (Thread 0x7f694c3bf700 (LWP 54589)):
|
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
|
#1 0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158
|
#2 wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325
|
#3 os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502
|
#4 0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f694c3ba7e0: 0x7f858c6db810) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471
|
#5 0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=1937) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477
|
#6 0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=n_spins@entry=30, n_delay=4, line=line@entry=1937, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594
|
#7 0x000055dee6f2e051 in lock_rec_lock (impl=impl@entry=false, mode=1027, block=0x7f7fa2f61118, heap_no=55, index=index@entry=0x7f6f5009db78, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:1937
|
#8 0x000055dee6f2ebc9 in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=<optimized out>, rec=rec@entry=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=index@entry=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, gap_mode=1024, thr=thr@entry=0x7f6b88779e48) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:5829
|
#9 0x000055dee6a3d5a0 in sel_set_rec_lock (pcur=<optimized out>, rec=0x7f7fa47e5e4a <Address 0x7f7fa47e5e4a out of bounds>, index=0x7f6f5009db78, offsets=<optimized out>, mode=<optimized out>, type=<optimized out>, thr=0x7f6b88779e48, mtr=0x7f694c3bb870) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:1264
|
#10 0x000055dee6fc99c9 in row_search_mvcc (buf=<optimized out>, mode=PAGE_CUR_G, prebuilt=0x7f6b88779628, match_mode=<optimized out>, direction=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0sel.cc:5063
|
#11 0x000055dee6ee91d7 in ha_innobase::general_fetch (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005", direction=1, match_mode=0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/handler/ha_innodb.cc:9544
|
#12 0x000055dee6d3f411 in handler::ha_rnd_next (this=0x7f6b883c3500, buf=0x7f6b886e9ea0 "\365)\216\005") at /usr/src/debug/MariaDB-10.4.13/src_0/sql/handler.cc:2839
|
#13 0x000055dee6e6da6c in rr_sequential (info=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.cc:477
|
#14 0x000055dee6bdea68 in read_record (this=0x7f694c3bc020) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/records.h:69
|
#15 mysql_update (thd=thd@entry=0x7f6ef8305878, table_list=0x7f6ef8313f00, fields=..., values=..., conds=0x7f6ef8314d50, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f694c3bc560, updated_return=updated_return@entry=0x7f694c3bc620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:975
|
#16 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6ef8305878) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
|
#17 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6ef8305878, rawbuf=rawbuf@entry=0x7f6ef8313ba0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f694c3be350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900
|
This is local UPDATE waiting lock_sys->mutex in lock_rec_lock().
Thread 6699 (Thread 0x7f696e127700 (LWP 715)):
|
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
|
#1 0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158
|
#2 wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325
|
#3 os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502
|
#4 0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f696e123460: 0x7f858c6db190) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471
|
#5 0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477
|
#6 0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594
|
#7 0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6bd00c9148, block=0x7f7fa2f5c0f0, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170
|
#8 0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f882010b8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091
|
#9 0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f704c285228, table_list=0x7f704c27b190, fields=..., values=..., conds=0x7f704c27bfe0, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f696e124560, updated_return=updated_return@entry=0x7f696e124620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168
|
#10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f704c285228) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
|
#11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f704c285228, rawbuf=rawbuf@entry=0x7f704c27ae30 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f696e126350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900
|
Here local UPDATE waiting lock_sys->mutex on lock_rec_unlock().
Thread 3484 (Thread 0x7f6968817700 (LWP 54282)):
|
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
|
#1 0x000055dee6f5c561 in wait (this=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:158
|
#2 wait_low (reset_sig_count=<optimized out>, this=0x55df51c9c5d0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:325
|
#3 os_event_wait_low (event=0x55df51c9c5d0, reset_sig_count=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/os/os0event.cc:502
|
#4 0x000055dee6feddb8 in sync_array_wait_event (arr=0x55deea37b190, cell=@0x7f6968813460: 0x7f858c6db410) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/sync/sync0arr.cc:471
|
#5 0x000055dee6f1525f in TTASEventMutex<GenericPolicy>::enter (this=this@entry=0x55dee7d30280 <lock_sys+64>, max_spins=60, max_spins@entry=30, max_delay=max_delay@entry=4, filename=filename@entry=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", line=line@entry=4170) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:477
|
#6 0x000055dee6f2750b in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (n_spins=30, n_delay=4, line=line@entry=4170, name=0x55dee74aa480 "/home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc", this=0x55dee7d30280 <lock_sys+64>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/ib0mutex.h:594
|
#7 0x000055dee6f283a4 in lock_rec_unlock (trx=trx@entry=0x7f6ea427bdd8, block=0x7f7fa2f5d608, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4170
|
#8 0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6ab8578ec8, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091
|
#9 0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x55df52f78a68, table_list=0x55df52f87930, fields=..., values=..., conds=0x55df52f88780, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6968814560, updated_return=updated_return@entry=0x7f6968814620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168
|
#10 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x55df52f78a68) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
|
#11 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x55df52f78a68, rawbuf=rawbuf@entry=0x55df52f875d0 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6968816350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900
|
This is also local UPDATE waiting for lock_sys->mutex in lock_rec_unlock(). Finally, the crashing thread:
Thread 1 (Thread 0x7f6a24847700 (LWP 423789)):
|
#0 0x00007f884411e9d1 in __pthread_kill (threadid=<optimized out>, signo=11) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:59
|
#1 0x000055dee6d3affe in handle_fatal_signal (sig=11) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/signal_handler.cc:343
|
#2 <signal handler called>
|
#3 lock_rec_get_n_bits (lock=0x0) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/include/lock0priv.ic:79
|
#4 lock_rec_has_to_wait_in_queue (wait_lock=wait_lock@entry=0x7f80d12e3210) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:2050
|
#5 0x000055dee6f2847e in lock_rec_unlock (trx=trx@entry=0x7f858c6a69b8, block=<optimized out>, rec=<optimized out>, lock_mode=LOCK_X) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/lock/lock0lock.cc:4214
|
#6 0x000055dee6fad626 in row_unlock_for_mysql (prebuilt=0x7f6f50094958, has_latches_on_recs=<optimized out>) at /usr/src/debug/MariaDB-10.4.13/src_0/storage/innobase/row/row0mysql.cc:2091
|
#7 0x000055dee6bdfc60 in mysql_update (thd=thd@entry=0x7f6d8c0921d8, table_list=0x7f6d8c0a17c0, fields=..., values=..., conds=0x7f6d8c0a2610, order_num=<optimized out>, order=<optimized out>, limit=18446744073709551615, ignore=false, found_return=found_return@entry=0x7f6a24844560, updated_return=updated_return@entry=0x7f6a24844620) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_update.cc:1168
|
#8 0x000055dee6b36d15 in mysql_execute_command (thd=thd@entry=0x7f6d8c0921d8) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:4376
|
#9 0x000055dee6b3c8ab in mysql_parse (thd=thd@entry=0x7f6d8c0921d8, rawbuf=rawbuf@entry=0x7f6d8c0a1460 "UPDATE ..."..., length=length@entry=317, parser_state=parser_state@entry=0x7f6a24846350, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at /usr/src/debug/MariaDB-10.4.13/src_0/sql/sql_parse.cc:7900
|
Finds NULL pointer on lock variable. Analysis:
- Customer is using single-master Galera topology. There is no indication of any applier threads running during crash.
- Transaction isolation level for crashing trx is READ COMMITTED
- There is no indication of Galera replay
- As this is single-master condition in lock_rec_has_to_wait_in_queue() i.e.
if (wsrep_thd_is_BF(wait_lock->trx->mysql_thd, FALSE) &&
|
wsrep_thd_is_BF(lock->trx->mysql_thd, TRUE)) {
|
is always false.
- As this is master node this is not lock replication problem
- At this point I do not see any evidence that this is Galera bug
- Listed threads are only one's that are interesting at the crash point
|
|
https://github.com/MariaDB/server/commit/fb2fc051004d11b4aeea81683596f505b58cab23
This is only for requested condition move with additional instrumentation when something goes wrong. I have failed to reproduce the actual case using RQG, mtr and sysbench. At the moment I do not have good ideas how to continue. One possibility is to give special build to customer and see if we still hit same bug. At least in that case we have additional information on error log to analyze.
|
|
Some code cleanup in this area is welcome, but I think that more effort is needed to reproduce this error.
Please revise the patch and request a further review.
When it comes to the suspicion that the loop would fail to terminate without Galera involvement, I think that it is extremely unlikely. The function lock_rec_unlock() would acquire lock_sys.mutex and trx_t::mutex, find a matching lock, and proceed to release the lock while holding those mutexes. The lock should not possibly disappear while we are holding those mutexes. Besides, locks should only be released by the thread that is executing the transaction, not by other threads. Other threads may only convert implicit locks into explicit ones, on behalf of the lock-holder transaction. That activity is blocked by both said mutexes.
|
|
https://github.com/MariaDB/server/commit/c2c4a77004e38bcdb735c330b5b9529c625b372d
I have run lot of randgen runs on both release build and debug build, but still failed to reproduce this problem (some others I have found and created MDEVs for them).
|
|
This looks mostly OK now. Please revert some unnecessary changes (in particular, the change that would have made lock_trx_handle_wait() dead code), and request one more review.
|
|
|
|
jplindst the customer is running the debug build since last week. We will try to trigger the crash again. Usually it appears a few hours later some DDL has been run.
|
|
There is bigger difference between 10.3 and 10.4 (Galera 4), but all that is on server code.
|
|
I posted a few minor comments to the 10.2 version. OK push to 10.2 after addressing those. I will use the other versions for reference when merging to later branches.
|