[MDEV-23693] Failing assertion: my_atomic_load32_explicit(&lock->lock_word, MY_MEMORY_ORDER_RELAXED) == X_LOCK_DECR Created: 2020-09-08  Updated: 2020-12-14  Resolved: 2020-10-27

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3.24
Fix Version/s: 10.2.35, 10.3.26, 10.4.16, 10.5.7

Type: Bug Priority: Blocker
Reporter: Geza Lucz Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 2
Labels: rr-profile
Environment:

Centos 7 / CloudLinux 7


Issue Links:
Blocks
blocks MDEV-23989 Merge new release of InnoDB 5.7.32 to... Closed
Duplicate
duplicates MDEV-23967 SEGV in my_atomic_loadlint storage/in... Closed
Relates
relates to MDEV-21329 InnoDB: Failing assertion: lock->lock... Closed
relates to MDEV-23452 Assertion `buf_page_get_io_fix(bpage)... Closed
relates to MDEV-23637 binlog_encryption.encrypted_master f... Closed
relates to MDEV-24030 SUMMARY: AddressSanitizer: heap-use-a... Closed

 Description   

I started seeing the following crash on August 26th. Since I received 4 of them. I don't have any hardware errors and nothing else suspicious is logged. InnoDB is not complaining about data corruption etc.. I only get this assertion failure and then a crash.

InnoDB: Failing assertion: my_atomic_load32_explicit(&lock->lock_word, MY_MEMORY_ORDER_RELAXED) == X_LOCK_DECR
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
200826 19:38:25 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.3.24-MariaDB
key_buffer_size=1048576000
read_buffer_size=33554432
max_used_connections=101
max_threads=102
thread_count=22
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 17737955 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f05fdc2ca28
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f065355ad30 thread_stack 0x49000
*** buffer overflow detected ***: /usr/sbin/mysqld terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f07589d9577]
/lib64/libc.so.6(+0x1166f2)[0x7f07589d76f2]
/lib64/libc.so.6(+0x1184d7)[0x7f07589d94d7]
/usr/sbin/mysqld(my_addr_resolve+0xda)[0x5629c874e32a]
/usr/sbin/mysqld(my_print_stacktrace+0x1c2)[0x5629c87376b2]
/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x5629c81cca4f]
/lib64/libpthread.so.0(+0xf630)[0x7f075a627630]
/lib64/libc.so.6(gsignal+0x37)[0x7f07588f7387]
/lib64/libc.so.6(abort+0x148)[0x7f07588f8a78]
/usr/sbin/mysqld(+0x4ed8a0)[0x5629c7f078a0]
/usr/sbin/mysqld(+0xa657c6)[0x5629c847f7c6]
/usr/sbin/mysqld(+0xb132a4)[0x5629c852d2a4]
/usr/sbin/mysqld(+0xb34cbb)[0x5629c854ecbb]
/usr/sbin/mysqld(+0xad4ddf)[0x5629c84eeddf]
/usr/sbin/mysqld(+0xb02174)[0x5629c851c174]
/usr/sbin/mysqld(+0xb03f8b)[0x5629c851df8b]
/usr/sbin/mysqld(+0xb0480c)[0x5629c851e80c]
/usr/sbin/mysqld(+0xae0961)[0x5629c84fa961]
/usr/sbin/mysqld(+0xad16ee)[0x5629c84eb6ee]
/usr/sbin/mysqld(+0xad6d9c)[0x5629c84f0d9c]
/usr/sbin/mysqld(+0xac9bf8)[0x5629c84e3bf8]
/usr/sbin/mysqld(+0xa37fb2)[0x5629c8451fb2]
/usr/sbin/mysqld(+0xa3b49e)[0x5629c845549e]
/usr/sbin/mysqld(+0x95be77)[0x5629c8375e77]
/usr/sbin/mysqld(_ZN7handler18ha_index_next_sameEPhPKhj+0x1c5)[0x5629c81d3225]
/usr/sbin/mysqld(+0x613fed)[0x5629c802dfed]
/usr/sbin/mysqld(_Z10sub_selectP4JOINP13st_join_tableb+0x211)[0x5629c8021c21]
/usr/sbin/mysqld(+0x5fcecc)[0x5629c8016ecc]
/usr/sbin/mysqld(_Z10sub_selectP4JOINP13st_join_tableb+0x1de)[0x5629c8021bee]
/usr/sbin/mysqld(_ZN4JOIN10exec_innerEv+0xa8b)[0x5629c8045b4b]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x33)[0x5629c8045d63]
/usr/sbin/mysqld(_Z12mysql_selectP3THDP10TABLE_LISTjR4ListI4ItemEPS4_jP8st_orderS9_S7_S9_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x11a)[0x5629c804426a]
/usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x1cc)[0x5629c8044d7c]
/usr/sbin/mysqld(+0x4daf06)[0x5629c7ef4f06]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x63ba)[0x5629c7fed8ca]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x36d)[0x5629c7ff063d]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0xfe1)[0x5629c7ff1ed1]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x11b)[0x5629c7ff416b]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1d6)[0x5629c80caa86]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x5629c80cab9d]
/lib64/libpthread.so.0(+0x7ea5)[0x7f075a61fea5]
/lib64/libc.so.6(clone+0x6d)[0x7f07589bf8dd]
======= Memory map: ========

I now also think that it is data related. I was running 10.3.24 for 2 weeks before getting the first crash. Since then It came once every 2 days. I rolled back to 10.3.23 4 days ago, and I have not had crashes since... it's probably too early to say but it seems like only 10.3.24 is affected. I run a lot of 10.0 and 10.1 versions and they don't crash out with the latest updates.



 Comments   
Comment by Ross Campbell [ 2020-09-15 ]

I am on Ubuntu 18.04 LTS, and upgraded to 10.3.24 (from from 10.3.23+maria~bionic) on Sep 9th. I immediately started seeing this error in my automated test suite several times per day. What exactly triggers it seems to be random; sometimes during the database load (before tests have begun) and sometimes during the tests themsevles. I can't find a way to reproduce it reliably, although I honestly haven't tried all that hard.

This definitely did NOT occur on 10.3.23 - my test suite runs at minimum once per day on a timer, and far more than that during active development, so I'm sure I would have seen it. We had been running 10.3.23 since July 13th 2020 with zero crashes.

Update: Due to my test suite failing most of time, I downgraded to 10.3.23 and this crash immediately went away.

I would be happy to provide more information or any other assistance is troubleshooting this issue, please just let me know if I can help. In the meantime, I am strongly considering moving to 10.5.

Logs are as follows:

Sep 11 15:40:58 otto-m1 mysqld[1814]: 2020-09-11 15:40:58 0x7f90e3be5700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.3.24/storage/innobase/sync/sync0rw.cc line 258
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: Failing assertion: my_atomic_load32_explicit(&lock->lock_word, MY_MEMORY_ORDER_RELAXED) == X_LOCK_DECR
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: We intentionally generate a memory trap.
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: If you get repeated assertion failures or crashes, even
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: immediately after the mysqld startup, there may be
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: corruption in the InnoDB tablespace. Please refer to
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
Sep 11 15:40:58 otto-m1 mysqld[1814]: InnoDB: about forcing recovery.
Sep 11 15:40:58 otto-m1 mysqld[1814]: 200911 15:40:58 [ERROR] mysqld got signal 6 ;
Sep 11 15:40:58 otto-m1 mysqld[1814]: This could be because you hit a bug. It is also possible that this binary
Sep 11 15:40:58 otto-m1 mysqld[1814]: or one of the libraries it was linked against is corrupt, improperly built,
Sep 11 15:40:58 otto-m1 mysqld[1814]: or misconfigured. This error can also be caused by malfunctioning hardware.
Sep 11 15:40:58 otto-m1 mysqld[1814]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Sep 11 15:40:58 otto-m1 mysqld[1814]: We will try our best to scrape up some info that will hopefully help
Sep 11 15:40:58 otto-m1 mysqld[1814]: diagnose the problem, but since we have already crashed,
Sep 11 15:40:58 otto-m1 mysqld[1814]: something is definitely wrong and this may fail.
Sep 11 15:40:58 otto-m1 mysqld[1814]: Server version: 10.3.24-MariaDB-1:10.3.24+maria~bionic
Sep 11 15:40:58 otto-m1 mysqld[1814]: key_buffer_size=134217728
Sep 11 15:40:58 otto-m1 mysqld[1814]: read_buffer_size=131072
Sep 11 15:40:58 otto-m1 mysqld[1814]: max_used_connections=499
Sep 11 15:40:58 otto-m1 mysqld[1814]: max_threads=1002
Sep 11 15:40:58 otto-m1 mysqld[1814]: thread_count=163
Sep 11 15:40:58 otto-m1 mysqld[1814]: It is possible that mysqld could use up to
Sep 11 15:40:58 otto-m1 mysqld[1814]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2333977 K  bytes of memory
Sep 11 15:40:58 otto-m1 mysqld[1814]: Hope that's ok; if not, decrease some variables in the equation.
Sep 11 15:40:58 otto-m1 mysqld[1814]: Thread pointer: 0x7f92f4000c08
Sep 11 15:40:58 otto-m1 mysqld[1814]: Attempting backtrace. You can use the following information to find out
Sep 11 15:40:58 otto-m1 mysqld[1814]: where mysqld died. If you see no messages after this, something went
Sep 11 15:40:58 otto-m1 mysqld[1814]: terribly wrong...
Sep 11 15:40:58 otto-m1 mysqld[1814]: stack_bottom = 0x7f90e3be4dd8 thread_stack 0x49000
Sep 11 15:40:59 otto-m1 mysqld[1814]: *** buffer overflow detected ***: /usr/sbin/mysqld terminated

Comment by Robert Klikics [ 2020-09-29 ]

We're facing the same issue since 10.3.24 on CentOS 7.8 with MariaDB from the official repos.

Comment by Marko Mäkelä [ 2020-09-29 ]

This might be related to the regression MDEV-23452 that we noticed and fixed soon after the 10.3.24 release. That fix was ported to the MariaDB Enterprise releases 10.2.33-8, 10.3.24-8, 10.4.14-8, 10.5.5-3.

Comment by Ulrich Moser (Inactive) [ 2020-10-01 ]

Since this is a bug that also affects users of the community edition I would expect it to be available there too. Is there a bugfix version >10.3.24 on the way to be released?

Comment by Daniel Black [ 2020-10-05 ]

Is https://github.com/mariadb-corporation/MariaDBEnterprise/commit/5a73c44cf08259af41f952415d2fbeaf5bf44e7c (that I can't read) in the community branches? It seemed the only one not by thiru

Comment by Marko Mäkelä [ 2020-10-05 ]

danblack, https://github.com/mariadb-corporation/MariaDBEnterprise/commit/5a73c44cf08259af41f952415d2fbeaf5bf44e7c is the MariaDB Enterprise Server 10.5.5-3 version of the first of the two MDEV-23452 commits. It changed the file storage/innobase/buf/buf0buf.cc to be identical to this 10.5 revision:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index 6a8c4869659..5977271dc20 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -3840,7 +3840,7 @@ buf_page_create(fil_space_t *space, uint32_t offset,
     const bool drop_hash_entry= block->page.state() == BUF_BLOCK_FILE_PAGE &&
       UNIV_LIKELY_NULL(block->index);
     if (UNIV_UNLIKELY(drop_hash_entry))
-      block->page.set_io_fix(BUF_IO_PIN);
+      rw_lock_x_lock(&block->lock);
 #endif /* BTR_CUR_HASH_ADAPT */
 
     /* Page can be found in buf_pool */
@@ -3851,7 +3851,7 @@ buf_page_create(fil_space_t *space, uint32_t offset,
     if (UNIV_UNLIKELY(drop_hash_entry))
     {
       btr_search_drop_page_hash_index(block);
-      block->page.io_unfix();
+      rw_lock_x_unlock(&block->lock);
     }
 #endif /* BTR_CUR_HASH_ADAPT */
 

In 10.5 the change was this much simpler than in 10.2 due to MDEV-15053 having simplified the code flow.

Because the MariaDB Enterprise Server releases occur some time after the Community Server releases that they are based on, some bug fixes that missed the underlying Community Server release may be included.

Comment by Matthias Leich [ 2020-10-16 ]

I got a similar assert on some development tree
origin/bb-10.3-MDEV-23072 23f5afe7fdb13e8f968eb9a587c1c8cf319927e8 2020-10-16T12:05:18+05:30
The server claims to be a 10.3.26.
 
Thread 4 received signal SIGABRT, Aborted.
...
# 2020-10-16T11:51:20 [908677] | Thread 4 (Thread 910010.910312):
# 2020-10-16T11:51:20 [908677] | #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
# 2020-10-16T11:51:20 [908677] | #1  0x000014c6b0928859 in __GI_abort () at abort.c:79
# 2020-10-16T11:51:20 [908677] | #2  0x000055809f72f033 in ut_dbg_assertion_failed (expr=expr@entry=0x5580a073f440 "my_atomic_load32_explicit(&lock->lock_word, MY_MEMORY_ORDER_RELAXED) == X_LOCK_DECR", file=file@entry=0x5580a073f660 "/home/mleich/Server/bb-10.3-MDEV-23072A/storage/innobase/sync/sync0rw.cc", line=line@entry=257) at /home/mleich/Server/bb-10.3-MDEV-23072A/storage/innobase/ut/ut0dbg.cc:60
# 2020-10-16T11:51:20 [908677] | #3  0x000055809f62183f in rw_lock_free_func (lock=lock@entry=0x4a6c5ade1440) at /home/mleich/Server/bb-10.3-MDEV-23072A/storage/innobase/sync/sync0rw.cc:257
# 2020-10-16T11:51:20 [908677] | #4  0x000055809f82bc2e in pfs_rw_lock_free_func (lock=0x4a6c5ade1440) at /home/mleich/Server/bb-10.3-MDEV-23072A/storage/innobase/include/sync0rw.ic:626
# 2020-10-16T11:51:20 [908677] | #5  0x000055809f86ab52 in buf_pool_resize () at /home/mleich/Server/bb-10.3-MDEV-23072A/storage/innobase/buf/buf0buf.cc:2928
# 2020-10-16T11:51:20 [908677] | #6  0x000055809f86dc61 in buf_resize_thread () at /home/mleich/Server/bb-10.3-MDEV-23072A/storage/innobase/buf/buf0buf.cc:3207
# 2020-10-16T11:51:20 [908677] | #7  0x0000137d33c62609 in start_thread (arg=<optimized out>) at pthread_create.c:477
# 2020-10-16T11:51:20 [908677] | #8  0x000014c6b0a25103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
rr:/home/leich/RQG/storage/1602842276/
001086.log -- RQG log
001086.tgz -- Archive of remains including rr trace
 
Replaying via "rr"
==============
cd /home/mleich/RQG/storage/1602842276/TBR-580/dev/shm/vardir/1602842276/35/1/rr
_RR_TRACE_DIR="." rr replay --mark-stdio
 
RQG
====
git clone https://github.com/mleich1/rqg --branch experimental RQG
 
perl rqg.pl \
--grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \
--gendata=conf/mariadb/table_stress.zz \
--gendata_sql=conf/mariadb/table_stress.sql \
--mysqld=--innodb_use_native_aio=1 \
--mysqld=--innodb_lock_schedule_algorithm=fcfs \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--connect_timeout=60 \
--mysqld=--interactive_timeout=28800 \
--mysqld=--slave_net_timeout=60 \
--mysqld=--net_read_timeout=30 \
--mysqld=--net_write_timeout=60 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--wait_timeout=28800 \
--mysqld=--lock-wait-timeout=86400 \
--mysqld=--innodb-lock-wait-timeout=50 \
--no-mask \
--queries=10000000 \
--seed=random \
--reporters=Backtrace \
--reporters=ErrorLog \
--reporters=Deadlock1 \
--validators=None \
--mysqld=--log_output=none \
--mysqld=--log-bin \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--engine=InnoDB \
--restart_timeout=120 \
--duration=400 \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=400 \
--mysqld=--loose-innodb-sync-debug \
--mysqld=--innodb_stats_persistent=on \
--mysqld=--loose-max-statement-time=30 \
--threads=33 \
--mysqld=--innodb_page_size=32K \
--mysqld=--innodb-buffer-pool-size=24M \
--duration=400 \
--no_mask \
--workdir=<local settings> \
--vardir=<local settings> \
--mtr-build-thread=<local settings> \
--basedir1=<local settings> \
--script_debug=_nix_ \
--rr=Server \
--rr_options=--chaos
 
RQG error tagging pattern
[ 'TBR-580' , 'InnoDB: Assertion failure in file.{1,150}sync0rw.cc.{1,250}InnoDB: Failing assertion: my_atomic_load32_explicit\(&lock->lock_word, MY_MEMORY_ORDER_RELAXED\) == X_LOCK_DECR' ],
 
Hitting that problem is very rare.

Comment by Marko Mäkelä [ 2020-10-20 ]

Bug #31036301 ASSERTION FAILURE: SYNC0RW.IC:429:LOCK->LOCK_WORD in MySQL 5.7.32 might be addressing this issue.

Comment by Marko Mäkelä [ 2020-10-23 ]

The port of the fix looks correct. I made a few minor comments.

Comment by Marko Mäkelä [ 2020-10-27 ]

While testing this, we encountered another race condition with buffer pool resizing, related to buf_pool->page_hash. We will handle that separately later, in MDEV-24030.

Comment by Matthias Leich [ 2020-10-27 ]

bb-10.2-MDEV-23693 and bb-10.3-MDEV-23693 behaved well during RQG testing.

Generated at Thu Feb 08 09:24:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.