[MDEV-24035] Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0 causing disruption and replication failure Created: 2020-10-27  Updated: 2024-02-06

Status: Open
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: 10.5.5, 10.5.6, 10.11.3, 10.6.12, 10.11.6
Fix Version/s: 10.6, 10.11, 11.0, 11.1, 11.2

Type: Bug Priority: Critical
Reporter: Martin Nix Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 4
Labels: crash
Environment:

Centos 7.8.2003


Attachments: Text File MDEV-24035.txt    
Issue Links:
Blocks
is blocked by MDEV-25163 Rowid filter does not process storage... Closed
Duplicate
duplicates MDEV-24322 Assertion failure in file /home/build... Closed
is duplicated by MDEV-24322 Assertion failure in file /home/build... Closed
is duplicated by MDEV-29478 Database Crashed with no reason Closed
is duplicated by MDEV-29536 InnoDB: Assertion failure Closed
is duplicated by MDEV-31259 mariadb server crash dump Closed
is duplicated by MDEV-31780 InnoDB: Assertion failure in file D:\... Closed
Relates
relates to MDEV-20870 InnoDB: Assertion failure in file /ho... Closed
relates to MDEV-31941 Assertion failures upon using OQGraph... Confirmed
relates to MDEV-32448 MariaDB 10.6.11 - InnoDB: Assertion f... Closed
relates to MDEV-33176 Assertion failure in /home/buildbot/b... Open

 Description   

See below error dump, happens sporadically (every 2 to 3 weeks) resulting in dropping of all connections and evacuation of cache taking around 20 seconds to recover. In each case it is a random simple select that causes the issue. All tables are innodb and have been mysqlcheck'd with no issue found. Issue did not occur under 10.4 but coincided with move to 10.5.5. System is not under excessive load at the time the problems manifests. Recent upgrade to 10.5.6 has not resolved the issue

IMPORTANT : Knock on effect for replication slaves is duplicate records causing replication to fail - so this effectively looks like two bugs (duplicate updates can be skipped). I believe the assertion failure handling may be causing this secondary serious issue :

1) assertion failure on master host
2) replication failure on slave host

2020-10-21 14:47:03 0x7f45b0c42700  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.5/storage/innobase/trx/trx0trx.cc line 1329
InnoDB: Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
201021 14:47:03 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.5-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=1137
max_threads=1252
thread_count=486
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2660907 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f43a0064958
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f45b0c41c90 thread_stack 0x49000
Can't start addr2line
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x556585139f3e]
/usr/sbin/mariadbd(handle_fatal_signal+0x307)[0x556584b3c607]
/lib64/libpthread.so.0(+0xf630)[0x7f7d8a62e630]
/lib64/libc.so.6(gsignal+0x37)[0x7f7d88688387]
/lib64/libc.so.6(abort+0x148)[0x7f7d88689a78]
/usr/sbin/mariadbd(+0x64c5e7)[0x55658481a5e7]
/usr/sbin/mariadbd(+0xda9197)[0x556584f77197]
/usr/sbin/mariadbd(+0xda92c3)[0x556584f772c3]
/usr/sbin/mariadbd(+0xda95a4)[0x556584f775a4]
/usr/sbin/mariadbd(+0xc4b6cc)[0x556584e196cc]
/usr/sbin/mariadbd(+0xc56723)[0x556584e24723]
/usr/sbin/mariadbd(_ZN7handler16ha_external_lockEP3THDi+0x261)[0x556584b48681]
/usr/sbin/mariadbd(+0xa6ae7a)[0x556584c38e7a]
/usr/sbin/mariadbd(_Z24mysql_unlock_read_tablesP3THDP13st_mysql_lock+0x72)[0x556584c39702]
/usr/sbin/mariadbd(_ZN4JOIN9join_freeEv+0x169)[0x556584976cb9]
/usr/sbin/mariadbd(_ZN4JOIN10exec_innerEv+0xc6c)[0x55658499b55c]
/usr/sbin/mariadbd(_ZN4JOIN4execEv+0x33)[0x55658499b813]
/usr/sbin/mariadbd(_Z12mysql_selectP3THDP10TABLE_LISTR4ListI4ItemEPS4_jP8st_orderS9_S7_S9_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x186)[0x5565849999d6]
/usr/sbin/mariadbd(_Z13handle_selectP3THDP3LEXP13select_resultm+0x1cc)[0x55658499a53c]
/usr/sbin/mariadbd(+0x63343a)[0x55658480143a]
/usr/sbin/mariadbd(_Z21mysql_execute_commandP3THD+0x37f5)[0x556584940605]
/usr/sbin/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x36a)[0x55658494350a]
/usr/sbin/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x1bbd)[0x55658494587d]
/usr/sbin/mariadbd(_Z10do_commandP3THD+0x10b)[0x55658494701b]
/usr/sbin/mariadbd(_Z24do_handle_one_connectionP7CONNECTb+0x1b2)[0x556584a2fb32]
/usr/sbin/mariadbd(handle_one_connection+0x34)[0x556584a2fde4]
/usr/sbin/mariadbd(+0xbb62ad)[0x556584d842ad]
/lib64/libpthread.so.0(+0x7ea5)[0x7f7d8a626ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f7d887508dd]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f43a0074710): SELECT * FROM `lookup_permissions_grid` WHERE  `application` IN ('guest')
Connection ID (thread ID): 187768692
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /mnt/db/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1030944              1030944              processes
Max open files            16384                16384                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       1030944              1030944              signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: core
 
2020-10-21 14:47:27 0 [Note] mariadbd: Aria engine: starting recovery
tables to flush: 2 1 0
 (0.0 seconds);
2020-10-21 14:47:27 0 [Note] mariadbd: Aria engine: recovery done

mysqld --print-defaults
mysqld would have been started with the following arguments:

--datadir=/mnt/db/mysql --tmpdir=/var/lib/mysqltmp --sql-mode=STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION --userstat=1 --log_warnings=2 --thread_handling=one-thread-per-connection --thread-cache-size=16 --table-open-cache=4096 --table-definition-cache=2048 --query-cache-type=0 --sort-buffer-size=1M --read-buffer-size=1M --read-rnd-buffer-size=1M --join-buffer-size=1M --tmp-table-size=128M --max-heap-table-size=128M --back-log=100 --max-connections=1250 --max-connect-errors=10000 --max-allowed-packet=16M --interactive-timeout=28800 --wait-timeout=28800 --default-storage-engine=InnoDB --innodb=FORCE --key-buffer-size=64M --myisam-sort-buffer-size=128M --innodb-buffer-pool-size=224256M --innodb-log-file-size=100M --innodb-file-per-table=1 --innodb-open-files=4096 --innodb_flush_log_at_trx_commit=0 --innodb_buffer_pool_instances=16 --innodb_flush_method=O_DIRECT --log-output=FILE --slow-query-log=1 --slow-query-log-file=/mnt/db/mysql_slow.log --long-query-time=1 --character-set-server=utf8 --group_concat_max_len=40960 --log-error=/mnt/db/mysqld.log --open-files-limit=65535 --server-id=1 --gtid-domain-id=1 --log-bin=/mnt/db/mysqllogs/bin-log --relay-log-space-limit=16G --expire-logs-days=2 --sync-binlog=0 --innodb_io_capacity=2000 --innodb_thread_concurrency=0



 Comments   
Comment by Alice Sherepa [ 2020-10-27 ]

Please provide the steps to reproduce or describe the steps you took in attempt to reproduce. + the output of SHOW CREATE TABLE lookup_permissions_grid;

Comment by Martin Nix [ 2020-10-27 ]

Cannot reproduce on demand - it occurs when it occurs, there appears to be no pattern to behaviour. Each time is also a different table. The table below has a total of 57 records

CREATE TABLE `lookup_permissions_grid` (
`permid` int(10) NOT NULL AUTO_INCREMENT,
`application` varchar(40) NOT NULL,
`name` varchar(100) NOT NULL,
`enabled_def` tinyint(1) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`desc` text DEFAULT NULL,
`indx` int(11) DEFAULT NULL,
PRIMARY KEY (`permid`),
KEY `search1` (`application`)
) ENGINE=InnoDB AUTO_INCREMENT=121 DEFAULT CHARSET=utf8

Comment by Alice Sherepa [ 2020-11-10 ]

Do you happen to use sequences, by any chance? There was a bug with a similar assertion(MDEV-21822), maybe somehow related
Would it be possible to turn on general logging?

Comment by Martin Nix [ 2020-11-11 ]

We do not use sequences at all Alice

Comment by Elena Stepanova [ 2020-12-15 ]

Let's track it in MDEV-24322 which Marko is already looking at.

Comment by Marko Mäkelä [ 2023-01-19 ]

Apparently, MDEV-24322 was hijacked for some Galera related bug, while marnixgb is not using Galera.

vlad.lesin has debugged and fixed locking issues between replication and InnoDB in the past.

Comment by Rick Pizzi [ 2023-02-15 ]

Just got a core out of this crash.
Updating ticket with the threads dump from gdb

Comment by Marko Mäkelä [ 2023-05-15 ]

This is a long shot, but I wonder if MDEV-31185 could be the root cause.

Comment by Vladislav Lesin [ 2023-05-15 ]

marko, I think, yes, MDEV-31185 can be the root case.

Comment by Marko Mäkelä [ 2023-05-19 ]

In case MDEV-31185 fixes this, the build https://ci.mariadb.org/35156/ that corresponds to this commit based on MariaDB Server 10.6.11 could be worth testing. marnixgb and others might want to give it a try.

Comment by Ivan [ 2023-08-01 ]

I'm seeing this error every 15-20 minutes on our server version 10.6.14. Is there any additional information I could provide to help debug the problem? I understand there might be a fix of MDEV-31185 in 10.6.15 which hasn't been released, but I wonder if that actually fixes this issue. It started showing up a few weeks ago every couple of days, but now it's happening several times every hour. The database crashes and restarts, every time there is a different query, but same assertion. The server acts as master in a replication but even with the slaves disabled the error happens several times per hour.

Do you have any hint on what can we check / change to improve our situation?

Comment by Sergei Golubchik [ 2023-08-16 ]

Rey, 10.6.15 is released. Please, add a comment if this error will keep happening in 10.6.15 and we'll reopen this issue.

Comment by Marko Mäkelä [ 2023-08-17 ]

According to comments in MDEV-25163, this crash does still occur in 10.6.15 and 10.11.5. There was some heavy refactoring of the InnoDB lock_sys in 10.6. Theoretically, the crash in MariaDB Server 10.4.31 and 10.5.22 could still have been fixed by MDEV-31185.

Comment by Marko Mäkelä [ 2023-08-17 ]

In MDEV-31941 there is an example where a SELECT that is executed ‘inside’ LOCK TABLES can cause this assertion failure.

Comment by Marko Mäkelä [ 2023-09-04 ]

MDEV-25163 (basically this same bug) was reproduced on 10.5.22, which includes the fix of MDEV-31185.

Theoretically, 10.6 and later major versions could hit this assertion due to some other bug than 10.5 or earlier major versions.

Comment by Vladislav Lesin [ 2023-09-26 ]

I am not sure this is a duplicate of MDEV-25163. It can be either MDEV-25163 or MDEV-31185 or some not yet fixed bug. We need some feedback from the bug reporter. If the build with both fixes solves their issue, then we can close the bug.

Comment by Martin Nix [ 2023-09-26 ]

Not reproducible since moving from 10.6.8 to 10.6.11 (running reliably for almost a year now)

Have since moved to 10.11 on rocky 8 and despite some trouble over swap necessity (kernel tuning) seem all good

Comment by Marko Mäkelä [ 2023-09-26 ]

Based on recent progress (reproducing and fixing MDEV-25163, as well as reproducing but not yet fixing MDEV-31941), I believe that the cause of this assertion failure should be some wrongdoing outside InnoDB, that is, ignoring the fact that a transaction was rolled back, and carrying on as if nothing happened. The assertion would fail because a supposedly non-locking transaction turns out to hold locks during commit.

I think that we will need feedback (not any longer from marnixgb, but from our support customers) to find out if there are further causes of this bug that have not yet been fixed.

Comment by Michael Widenius [ 2023-11-05 ]

What is the state of this MDEV?
It is not pushed to 10.6 and there is no reason I can find why this is in NEED FEEDBACK.
What feedback are we waiting for and what will happen to the commit that exists?
If the commit is wrong or does nothing, then we can ignore it, but that should be spelled out in the MDEV.
There are 6 support cases related this MDEV, so it is a reasonable common bug that we need fixed ASAP.

Generated at Thu Feb 08 09:26:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.