[MDEV-23939] Crash with mariadb 10.4: storage/innobase/row/row0sel.cc:4480: row_search_mvcc Assertion `prebuilt->sql_stat_start || trx->state == TRX_STATE_ACTIVE || (prebuilt->table->no_rollback() && trx->state == TRX_STATE_NOT_STARTED)’ failed. Created: 2020-10-12  Updated: 2023-05-02  Resolved: 2023-05-02

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.4.13, 10.4.14
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Jérome Perrin Assignee: Oleksandr Byelkin
Resolution: Incomplete Votes: 0
Labels: crash

Attachments: Text File gdb.txt     Text File mariadb_error.log     File mariadb_general.log.xz    
Issue Links:
Relates
relates to MDEV-20666 Assertion failures prebuilt->sql_stat... Stalled
relates to MDEV-23338 server crash on select for update Open
relates to MDEV-26291 ER_FILSORT_ABORT triggers - crash que... Closed
relates to MDEV-10087 mysqld_update()/mysql_delete() contin... Closed

 Description   

We are observing mariadb sever crashes with 10.4 (confirmed with 10.4.13 and 10.4.14) in a scenario where multiple clients are modifying rows from the same innodb tables. The same scenario not to happen with mariadb 10.3 (confirmed OK with 10.3.22).

On a debug build (built with -DCMAKE_BUILD_TYPE=Debug) , the error log contain:

mysqld: /srv/slapgrid/slappart4/srv/slapos/inst/slappart0/tmp/soft/5f89b8831c4d60d07cb75ad034cd3ec6/parts/mariadb__compile__/mariadb-10.4.14/storage/innobase/row/row0sel.cc:4480: dberr_t row_search_mvcc(byte*, page_cur_mode_t, row_prebuilt_t*, ulint, ulint): Assertion `prebuilt->sql_stat_start || trx->state == TRX_STATE_ACTIVE || (prebuilt->table->no_rollback() && trx->state == TRX_STATE_NOT_STARTED)' failed.
201008  7:10:57 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.4.14-MariaDB-debug-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=40
max_threads=132
thread_count=47
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 421632 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f17cc000a88
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f18493eae88 thread_stack 0x49000
/srv/slapgrid/slappart4/srv/slapos/inst/slappart0/tmp/soft/5f89b8831c4d60d07cb75ad034cd3ec6/parts/mariadb/bin/mysqld(my_print_stacktrace+0x2b)[0x563fa496a232]
mysys/stacktrace.c:270(my_print_stacktrace)[0x563fa41bcc09]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f1864c080c0]
linux/raise.c:51(__GI_raise)[0x7f1863895fcf]
stdlib/abort.c:91(__GI_abort)[0x7f18638973fa]
assert/assert.c:92(__assert_fail_base)[0x7f186388ee37]
/lib/x86_64-linux-gnu/libc.so.6(+0x2bee2)[0x7f186388eee2]
/srv/slapgrid/slappart4/srv/slapos/inst/slappart0/tmp/soft/5f89b8831c4d60d07cb75ad034cd3ec6/parts/mariadb/bin/mysqld(+0xd56934)[0x563fa4595934]
include/sync0types.h:451(row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long))[0x563fa43e8fa7]
:0(ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function))[0x563fa41d4c2c]
sql/handler.h:3631(handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function))[0x563fa41c5474]
sql/handler.cc:2919(handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function))[0x563fa41cd459]
sql/handler.cc:5878(handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool))[0x563fa40830cd]
sql/multi_range_read.cc:353(handler::multi_range_read_next(void**))[0x563fa4083158]
sql/multi_range_read.cc:390(Mrr_simple_index_reader::get_next(void**))[0x563fa4084f3a]
sql/multi_range_read.cc:1491(DsMrr_impl::dsmrr_next(void**))[0x563fa43d1370]
:0(ha_innobase::multi_range_read_next(void**))[0x563fa432606e]
sql/opt_range.cc:12224(QUICK_RANGE_SELECT::get_next())[0x563fa41bb099]
sql/filesort.cc:772(find_all_keys)[0x563fa3f60fee]
sql/sql_select.cc:23654(create_sort_index(THD*, JOIN*, st_join_table*, Filesort*))[0x563fa3f612fe]
sql/sql_select.cc:21400(st_join_table::sort_table())[0x563fa3f61432]
sql/sql_select.cc:21338(join_init_read_record(st_join_table*))[0x563fa3f49d10]
sql/sql_select.cc:20414(sub_select(JOIN*, st_join_table*, bool))[0x563fa3f7d39e]
sql/sql_select.cc:19955(do_select)[0x563fa3f7d7d5]
sql/sql_select.cc:4256(JOIN::exec())[0x563fa3f7bb92]
sql/sql_select.cc:4687(mysql_select(THD*, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*))[0x563fa3f7beac]
sql/sql_select.cc:410(handle_select(THD*, LEX*, select_result*, unsigned long))[0x563fa3ef4a65]
sql/sql_parse.cc:6355(execute_sqlcom_select(THD*, TABLE_LIST*))[0x563fa3effe3b]
sql/sql_parse.cc:3889(mysql_execute_command(THD*))[0x563fa3f0a33e]
sql/sql_parse.cc:7896(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x563fa3f0cb89]
sql/sql_parse.cc:1834(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x563fa3f10504]
sql/sql_parse.cc:1352(do_command(THD*))[0x563fa403ca5d]
sql/sql_connect.cc:1412(do_handle_one_connection(CONNECT*))[0x563fa403cb74]
nptl/pthread_create.c:333(start_thread)[0x7f1864bfe494]
x86_64/clone.S:99(clone)[0x7f186394baff]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f17cc0112e0): SELECT * FROM message WHERE processing_node IN (0, 2) AND path IN ('/erp5/portal_simulation/1/7/1/1/1/1/1','/erp5/portal_simulation/1/7/1/1/1/1','/erp5/portal_simulation/1/7/1/1/1','/erp5/portal_simulation/1/7/1/1','/erp5/portal_simulation/1/7/1','/erp5/portal_simulation/1/7','/erp5/portal_simulation/1') AND method_id = 'expand' AND group_method_id = '\0' ORDER BY path LIMIT 1 FOR UPDATE
Connection ID (thread ID): 47
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /srv/slapgrid/slappart4/srv/slapos/inst/slappart0/tmp/inst/crash3/srv/mariadb
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             257485               257485               processes
Max open files            4096                 4096                 files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       257485               257485               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: core

on a non-debug build, it was something like InnoDB: Assertion failure in file /opt/slapgrid/shared/mariadb/4da287626e382c1b113859af7d7565a9_compile_/mariadb-10.4.13/storage/innobase/lock/lock0lock.cc line 6898.

We have collected a core dump and produced a full back trace of all threads, please see attached gdb.txt

I also attach the general log from this execution, from the creation of the database until the crash. In that scenario, we are using mariadb for a kind of "task queue" application, the typical queries that might be causing the crash are using something like this simplified pattern, executed by several clients:

-- pick a task to execute
SELECT uid FROM message WHERE processing_node=0 LIMIT 1 FOR UPDATE;
-- assign to self
UPDATE message SET processing_node=(worker id) WHERE uid=(uid of the picked task)
-- remove when done
DELETE FROM message WHERE uid=(uid of the picked task)

We don't have a simple reproduction, but we have a script which compiles mariadb and all its dependencies, start an application creating random data and it usually crashes after less than one hour. I'd be happy to try a patch with this script if you'd like, describe how to run this script, give access to a virtual machine with an already set up environment, or anything else which could help.



 Comments   
Comment by Marko Mäkelä [ 2022-11-01 ]

This looks like a possible duplicate of MDEV-10087. The assertion in trx_state_eq() is intended to catch cases where the SQL layer is ignoring that the execution of a transaction was aborted, for example due to a deadlock of transactions.

Comment by Elena Stepanova [ 2023-02-19 ]

perrinjerome,

A fix for the bug MDEV-10087 which marko mentioned above was included into the latest release (in 10.4 series it is 10.4.28). Would you be able to upgrade and see if it helps?

Comment by Jérome Perrin [ 2023-02-22 ]

@elenst thank you very much. I still have my script reproducing the problem, with mariadb 1.0.17, I could see a crash after 15 minutes. The same script with mariadb 10.4.28 has been running for more than 4 hours and there was no problem, so it really seems that this was resolved.

Comment by Jérome Perrin [ 2023-02-24 ]

I was trying more 10.4.28 today and I saw another crash, I need to re-run with a debug build of mariadb to see if it is the same problem.

Comment by Elena Stepanova [ 2023-03-27 ]

perrinjerome,

Any luck with reproducing?

Comment by Jérome Perrin [ 2023-03-30 ]

Thank you elenst I am still investigating this "another crash", but I really believe it's another independent issue, I also observe it on 10.3 and the conditions are different. If you were waiting for my feedback to close this issue, then I believe you can close this issue already. I'll probably open another issue for that "another crash" once I have more information.

Thanks again

Generated at Thu Feb 08 09:26:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.