[MDEV-23897] ASAN SEGV in lock_grant_and_move_on_page Created: 2020-10-06  Updated: 2020-10-07  Resolved: 2020-10-06

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.2.35, 10.3.26, 10.4.16, 10.5.7

Type: Bug Priority: Blocker
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: crash, debug, regression

Issue Links:
Problem/Incident
is caused by MDEV-23101 SIGSEGV in lock_rec_unlock() when Gal... Closed
Relates
relates to MDEV-11039 Add new scheduling algorithm for redu... Closed
relates to MDEV-16664 InnoDB: Failing assertion: !other_loc... Closed

 Description   

This failure has been happening concurrent tests quite a lot recently. I don't have anything reproducible yet, but marko apparently knows how to fix it already.

It's been filed for 10.5 because the corresponding concurrent tests on ASAN debug builds only run on 10.5 and 10.6 (and fail on both). However, it may very well affect previous versions.

10.5 7fba16d53fe3c

ASAN:DEADLYSIGNAL
=================================================================
==6765==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000001e2e061 bp 0x7f2f5c5b1d40 sp 0x7f2f5c5b1d10 T15)
==6765==The signal is caused by a READ memory access.
==6765==Hint: address points to the zero page.
    #0 0x1e2e060 in lock_grant_and_move_on_page /home/vsts/src/storage/innobase/lock/lock0lock.cc:2085
    #1 0x1e2ed2c in lock_rec_dequeue_from_page /home/vsts/src/storage/innobase/lock/lock0lock.cc:2166
    #2 0x1e39a46 in lock_release(trx_t*) /home/vsts/src/storage/innobase/lock/lock0lock.cc:4147
    #3 0x21c8d6a in trx_t::release_locks() /home/vsts/src/storage/innobase/trx/trx0trx.cc:503
    #4 0x21c9ae5 in trx_t::commit_in_memory(mtr_t const*) /home/vsts/src/storage/innobase/trx/trx0trx.cc:1368
    #5 0x21c18e0 in trx_t::commit_low(mtr_t*) /home/vsts/src/storage/innobase/trx/trx0trx.cc:1552
    #6 0x21c19e3 in trx_t::commit() /home/vsts/src/storage/innobase/trx/trx0trx.cc:1566
    #7 0x21c2475 in trx_commit_for_mysql(trx_t*) /home/vsts/src/storage/innobase/trx/trx0trx.cc:1698
    #8 0x1c8936c in innobase_commit_low(trx_t*) /home/vsts/src/storage/innobase/handler/ha_innodb.cc:4004
    #9 0x1c89a97 in innobase_commit_ordered_2 /home/vsts/src/storage/innobase/handler/ha_innodb.cc:4110
    #10 0x1c8a559 in innobase_commit /home/vsts/src/storage/innobase/handler/ha_innodb.cc:4214
    #11 0x124d799 in commit_one_phase_2 /home/vsts/src/sql/handler.cc:1941
    #12 0x124d4b9 in ha_commit_one_phase(THD*, bool) /home/vsts/src/sql/handler.cc:1920
    #13 0x124b850 in ha_commit_trans(THD*, bool) /home/vsts/src/sql/handler.cc:1714
    #14 0xecfd9e in trans_commit_implicit(THD*) /home/vsts/src/sql/transaction.cc:329
    #15 0xa74d09 in mysql_execute_command(THD*) /home/vsts/src/sql/sql_parse.cc:5054
    #16 0xa8883a in mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool) /home/vsts/src/sql/sql_parse.cc:7994
    #17 0xa5f2e1 in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool) /home/vsts/src/sql/sql_parse.cc:1867
    #18 0xa5bb55 in do_command(THD*) /home/vsts/src/sql/sql_parse.cc:1348
    #19 0xe9087f in do_handle_one_connection(CONNECT*, bool) /home/vsts/src/sql/sql_connect.cc:1410
    #20 0xe901d8 in handle_one_connection /home/vsts/src/sql/sql_connect.cc:1312
    #21 0x1aad4a8 in pfs_spawn_thread /home/vsts/src/storage/perfschema/pfs.cc:2201
    #22 0x7f2f80c206b9 in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76b9)
    #23 0x7f2f7fe4e4dc in clone (/lib/x86_64-linux-gnu/libc.so.6+0x1074dc)
 
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /home/vsts/src/storage/innobase/lock/lock0lock.cc:2085 in lock_grant_and_move_on_page
Thread T15 created by T0 here:
    #0 0x7f2f826f5d6f in pthread_create (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x37d6f)
    #1 0x1aa81fa in my_thread_create /home/vsts/src/storage/perfschema/my_thread.h:38
    #2 0x1aad897 in pfs_spawn_thread_v1 /home/vsts/src/storage/perfschema/pfs.cc:2252
    #3 0x7604da in inline_mysql_thread_create /home/vsts/src/include/mysql/psi/mysql_thread.h:1321
    #4 0x775b33 in create_thread_to_handle_connection(CONNECT*) /home/vsts/src/sql/mysqld.cc:6022
    #5 0x776194 in create_new_thread(CONNECT*) /home/vsts/src/sql/mysqld.cc:6081
    #6 0x7764ba in handle_accepted_socket(st_mysql_socket, st_mysql_socket) /home/vsts/src/sql/mysqld.cc:6146
    #7 0x776fd7 in handle_connections_sockets() /home/vsts/src/sql/mysqld.cc:6273
    #8 0x77538c in mysqld_main(int, char**) /home/vsts/src/sql/mysqld.cc:5668
    #9 0x75ec06 in main /home/vsts/src/sql/main.cc:25
    #10 0x7f2f7fd6783f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2083f)
 
==6765==ABORTING
201005  7:29:04 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.5.6-MariaDB-debug-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=6
max_threads=153
thread_count=8
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467965 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x62b0000bd288
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f2f5c5b59b0 thread_stack 0x5fc00
/usr/lib/x86_64-linux-gnu/libasan.so.4(+0x55900)[0x7f2f82713900]
/home/vsts/server/bin/mysqld(my_print_stacktrace+0xc3)[0x26eb694]
buf/buf0flu.cc:398(buf_flush_insert_into_flush_list(buf_block_t*, unsigned long))[0x123de43]
??:0(__restore_rt)[0x7f2f80c2a390]
linux/raise.c:54(__GI_raise)[0x7f2f7fd7c438]
stdlib/abort.c:91(__GI_abort)[0x7f2f7fd7e03a]
??:0(__sanitizer_cov_trace_pc_guard_init)[0x7f2f827be76e]
??:0(__sanitizer_symbolize_global)[0x7f2f827c6558]
??:0(__asan_unpoison_intra_object_redzone)[0x7f2f827a47b9]
??:0(__asan_unpoison_intra_object_redzone)[0x7f2f827a3302]
??:0(__restore_rt)[0x7f2f80c2a390]
/home/vsts/server/bin/mysqld[0x1e2e061]
maria/ma_backup.c:132(_ma_base_info_read)[0x1e2ed2d]
maria/ma_backup.c:184(aria_read_index)[0x1e39a47]
maria/ma_ft_nlq_search.c:76(walk_and_match)[0x21c8d6b]
handler/i_s.cc:6545(i_s_dict_fill_sys_tablespaces(THD*, unsigned int, char const*, unsigned long, TABLE*))[0x21c9ae6]
handler/i_s.cc:6621(i_s_sys_tablespaces_fill_table(THD*, TABLE_LIST*, Item*))[0x21c18e1]
handler/i_s.cc:5293(i_s_dict_fill_sys_indexes(THD*, unsigned long, unsigned long, dict_index_t*, TABLE*))[0x21c19e4]
handler/i_s.cc:5295(i_s_dict_fill_sys_indexes(THD*, unsigned long, unsigned long, dict_index_t*, TABLE*))[0x21c2476]
handler/i_s.cc:5330(i_s_sys_indexes_fill_table(THD*, TABLE_LIST*, Item*))[0x1c8936d]
maria/ma_pagecache.c:3072(read_block)[0x1c89a98]
maria/ma_pagecache.c:3111(read_block)[0x1c8a55a]
maria/ma_pagecache.c:3201(pagecache_unlock)[0x124d79a]
sql/field.h:2058(Field_num::pack_length_from_metadata(unsigned int) const)[0x124d4ba]
sql/field.h:2040(Field_num::decimals() const)[0x124b851]
sql/field.h:1133(Field::value_length())[0xecfd9f]
sql/sql_prepare.cc:3228(mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool))[0xa74d0a]
/home/vsts/server/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x6c8)[0xa8883b]
/home/vsts/server/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x2312)[0xa5f2e2]
/home/vsts/server/bin/mysqld(_Z10do_commandP3THD+0x1156)[0xa5bb56]
/home/vsts/server/bin/mysqld(_Z24do_handle_one_connectionP7CONNECTb+0x4cd)[0xe90880]
sql/sql_parse.cc:9231(kill_one_thread(THD*, long long, killed_state, killed_type))[0xe901d9]
sql/sql_parse.cc:9152(find_thread_by_id_with_thd_data_lock(long long, bool))[0x1aad4a9]
??:0(start_thread)[0x7f2f80c206ba]
x86_64/clone.S:111(clone)[0x7f2f7fe4e4dd]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x62b0000c42a8): LOCK /* QNO 17386 CON_ID 10 */ TABLES `view_t4_MyISAM` READ LOCAL, `view_t6_Aria` READ, `t2` WRITE, `view_t3_Aria` w READ LOCAL, `view_t1_MyISAM` READ
 
Connection ID (thread ID): 10
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /home/vsts/logs/vardir/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             27709                27709                processes 
Max open files            65536                65536                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       27709                27709                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: |/usr/share/apport/apport %p %s %c %d %P %E{



 Comments   
Comment by Marko Mäkelä [ 2020-10-06 ]

The problematic assertion was added in MDEV-23101, and apparently it can be hit with the MDEV-16664 test case grammar on 10.2 as well. I see two potentially problematic assertions where lock may be a null pointer:

@@ -2346,6 +2306,7 @@ lock_grant_and_move_on_page(ulint rec_fold, ulint space, ulint page_no)
 		lock = previous->hash;
 	}
 
+	ut_ad(!lock->trx->is_wsrep());
 	ut_ad(previous->hash == lock || previous == lock);
 	/* Grant locks if there are no conflicting locks ahead.
 	 Move granted locks to the head of the list. */
@@ -4252,6 +4211,7 @@ lock_grant_and_move_on_rec(
 	    }
 		lock = previous->hash;
 	}
+	ut_ad(!lock->trx->is_wsrep());
 	/* Grant locks if there are no conflicting locks ahead.
 	 Move granted locks to the head of the list. */
 	for (;lock != NULL;) {

In the reported stack trace, the null pointer is being defererenced for the first hunk.

In retrospect, it is obvious that we must be coming from the following branch:

	else {
		while (previous->hash &&
		       (previous->hash->un_member.rec_lock.page_id != id)) {
			previous = previous->hash;
		}
		lock = previous->hash;
	}
 
	ut_ad(!lock->trx->is_wsrep());

The while condition may be terminated when previous->hash is a null pointer, and that pointer can be assigned to lock. The solution is to move the assertion inside the while (lock) loop that follows.

The second hunk is basically a copy of the same, with the exception that the while (lock) loop is actually for (;lock != NULL;) and visible in the diff above. Also in that hunk, we may assign a null pointer to lock.

I am sorry for not being more careful in the MDEV-23101 review. Luckily, unlike the MDEV-23101 itself, the SIGSEGV in the fix only affects debug builds.

Comment by Marko Mäkelä [ 2020-10-06 ]

I just understood why the bug was not repeatable on buildbot: both problematic hunks are only reachable if the non-default parameter innodb_lock_schedule_algorithm=VATS is set. Because that parameter is never set with Galera, it means that MDEV-23101 (which was fixing a crash with Galera replication) may have included some unnecessary changes.

Comment by Marko Mäkelä [ 2020-10-06 ]

Very many mtr tests that use InnoDB would crash if the test is started with

./mtr --mysqld=--loose-innodb-lock-schedule-algorithm=VATS

I am really sorry that I assumed that such testing would have been conducted as part of the developer testing of MDEV-23101. That code was touching the VATS code paths on purpose. The debug assertions are trying to say that Galera cannot be in use together with VATS.

Here is rather small test that would crash:

10.2 2b832151ad66556cb07ed232b139411f4b6defe3

CURRENT_TEST: innodb.innodb-alter-nullable
mysqltest: At line 86: query 'ALTER IGNORE TABLE t1 ADD UNIQUE(c)' failed: 2013: Lost connection to MySQL server during query
#2  0x00005579283fffbe in handle_fatal_signal (sig=11) at /mariadb/10.2o/sql/signal_handler.cc:343
#3  <signal handler called>
#4  lock_grant_and_move_on_page (rec_fold=rec_fold@entry=424183894671, space=space@entry=12, page_no=page_no@entry=4) at /mariadb/10.2o/storage/innobase/lock/lock0lock.cc:2309
#5  0x0000557928670bf1 in lock_rec_dequeue_from_page (in_lock=in_lock@entry=0x7f9c2972d388) at /mariadb/10.2o/storage/innobase/lock/lock0lock.cc:2395
#6  0x00005579286765a4 in lock_release (trx=trx@entry=0x7f9c2972d140) at /mariadb/10.2o/storage/innobase/lock/lock0lock.cc:4400
#7  0x00005579286766ca in lock_trx_release_locks (trx=trx@entry=0x7f9c2972d140) at /mariadb/10.2o/storage/innobase/lock/lock0lock.cc:6521
#8  0x00005579287f7530 in trx_t::release_locks (this=0x7f9c2972d140) at /mariadb/10.2o/storage/innobase/trx/trx0trx.cc:602
#9  trx_commit_in_memory (trx=trx@entry=0x7f9c2972d140, mtr=mtr@entry=0x7f9c28a2dc50, serialised=serialised@entry=true) at /mariadb/10.2o/storage/innobase/trx/trx0trx.cc:1746
#10 0x00005579287f9012 in trx_commit_low (trx=trx@entry=0x7f9c2972d140, mtr=0x7f9c28a2dc50) at /mariadb/10.2o/storage/innobase/trx/trx0trx.cc:1961
#11 0x00005579287f93ca in trx_commit (trx=trx@entry=0x7f9c2972d140) at /mariadb/10.2o/storage/innobase/trx/trx0trx.cc:1985

However, this test already runs with 5 innodb_page_size combinations. I will try to find an existing small test that is short and does not include combinations.

Comment by Marko Mäkelä [ 2020-10-06 ]

This only affected debug builds when the non-default read-only configuration parameter innodb_lock_schedule_algorithm=VATS was set.

Generated at Thu Feb 08 09:25:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.