[MDEV-28589] Mariadb signal11 crash and restart after select Created: 2022-05-17  Updated: 2022-06-20

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - Aria
Affects Version/s: 10.4.18
Fix Version/s: None

Type: Bug Priority: Major
Reporter: yongjian.wu Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: crash, signal
Environment:

MariaDB single


Attachments: Text File gdb.log    

 Description   

Hi, I might an issue with my PROD database.
The database crashed with signal 11 and then restarted.
Below is the related alert message and gdb backtrace. anyone can help with this?
Thank you

the SQL statement below is not a one time query, it will take exec lots of times per hour

220517 15:00:51 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.4.18-MariaDB-log
key_buffer_size=10485760
read_buffer_size=131072
max_used_connections=457
max_threads=2051
thread_count=27
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 8725647 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x5625418f2008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f8d7e372bd8 thread_stack 0x49000
mysys/stacktrace.c:175(my_print_stacktrace)[0x5624e7cd3dd9]
sql/signal_handler.cc:209(handle_fatal_signal)[0x5624e77fcc3f]
/lib64/libpthread.so.0(+0xf630)[0x7f8d97276630]
bits/string3.h:51(memcpy)[0x5624e7bfd160]
maria/ma_blockrec.c:3564(allocate_and_write_block_record)[0x5624e7bfe206]
maria/ma_write.c:157(maria_write)[0x5624e7c05a34]
sql/sql_class.h:6638(handler::ha_write_tmp_row(unsigned char*))[0x5624e7663a80]
sql/sql_select.cc:21833(end_write(JOIN*, st_join_table*, bool))[0x5624e764f3d7]
sql/sql_class.h:3908(THD::get_stmt_da())[0x5624e7631967]
sql/sql_select.cc:20413(sub_select(JOIN*, st_join_table*, bool))[0x5624e763eec6]
sql/sql_class.h:3908(THD::get_stmt_da())[0x5624e7631967]
sql/sql_select.cc:20413(sub_select(JOIN*, st_join_table*, bool))[0x5624e763eec6]
sql/sql_select.cc:19944(do_select)[0x5624e766147a]
sql/sql_select.cc:4269(JOIN::exec())[0x5624e7661694]
sql/sql_select.cc:4705(mysql_select(THD*, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*))[0x5624e765f940]
sql/sql_select.cc:422(handle_select(THD*, LEX*, select_result*, unsigned long))[0x5624e76604df]
sql/sql_parse.cc:6419(execute_sqlcom_select(THD*, TABLE_LIST*))[0x5624e750963f]
sql/sql_parse.cc:3937(mysql_execute_command(THD*))[0x5624e7609f2a]
sql/sql_parse.cc:7960(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x5624e760f4d6]
sql/sql_parse.cc:1917(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x5624e7611864]
sql/sql_parse.cc:1374(do_command(THD*))[0x5624e7612bd0]
sql/threadpool_common.cc:373(threadpool_process_request)[0x5624e77d1ebf]
sql/threadpool_generic.cc:1605(worker_main(void*))[0x5624e796e23e]
perfschema/pfs.cc:1872(pfs_spawn_thread)[0x5624e7c96184]
pthread_create.c:0(start_thread)[0x7f8d9726eea5]
/lib64/libc.so.6(clone+0x6d)[0x7f8d94fe39fd]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x56253f542820): SELECT
                   A.TR_ID,
                   A.SVC_TR_ID,
               A.INTER_AUTH_TOKEN,
               A.RESLT_CD FINAL_RESLT_CD,
               TIMESTAMPDIFF(SECOND, NOW(), STR_TO_DATE(A.expire_dt, '%Y%m%d%H%i%s')) AS avlbl_time,
               B.RESLT_CD,
               B.RESLT_MSG,
               A.DEVICE_MNGT_ID,
               A.BIZ_REQ_TYPE,
               B.EXECT_TYPE,
               B.EXECT_ORDER,
               B.APD_EXECT_TYPE,
               B.APD_EXECT_ORDER,
               A.SVC_TR_CHALLENGE,
               B.TR_PLAIN,
               B.TR_PLAIN1,
               B.TR_PLAIN2,
               B.TR_PLAIN3,
               B.TR_PLAIN4,
               B.TR_PLAIN5,
                B.TR_SIGN,
               B.TR_SIGN1,
               B.TR_SIGN2,
               B.TR_SIGN3,
               B.TR_SIGN4,
               B.TR_SIGN5,
               A.POLICY_ID,
               A.SITE_ID,
               A.SVC_ID,
               B.AAID,
               A.USER_ID,
               A.CUST_MNGT_ID,
                           A.REQ_PROTOCOL_TYPE,
                           A.TRAN_DATA_TYPE,
                           A.APP_PKG_NM,
                           A.FACET_ID,
                           A.KEY_ID
          FROM OP_SESSION_MNGT A, OP_SESSION_JOB_LIST B
          WHERE 1 = 1
                AND  A.TR_ID = 'here is a number I masked'
                    AND A.TR_ID = B.TR_ID
                ORDER BY A.TR_ID, B.EXECT_ORDER ASC
 
Connection ID (thread ID): 76621082
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=on,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=on,mrr_cost_based=on,mrr_sort_keys=on,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on

GDB backtrace

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/maria10.1/bin/mysqld'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f8d94fd8ccd in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bits-mariadb-galera-rac-2.4.4-1.x86_64
(gdb) bt
#0  0x00007f8d94fd8ccd in poll () from /lib64/libc.so.6
#1  0x00005624e7554409 in poll (__timeout=-1, __nfds=2, __fds=0x7ffcea0612e0) at /usr/include/bits/poll2.h:41
#2  handle_connections_sockets () at /data/rpm/source/mariadb-10.4.18/sql/mysqld.cc:6465
#3  0x00005624e75570bd in mysqld_main (argc=147, argv=0x5624eae3e668) at /data/rpm/source/mariadb-10.4.18/sql/mysqld.cc:5901
#4  0x00007f8d94f07555 in __libc_start_main () from /lib64/libc.so.6
#5  0x00005624e754b6e7 in _start ()



 Comments   
Comment by Daniel Black [ 2022-05-18 ]

Thank you for the bug report. Is the query crashing on every occurrence?

Thanks for providing the gdb backtrace. Unfortunately the crash is in a different thread. If you still have a core file, please use thread apply all bt -frame-arguments all full. If you are on an older RHEL7 version of gdb:

 set print frame-arguments all
 thread apply all bt full

The crash is in the temporary table creation. Altering your query to:

                AND  B.TR_ID = 'here is a number I masked'
                    AND A.TR_ID = B.TR_ID
                ORDER BY B.TR_ID, B.EXECT_ORDER ASC

With an OP_SESSION_JOB_LIST index on (TR_ID, EXECT_ORDER might be sufficient to avoid the temporary table creation, avoid the crash, and might even be quicker.

I haven't seen anything in later versions that may fix this. f24038b8515864e496dd2c55d973a0227b3698df changes allocations, but it shouldn't materially affect this bug report. Note: changed to Aria, because even though the tables might be innodb, Aria is used as the temporary table and its in here that the crash occurs.

Comment by yongjian.wu [ 2022-05-18 ]

Hi Daniel
thanks a lot for your help and analysis ,
for the first question
--Is the query crashing on every occurrence? - no, this query runs very frequently per minute, and this database has already run for about 1 year with no problem.
for the second
--Unfortunately the crash is in a different thread - I attach a full log with name gdb.log gdb.log
for the third
could I get more about why when the temporary table created will cause the crash in this case?
for the last
I also find no matter add the index that you recommend or not the query still uses the temporary table. unless I give up the "exect_order" in the "order by".
thanks a ton

Comment by Daniel Black [ 2022-05-18 ]

Glad to know its not a frequent occurrence. It could be starting to happen now as the nature of the data changes a new query plan happens.

"Perfect" crash gdb report, thanks. Extracting for read/search-ability:

memcpy code location ma_blockrec.c#L3453

thread 5

Thread 5 (Thread 0x7f8d7e373700 (LWP 4452)):
#0  0x00005624e7bfd160 in memcpy (__len=8, __src=0x56264264c01b, __dest=<synthetic pointer>) at /usr/include/bits/string3.h:51
No locals.
#1  write_block_record (info=info@entry=0x562569196008, old_record=old_record@entry=0x0, record=record@entry=0x56254213c020 "", 
    row=row@entry=0x562569196078, bitmap_blocks=bitmap_blocks@entry=0x562569196078, head_block_is_read=<optimized out>, 
    row_pos=row_pos@entry=0x7f8d7e36f940, undo_lsn=undo_lsn@entry=1, old_record_checksum=old_record_checksum@entry=0)
    at /data/rpm/source/mariadb-10.4.18/storage/maria/ma_blockrec.c:3453
        blob_pos = <optimized out>
        length = 4294967288
        blob_length = <optimized out>
        data = <optimized out>
        end_of_data = <optimized out>
        tmp_data_used = 0x0
        tmp_data = 0x56256919681a ""
        row_extents_first_part = <optimized out>
        row_extents_second_part = <optimized out>
        field_length_data = <optimized out>
        page_buff = <optimized out>
        block = 0x56253c6d41a8
        head_block = 0x56253c6d4190
        share = <optimized out>
        column = 0x56253d92b0c8
        end_column = 0x56253d92b330
        page_link = {link = 0x562522f60b48, unlock = PAGECACHE_LOCK_READ_UNLOCK, write_lock = 22053, changed = 1 '\001'}
        block_size = <optimized out>
        flag = <optimized out>
        head_length = <optimized out>
        blob_lengths = 0x562540c99148
        row_extents_in_use = <optimized out>
        blob_full_pages_exists = <optimized out>
        lsn = 0
        position = <optimized out>
        save_my_errno = <optimized out>
#2  0x00005624e7bfe206 in allocate_and_write_block_record (undo_lsn=1, row=0x562569196078, record=0x56254213c020 "", info=0x562569196008)
    at /data/rpm/source/mariadb-10.4.18/storage/maria/ma_blockrec.c:3564
        row_pos = {buff = 0x5625691967c8 "", data = 0x5625691967d4 "", dir = 0x5625691987c0 "\f", length = 8172, rownr = 0, empty_space = 8102}
        blocks = 0x562569196078
        save_my_errno = <optimized out>
#3  _ma_write_init_block_record (info=0x562569196008, record=0x56254213c020 "")
    at /data/rpm/source/mariadb-10.4.18/storage/maria/ma_blockrec.c:3604
No locals.
#4  0x00005624e7c05a34 in maria_write (info=0x562569196008, record=0x56254213c020 "")
    at /data/rpm/source/mariadb-10.4.18/storage/maria/ma_write.c:157
        share = 0x56253d92a008
        i = <optimized out>
        save_errno = <optimized out>
        filepos = <optimized out>
        oldpos = 18446744073709551615
        buff = <optimized out>
        lock_tree = 0 '\000'
        fatal_error = <optimized out>
        keyinfo = <optimized out>
#5  0x00005624e7663a80 in handler::ha_write_tmp_row (this=0x56253eee0020, buf=0x56254213c020 "")
    at /data/rpm/source/mariadb-10.4.18/sql/sql_class.h:6638
        this_tracker = <optimized out>
        error = <optimized out>
#6  0x00005624e764f3d7 in end_write (join=0x562540faecf8, join_tab=0x5625405193d8, end_of_records=<optimized out>)
    at /data/rpm/source/mariadb-10.4.18/sql/sql_select.cc:21833
        error = <optimized out>
        table = 0x56253ec47020

could I get more about why when the temporary table created will cause the crash in this case?

Not sure yet. When I look closer something may relieve itself. Is the normally query execution plan for this query to create a temp table?

Based of the backtrace length could this be a fully populated blob column of this particular TR_ID? A gdb p *column as this thread 5 location using up to get to the right function might be insightful. And at least a partial table definition around this column.

Sorry adding the index didn't help. You did change A.TR_ID = X to B.TR_ID = X in the query? Either way thanks for the feedback.

Comment by yongjian.wu [ 2022-05-18 ]

thanks for your quick response
– Is the normally query execution plan for this query to create a temp table? -yes, I think it is a normal exec plain as I also check on non-prod env and also on my
test env, and the plain is the same.
– could this be a fully populated blob column of this particular TR_ID? - the TR_ID of both A&B is varchar(20) btw the B.EXECT_ORDER is decimal(2,0)
– You did change A.TR_ID = X to B.TR_ID = X in the query? - sorry for my fault I did not change all "A.TR_ID to B.TR_ID ".
I add the index on OP_SESSION_JOB_LIST (TR_ID,EXECT_ORDER) then the plain changed from "Using where; Using temporary; Using filesort" to "Using where; Using filesort".
and if i give up the B.EXECT_ORDER in the "order by" it changes from "Using where; Using temporary; Using filesort" to "Using where; " although I did not add the index.
seems there is a way to be forbidden to use the temp table.

could you let me know how can you analyze that this case crash point is temporary table creation, which part error message can get this or ..... ? and is there a newer version that can fix this issue?

Comment by Daniel Black [ 2022-05-18 ]

The blob/text field could be any of the result columns.

gdb to get column details associated with the crash

thread 5
up
p *column

I don't think a newer version would fix this, there just hasn't been much changes in code in this area recently. There maybe other fixes elsewhere that I haven't seen that affect this.

If you had a repeatable crash, it would be appreciated a test on a latest 10.4 to confirm.

Comment by yongjian.wu [ 2022-05-18 ]

thanks, buddy

Comment by yongjian.wu [ 2022-05-19 ]

Hi Daniel

if we change the tmp engine to myisam by "aria_used_for_temp_tables" "OFF" , if this issue can be avoided?
actually, we are thinking how to avoid this but do not change the query statement, although we add the index that you suggest we still need to change the statement.
or change the column definition from text to varchar is the text data type is the culprit?

Comment by Daniel Black [ 2022-05-19 ]

A tmp engine to myisam is possible, however the code base is rather stagnant and you could run into more issues rather than less (especially as its a non-default setting). I haven't looked how different the myisam implementation of write_block_record is.

Changing column definitions to varchar rather than text will certainly avoid this bit of code. Is your text type a LONGTEXT by chance? From the backtrace length=4294967288 is 2^32 - 8(size of a pointer), if correct, implies longtext which certainly won't fit in a varchar.

Are any of your text fields particularly large that could make length a miscalculation?

Comment by yongjian.wu [ 2022-05-19 ]

Unfortunately for the application side, it is a vendor Product and only provides to us as a package.it is difficult for us to make change. that's why we are thinking temporary solution from the db side. And for this application their original environment is using amazon aurora but never faced the issue like this time we met, we are trying to migrate it to MariaDB.

All of the columns defined with text is "text COLLATE utf8mb4_unicode_ci DEFAULT NULL",is it safe to transfer to varchar?
Thanks for always helping.

Comment by Daniel Black [ 2022-05-23 ]

safety depends on ensuring that the data in those columns can fit inside the varchar. Aside from that I can't think of any functional differences.

Generated at Thu Feb 08 10:01:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.