[MDEV-23657] SIGSEGV in malloc_size_and_flag from my_free / PROFILING::finish_current_query (on optimized builds) Created: 2020-09-03  Updated: 2021-04-15  Resolved: 2021-04-15

Status: Closed
Project: MariaDB Server
Component/s: N/A
Affects Version/s: 10.4
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Roel Van de Paar Assignee: Sujatha Sivakumar (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Attachments: File MDEV-23657_8K.sql     File MDEV-23657_FULL_ORIGINAL.sql.tar.gz     File master.err    
Issue Links:
Relates
relates to MDEV-23534 SIGSEGV in sf_malloc_usable_size/my_f... Closed

 Description   

This bug is likely one of the most sporadic issues I have ever worked on. Both elenst and myself have observed this bug in QA runs. It is very hard to reduce and reproduce. I am thus uploading everything I have so far in the hope that a developer can find the issue through code + core dump analysis. The stack trace is also very short so hopefully bug analysis is not hard. A full core dump is available. An SQL trace reduced to just over 8K lines is attached as MDEV-23657_8K.sql, but reproducibility with the same (only a single thread is needed) is ultra low.

10.4.15 eae968f62d285de97ed607c87bc131cd863d5d03 (Optimized)

Core was generated by `/test/MD110820-mariadb-10.4.15-linux-x86_64-opt/bin/mysqld --no-defaults --max_'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=11)
    at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
[Current thread is 1 (Thread 0x14ad7c0c9700 (LWP 876069))]
(gdb) bt
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=11)
    at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
#1  0x000055b679d54a77 in my_write_core (sig=sig@entry=11) at /test/10.4_opt/mysys/stacktrace.c:482
#2  0x000055b67972c62a in handle_fatal_signal (sig=11) at /test/10.4_opt/sql/signal_handler.cc:343
#3  <signal handler called>
#4  malloc_size_and_flag (is_thread_specific=<synthetic pointer>, p=0x9)
    at /test/10.4_opt/mysys/my_malloc.c:43
#5  my_free (ptr=0x9) at /test/10.4_opt/mysys/my_malloc.c:213
#6  0x000055b679606348 in PROFILING::finish_current_query_impl (this=this@entry=0x14ad340042f0)
    at /test/10.4_opt/sql/sql_profile.cc:391
#7  0x000055b679524dc4 in PROFILING::finish_current_query (this=0x14ad340042f0)
    at /test/10.4_opt/sql/sql_profile.h:302
#8  dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x14ad34000c08, 
    packet=<optimized out>, packet@entry=0x14ad34924b69 "", packet_length=<optimized out>, 
    packet_length@entry=86, is_com_multi=is_com_multi@entry=false, 
    is_next_command=is_next_command@entry=false) at /test/10.4_opt/sql/sql_parse.cc:2466
#9  0x000055b679526e04 in do_command (thd=0x14ad34000c08) at /test/10.4_opt/sql/sql_parse.cc:1352
#10 0x000055b679603dbe in do_handle_one_connection (connect=connect@entry=0x55b67bfeb1d8)
    at /test/10.4_opt/sql/sql_connect.cc:1412
#11 0x000055b679603e7d in handle_one_connection (arg=arg@entry=0x55b67bfeb1d8)
    at /test/10.4_opt/sql/sql_connect.cc:1316
#12 0x000055b679c5c4ea in pfs_spawn_thread (arg=0x55b67bf71958)
    at /test/10.4_opt/storage/perfschema/pfs.cc:1869
#13 0x000014ad98ad36db in start_thread (arg=0x14ad7c0c9700) at pthread_create.c:463
#14 0x000014ad97c4da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

All other details + files (core + mysqld + ldd files, data dir, error log, reduced testcase) uploaded as attachment.

elenst had a great idea to change the reducer from an exact bug match to a more generic one, and I next tried to reduce towards 'malloc' in the error log only, rather then looking for a specific unique bug ID match.

This led to the following testcase (exactly as reduced to avoid the risk of non-reproducibility):

DROP DATABASE transforms;CREATE DATABASE transforms;DROP DATABASE test;CREATE DATABASE test;USE test;
DROP DATABASE transforms;
SET @@GLOBAL.OPTIMIZER_SWITCH="duplicateweedout=ON";#ERROR: 1231 - Variable 'optimizer_switch' can't be set to the value of 'duplicateweedout=ON'
ALTER TABLE tab MODIFY COLUMN c3 POLYGON NOT NULL;#ERROR: 1399 - XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state
SELECT * FROM t1  WHERE c2 BETWEEN '1971-01-01 00:00:01' AND '2038-01-09 03:14:07' ORDER BY c1,c2 DESC;#ERROR: 1054 - Nepoznata kolona 'c2' u 'where clause'
insert into ti (id) values (1) on duplicate key update y = 0, z = 42;#ERROR: 1792 - Cannot execute statement in a READ ONLY transaction
SET @@session.default_master_connection = '1234-5678';#NOERROR
SELECT ST_NUMINTERIORRINGS(1);#ERROR: 4079 - Illegal parameter data type int for operation 'st_numinteriorrings'
change master to IGNORE_SERVER_IDS= ();#NOERROR
SET @@GLOBAL.replicate_wild_ignore_table="";#NOERROR
SELECT SLEEP(3);

Likely quite a few items in this testcase are superfluous like the initial DROP DATABASE transforms statement etc. And please note that this testcase leads to malloc being mentioned in the error log only, not to the stack shown above. From memory, I setup reducer to reduce based on the MySQL client: I expect this issue to be reproducible in high repetition parallel MTR runs with minor modifications if required for MTR. The same applies for the 8K SQL testcase (MDEV-23657_8K.sql).

For reference, I have also uploaded the original SQL trace (88K lines) as MDEV-23657_FULL_ORIGINAL.sql.tar.gz and the full set file (ref first comment below) also includes all in-between stages (_out files, the more _out's the more reduced)

MDEV-22706 has a slightly similar stacktrace, though I doubt it's related, and it was previously fixed.



 Comments   
Comment by Roel Van de Paar [ 2020-09-03 ]

With thanks to elenst - the shorter version of the testcase seems to lead us back to MDEV-23534. sujatha.sivakumar please have a look at this bug and core dump and let us know your thoughts? Thank you

Comment by Roel Van de Paar [ 2020-09-04 ]

A scan for 'malloc' in the logs while reducing gives different issues;

/dev/shm/1599177442/subreducer$ grep "malloc" */log/master.err
1/log/master.err:/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x153c98bfe277]
1/log/master.err:/test/MD110820-mariadb-10.4.15-linux-x86_64-opt/bin/mysqld(my_malloc+0x36)[0x5624e369df16]
1/log/master.err:mysys/my_malloc.c:103(my_malloc)[0x5624e2e7022a]
109/log/master.err:malloc/malloc.c:5350(_IO_str_seekoff)[0x153bfff2097a]
109/log/master.err:malloc/malloc.c:4157(_int_free)[0x153bfff27e8c]
132/log/master.err:malloc/malloc.c:5350(_IO_str_seekoff)[0x14ad24c5f97a]
132/log/master.err:malloc/malloc.c:4161(_int_free)[0x14ad24c66e9c]
170/log/master.err:/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x149302d5a277]
170/log/master.err:/test/MD110820-mariadb-10.4.15-linux-x86_64-opt/bin/mysqld(my_malloc+0x36)[0x55621cff3f16]
170/log/master.err:mysys/my_malloc.c:103(my_malloc)[0x55621c8a8ede]
170/log/master.err:mysys/my_malloc.c:256(my_strndup)[0x55621c7c948c]

However, some good news, using a new 10.4 build made today (@ rev: 1cda462f46305daf2a5becb1ed0ce4fcdf3ae404) which includes the patch for MDEV-23534 no longer produces any malloc issues, including the PROFILING one.

In summary, it may be that the patch for MDEV-23534 solved all issues in this cluster of issues.

Comment by Sujatha Sivakumar (Inactive) [ 2020-09-04 ]

Hello Roel

Thank you for the bug report. I looked into the stack trace, but I could not associate the current issue with MDEV-22706.
From the attached sql file we could see that SET @@GLOBAL.replicate_wild_ignore_table="";#NOERROR is set just few instructions before the crashing query. Hence it could be caused by MDEV-23534. It is better to keep this issue open and monitor for some more time, to ensure it is completely gone.

Comment by Roel Van de Paar [ 2020-09-04 ]

Thank you sujatha.sivakumar. From the testing with the patched version it "looks" gone but indeed I can also not say for sure it is. Let's keep an eye open.

Deadline proposal; if this issue was not re-observed by 1-1-2021 I think we can close it.

Comment by Roel Van de Paar [ 2020-09-04 ]

Small update; elenst is seeing this particular bug also for trials which are not using replicate variables.

Comment by Roel Van de Paar [ 2020-11-20 ]

Attempting to run the original (shorter) testcase under ASAN on various versions including 10.4 yield no results.

Comment by Roel Van de Paar [ 2020-11-20 ]

However, running the longer testcase under ASAN did produce a number of ASAN issues, reported in the error log.

sujatha.sivakumar I uploaded the error log. Would you please have a look to see if any of the issues in there clearly related to the SIGSEGV? If so, I will reduce towards such a particular issue. All I need is a specific string from the ASAN trace to validate against, for example 'Query_arena::set_query_arena' etc.

It looks like most of the issues (like the Query_arena) are known, i.e. MDEV-22737.

Comment by Roel Van de Paar [ 2021-02-26 ]

Hi sujatha.sivakumar. Could you have a look at the ASAN issues seen to see if there is any clearly related to the SIGSEGV please? If so, I can reduce towards that. Thank you.

Comment by Sujatha Sivakumar (Inactive) [ 2021-03-01 ]

Hello Roel

Thank you for the details.

The SIGSEGV is caused during the cleanup of 'QUERY_PROFILE' objects. Upon enabling 'set profiling=1' a history of executed queries is maintained in a queue. By default 'profiling_history_size=15'. When number of query elements within history queue execede this size, elements are popped from the history queue.

    /* Maintain the history size. */
    while (history.elements > thd->variables.profiling_history_size)
      delete history.pop();

When history.elements=16 and thd->variables.profiling_history_size=15 then element is popped from queue as shown below.

#0  malloc_size_and_flag (p=0x7fd93c013a58, is_thread_specific=0x7fd99b5b3277 "") at /home/sujatha/bug_repo/test-10.4/mysys/my_malloc.c:43
#1  0x0000559ebda16779 in my_free (ptr=0x7fd93c013a58) at /home/sujatha/bug_repo/test-10.4/mysys/my_malloc.c:213
#2  0x0000559ebce7649b in Queue<QUERY_PROFILE>::pop (this=0x7fd93c0045d0) at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.h:136
#3  0x0000559ebce7493c in PROFILING::finish_current_query_impl (this=0x7fd93c0045a8)
    at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.cc:391
#4  0x0000559ebcbf9006 in PROFILING::finish_current_query (this=0x7fd93c0045a8) at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.h:302
#5  0x0000559ebcce110c in dispatch_command (command=COM_QUERY, thd=0x7fd93c000ce8, packet=0x7fd93c00a129 "", packet_length=25, 
    is_com_multi=false, is_next_command=false) at /home/sujatha/bug_repo/test-10.4/sql/sql_parse.cc:2483
#6  0x0000559ebccdd621 in do_command (thd=0x7fd93c000ce8) at /home/sujatha/bug_repo/test-10.4/sql/sql_parse.cc:1373
#7  0x0000559ebce6f90a in do_handle_one_connection (connect=0x559ec0e1a1e8) at /home/sujatha/bug_repo/test-10.4/sql/sql_connect.cc:1412
#8  0x0000559ebce6f659 in handle_one_connection (arg=0x559ec0e1a1e8) at /home/sujatha/bug_repo/test-10.4/sql/sql_connect.cc:1316
#9  0x0000559ebd8d0300 in pfs_spawn_thread (arg=0x559ec0d549e8) at /home/sujatha/bug_repo/test-10.4/storage/perfschema/pfs.cc:1869
#10 0x00007fd9a71cd6db in start_thread (arg=0x7fd99b5b4700) at pthread_create.c:463
#11 0x00007fd9a636871f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The SIGSEGV seems to be caused because of an invalid QUERY_PROFILE item being freed.

I looked into the master.err file (Excluded Query_arena), I could not find any relation between ASAN stack traces and current SIGSEGV.

Comment by Roel Van de Paar [ 2021-04-15 ]

I did some more testing and could not reproduce this issue.
It is presumed fixed (in whole or in part) by the patch in MDEV-23534.
It it occurs again, in the same form, this ticket can be re-opened.

Generated at Thu Feb 08 09:24:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.