Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23657

SIGSEGV in malloc_size_and_flag from my_free / PROFILING::finish_current_query (on optimized builds)

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.4(EOL)
    • N/A
    • N/A
    • None

    Description

      This bug is likely one of the most sporadic issues I have ever worked on. Both elenst and myself have observed this bug in QA runs. It is very hard to reduce and reproduce. I am thus uploading everything I have so far in the hope that a developer can find the issue through code + core dump analysis. The stack trace is also very short so hopefully bug analysis is not hard. A full core dump is available. An SQL trace reduced to just over 8K lines is attached as MDEV-23657_8K.sql, but reproducibility with the same (only a single thread is needed) is ultra low.

      10.4.15 eae968f62d285de97ed607c87bc131cd863d5d03 (Optimized)

      Core was generated by `/test/MD110820-mariadb-10.4.15-linux-x86_64-opt/bin/mysqld --no-defaults --max_'.
      Program terminated with signal SIGSEGV, Segmentation fault.
      #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=11)
          at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
      [Current thread is 1 (Thread 0x14ad7c0c9700 (LWP 876069))]
      (gdb) bt
      #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=11)
          at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
      #1  0x000055b679d54a77 in my_write_core (sig=sig@entry=11) at /test/10.4_opt/mysys/stacktrace.c:482
      #2  0x000055b67972c62a in handle_fatal_signal (sig=11) at /test/10.4_opt/sql/signal_handler.cc:343
      #3  <signal handler called>
      #4  malloc_size_and_flag (is_thread_specific=<synthetic pointer>, p=0x9)
          at /test/10.4_opt/mysys/my_malloc.c:43
      #5  my_free (ptr=0x9) at /test/10.4_opt/mysys/my_malloc.c:213
      #6  0x000055b679606348 in PROFILING::finish_current_query_impl (this=this@entry=0x14ad340042f0)
          at /test/10.4_opt/sql/sql_profile.cc:391
      #7  0x000055b679524dc4 in PROFILING::finish_current_query (this=0x14ad340042f0)
          at /test/10.4_opt/sql/sql_profile.h:302
      #8  dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x14ad34000c08, 
          packet=<optimized out>, packet@entry=0x14ad34924b69 "", packet_length=<optimized out>, 
          packet_length@entry=86, is_com_multi=is_com_multi@entry=false, 
          is_next_command=is_next_command@entry=false) at /test/10.4_opt/sql/sql_parse.cc:2466
      #9  0x000055b679526e04 in do_command (thd=0x14ad34000c08) at /test/10.4_opt/sql/sql_parse.cc:1352
      #10 0x000055b679603dbe in do_handle_one_connection (connect=connect@entry=0x55b67bfeb1d8)
          at /test/10.4_opt/sql/sql_connect.cc:1412
      #11 0x000055b679603e7d in handle_one_connection (arg=arg@entry=0x55b67bfeb1d8)
          at /test/10.4_opt/sql/sql_connect.cc:1316
      #12 0x000055b679c5c4ea in pfs_spawn_thread (arg=0x55b67bf71958)
          at /test/10.4_opt/storage/perfschema/pfs.cc:1869
      #13 0x000014ad98ad36db in start_thread (arg=0x14ad7c0c9700) at pthread_create.c:463
      #14 0x000014ad97c4da3f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      All other details + files (core + mysqld + ldd files, data dir, error log, reduced testcase) uploaded as attachment.

      elenst had a great idea to change the reducer from an exact bug match to a more generic one, and I next tried to reduce towards 'malloc' in the error log only, rather then looking for a specific unique bug ID match.

      This led to the following testcase (exactly as reduced to avoid the risk of non-reproducibility):

      DROP DATABASE transforms;CREATE DATABASE transforms;DROP DATABASE test;CREATE DATABASE test;USE test;
      DROP DATABASE transforms;
      SET @@GLOBAL.OPTIMIZER_SWITCH="duplicateweedout=ON";#ERROR: 1231 - Variable 'optimizer_switch' can't be set to the value of 'duplicateweedout=ON'
      ALTER TABLE tab MODIFY COLUMN c3 POLYGON NOT NULL;#ERROR: 1399 - XAER_RMFAIL: The command cannot be executed when global transaction is in the  ACTIVE state
      SELECT * FROM t1  WHERE c2 BETWEEN '1971-01-01 00:00:01' AND '2038-01-09 03:14:07' ORDER BY c1,c2 DESC;#ERROR: 1054 - Nepoznata kolona 'c2' u 'where clause'
      insert into ti (id) values (1) on duplicate key update y = 0, z = 42;#ERROR: 1792 - Cannot execute statement in a READ ONLY transaction
      SET @@session.default_master_connection = '1234-5678';#NOERROR
      SELECT ST_NUMINTERIORRINGS(1);#ERROR: 4079 - Illegal parameter data type int for operation 'st_numinteriorrings'
      change master to IGNORE_SERVER_IDS= ();#NOERROR
      SET @@GLOBAL.replicate_wild_ignore_table="";#NOERROR
      SELECT SLEEP(3);
      

      Likely quite a few items in this testcase are superfluous like the initial DROP DATABASE transforms statement etc. And please note that this testcase leads to malloc being mentioned in the error log only, not to the stack shown above. From memory, I setup reducer to reduce based on the MySQL client: I expect this issue to be reproducible in high repetition parallel MTR runs with minor modifications if required for MTR. The same applies for the 8K SQL testcase (MDEV-23657_8K.sql).

      For reference, I have also uploaded the original SQL trace (88K lines) as MDEV-23657_FULL_ORIGINAL.sql.tar.gz and the full set file (ref first comment below) also includes all in-between stages (_out files, the more _out's the more reduced)

      MDEV-22706 has a slightly similar stacktrace, though I doubt it's related, and it was previously fixed.

      Attachments

        Issue Links

          Activity

            Roel Roel Van de Paar added a comment - - edited

            Attempting to run the original (shorter) testcase under ASAN on various versions including 10.4 yield no results.

            Roel Roel Van de Paar added a comment - - edited Attempting to run the original (shorter) testcase under ASAN on various versions including 10.4 yield no results.
            Roel Roel Van de Paar added a comment - - edited

            However, running the longer testcase under ASAN did produce a number of ASAN issues, reported in the error log.

            sujatha.sivakumar I uploaded the error log. Would you please have a look to see if any of the issues in there clearly related to the SIGSEGV? If so, I will reduce towards such a particular issue. All I need is a specific string from the ASAN trace to validate against, for example 'Query_arena::set_query_arena' etc.

            It looks like most of the issues (like the Query_arena) are known, i.e. MDEV-22737.

            Roel Roel Van de Paar added a comment - - edited However, running the longer testcase under ASAN did produce a number of ASAN issues, reported in the error log. sujatha.sivakumar I uploaded the error log. Would you please have a look to see if any of the issues in there clearly related to the SIGSEGV? If so, I will reduce towards such a particular issue. All I need is a specific string from the ASAN trace to validate against, for example 'Query_arena::set_query_arena' etc. It looks like most of the issues (like the Query_arena) are known, i.e. MDEV-22737 .
            Roel Roel Van de Paar added a comment - - edited

            Hi sujatha.sivakumar. Could you have a look at the ASAN issues seen to see if there is any clearly related to the SIGSEGV please? If so, I can reduce towards that. Thank you.

            Roel Roel Van de Paar added a comment - - edited Hi sujatha.sivakumar . Could you have a look at the ASAN issues seen to see if there is any clearly related to the SIGSEGV please? If so, I can reduce towards that. Thank you.

            Hello Roel

            Thank you for the details.

            The SIGSEGV is caused during the cleanup of 'QUERY_PROFILE' objects. Upon enabling 'set profiling=1' a history of executed queries is maintained in a queue. By default 'profiling_history_size=15'. When number of query elements within history queue execede this size, elements are popped from the history queue.

                /* Maintain the history size. */
                while (history.elements > thd->variables.profiling_history_size)
                  delete history.pop();
            

            When history.elements=16 and thd->variables.profiling_history_size=15 then element is popped from queue as shown below.

            #0  malloc_size_and_flag (p=0x7fd93c013a58, is_thread_specific=0x7fd99b5b3277 "") at /home/sujatha/bug_repo/test-10.4/mysys/my_malloc.c:43
            #1  0x0000559ebda16779 in my_free (ptr=0x7fd93c013a58) at /home/sujatha/bug_repo/test-10.4/mysys/my_malloc.c:213
            #2  0x0000559ebce7649b in Queue<QUERY_PROFILE>::pop (this=0x7fd93c0045d0) at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.h:136
            #3  0x0000559ebce7493c in PROFILING::finish_current_query_impl (this=0x7fd93c0045a8)
                at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.cc:391
            #4  0x0000559ebcbf9006 in PROFILING::finish_current_query (this=0x7fd93c0045a8) at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.h:302
            #5  0x0000559ebcce110c in dispatch_command (command=COM_QUERY, thd=0x7fd93c000ce8, packet=0x7fd93c00a129 "", packet_length=25, 
                is_com_multi=false, is_next_command=false) at /home/sujatha/bug_repo/test-10.4/sql/sql_parse.cc:2483
            #6  0x0000559ebccdd621 in do_command (thd=0x7fd93c000ce8) at /home/sujatha/bug_repo/test-10.4/sql/sql_parse.cc:1373
            #7  0x0000559ebce6f90a in do_handle_one_connection (connect=0x559ec0e1a1e8) at /home/sujatha/bug_repo/test-10.4/sql/sql_connect.cc:1412
            #8  0x0000559ebce6f659 in handle_one_connection (arg=0x559ec0e1a1e8) at /home/sujatha/bug_repo/test-10.4/sql/sql_connect.cc:1316
            #9  0x0000559ebd8d0300 in pfs_spawn_thread (arg=0x559ec0d549e8) at /home/sujatha/bug_repo/test-10.4/storage/perfschema/pfs.cc:1869
            #10 0x00007fd9a71cd6db in start_thread (arg=0x7fd99b5b4700) at pthread_create.c:463
            #11 0x00007fd9a636871f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            The SIGSEGV seems to be caused because of an invalid QUERY_PROFILE item being freed.

            I looked into the master.err file (Excluded Query_arena), I could not find any relation between ASAN stack traces and current SIGSEGV.

            sujatha.sivakumar Sujatha Sivakumar (Inactive) added a comment - - edited Hello Roel Thank you for the details. The SIGSEGV is caused during the cleanup of 'QUERY_PROFILE' objects. Upon enabling 'set profiling=1' a history of executed queries is maintained in a queue. By default 'profiling_history_size=15'. When number of query elements within history queue execede this size, elements are popped from the history queue. /* Maintain the history size. */ while (history.elements > thd->variables.profiling_history_size) delete history.pop(); When history.elements=16 and thd->variables.profiling_history_size=15 then element is popped from queue as shown below. #0 malloc_size_and_flag (p=0x7fd93c013a58, is_thread_specific=0x7fd99b5b3277 "") at /home/sujatha/bug_repo/test-10.4/mysys/my_malloc.c:43 #1 0x0000559ebda16779 in my_free (ptr=0x7fd93c013a58) at /home/sujatha/bug_repo/test-10.4/mysys/my_malloc.c:213 #2 0x0000559ebce7649b in Queue<QUERY_PROFILE>::pop (this=0x7fd93c0045d0) at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.h:136 #3 0x0000559ebce7493c in PROFILING::finish_current_query_impl (this=0x7fd93c0045a8) at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.cc:391 #4 0x0000559ebcbf9006 in PROFILING::finish_current_query (this=0x7fd93c0045a8) at /home/sujatha/bug_repo/test-10.4/sql/sql_profile.h:302 #5 0x0000559ebcce110c in dispatch_command (command=COM_QUERY, thd=0x7fd93c000ce8, packet=0x7fd93c00a129 "", packet_length=25, is_com_multi=false, is_next_command=false) at /home/sujatha/bug_repo/test-10.4/sql/sql_parse.cc:2483 #6 0x0000559ebccdd621 in do_command (thd=0x7fd93c000ce8) at /home/sujatha/bug_repo/test-10.4/sql/sql_parse.cc:1373 #7 0x0000559ebce6f90a in do_handle_one_connection (connect=0x559ec0e1a1e8) at /home/sujatha/bug_repo/test-10.4/sql/sql_connect.cc:1412 #8 0x0000559ebce6f659 in handle_one_connection (arg=0x559ec0e1a1e8) at /home/sujatha/bug_repo/test-10.4/sql/sql_connect.cc:1316 #9 0x0000559ebd8d0300 in pfs_spawn_thread (arg=0x559ec0d549e8) at /home/sujatha/bug_repo/test-10.4/storage/perfschema/pfs.cc:1869 #10 0x00007fd9a71cd6db in start_thread (arg=0x7fd99b5b4700) at pthread_create.c:463 #11 0x00007fd9a636871f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The SIGSEGV seems to be caused because of an invalid QUERY_PROFILE item being freed. I looked into the master.err file (Excluded Query_arena), I could not find any relation between ASAN stack traces and current SIGSEGV.

            I did some more testing and could not reproduce this issue.
            It is presumed fixed (in whole or in part) by the patch in MDEV-23534.
            It it occurs again, in the same form, this ticket can be re-opened.

            Roel Roel Van de Paar added a comment - I did some more testing and could not reproduce this issue. It is presumed fixed (in whole or in part) by the patch in MDEV-23534 . It it occurs again, in the same form, this ticket can be re-opened.

            People

              sujatha.sivakumar Sujatha Sivakumar (Inactive)
              Roel Roel Van de Paar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.