Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-40010

SIGSEGV in _ma_read_block_record2 during INFORMATION_SCHEMA query while Galera node enters GATHER/shutdown

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.11.18
    • None
    • Galera
    • None
    • Debian bookworm amd64

    Description

      I experienced a crash of one of my servers (let's call it "delta") in a 6-node Galera cluster. A different machine, "gamma", hung for 3.3 seconds in what I believe was THP compaction. The other five machines in the cluster gave up on gamma. Delta's MariaDB instance gave up and then crashed completely.

      Claude helped me analyze the attached stack trace, and generated the below analysis. I hope it's helpful.

      ------------------------------
      Summary: MariaDB crashes with SIGSEGV (signal 11) in _ma_read_block_record2() → memset() when an INFORMATION_SCHEMA.COLUMNS query is executing concurrently with a Galera view change that triggers wsrep_shutdown().

      Crash location: storage/maria/ma_blockrec.c:4887 — the extent.extent pointer is 0x40 (invalid), causing a segfault in memset called from _ma_read_block_record2.

      Stack (crashing thread):

      __memset_avx2_unaligned_erms
      _ma_read_block_record2 (ma_blockrec.c:4887)
      _ma_scan_block_record (ma_blockrec.c:5506)
      handler::ha_rnd_next (handler.cc:3613)
      rr_sequential (records.cc:519)
      sub_select (sql_select.cc:23074)
      ...
      wsrep_mysql_parse (sql_parse.cc:8031)
      Query:

      sql
      SELECT COLUMN_NAME, COLUMN_COMMENT, COLUMN_TYPE
      FROM INFORMATION_SCHEMA.COLUMNS
      WHERE TABLE_SCHEMA = 'atoz'
      AND TABLE_NAME = 'c_circ_rules'
      AND COLUMN_COMMENT LIKE '%flags="%'
      Context: The main thread (Thread 11) was executing wsrep_stop_replication_common() → wsrep_shutdown() from mysqld_main() at the time of the crash. This was triggered by a Galera GATHER phase after another node in the cluster became unresponsive. The INFORMATION_SCHEMA query was mid-scan of an Aria temporary table when the Galera state transition appears to have invalidated or freed underlying Aria structures.

      Impact: This crash escalated a recoverable 2-node Galera event into a full 6-node cluster outage, as the crash prevented the remaining nodes from completing the GATHER consensus needed to maintain a Primary Component.

      How to reproduce: Difficult to reproduce on demand. Requires an INFORMATION_SCHEMA query executing at the exact moment a Galera view change triggers wsrep shutdown. In our case, the view change was caused by a peer node stalling for ~3.3 seconds due to THP compaction.
      ------------------------------

      Me again. I also asked it to suggest a fix.

      It said:
      ------------------------------
      Proposed Patch (Defensive)
      I'd include this in the JIRA as a suggested fix, not a formal patch:

      diff
      — a/storage/maria/ma_blockrec.c
      +++ b/storage/maria/ma_blockrec.c
      @@ -XXXX,6 +XXXX,7 @@
      else
      {
      cur_row->extents_count= 0;
      + extent.extent= NULL;
      (*cur_row->tail_positions)= 0;
      extent.page_count= 0;
      extent.extent_count= 1;
      And in read_next_extent():

      diff
      @@ -XXXX,6 +XXXX,8 @@
      if (!extent->page_count)
      {
      uint page_count;
      + if (!extent->extent)
      + goto crashed;
      if (!-extent>extent_count)
      goto crashed;
      extent->extent+= ROW_EXTENT_SIZE;
      This prevents the worst case (dereferencing uninitialized stack garbage) but doesn't address the root cause, which is the race between wsrep_shutdown() and active queries. The wsrep shutdown path should kill or wait for in-progress INFORMATION_SCHEMA scans before proceeding, or the Aria scan loop should check for shutdown/killed status between rows.

      Include this analysis in your JIRA ticket along with the backtrace. The MariaDB developers will know whether to fix it at the Aria layer, the wsrep layer, or both.
      ------------------------------

      Attachments

        1. gdb.txt
          148 kB
          Xan Charbonnet

        Activity

          People

            seppo Seppo Jaakola
            xan@biblionix.com Xan Charbonnet
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.