Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-6325

PrimProc Crash: Null pointer dereference in TupleBPS::sendError() via uninitialized SBS msgBpp

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 23.02.18
    • None
    • None
    • None
    • MariaDB Enterprise Server: 10.6.15_10_23.02.18-1.el8.x86_64
      Component: MariaDB ColumnStore (`PrimProc`)

    Description

      Under heavy concurrent workloads, the `PrimProc` process crashes fatally with an `Assertion 'px != 0' failed` error. The crash is triggered by a hardcoded null pointer dereference within the error-handling path of the `TupleBPS` class.

      Steps to Reproduce / Trigger
      1. Initiate a heavy workload of concurrent `cpimport` bulk loading jobs.
      2. Simultaneously run massive `AGG` (aggregation) `SELECT` queries that heavily utilize `DictScanJob` evaluations.
      3. Wait for resource contention or a transient timeout to force the system to report an error via `TupleBPS::sendError()`.
      4. `PrimProc` immediately aborts and drops the cluster.

      Root Cause Analysis (Code Level)
      The crash occurs in `storage/columnstore/columnstore/dbcon/joblist/tuple-bps.cpp` inside `void TupleBPS::sendError(uint16_t status)`.

      Starting at line 1558:

      1558: void TupleBPS::sendError(uint16_t status)
      1559: {
      1560: SBS msgBpp;
      1561: fBPP->setCount(1);
      1562: fBPP->setStatus(status);
      1563: fBPP->runErrorBPP(*msgBpp);
      

      On line 1560, `msgBpp` is declared as an empty `messageqcpp::ByteStream` shared pointer (`SBS msgBpp;`). It is never allocated before it is dereferenced (`*msgBpp`) on line 1563. This causes the `boost::shared_ptr` to fail its internal `px != 0` assertion, instantly killing the `PrimProc` daemon.

      Suggested Fix
      Properly allocate the `messageqcpp::ByteStream` object before attempting to dereference it inside `sendError()`.

      Customer Impact
      Critical / S2. The system drops queries and causes a full production outage, requiring a manual restart of the cluster.

      Attachments

        Activity

          People

            alexey.antipovsky Aleksei Antipovskii
            kyle.hutchinson Kyle Hutchinson
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.