[MCOL-5748] Remove strict crash Not SM must set the cluster into read-only if it encounters a serious issue in runtime - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 23.02.11
Component/s: Storage Manager
Labels:
None
Environment:
CS 23.02.8 Rhel 8, On-premise S3

Description

The crash trace below was not actually fixed as this developer comment explains but what was delivered in 23.02.11 storagemanager was removing too strict of a crash

My actual hypthesis is that on the start there were competing commands on SM start. These competing SM commands created a fantom PrefixCache record that doesn't have a corresponding file. This caused PrefixCache code to crash b/c the code enforces this StorageManager internal state invariant.

IMHO the invariant is too strict so we removed the crash.

Original Ask
The research on the root cause unveils that S3-based installation suffered from lack of disk space. After the lack of space has been addressed, MCS cluster hadn't been restarted but survived nevertheless. The internal state of StorageManager had became inconsistent though. This inconsistency fired back later in a form of a SM crash.
MCS must become stricter in case of disk-space(or similar resource) shortage. MCS must become read-only explicitly signaling/asking for a manual intervention.

Crash Trace

Date/time: 2024-05-13 10:43:23

Signal: 6

/lib64/libstoragemanager.so(_Z12fatalHandleri+0x139)[0x7fd7a6623399]

/usr/bin/StorageManager(+0x80d1)[0x56189513d0d1]

/lib64/libpthread.so.0(+0x12ce0)[0x7fd7a5db0ce0]

/lib64/libc.so.6(gsignal+0x10f)[0x7fd7a5082a9f]

/lib64/libc.so.6(abort+0x127)[0x7fd7a5055e05]

/lib64/libc.so.6(+0x21cd9)[0x7fd7a5055cd9]

/lib64/libc.so.6(+0x473f6)[0x7fd7a507b3f6]

/lib64/libstoragemanager.so(_ZN14storagemanager11PrefixCache10_makeSpaceEm+0x45f)[0x7fd7a661fc6f]

/lib64/libstoragemanager.so(_ZN14storagemanager11PrefixCache11doneReadingERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x97)[0x7fd7a661ff07]

/lib64/libstoragemanager.so(_ZN14storagemanager13IOCoordinator4readEPKcPhlm+0xcc0)[0x7fd7a65bdea0]

/lib64/libstoragemanager.so(_ZN14storagemanager8ReadTask3runEv+0x129)[0x7fd7a65b15b9]

/lib64/libstoragemanager.so(_ZN14storagemanager11ProcessTaskclEv+0xc4)[0x7fd7a65b1184]

/lib64/libstoragemanager.so(_ZN14storagemanager10ThreadPool15_processingLoopERN5boost11unique_lockINS1_5mutexEEE+0x4f8)[0x7fd7a65b3c28]

/lib64/libstoragemanager.so(_ZN14storagemanager10ThreadPool14processingLoopEv+0x34)[0x7fd7a65b3e04]

/lib64/libstoragemanager.so(+0xdc587)[0x7fd7a6625587]

/lib64/libpthread.so.0(+0x81cf)[0x7fd7a5da61cf]

/lib64/libc.so.6(clone+0x43)[0x7fd7a506ddd3]

Crash Trace Analyzer

/usr/src/debug/MariaDB-/src_0/storage/columnstore/columnstore/storage-manager/src/PrefixCache.cpp:474

Code

while (it != lru.end())

      // make sure it's not currently being read or being flushed by another _makeSpace() call

      if ((doNotEvict.find(it) == doNotEvict.end()) && (toBeDeleted.find(it) == toBeDeleted.end()))

        break;

      ++it;

    if (it == lru.end())

      // nothing can be deleted right now

      return;

    // ran into this a couple times, still happens as of commit 948ee1aa5

    // BT: made this more visable in logging.

    //     likely related to MCOL-3499 and lru containing double entries.

    if (!bf::exists(cachePrefix / *it))

      logger->log(LOG_WARNING, "PrefixCache::makeSpace(): doesn't exist, %s/%s", cachePrefix.string().c_str(),

                  ((string)(*it)).c_str());

    assert(bf::exists(cachePrefix / *it));       <------  FAIL HERE

/*

        tell Synchronizer that this key will be evicted

        delete the file

        remove it from our structs

        update current size

*/

Attachments

Issue Links

mentioned in: Page Loading...; Page Loading...; Page Loading...

Activity

People

Assignee:: Roman

Reporter:: Allen Herrera

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-05-14 14:09

Updated:: 2025-04-23 18:09

Resolved:: 2024-09-10 20:34

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.