Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5748

Remove strict crash Not SM must set the cluster into read-only if it encounters a serious issue in runtime

    XMLWordPrintable

Details

    • New Feature
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • None
    • 23.02.11
    • Storage Manager
    • None
    • CS 23.02.8 Rhel 8, On-premise S3

    Description

      The crash trace below was not actually fixed as this developer comment explains but what was delivered in 23.02.11 storagemanager was removing too strict of a crash

      My actual hypthesis is that on the start there were competing commands on SM start. These competing SM commands created a fantom PrefixCache record that doesn't have a corresponding file. This caused PrefixCache code to crash b/c the code enforces this StorageManager internal state invariant.
       
      IMHO the invariant is too strict so we removed the crash.
      

      Original Ask
      The research on the root cause unveils that S3-based installation suffered from lack of disk space. After the lack of space has been addressed, MCS cluster hadn't been restarted but survived nevertheless. The internal state of StorageManager had became inconsistent though. This inconsistency fired back later in a form of a SM crash.
      MCS must become stricter in case of disk-space(or similar resource) shortage. MCS must become read-only explicitly signaling/asking for a manual intervention.

      Crash Trace

      Date/time: 2024-05-13 10:43:23
      Signal: 6
       
      /lib64/libstoragemanager.so(_Z12fatalHandleri+0x139)[0x7fd7a6623399]
      /usr/bin/StorageManager(+0x80d1)[0x56189513d0d1]
      /lib64/libpthread.so.0(+0x12ce0)[0x7fd7a5db0ce0]
      /lib64/libc.so.6(gsignal+0x10f)[0x7fd7a5082a9f]
      /lib64/libc.so.6(abort+0x127)[0x7fd7a5055e05]
      /lib64/libc.so.6(+0x21cd9)[0x7fd7a5055cd9]
      /lib64/libc.so.6(+0x473f6)[0x7fd7a507b3f6]
      /lib64/libstoragemanager.so(_ZN14storagemanager11PrefixCache10_makeSpaceEm+0x45f)[0x7fd7a661fc6f]
      /lib64/libstoragemanager.so(_ZN14storagemanager11PrefixCache11doneReadingERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x97)[0x7fd7a661ff07]
      /lib64/libstoragemanager.so(_ZN14storagemanager13IOCoordinator4readEPKcPhlm+0xcc0)[0x7fd7a65bdea0]
      /lib64/libstoragemanager.so(_ZN14storagemanager8ReadTask3runEv+0x129)[0x7fd7a65b15b9]
      /lib64/libstoragemanager.so(_ZN14storagemanager11ProcessTaskclEv+0xc4)[0x7fd7a65b1184]
      /lib64/libstoragemanager.so(_ZN14storagemanager10ThreadPool15_processingLoopERN5boost11unique_lockINS1_5mutexEEE+0x4f8)[0x7fd7a65b3c28]
      /lib64/libstoragemanager.so(_ZN14storagemanager10ThreadPool14processingLoopEv+0x34)[0x7fd7a65b3e04]
      /lib64/libstoragemanager.so(+0xdc587)[0x7fd7a6625587]
      /lib64/libpthread.so.0(+0x81cf)[0x7fd7a5da61cf]
      /lib64/libc.so.6(clone+0x43)[0x7fd7a506ddd3]
      

      Crash Trace Analyzer

      /usr/src/debug/MariaDB-/src_0/storage/columnstore/columnstore/storage-manager/src/PrefixCache.cpp:474
      

      Code

      while (it != lru.end())
          {
            // make sure it's not currently being read or being flushed by another _makeSpace() call
            if ((doNotEvict.find(it) == doNotEvict.end()) && (toBeDeleted.find(it) == toBeDeleted.end()))
              break;
            ++it;
          }
          if (it == lru.end())
          {
            // nothing can be deleted right now
            return;
          }
       
          // ran into this a couple times, still happens as of commit 948ee1aa5
          // BT: made this more visable in logging.
          //     likely related to MCOL-3499 and lru containing double entries.
          if (!bf::exists(cachePrefix / *it))
            logger->log(LOG_WARNING, "PrefixCache::makeSpace(): doesn't exist, %s/%s", cachePrefix.string().c_str(),
                        ((string)(*it)).c_str());
          assert(bf::exists(cachePrefix / *it));       <------  FAIL HERE
          /*
              tell Synchronizer that this key will be evicted
              delete the file
              remove it from our structs
              update current size
          */
      

      Attachments

        Issue Links

          Activity

            People

              drrtuy Roman
              allen.herrera Allen Herrera
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.