Details
-
New Feature
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
None
-
None
-
CS 23.02.8 Rhel 8, On-premise S3
Description
The crash trace below was not actually fixed as this developer comment explains but what was delivered in 23.02.11 storagemanager was removing too strict of a crash
My actual hypthesis is that on the start there were competing commands on SM start. These competing SM commands created a fantom PrefixCache record that doesn't have a corresponding file. This caused PrefixCache code to crash b/c the code enforces this StorageManager internal state invariant. |
|
IMHO the invariant is too strict so we removed the crash.
|
Original Ask
The research on the root cause unveils that S3-based installation suffered from lack of disk space. After the lack of space has been addressed, MCS cluster hadn't been restarted but survived nevertheless. The internal state of StorageManager had became inconsistent though. This inconsistency fired back later in a form of a SM crash.
MCS must become stricter in case of disk-space(or similar resource) shortage. MCS must become read-only explicitly signaling/asking for a manual intervention.
Crash Trace
Date/time: 2024-05-13 10:43:23 |
Signal: 6 |
|
/lib64/libstoragemanager.so(_Z12fatalHandleri+0x139)[0x7fd7a6623399] |
/usr/bin/StorageManager(+0x80d1)[0x56189513d0d1] |
/lib64/libpthread.so.0(+0x12ce0)[0x7fd7a5db0ce0] |
/lib64/libc.so.6(gsignal+0x10f)[0x7fd7a5082a9f] |
/lib64/libc.so.6(abort+0x127)[0x7fd7a5055e05] |
/lib64/libc.so.6(+0x21cd9)[0x7fd7a5055cd9] |
/lib64/libc.so.6(+0x473f6)[0x7fd7a507b3f6] |
/lib64/libstoragemanager.so(_ZN14storagemanager11PrefixCache10_makeSpaceEm+0x45f)[0x7fd7a661fc6f] |
/lib64/libstoragemanager.so(_ZN14storagemanager11PrefixCache11doneReadingERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x97)[0x7fd7a661ff07] |
/lib64/libstoragemanager.so(_ZN14storagemanager13IOCoordinator4readEPKcPhlm+0xcc0)[0x7fd7a65bdea0] |
/lib64/libstoragemanager.so(_ZN14storagemanager8ReadTask3runEv+0x129)[0x7fd7a65b15b9] |
/lib64/libstoragemanager.so(_ZN14storagemanager11ProcessTaskclEv+0xc4)[0x7fd7a65b1184] |
/lib64/libstoragemanager.so(_ZN14storagemanager10ThreadPool15_processingLoopERN5boost11unique_lockINS1_5mutexEEE+0x4f8)[0x7fd7a65b3c28] |
/lib64/libstoragemanager.so(_ZN14storagemanager10ThreadPool14processingLoopEv+0x34)[0x7fd7a65b3e04] |
/lib64/libstoragemanager.so(+0xdc587)[0x7fd7a6625587] |
/lib64/libpthread.so.0(+0x81cf)[0x7fd7a5da61cf] |
/lib64/libc.so.6(clone+0x43)[0x7fd7a506ddd3] |
Crash Trace Analyzer
/usr/src/debug/MariaDB-/src_0/storage/columnstore/columnstore/storage-manager/src/PrefixCache.cpp:474 |
Code
while (it != lru.end()) |
{
|
// make sure it's not currently being read or being flushed by another _makeSpace() call |
if ((doNotEvict.find(it) == doNotEvict.end()) && (toBeDeleted.find(it) == toBeDeleted.end())) |
break; |
++it;
|
}
|
if (it == lru.end()) |
{
|
// nothing can be deleted right now |
return; |
}
|
|
// ran into this a couple times, still happens as of commit 948ee1aa5 |
// BT: made this more visable in logging. |
// likely related to MCOL-3499 and lru containing double entries. |
if (!bf::exists(cachePrefix / *it)) |
logger->log(LOG_WARNING, "PrefixCache::makeSpace(): doesn't exist, %s/%s", cachePrefix.string().c_str(), |
((string)(*it)).c_str());
|
assert(bf::exists(cachePrefix / *it)); <------ FAIL HERE |
/* |
tell Synchronizer that this key will be evicted
|
delete the file
|
remove it from our structs
|
update current size
|
*/ |