Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
23.02.3
-
None
-
HA 3 node redhat 8.7
Each node: 48 core, 750GB ram
NFS shared storage
-
2023-8, 2023-10, 2023-11
Description
See Developer Comments for the types of queries running and logs
Core dump enabled, will share when given (8/24/2023 - segfault still hasnt reoccured )
This is currently happening every night,
Currently every night queries hang requiring the cluster to be restarted to continue with nightly ETL (daily aggregation). The only red flag found was the following crash trace once.
We need
1) avoid queries from hanging
2) avoid the seg fault
3) have the cluster recover from 1 primproc restarting
Crash Trace 1
Date/time: 2023-08-20 20:39:11 |
Signal: 11 |
 |
/usr/bin/PrimProc(+0xb70f6)[0x556a0ed230f6] |
/lib64/libpthread.so.0(+0x12cf0)[0x7fcaa8c5dcf0] |
/lib64/libbrm.so(_ZN5boost9unordered13unordered_mapIiNS1_IjNS_9container6vectorIlNS_12interprocess9allocatorIlNS4_15segment_managerIcNS4_15rbtree_best_fitINS4_12mutex_familyENS4_10offset_ptrIvlmLm0EEELm0EEENS4_10iset_indexEEEEEvEENS_4hashIjEESt8equal_toIjENS5_ISt4pairIKjSF_ESD_EEEENSG_IiEESI_IiENS5_ISK_IKiSO_ESD_EEE4findERSR_+0x15b)[0x7fcaa943040b] |
/lib64/libbrm.so(_ZN3BRM18ExtentMapIndexImpl14search2ndLayerERN5boost9unordered13unordered_mapIiNS3_IjNS1_9container6vectorIlNS1_12interprocess9allocatorIlNS6_15segment_managerIcNS6_15rbtree_best_fitINS6_12mutex_familyENS6_10offset_ptrIvlmLm0EEELm0EEENS6_10iset_indexEEEEEvEENS1_4hashIjEESt8equal_toIjENS7_ISt4pairIKjSH_ESF_EEEENSI_IiEESK_IiENS7_ISM_IKiSQ_ESF_EEEEi+0x49)[0x7fcaa9416ce9] |
/lib64/libbrm.so(_ZN3BRM18ExtentMapIndexImpl4findEti+0x77)[0x7fcaa9417b57] |
/lib64/libbrm.so(_ZN3BRM9ExtentMap10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0xe0)[0x7fcaa9423e10] |
/lib64/libbrm.so(_ZN3BRM4DBRM10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0x23)[0x7fcaa9404883] |
/lib64/libjoblist.so(_ZN7joblist15pDictionaryScanC1EiiRKN8execplan20CalpontSystemCatalog7ColTypeERKNS_7JobInfoE+0x32a)[0x7fcaaa17737a] |
/lib64/libjoblist.so(+0x1652a7)[0x7fcaaa0db2a7] |
/lib64/libjoblist.so(_ZN7joblist21JLF_ExecPlanToJobList8walkTreeEPN8execplan9ParseTreeERNS_7JobInfoE+0x212)[0x7fcaaa0dea42] |
/lib64/libjoblist.so(_ZN7joblist21JLF_ExecPlanToJobList8walkTreeEPN8execplan9ParseTreeERNS_7JobInfoE+0x606)[0x7fcaaa0dee36] |
/lib64/libjoblist.so(+0x1cac4f)[0x7fcaaa140c4f] |
/lib64/libjoblist.so(_ZN7joblist12makeJobStepsEPN8execplan26CalpontSelectExecutionPlanERNS_7JobInfoERSt6vectorIN5boost10shared_ptrINS_7JobStepEEESaIS9_EESC_RSt3mapIiS9_St4lessIiESaISt4pairIKiS9_EEE+0x249)[0x7fcaaa143b89] |
/lib64/libjoblist.so(+0x1cfcfd)[0x7fcaaa145cfd] |
/lib64/libjoblist.so(_ZN7joblist14JobListFactory11makeJobListEPN8execplan20CalpontExecutionPlanEPNS_15ResourceManagerERK26PrimitiveServerThreadPoolsbb+0x62)[0x7fcaaa146142] |
/usr/bin/PrimProc(+0xb2163)[0x556a0ed1e163] |
/lib64/libthreadpool.so(_ZN10threadpool10ThreadPool11beginThreadEv+0x615)[0x7fcaa89f2ad5] |
/usr/bin/PrimProc(+0xb8bd7)[0x556a0ed24bd7] |
/lib64/libpthread.so.0(+0x81ca)[0x7fcaa8c531ca] |
/lib64/libc.so.6(clone+0x43)[0x7fcaa7668e73] |
Debugging Trace 1
LBID_tFindResult ExtentMapIndexImpl::find(const DBRootT dbroot, const OID_t oid) |
{
|
ExtentMapIndex& emIndex = *get();
|
if (dbroot >= emIndex.size()) |
return {}; |
return search2ndLayer(emIndex[dbroot], oid); |
}
|
Messages File 1
 |
# On Node 1 |
Aug 20 20:39:44 atx-mdb101pl messagequeue[516171]: 44.604637 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic(): I/O error2.1: err = -1 e = 104: Connection reset by peer %%10%% |
Aug 20 20:39:44 atx-mdb101pl env[516171]: DEC Caught EXCEPTION: InetStreamSocket::readToMagic(): I/O error2.1: err = -1 e = 104: Connection reset by peer |
 |
# On Node 2 |
Aug 20 20:39:44 atx-mdb102pl messagequeue[3758352]: 44.591925 |0|0|0| W 31 CAL0000: MessageQueueClient::write: error writing 16 bytes to IOSocket: sd: 24 inet: 10.224.140.32 port: 8620. Socket error was InetStreamSocket::write error: Broken pipe -- write from InetStreamSocket: sd: 24 inet: 10.224.140.32 port: 8620#012 %%10%% |
 |
# On Node 3 |
Aug 20 20:39:11 atx-mdb103pl systemd[1]: Started Process Core Dump (PID 1287119/UID 0). |
Aug 20 20:39:11 atx-mdb103pl systemd-coredump[1287121]: Resource limits disable core dumping for process 792020 (PrimProc). |
Aug 20 20:39:11 atx-mdb103pl systemd-coredump[1287121]: Process 792020 (PrimProc) of user 993 dumped core. |
Workload
The customer ingests 3TB of raw data a day. over 2.6 Billion records, they batch the data to be imported hourly into each table. At night they aggregate all the hourly data into a daily table to summary/reduce the data footprint. This topic / query is split into 48 parts to optimize for extent elimination based on date/ lat/long search as well as to divide the data to not oom on group by calc because they have so many distinct values in the group by criteria. |
parts 0 to 13 is loaded on first node, parts 14 to 26 on second node and parts 27 to 47 on the third node. The parts are not equally distributed as they represent geographical areas of a map, imagine a lat/long grid. so |
 |
high level example:
|
sudo -u mysql ${MCSMYSQL} -qs ${DATABASE} -N -e"${DAILY_SQL}" > ${TOPIC}_DAILY.tbl |
sudo -u mysql ${CPIMPORT} -m 3 -j1337 |
Crash Trace 2:
While this ran there was 1 query running on node 2 and a couple cpimports across all nodes
Date/time: 2023-08-29 08:34:04 |
Signal: 11 |
 |
/usr/bin/PrimProc(+0xb70f6)[0x561b2fe610f6] |
/lib64/libpthread.so.0(+0x12cf0)[0x7f0304045cf0] |
/lib64/libbrm.so(_ZN3BRM9ExtentMap10findByLBIDEl+0x185)[0x7f0304801045] |
/lib64/libbrm.so(_ZN3BRM9ExtentMap18getEmIdentsByLbidsERKN5boost9container6vectorIlvvEE+0x1e4)[0x7f0304801974] |
/lib64/libbrm.so(_ZN3BRM9ExtentMap10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0x11c)[0x7f030480be4c] |
/lib64/libbrm.so(_ZN3BRM4DBRM10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0x23)[0x7f03047ec883] |
/lib64/libjoblist.so(_ZN7joblist8pColStepC2EiiRKN8execplan20CalpontSystemCatalog7ColTypeERKNS_7JobInfoE+0x41e)[0x7f03055583ae] |
/lib64/libjoblist.so(+0x163f30)[0x7f03054c1f30] |
/lib64/libjoblist.so(_ZN7joblist21JLF_ExecPlanToJobList8walkTreeEPN8execplan9ParseTreeERNS_7JobInfoE+0x212)[0x7f03054c6a42] |
/lib64/libjoblist.so(+0x1cac4f)[0x7f0305528c4f] |
/lib64/libjoblist.so(_ZN7joblist12makeJobStepsEPN8execplan26CalpontSelectExecutionPlanERNS_7JobInfoERSt6vectorIN5boost10shared_ptrINS_7JobStepEEESaIS9_EESC_RSt3mapIiS9_St4lessIiESaISt4pairIKiS9_EEE+0x249)[0x7f030552bb89] |
/lib64/libjoblist.so(+0x1cfcfd)[0x7f030552dcfd] |
/lib64/libjoblist.so(_ZN7joblist14JobListFactory11makeJobListEPN8execplan20CalpontExecutionPlanEPNS_15ResourceManagerERK26PrimitiveServerThreadPoolsbb+0x62)[0x7f030552e142] |
/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog13getSysData_ECERNS_26CalpontSelectExecutionPlanERNS0_14NJLSysDataListERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x98)[0x7f0304f15258] |
/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog10getSysDataERNS_26CalpontSelectExecutionPlanERNS0_14NJLSysDataListERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x8b1)[0x7f0304f161b1] |
/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog9getTablesENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0xb6d)[0x7f0304f1cb5d] |
/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog13getSchemaInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x28d)[0x7f0304f2ae9d] |
/usr/bin/PrimProc(+0xb05e1)[0x561b2fe5a5e1] |
/usr/bin/PrimProc(+0xb0653)[0x561b2fe5a653] |
/usr/bin/PrimProc(+0xb0653)[0x561b2fe5a653] |
/usr/bin/PrimProc(+0xb30f2)[0x561b2fe5d0f2] |
/lib64/libthreadpool.so(_ZN10threadpool10ThreadPool11beginThreadEv+0x615)[0x7f0303ddaad5] |
/usr/bin/PrimProc(+0xb8bd7)[0x561b2fe62bd7] |
/lib64/libpthread.so.0(+0x81ca)[0x7f030403b1ca] |
/lib64/libc.so.6(clone+0x43)[0x7f0302a50e73] |
Debugging Trace 2:
/usr/src/debug/MariaDB-/src_0/storage/columnstore/columnstore/.boost/boost-lib/include/boost/intrusive/bstree_algorithms.hpp:2034
template<class KeyType, class KeyNodePtrCompare> |
static node_ptr lower_bound_loop |
(node_ptr x, node_ptr y, const KeyType &key, KeyNodePtrCompare comp) |
{
|
while(x){ |
if(comp(x, key)){ <----------- line 2034 |
x = NodeTraits::get_right(x);
|
}
|
else{ |
y = x;
|
x = NodeTraits::get_left(x);
|
}
|
}
|
return y; |
}
|
Attachments
Issue Links
- is duplicated by
-
MCOL-5488 Shmem RWLock is not strict enough for the operation it guards.
- Closed
- relates to
-
MCOL-5488 Shmem RWLock is not strict enough for the operation it guards.
- Closed
-
MCOL-5565 Queries stuck in MDB waiting for an answer from PP
- Closed
-
MCOL-5487 A race in BRM causes SEGV working with managed shared mem segment
- Closed
- mentioned in
-
Page Loading...