[MCOL-5559] Shmem segment remap causes SEGV in ExtentMapIndexImpl::find - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 23.02.3
Fix Version/s: 23.10.1, 23.10.0, 23.02.6
Component/s: PrimProc
Labels:
None
Environment:
HA 3 node redhat 8.7
Each node: 48 core, 750GB ram
NFS shared storage

Sprint:
2023-8, 2023-10, 2023-11

Description

See Developer Comments for the types of queries running and logs
Core dump enabled, will share when given (8/24/2023 - segfault still hasnt reoccured )

~~This is currently happening every night,~~
Currently every night queries hang requiring the cluster to be restarted to continue with nightly ETL (daily aggregation). The only red flag found was the following crash trace once.

We need
1) avoid queries from hanging
2) avoid the seg fault
3) have the cluster recover from 1 primproc restarting

Crash Trace 1

Date/time: 2023-08-20 20:39:11

Signal: 11

/usr/bin/PrimProc(+0xb70f6)[0x556a0ed230f6]

/lib64/libpthread.so.0(+0x12cf0)[0x7fcaa8c5dcf0]

/lib64/libbrm.so(_ZN5boost9unordered13unordered_mapIiNS1_IjNS_9container6vectorIlNS_12interprocess9allocatorIlNS4_15segment_managerIcNS4_15rbtree_best_fitINS4_12mutex_familyENS4_10offset_ptrIvlmLm0EEELm0EEENS4_10iset_indexEEEEEvEENS_4hashIjEESt8equal_toIjENS5_ISt4pairIKjSF_ESD_EEEENSG_IiEESI_IiENS5_ISK_IKiSO_ESD_EEE4findERSR_+0x15b)[0x7fcaa943040b]

/lib64/libbrm.so(_ZN3BRM18ExtentMapIndexImpl14search2ndLayerERN5boost9unordered13unordered_mapIiNS3_IjNS1_9container6vectorIlNS1_12interprocess9allocatorIlNS6_15segment_managerIcNS6_15rbtree_best_fitINS6_12mutex_familyENS6_10offset_ptrIvlmLm0EEELm0EEENS6_10iset_indexEEEEEvEENS1_4hashIjEESt8equal_toIjENS7_ISt4pairIKjSH_ESF_EEEENSI_IiEESK_IiENS7_ISM_IKiSQ_ESF_EEEEi+0x49)[0x7fcaa9416ce9]

/lib64/libbrm.so(_ZN3BRM18ExtentMapIndexImpl4findEti+0x77)[0x7fcaa9417b57]

/lib64/libbrm.so(_ZN3BRM9ExtentMap10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0xe0)[0x7fcaa9423e10]

/lib64/libbrm.so(_ZN3BRM4DBRM10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0x23)[0x7fcaa9404883]

/lib64/libjoblist.so(_ZN7joblist15pDictionaryScanC1EiiRKN8execplan20CalpontSystemCatalog7ColTypeERKNS_7JobInfoE+0x32a)[0x7fcaaa17737a]

/lib64/libjoblist.so(+0x1652a7)[0x7fcaaa0db2a7]

/lib64/libjoblist.so(_ZN7joblist21JLF_ExecPlanToJobList8walkTreeEPN8execplan9ParseTreeERNS_7JobInfoE+0x212)[0x7fcaaa0dea42]

/lib64/libjoblist.so(_ZN7joblist21JLF_ExecPlanToJobList8walkTreeEPN8execplan9ParseTreeERNS_7JobInfoE+0x606)[0x7fcaaa0dee36]

/lib64/libjoblist.so(+0x1cac4f)[0x7fcaaa140c4f]

/lib64/libjoblist.so(_ZN7joblist12makeJobStepsEPN8execplan26CalpontSelectExecutionPlanERNS_7JobInfoERSt6vectorIN5boost10shared_ptrINS_7JobStepEEESaIS9_EESC_RSt3mapIiS9_St4lessIiESaISt4pairIKiS9_EEE+0x249)[0x7fcaaa143b89]

/lib64/libjoblist.so(+0x1cfcfd)[0x7fcaaa145cfd]

/lib64/libjoblist.so(_ZN7joblist14JobListFactory11makeJobListEPN8execplan20CalpontExecutionPlanEPNS_15ResourceManagerERK26PrimitiveServerThreadPoolsbb+0x62)[0x7fcaaa146142]

/usr/bin/PrimProc(+0xb2163)[0x556a0ed1e163]

/lib64/libthreadpool.so(_ZN10threadpool10ThreadPool11beginThreadEv+0x615)[0x7fcaa89f2ad5]

/usr/bin/PrimProc(+0xb8bd7)[0x556a0ed24bd7]

/lib64/libpthread.so.0(+0x81ca)[0x7fcaa8c531ca]

/lib64/libc.so.6(clone+0x43)[0x7fcaa7668e73]

Debugging Trace 1

LBID_tFindResult ExtentMapIndexImpl::find(const DBRootT dbroot, const OID_t oid)

  ExtentMapIndex& emIndex = *get();

  if (dbroot >= emIndex.size())

    return {};

  return search2ndLayer(emIndex[dbroot], oid);

Messages File 1

# On Node 1

Aug 20 20:39:44 atx-mdb101pl messagequeue[516171]: 44.604637 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic(): I/O error2.1: err = -1 e = 104: Connection reset by peer        %%10%%

Aug 20 20:39:44 atx-mdb101pl env[516171]: DEC Caught EXCEPTION: InetStreamSocket::readToMagic(): I/O error2.1: err = -1 e = 104: Connection reset by peer

# On Node 2

Aug 20 20:39:44 atx-mdb102pl messagequeue[3758352]: 44.591925 |0|0|0| W 31 CAL0000: MessageQueueClient::write: error writing 16 bytes to IOSocket: sd: 24 inet: 10.224.140.32 port: 8620. Socket error was InetStreamSocket::write error: Broken pipe -- write from InetStreamSocket: sd: 24 inet: 10.224.140.32 port: 8620#012         %%10%%

# On Node 3

Aug 20 20:39:11 atx-mdb103pl systemd[1]: Started Process Core Dump (PID 1287119/UID 0).

Aug 20 20:39:11 atx-mdb103pl systemd-coredump[1287121]: Resource limits disable core dumping for process 792020 (PrimProc).

Aug 20 20:39:11 atx-mdb103pl systemd-coredump[1287121]: Process 792020 (PrimProc) of user 993 dumped core.

Workload

The customer ingests 3TB of raw data a day. over 2.6 Billion records, they batch the data to be imported hourly into each table. At night they aggregate all the hourly data into a daily table to summary/reduce the data footprint.  This topic / query is split into 48 parts to optimize for extent elimination based on date/ lat/long search as well as to divide the data to not oom on group by calc because they have so many distinct values in the group by criteria.

parts 0 to 13 is loaded on first node, parts 14 to 26 on second node and parts 27 to 47 on the third node. The parts are not equally distributed as they represent geographical areas of a map, imagine a lat/long grid. so

high level example:

sudo -u mysql ${MCSMYSQL} -qs ${DATABASE} -N -e"${DAILY_SQL}" > ${TOPIC}_DAILY.tbl

sudo -u mysql ${CPIMPORT} -m 3 -j1337

Crash Trace 2:
While this ran there was 1 query running on node 2 and a couple cpimports across all nodes

Date/time: 2023-08-29 08:34:04

Signal: 11

/usr/bin/PrimProc(+0xb70f6)[0x561b2fe610f6]

/lib64/libpthread.so.0(+0x12cf0)[0x7f0304045cf0]

/lib64/libbrm.so(_ZN3BRM9ExtentMap10findByLBIDEl+0x185)[0x7f0304801045]

/lib64/libbrm.so(_ZN3BRM9ExtentMap18getEmIdentsByLbidsERKN5boost9container6vectorIlvvEE+0x1e4)[0x7f0304801974]

/lib64/libbrm.so(_ZN3BRM9ExtentMap10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0x11c)[0x7f030480be4c]

/lib64/libbrm.so(_ZN3BRM4DBRM10getExtentsEiRSt6vectorINS_7EMEntryESaIS2_EEbbb+0x23)[0x7f03047ec883]

/lib64/libjoblist.so(_ZN7joblist8pColStepC2EiiRKN8execplan20CalpontSystemCatalog7ColTypeERKNS_7JobInfoE+0x41e)[0x7f03055583ae]

/lib64/libjoblist.so(+0x163f30)[0x7f03054c1f30]

/lib64/libjoblist.so(_ZN7joblist21JLF_ExecPlanToJobList8walkTreeEPN8execplan9ParseTreeERNS_7JobInfoE+0x212)[0x7f03054c6a42]

/lib64/libjoblist.so(+0x1cac4f)[0x7f0305528c4f]

/lib64/libjoblist.so(_ZN7joblist12makeJobStepsEPN8execplan26CalpontSelectExecutionPlanERNS_7JobInfoERSt6vectorIN5boost10shared_ptrINS_7JobStepEEESaIS9_EESC_RSt3mapIiS9_St4lessIiESaISt4pairIKiS9_EEE+0x249)[0x7f030552bb89]

/lib64/libjoblist.so(+0x1cfcfd)[0x7f030552dcfd]

/lib64/libjoblist.so(_ZN7joblist14JobListFactory11makeJobListEPN8execplan20CalpontExecutionPlanEPNS_15ResourceManagerERK26PrimitiveServerThreadPoolsbb+0x62)[0x7f030552e142]

/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog13getSysData_ECERNS_26CalpontSelectExecutionPlanERNS0_14NJLSysDataListERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x98)[0x7f0304f15258]

/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog10getSysDataERNS_26CalpontSelectExecutionPlanERNS0_14NJLSysDataListERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x8b1)[0x7f0304f161b1]

/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog9getTablesENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0xb6d)[0x7f0304f1cb5d]

/lib64/libexecplan.so(_ZN8execplan20CalpontSystemCatalog13getSchemaInfoERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEi+0x28d)[0x7f0304f2ae9d]

/usr/bin/PrimProc(+0xb05e1)[0x561b2fe5a5e1]

/usr/bin/PrimProc(+0xb0653)[0x561b2fe5a653]

/usr/bin/PrimProc(+0xb0653)[0x561b2fe5a653]

/usr/bin/PrimProc(+0xb30f2)[0x561b2fe5d0f2]

/lib64/libthreadpool.so(_ZN10threadpool10ThreadPool11beginThreadEv+0x615)[0x7f0303ddaad5]

/usr/bin/PrimProc(+0xb8bd7)[0x561b2fe62bd7]

/lib64/libpthread.so.0(+0x81ca)[0x7f030403b1ca]

/lib64/libc.so.6(clone+0x43)[0x7f0302a50e73]

Debugging Trace 2:
/usr/src/debug/MariaDB-/src_0/storage/columnstore/columnstore/.boost/boost-lib/include/boost/intrusive/bstree_algorithms.hpp:2034

   template<class KeyType, class KeyNodePtrCompare>

   static node_ptr lower_bound_loop

      (node_ptr x, node_ptr y, const KeyType &key, KeyNodePtrCompare comp)

      while(x){

         if(comp(x, key)){                 <----------- line 2034

            x = NodeTraits::get_right(x);

         else{

            y = x;

            x = NodeTraits::get_left(x);

      return y;

Attachments

Issue Links

is duplicated by

MCOL-5488 Shmem RWLock is not strict enough for the operation it guards.

Closed

relates to

MCOL-5488 Shmem RWLock is not strict enough for the operation it guards.

Closed

MCOL-5565 Queries stuck in MDB waiting for an answer from PP

Closed

MCOL-5487 A race in BRM causes SEGV working with managed shared mem segment

Closed

mentioned in: Page Loading...

Activity

People

Assignee:: Roman

Reporter:: Allen Herrera

Assigned for Review:: Gagan Goel (Inactive)

Assigned for Testing:: Allen M Herrera

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2023-08-21 16:25

Updated:: 2024-08-06 17:18

Resolved:: 2023-11-01 17:06

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.