[MCOL-5368] PrimProc eventually failed on slave node. Docker. Created: 2022-12-21  Updated: 2022-12-27  Resolved: 2022-12-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: cmapi
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Alan Mologorsky Assignee: Alan Mologorsky
Resolution: Not a Bug Votes: 0
Labels: None


 Description   

Steps to reproduce:
1. build and start 3 node cluster w or w/o MXS. Build and verification is ok (green).
2. exec to the slave node 2 using docker exec -it mcs2 bash, for me it is mcs2 everytime.
3. check process list (ps aux or mcs cluster status), all MCS processes should exist
4. wait 1-2 minutes and do nothing with a cluster
5. check process list again, now PrimProc process is gone

After PrimProc gone I got those additional info:
From /var/log/mariadb/columnstore/trace/PrimProc****

Date/time: 2022-12-13 16:11:25
Signal: 11
 
/usr/bin/PrimProc(+0xbe6c6)[0x55dacb8066c6]
/lib64/libpthread.so.0(+0x12cf0)[0x7f29c217dcf0]
/lib64/libjoblist.so(_ZN7joblist21DistributedEngineComm5SetupEv+0x1384)[0x7f29c3634b14]
/lib64/libjoblist.so(_ZN7joblist21DistributedEngineComm6ListenEN5boost10shared_ptrIN11messageqcpp18MessageQueueClientEEEj+0x522)[0x7f29c3635e02]
/lib64/libjoblist.so(+0x13b046)[0x7f29c3636046]
/usr/bin/PrimProc(+0xc01a7)[0x55dacb8081a7]
/lib64/libpthread.so.0(+0x81cf)[0x7f29c21731cf]
/lib64/libc.so.6(clone+0x43)[0x7f29c0b87e73]

Using MariaDB-columnstore-engine-debuginfo package I got those:

nm /usr/lib/debug/usr/lib64/libjoblist.so-10.6.11_6_22.08.4-1.el8.x86_64.debug | grep _ZN7joblist21DistributedEngineComm5SetupEv
0000000000138790 T _ZN7joblist21DistributedEngineComm5SetupEv
00000000000acd94 t _ZN7joblist21DistributedEngineComm5SetupEv.cold
 
0x1384 + 0x138790 = 0x139B14
 
addr2line -f -e /lib64/libjoblist.so 0x139b14
_ZN7joblist21DistributedEngineComm5SetupEv
/usr/src/debug/MariaDB-/src_0/storage/columnstore/columnstore/.boost/boost-lib/include/boost/smart_ptr/shared_ptr.hpp:786

At the same time I could observe this messages at debug.log
Seems that this is not related but anyway.

Dec 13 16:55:49 mcs2 joblist[564]: 49.623685 |0|0|0| W 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/dbcon/joblist/distributedenginecomm.cpp @ 308 Could not connect to PMS0: Connection refused from PMS0      %%10%%
Dec 13 16:55:49 mcs2 joblist[564]: 49.624413 |0|0|0| W 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/dbcon/joblist/distributedenginecomm.cpp @ 308 Could not connect to PMS0: Connection refused from PMS0      %%10%%
Dec 13 16:55:49 mcs2 joblist[564]: 49.624919 |0|0|0| W 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/dbcon/joblist/distributedenginecomm.cpp @ 308 Could not connect to PMS0: Connection refused from PMS0      %%10%%
Dec 13 16:55:49 mcs2 joblist[564]: 49.625456 |0|0|0| W 05 CAL0000: /mdb/verylongdirnameforverystrangecpackbehavior/storage/columnstore/columnstore/dbcon/joblist/distributedenginecomm.cpp @ 308 Could not connect to PMS0: Connection refused from PMS0      %%10%%



 Comments   
Comment by Alan Mologorsky [ 2022-12-21 ]

alexey.vorovich Is it on Windows? If so, than no idea about it.
tntnatbry UPD: today tested again. Here it is /var/log/mariadb/columnstore/trace/PrimProc.632.log after crash on mcs2

Date/time: 2022-12-21 16:00:26
Signal: 11
 
/usr/bin/PrimProc(+0xbe6c6)[0x55a9279c26c6]
/lib64/libpthread.so.0(+0x12cf0)[0x7fbd12969cf0]
/usr/bin/PrimProc(+0xc90f6)[0x55a9279cd0f6]
/usr/bin/PrimProc(+0x9fcd2)[0x55a9279a3cd2]
/usr/bin/PrimProc(+0xc01a7)[0x55a9279c41a7]
/lib64/libpthread.so.0(+0x81cf)[0x7fbd1295f1cf]
/lib64/libc.so.6(clone+0x43)[0x7fbd11373e73]

I've been installed debuginfo package at a build time. And here it is strace result after the crash:

futex(0x7fff7aa38188, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ killed by SIGSEGV (core dumped) +++

Generated at Thu Feb 08 02:57:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.