[MCOL-3273] ExeMgr crash - __memcpy_ssse3_back Created: 2019-04-22  Updated: 2021-07-08  Resolved: 2021-07-08

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.2.3
Fix Version/s: 1.4.5

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: David Hall (Inactive)
Resolution: Won't Do Votes: 0
Labels: None
Environment:

2um 2pm system


Sprint: 2021-4

 Description   

Customer report ExeMgr crash, system recovered and continue working. Memory usage was shown to be normal from the customer. No logs reporting memory usage problems.

from crash

Program terminated with signal 11, Segmentation fault.
#0 0x00007f0985a455bf in __memcpy_ssse3_back () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install mariadb-columnstore-platform-1.2.3-1.x86_64
(gdb) bt
#0 0x00007f0985a455bf in __memcpy_ssse3_back () from /lib64/libc.so.6
#1 0x00007f09884af126 in memcpy (__len=251805234, __src=0x7f056b41a040, __dest=<optimized out>) at /usr/include/bits/string3.h:51
#2 messageqcpp::ByteStream::append (this=this@entry=0x7f055ebfe090, bp=0x7f056b41a040 "", len=len@entry=251805234) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/utils/messageqcpp/bytestream.cpp:475
#3 0x00007f098aba412e in rowgroup::RGData::serialize (this=0x7f0715702940, bs=..., amount=251805234) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/utils/rowgroup/rowgroup.cpp:426
#4 0x00007f098aba5307 in rowgroup::RowGroup::serializeRGData (this=this@entry=0x7f056e1074d8, bs=...) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/utils/rowgroup/rowgroup.cpp:1380
#5 0x00007f098b7b7e05 in joblist::TupleAnnexStep::nextBand (this=0x7f056e107000, bs=...) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/dbcon/joblist/tupleannexstep.cpp:279
#6 0x00007f098b72875c in joblist::TupleJobList::projectTable (this=0x7f056e058180, bs=...) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/dbcon/joblist/joblist.cpp:378
#7 0x000055a9d609939b in (anonymous namespace)::SessionThread::operator() (this=<optimized out>) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/exemgr/main.cpp:954
#8 0x00007f098949deb4 in operator() (this=0x7f055a40cdd8) at /usr/include/boost/function/function_template.hpp:767
#9 threadpool::ThreadPool::beginThread (this=0x7ffe186160a0) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/utils/threadpool/threadpool.cpp:391
#10 0x00007f0987a1e27a in thread_proxy () from /lib64/libboost_thread-mt.so.1.53.0
#11 0x00007f0986f7edd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f09859edead in clone () from /lib64/libc.so.6

from um1 log

Apr 22 10:17:36 usfit-scdb1 ProcessMonitor[112390]: 36.751600 |0|0|0| C 18 CAL0000: *****MariaDB ColumnStore Process Restarting: ExeMgr, old PID = 179302

from trace

Date/time: 2019-04-22 10:17:30
Signal: 11

[0x55a9d60a4b40]
/lib64/libpthread.so.0(+0xf5d0)[0x7f0986f865d0]
/lib64/libc.so.6(+0x1555bf)[0x7f0985a455bf]
/usr/local/mariadb/columnstore/lib/libmessageqcpp.so.1(_ZN11messageqcpp10ByteStream6appendEPKhj+0x56)[0x7f09884af126]
/usr/local/mariadb/columnstore/lib/librowgroup.so.1(_ZNK8rowgroup6RGData9serializeERN11messageqcpp10ByteStreamEj+0x3e)[0x7f098aba412e]
/usr/local/mariadb/columnstore/lib/libjoblist.so.1(_ZN7joblist14TupleAnnexStep8nextBandERN11messageqcpp10ByteStreamE+0xe5)[0x7f098b7b7e05]
/usr/local/mariadb/columnstore/lib/libjoblist.so.1(_ZN7joblist12TupleJobList12projectTableEiRN11messageqcpp10ByteStreamE+0x1c)[0x7f098b72875c]
[0x55a9d609939b]
/usr/local/mariadb/columnstore/lib/libthreadpool.so.1(_ZN10threadpool10ThreadPool11beginThreadEv+0x434)[0x7f098949deb4]
/lib64/libboost_thread-mt.so.1.53.0(+0xd27a)[0x7f0987a1e27a]
/lib64/libpthread.so.0(+0x7dd5)[0x7f0986f7edd5]
/lib64/libc.so.6(clone+0x6d)[0x7f09859edead]



 Comments   
Comment by David Hill (Inactive) [ 2019-05-17 ]

Support Issue number corrected, requested corefile.

Comment by David Hill (Inactive) [ 2019-05-17 ]

compressed the corefile (it was 22G) and sent it through the ftp as core.ExeMgr.179302.gz

Comment by Andrew Hutchings (Inactive) [ 2019-05-30 ]

Core file analysed.

Crash happens because rgData is a null ptr. But RGData::serialize() tries to append it anyway. The length is random data from a bad pointer too. Happens in tupleannexstep.cpp:280.

We get here because fDie is true. It is an aborted query. I suspect before any data was delivered so fRowGroupDeliver.setData() is never called.

Not sure how TAS gets in that state but I think it might be an empty result set.

Comment by Andrew Hutchings (Inactive) [ 2019-05-30 ]

OK. It is because nextBand() is called during abort after TupleAnnexStep is freed. This of course shouldn't happen.

Generated at Thu Feb 08 02:41:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.