[MCOL-835] Killing MariaDB connections can crash ExeMgr Created: 2017-07-26  Updated: 2017-08-18  Resolved: 2017-08-18

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.0.9, 1.1.0
Fix Version/s: 1.0.11, 1.1.0

Type: Bug Priority: Blocker
Reporter: Andrew Hutchings (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 2
Labels: None

Sprint: 2017-15, 2017-16

 Description   

How I reproduce this:

1. Run 100 concurrent connections looping ColumnStore queries
2. Kill random connections until ExeMgr dies (usually 1 in 50)

Crash happens here in inetstreamsocket.cpp:

	if (stats)
		stats->dataSent(msglen + sizeof(msglen) + sizeof(magic));

Core file bt:

Core was generated by `'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f93c0000168 in ?? ()
[Current thread is 1 (Thread 0x7f91a6ff5700 (LWP 14761))]
(gdb) bt
#0  0x00007f93c0000168 in ?? ()
#1  0x00007f94373e6458 in messageqcpp::InetStreamSocket::do_write (
    this=this@entry=0x13c3a30, msg=..., whichMagic=whichMagic@entry=352043319, 
    stats=stats@entry=0x7f93c0022ff0)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/utils/messageqcpp/inetstreamsocket.cpp:627
#2  0x00007f94373e658d in messageqcpp::InetStreamSocket::write (
    this=this@entry=0x13c3a30, msg=..., stats=stats@entry=0x7f93c0022ff0)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/utils/messageqcpp/inetstreamsocket.cpp:632
#3  0x00007f94373edcd1 in messageqcpp::CompressedInetStreamSocket::write (
    this=0x13c3a30, msg=..., stats=0x7f93c0022ff0)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/utils/messageqcpp/compressed_iss.cpp:122
#4  0x00007f94373d9f6e in messageqcpp::IOSocket::write (stats=0x7f93c0022ff0, 
    msg=..., this=0x13c3968)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/utils/messageqcpp/iosocket.h:206
#5  messageqcpp::MessageQueueClient::write (this=this@entry=0x13c3930, 
    msg=..., timeout=timeout@entry=0x0, stats=stats@entry=0x7f93c0022ff0)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/utils/messageqcpp/messagequeue.cpp:278
#6  0x00007f943a231b1d in joblist::DistributedEngineComm::writeToClient (
    is@entry=0x139c5e0, index=1, bs=..., sender=sender@entry=2284, 
    doInterleaving=doInterleaving@entry=true)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp:857
#7  0x00007f943a2346c6 in joblist::DistributedEngineComm::write (this=0x139c5e0, senderID=2284, msg=...)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp:747
#8  0x00007f943a30d68f in joblist::TupleBPS::sendJobs (this=this@entry=0x7f93c0112e80, 
    jobs=std::vector of length 65, capacity 128 = {...})
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/tuple-bps.cpp:1298
#9  0x00007f943a312bf3 in joblist::TupleBPS::sendPrimitiveMessages (this=0x7f93c0112e80)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/tuple-bps.cpp:1691
#10 0x00007f943a321fef in joblist::TupleBPSPrimitive::operator() (this=<optimised out>)
    at /home/linuxjedi/Programming/Git/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/tuple-bps.cpp:108
#11 boost::detail::thread_data<joblist::TupleBPSPrimitive>::run (this=0x7f93c01c9390)
    at /usr/include/boost/thread/detail/thread.hpp:116
#12 0x00007f94369365d5 in boost::(anonymous namespace)::thread_proxy (param=<optimised out>)
    at libs/thread/src/pthread/thread.cpp:168
#13 0x00007f9435dec6ba in start_thread (arg=0x7f91a6ff5700) at pthread_create.c:333
#14 0x00007f94348743dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2017-08-05 ]

I can also trigger this about every 3 hours running working_tpch1/misc/bug4488.sql in a loop. Upgrading this to a blocker since this is a race that regular queries can hit.

Comment by Andrew Hutchings (Inactive) [ 2017-08-11 ]

Problem was the Stats object was retrieved using a lock but then used after the lock had been unlocked. Which means that if could be freed before use due to a small race.

This fix gets the shared pointer for the parent MQE object and gets the Stats object from that so that cannot be freed whilst it is being used.

For QA:
There is no easy way to reproduce this yet. The two ways I've done it are in the description and comments. It happens during connection disconnect.

Comment by Daniel Lee (Inactive) [ 2017-08-18 ]

Builds verified: 1.0.11-1, 1.1.0-1
Execute the mentioned test for 17 hours for each release. It executed close to 25,000 iterations each. For 1.0.11-1, ExeMgr did crashed one time. I am not sure what the exact cause yet. The fix did make a huge stability improvement for this test case.

This test also covered MCOL-744.

Generated at Thu Feb 08 02:24:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.