[MCOL-5264] writeToClient() sporadically throws an exception processing exception Created: 2022-10-13  Updated: 2023-11-17  Resolved: 2022-12-13

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 6.3.1, 6.4.5-dompe
Fix Version/s: 22.08.7

Type: Bug Priority: Critical
Reporter: Roman Assignee: Andrey Piskunov (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Assigned for Review: Roman Roman

 Description   

There is a crash trace found by one of the clients after EM crashes.

Date/time: 2022-10-10 02:05:07
Signal: 6
 
/usr/bin/ExeMgr(+0x2840c)[0x561f9cabf40c]
/lib64/libpthread.so.0(+0xf630)[0x7f056b405630]
/lib64/libc.so.6(gsignal+0x37)[0x7f056932d387]
/lib64/libc.so.6(abort+0x148)[0x7f056932ea78]
/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x7f0569a27a95]
/lib64/libstdc++.so.6(+0x5ea06)[0x7f0569a25a06]
/lib64/libstdc++.so.6(+0x5d9b9)[0x7f0569a249b9]
/lib64/libstdc++.so.6(__gxx_personality_v0+0x564)[0x7f0569a25624]
/usr/bin/ExeMgr(+0x2abf3)[0x561f9cac1bf3]
/usr/bin/ExeMgr(+0x2b3e7)[0x561f9cac23e7]
/usr/bin/ExeMgr(+0x274c8)[0x561f9cabe4c8]
/lib64/libjoblist.so(_ZN7joblist21DistributedEngineComm13writeToClientEmRKN11messageqcpp10ByteStreamEjb+0x179)[0x7f056fd7de79]
/lib64/libjoblist.so(_ZN7joblist21DistributedEngineComm5writeEjRN11messageqcpp10ByteStreamE+0x286)[0x7f056fd7ff76]

SIGABRT is sent when code throws an exception that is not handled.



 Comments   
Comment by Roman [ 2022-11-23 ]

The line

/lib64/libjoblist.so(_ZN7joblist21DistributedEngineComm13writeToClientEmRKN11messageqcpp10ByteStreamEjb+0x179)[0x7f056fd7de79]

corresponds to the line 329 in the boost::~unique_ptr(). The actual boost version is 1.53

 325     ~unique_lock()
 326     {
 327       if (owns_lock())
 328       {
 329         m->unlock();
 330       }
 331     }

Here is boost::unique_ptr::unlock() code.

 437     void unlock()
 438     { 
 439       if (m == 0)
 440       {
 441         boost::throw_exception(
 442             boost::lock_error(system::errc::operation_not_permitted, "boost unique_lock has no mutex"));
 443       }
 444       if (!owns_lock())
 445       {
 446         boost::throw_exception(
 447             boost::lock_error(system::errc::operation_not_permitted, "boost unique_lock doesn't own the mutex"));
 448       }
 449       m->unlock();
 450       is_locked = false;
 451     } 

There are two potential exceptions in unlock(). The second is guarded by owns_lock() that should return true so I presume this is the first throw, namely the underlying mutex is 0, that produces SIGABRT.

Comment by Roman [ 2022-11-23 ]

The suggested solution is similar to what denis0x0D implemented in MCOL-5265, namely replace boost::mutex and boost::scoped_lock with the counterparts from stdlib.

Comment by Roman [ 2022-11-24 ]

Plz review.

Generated at Thu Feb 08 02:56:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.