[MCOL-5165] CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable Created: 2022-07-22  Updated: 2023-12-15

Status: Stalled
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 6.1.1
Fix Version/s: 23.10.1

Type: Task Priority: Critical
Reporter: Ricardo Galvan Assignee: Joe Cotellese
Resolution: Unresolved Votes: 3
Labels: triage

Attachments: Text File mcs1_columnstoreMetrics.txt     Text File mcs2_columnstoreMetrics.txt     Text File mcs3_columnstoreMetrics.txt    

 Description   

The customer ran into issues reporting this message:

prod.ERROR: SQLSTATE[HY000]: General error: 1815 Internal error: IDB-2004: Cannot connect to ExeMgr.

The error.log file reports this:

Jul 22 12:46:26 ip-10-111-0-54 joblist[9737]: 26.282984 |198936|0|0| C 05 CAL0000: TupleAggregateStep::doThreadedAggregate() caught Throw location unknown (consider using BOOST_THROW_EXCEPTION)#012Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >#012std::exception::what: boost::thread_resource_error: Resource temporarily unavailable
Jul 22 12:46:26 ip-10-111-0-54 joblist[9737]: 26.324209 |198936|0|0| C 05 CAL0000: TupleAggregateStep::doThreadedAggregate() caught Throw location unknown (consider using BOOST_THROW_EXCEPTION)#012Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >#012std::exception::what: boost::thread_resource_error: Resource temporarily unavailable
Jul 22 12:46:26 ip-10-111-0-54 ExeMgr[9737]: 26.359785 |198936|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable
Jul 22 12:46:35 ip-10-111-0-54 ExeMgr[9737]: 35.473785 |198913|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable
Jul 22 12:46:39 ip-10-111-0-54 ExeMgr[9737]: 39.722336 |198930|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable
Jul 22 13:00:12 ip-10-111-0-54 threadpool[9737]: 12.171413 |0|0|0| E 22 CAL0005: ThreadPool: Caught exception during execution:  boost::thread_resource_error: Resource temporarily unavailable
Jul 22 13:01:10 ip-10-111-0-54 ExeMgr[9737]: 10.819957 |2147683103|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 13:01:11 ip-10-111-0-54 ExeMgr[9737]: 11.007345 |2147683105|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 13:01:12 ip-10-111-0-54 ExeMgr[9737]: 12.039356 |2147683111|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 13:05:58 ip-10-111-0-54 ExeMgr[9737]: 58.503954 |2147683259|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 13:06:57 ip-10-111-0-54 ExeMgr[9737]: 57.268840 |2147683295|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 13:06:57 ip-10-111-0-54 ExeMgr[9737]: 57.636075 |2147683294|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 13:07:19 ip-10-111-0-54 ExeMgr[9737]: 19.297657 |2147683301|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable

The debug.log reports this:

Jul 22 14:01:14 ip-10-111-0-54 ExeMgr[9737]: 14.409662 |2147684316|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 14:01:15 ip-10-111-0-54 ExeMgr[9737]: 15.246773 |2147684822|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='app' and tablename='client14spenddata' --columnRIDs/FE; ||
Jul 22 14:01:15 ip-10-111-0-54 ExeMgr[9737]: 15.265702 |2147684822|0|0| D 16 CAL0042: End SQL statement
Jul 22 14:01:15 ip-10-111-0-54 ExeMgr[9737]: 15.265770 |2147684822|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 14:01:16 ip-10-111-0-54 ExeMgr[9737]: 16.928010 |2147684823|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='app' and tablename='client14spenddata' --columnRIDs/FE; ||
Jul 22 14:01:16 ip-10-111-0-54 ExeMgr[9737]: 16.947776 |2147684823|0|0| D 16 CAL0042: End SQL statement
Jul 22 14:01:16 ip-10-111-0-54 ExeMgr[9737]: 16.947852 |2147684823|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 14:01:23 ip-10-111-0-54 ExeMgr[9737]: 23.982106 |2147684824|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='app' and tablename='client1spenddata' --columnRIDs/FE; ||
Jul 22 14:01:24 ip-10-111-0-54 ExeMgr[9737]: 24.004953 |2147684824|0|0| D 16 CAL0042: End SQL statement
Jul 22 14:01:24 ip-10-111-0-54 ExeMgr[9737]: 24.005033 |2147684824|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 14:01:30 ip-10-111-0-54 ExeMgr[9737]: 30.015599 |2147684827|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='app' and tablename='client540spenddata' --columnRIDs/FE; ||
Jul 22 14:01:30 ip-10-111-0-54 ExeMgr[9737]: 30.036076 |2147684827|0|0| D 16 CAL0042: End SQL statement
Jul 22 14:01:30 ip-10-111-0-54 ExeMgr[9737]: 30.036159 |2147684827|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 14:01:31 ip-10-111-0-54 ExeMgr[9737]: 31.475184 |2147684829|0|0| D 16 CAL0041: Start SQL statement: select objectid,columnname from syscolumn where schema='app' and tablename='client540spenddata' --columnRIDs/FE; ||
Jul 22 14:01:31 ip-10-111-0-54 ExeMgr[9737]: 31.493704 |2147684829|0|0| D 16 CAL0042: End SQL statement
Jul 22 14:01:31 ip-10-111-0-54 ExeMgr[9737]: 31.493773 |2147684829|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. boost::thread_resource_error: Resource temporarily unavailable
Jul 22 14:01:41 ip-10-111-0-54 DMLProc[9769]: 41.976756 |0|0|0| I 20 CAL0002: DMLProc has been told to rollback all DML transactions.
Jul 22 14:02:01 ip-10-111-0-54 IDBFile[9424]: 01.311265 |0|0|0| D 35 CAL0002: IDBFactory::installPlugin: installed filesystem plugin libcloudio.so
Jul 22 14:02:02 ip-10-111-0-54 StorageManager[9416]: SessionManager Caught Signal 15
Jul 22 14:02:02 ip-10-111-0-54 StorageManager[9416]: Shutdown StorageManager...
Jul 22 14:02:02 ip-10-111-0-54 controllernode[9185]: 02.375516 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jul 22 14:02:02 ip-10-111-0-54 controllernode[9185]: 02.375800 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jul 22 14:02:02 ip-10-111-0-54 StorageManager[9416]: Ownership: releasing ownership of data1
Jul 22 14:02:02 ip-10-111-0-54 StorageManager[9416]: StorageManager Shutdown Complete.
Jul 22 14:03:44 ip-10-111-0-54 controllernode[9876]: 44.722037 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jul 22 14:03:45 ip-10-111-0-54 controllernode[9876]: 45.909634 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jul 22 14:03:46 ip-10-111-0-54 controllernode[9876]: 46.298849 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jul 22 14:03:47 ip-10-111-0-54 controllernode[9876]: 47.698537 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jul 22 14:03:56 ip-10-111-0-54 StorageManager[10100]: Using the config file found at /etc/columnstore/storagemanager.cnf



 Comments   
Comment by David Hall (Inactive) [ 2022-08-03 ]

If this should happen again, it would be most useful to get the thread counts as above as well as memory consumption over memeory capacity at the point of failure. If at all possible, it would be nice to see the same numbers after ten minutes of idle to see if resources are (eventually) returned. I understand this may not be desirable in a production environment, but it would be helpful if possible.

Comment by Michael Amadi [ 2022-09-02 ]

allen.herrera,

Please see Massimo's comment below. would that suffice? or you'll need us to recapture these afresh again?

Comment by David Hall (Inactive) [ 2022-12-08 ]

It's possible that the kernel values for pid_max and max_map_count are too small
We need to modify their values in /usr/lib/sysctl.d. There may be a number of .conf files in this directory.

pid_max
There may be a file called 50-pid-max.conf or perhaps something similar. If so, modify it to be
kernel.pid_max = 4194304

If not we need to create a file for it.

sudo echo "kernel.pid_max = 4194304" > /usr/lib/sysctl.d/50-pid-max.conf

max_map_count
There may be a file that contains max_map_count. if so, modify it to
vm.max_map_count=262144
If not we need to create a file for it.

sudo echo "vm.max_map_count=4262144" > /usr/lib/sysctl.d/50-max_map_count.conf

Finish up
After these changes, load them with

sudo sysctl -p

Check that the changes took

cat /proc/sys/kernel/pid_max
cat /proc/sys/vm/max_map_count

Comment by David Hall (Inactive) [ 2022-12-09 ]

bbancroft
I would suggest the values I gave above for all machines, testbed and production
kernel/threads-max = 2000000
kernel/pid_max = 4194304
vm.max_map_count=4262144

These numbers should be large enough. Exactness isn't important, but much larger than the default is needed

Comment by David Hall (Inactive) [ 2023-01-09 ]

It appears the ulimit for max user processes is set to 4096, a way too small of a number. It should be 63069. Modify /etc/security/limits.conf to increase both the soft and hard limits.

Comment by JiraAutomate [ 2023-12-15 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Generated at Thu Feb 08 02:55:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.