[MCOL-3470] ServerMonitor hung, running at 100% cpu and not responding Created: 2019-09-03  Updated: 2023-10-26  Resolved: 2020-04-15

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.2.2
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: David Hill (Inactive) Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

1um 3pm with local query enabled



 Description   

Customer reporting that ServerMonitor on PM1 continually gets hung where it shows running 100% cpu usage and fails to respond to mcsadmin commands. A restartSystem will resolve the issue, but it will eventually get back into the same state.

mcsadmin> getModuleCpuUsers pm1
getmodulecpuusers Tue Sep 3 09:33:57 2019

Failed to get Top CPU Users: API Failure return in getTopProcessCpuUsers API

top from PM1:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
70427 root 20 0 230136 34488 8940 S 100.0 0.0 3921:54 ServerMonitor
7814 root 39 19 0 0 0 S 3.0 0.0 110:44.50 kipmi0
120599 root 20 0 608500 48744 2912 S 2.0 0.0 567:48.58 ProcMgr

gdb of ServerMonitor:

(gdb) bt
#0 0x00007f2f69a314ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007f2f69a2cdcb in _L_lock_883 () from /lib64/libpthread.so.0
#2 0x00007f2f69a2cc98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00005560d23ffa52 in msgProcessor () at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/oamapps/serverMonitor/msgProcessor.cpp:144
#4 0x00005560d23dd113 in main (argc=<optimized out>, argv=<optimized out>) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/oamapps/serverMonitor/main.cpp:325
(gdb) info threads
Id Target Id Frame
5 Thread 0x7f2f62fff700 (LWP 70503) "ServerMonitor" 0x00007f2f69a2ed12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x7f2f61fff700 (LWP 70504) "ServerMonitor" 0x00007f2f6a8ff366 in alarmmanager::operator>> (input=..., alarm=...) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/oamapps/alarmmanager/alarm.cpp:121
3 Thread 0x7f2f60fff700 (LWP 74505) "ServerMonitor" 0x00007f2f69a314ed in __lll_lock_wait () from /lib64/libpthread.so.0
2 Thread 0x7f2f607fe700 (LWP 74506) "ServerMonitor" 0x00007f2f69a314ed in __lll_lock_wait () from /lib64/libpthread.so.0

  • 1 Thread 0x7f2f6e0c88c0 (LWP 70427) "ServerMonitor" 0x00007f2f69a314ed in __lll_lock_wait () from /lib64/libpthread.so.0


 Comments   
Comment by David Hill (Inactive) [ 2019-09-30 ]

update from customer

Just an update that we have a different system now running on the Centos6 packages and installed with Columnstore 1.2.5 and we still see the ServerMonitor running at 100% CPU on PM1 while the system is idle. Same symptoms.. appears to be stuck on some mutex lock as the gdb output was the same as previously. Specifically running restart on ServerMonitor still resolves it for the time being.

(gdb) bt
#0 0x0000003b5d60e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x0000003b5d6095d8 in _L_lock_854 () from /lib64/libpthread.so.0
#2 0x0000003b5d6094a7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x0000000000431a2a in msgProcessor() ()
#4 0x000000000040f61e in main ()

Comment by Todd Stoffel (Inactive) [ 2020-04-15 ]

OAM is being deprecated and replaced by an enhanced API and the MaxScale orchestration project.

Generated at Thu Feb 08 02:42:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.