[MCOL-2245] altersystem-disablemodule return with failure on a busy system Created: 2019-03-15  Updated: 2023-10-26  Resolved: 2019-07-10

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.2.2
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

2um 2pm system



 Description   

Reported by customer and seen by support in a shared session. The altersystem-disableModule command failed on the first try. It passed on the second try.

I tried to reproduce the issue on local 2um/2pm system when it was idle and couldn't reproduce it.

I looked at the logs from the customer system and saw that a message was sent to UM2 to stop the process. UM2 received the message and replied back 1:02 minutes later. But the timeout in ProcMgr on PM1 is 1 minutes, so it timeout and returned and error back to the user.

Viewing the logs, the system was active with Bulk Loading, which probably contributed to the msg timeout. So the timeout might need to be increased from 1 minute to something higher for it to work on a busy system. BUT its not recommend that user run this command on a active system to start with.

From the logs below, its shows why the altersystem-disablemodule command failed.
Pm1 has a 1 minute wait time and it timed out. Um2 received the message but took 1:02 minutes to response. So it was slower than the timeout.

There was bulk loading, cpimport jobs, running the disablemodule was run. So this might have contributed to the timeout and it taking longer to perform the disablemodule and why I could reproduce the issue. I was running on an ideal system.

So best bet is the altersystem-disableModule would have worked on you system if it was idle. Cant say for sure. I will go ahead and open a new bug request an increase on the timeout, but development might say the altersystem-disablemodule should only be done on a idle system and 1 minutes is valid.

Thought I would pass this on.. This is my report on the altersystem-disablemodule failure your system had.

Pm1 logs

Thu Mar 14 15:01:02 2019: altersystem-disablemodule um2 y

BULK LOAD WAS GOING ON AT THE TIME OF THE DISABLE

Mar 14 15:01:01 ip-172-48-32-68 cpimport.bin[30605]: 01.613470 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-3049; db-tradealert

Mar 14 15:01:02 ip-172-48-32-68 ProcessManager[4727]: 02.970810 |0|0|0| I 17 CAL0000: MSG RECEIVED: Stop Module request on um2

THE WAIT IS 1 MINUTE, SO A TIMEOUT OCCURRED WHICH CAUSES THE FAILURE TO THE USER ON THE DISABLE-MODULE COMMAND

Mar 14 15:02:03 ip-172-48-32-68 ProcessManager[4727]: 03.034561 |0|0|0| E 17 CAL0000: line: 6901 sendMsgProcMon: ProcMon Msg timeout on module um2

Mar 14 15:02:03 ip-172-48-32-68 ProcessManager[4727]: 03.034635 |0|0|0| W 17 CAL0000: um2 module failed to stop!!

um2 – took 1:04 minutes to respond

Mar 14 15:01:02 ip-172-48-44-207 ProcessMonitor[3853]: 02.988421 |0|0|0| I 18 CAL0000: MSG RECEIVED: Stop All process request…
Mar 14 15:02:06 ip-172-48-44-207 ProcessMonitor[3853]: 06.668032 |0|0|0| I 18 CAL0000: STOPALL: ACK back to ProcMgr, return status = 0


Generated at Thu Feb 08 02:34:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.