Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-2245

altersystem-disablemodule return with failure on a busy system

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 1.2.2
    • Icebox
    • ?
    • None
    • 2um 2pm system

    Description

      Reported by customer and seen by support in a shared session. The altersystem-disableModule command failed on the first try. It passed on the second try.

      I tried to reproduce the issue on local 2um/2pm system when it was idle and couldn't reproduce it.

      I looked at the logs from the customer system and saw that a message was sent to UM2 to stop the process. UM2 received the message and replied back 1:02 minutes later. But the timeout in ProcMgr on PM1 is 1 minutes, so it timeout and returned and error back to the user.

      Viewing the logs, the system was active with Bulk Loading, which probably contributed to the msg timeout. So the timeout might need to be increased from 1 minute to something higher for it to work on a busy system. BUT its not recommend that user run this command on a active system to start with.

      From the logs below, its shows why the altersystem-disablemodule command failed.
      Pm1 has a 1 minute wait time and it timed out. Um2 received the message but took 1:02 minutes to response. So it was slower than the timeout.

      There was bulk loading, cpimport jobs, running the disablemodule was run. So this might have contributed to the timeout and it taking longer to perform the disablemodule and why I could reproduce the issue. I was running on an ideal system.

      So best bet is the altersystem-disableModule would have worked on you system if it was idle. Cant say for sure. I will go ahead and open a new bug request an increase on the timeout, but development might say the altersystem-disablemodule should only be done on a idle system and 1 minutes is valid.

      Thought I would pass this on.. This is my report on the altersystem-disablemodule failure your system had.

      Pm1 logs

      Thu Mar 14 15:01:02 2019: altersystem-disablemodule um2 y

      BULK LOAD WAS GOING ON AT THE TIME OF THE DISABLE

      Mar 14 15:01:01 ip-172-48-32-68 cpimport.bin[30605]: 01.613470 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-3049; db-tradealert

      Mar 14 15:01:02 ip-172-48-32-68 ProcessManager[4727]: 02.970810 |0|0|0| I 17 CAL0000: MSG RECEIVED: Stop Module request on um2

      THE WAIT IS 1 MINUTE, SO A TIMEOUT OCCURRED WHICH CAUSES THE FAILURE TO THE USER ON THE DISABLE-MODULE COMMAND

      Mar 14 15:02:03 ip-172-48-32-68 ProcessManager[4727]: 03.034561 |0|0|0| E 17 CAL0000: line: 6901 sendMsgProcMon: ProcMon Msg timeout on module um2

      Mar 14 15:02:03 ip-172-48-32-68 ProcessManager[4727]: 03.034635 |0|0|0| W 17 CAL0000: um2 module failed to stop!!

      um2 – took 1:04 minutes to respond

      Mar 14 15:01:02 ip-172-48-44-207 ProcessMonitor[3853]: 02.988421 |0|0|0| I 18 CAL0000: MSG RECEIVED: Stop All process request…
      Mar 14 15:02:06 ip-172-48-44-207 ProcessMonitor[3853]: 06.668032 |0|0|0| I 18 CAL0000: STOPALL: ACK back to ProcMgr, return status = 0

      Attachments

        Activity

          People

            Unassigned Unassigned
            hill David Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.