Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-3341

PM ExeMgr doesnt restart on User Module failure w/ local query enabled

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 1.2.3
    • Fix Version/s: N/A
    • Component/s: oam
    • Labels:
      None
    • Environment:
      2um 2pm with local query enabled

      Description

      Reported by customer and reproduced:

      System with multiple UMs and local query enabled, if UM1 goes down the ExeMgrs are all stopped and started as part of the recovery process. The ExeMgrs fail to start leaving the system in this state:

      System BUSY_INIT Thu May 30 14:45:51 2019

      Module um1 AUTO_DISABLED/DEGRADED Thu May 30 14:45:57 2019
      Module um2 FAILED Thu May 30 14:48:20 2019
      Module pm1 ACTIVE Thu May 30 14:48:02 2019
      Module pm2 ACTIVE Thu May 30 14:48:03 2019

      Active Parent OAM Performance Module is 'pm1'
      Primary Front-End MariaDB ColumnStore Module is 'um2'
      Local Query Feature is enabled
      MariaDB ColumnStore Replication Feature is enabled
      MariaDB ColumnStore set for Distributed Install

      MariaDB ColumnStore Process statuses

      Process Module Status Last Status Change Process ID
      ------------------ ------ --------------- ------------------------ ----------
      ProcessMonitor um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
      ServerMonitor um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
      DBRMWorkerNode um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
      ExeMgr um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
      DDLProc um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
      DMLProc um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
      mysqld um1 AUTO_OFFLINE Thu May 30 14:45:57 2019

      ProcessMonitor um2 ACTIVE Thu May 30 14:42:22 2019 7059
      ServerMonitor um2 ACTIVE Thu May 30 14:42:48 2019 7497
      DBRMWorkerNode um2 ACTIVE Thu May 30 14:47:19 2019 11086
      ExeMgr um2 ACTIVE Thu May 30 14:47:50 2019 11270
      DDLProc um2 COLD_STANDBY Thu May 30 14:46:48 2019
      DMLProc um2 COLD_STANDBY Thu May 30 14:46:49 2019
      mysqld um2 ACTIVE Thu May 30 14:48:24 2019 11521

      ProcessMonitor pm1 ACTIVE Thu May 30 14:41:30 2019 9303
      ProcessManager pm1 ACTIVE Thu May 30 14:41:36 2019 9427
      DBRMControllerNode pm1 ACTIVE Thu May 30 14:47:16 2019 23967
      ServerMonitor pm1 ACTIVE Thu May 30 14:42:42 2019 11653
      DBRMWorkerNode pm1 ACTIVE Thu May 30 14:47:23 2019 24115
      PrimProc pm1 ACTIVE Thu May 30 14:47:32 2019 24253
      ExeMgr pm1 MAN_OFFLINE Thu May 30 14:45:59 2019
      WriteEngineServer pm1 ACTIVE Thu May 30 14:47:45 2019 24491
      mysqld pm1 ACTIVE Thu May 30 14:48:02 2019 24952

      ProcessMonitor pm2 ACTIVE Thu May 30 14:42:32 2019 7669
      ProcessManager pm2 HOT_STANDBY Thu May 30 14:42:33 2019 7765
      DBRMControllerNode pm2 COLD_STANDBY Thu May 30 14:47:15 2019
      ServerMonitor pm2 ACTIVE Thu May 30 14:42:53 2019 8137
      DBRMWorkerNode pm2 ACTIVE Thu May 30 14:47:28 2019 10444
      PrimProc pm2 ACTIVE Thu May 30 14:47:36 2019 10512
      ExeMgr pm2 MAN_OFFLINE Thu May 30 14:45:59 2019
      WriteEngineServer pm2 ACTIVE Thu May 30 14:47:46 2019 10589
      mysqld pm2 ACTIVE Thu May 30 14:48:03 2019 10855

      From pm1 logs when ExeMgr is trying to start back up

      May 30 14:46:47 ip-172-31-38-221 ProcessMonitor[9303]: 47.487022 |0|0|0| E 18 CAL0000: Process location: not found
      May 30 14:47:52 ip-172-31-38-221 ProcessMonitor[9303]: 52.591412 |0|0|0| E 18 CAL0000: Process location: not found

      I think the issue is that in the a separate system install, the ExeMgr Process Configuration shows its running on UM. So the reason for the error above. Looks like there needs to be additional code to handle the local query option.

      Process #7 Configuration information
      ProcessName = ExeMgr
      ModuleType = um
      ProcessLocation = /usr/local/mariadb/columnstore/bin/ExeMgr
      BootLaunch = 2
      LaunchID = 30
      DepModuleName1 = pm*
      DepProcessName1 = PrimProc
      RunType = LOADSHARE
      LogFile = off

        Attachments

          Activity

            People

            Assignee:
            ben.thompson Ben Thompson
            Reporter:
            hill David Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.