Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-1977

Amazon AMI with EBS - pm failover failed - ExeMgr in BUSY_INIT for 2 miuntes

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Do
    • 1.2.2
    • N/A
    • ExeMgr
    • None
    • Amazon AMI with EBS 1um and 2 pm with EBS

    Description

      Started with 1um 2pm all Active and stopped PM2 instance.
      Notice ExeMgr was in a BUST_INIT state on UM1 for a long time, over 2 minutes. This is not normal
      And DDL/DMLProc remained MAN_OFFLINE. DDL/DML didnt start up because ExeMgr in BUSY_INIT too long. code gave up too soon on bring up DDL/DML

      System BUSY_INIT Wed Nov 28 16:11:40 2018

      Module um1 ACTIVE Wed Nov 28 16:12:39 2018
      Module pm1 ACTIVE Wed Nov 28 16:08:01 2018
      Module pm2 AUTO_DISABLED/DEGRADED Wed Nov 28 16:11:51 2018

      Active Parent OAM Performance Module is 'pm1'
      MariaDB ColumnStore Replication Feature is enabled
      MariaDB ColumnStore set for Distributed Install

      MariaDB ColumnStore Process statuses

      Process Module Status Last Status Change Process ID
      ------------------ ------ --------------- ------------------------ ----------
      ProcessMonitor um1 ACTIVE Wed Nov 28 16:07:37 2018 17077
      ServerMonitor um1 ACTIVE Wed Nov 28 16:07:58 2018 17683
      DBRMWorkerNode um1 ACTIVE Wed Nov 28 16:12:10 2018 20984
      ExeMgr um1 ACTIVE Wed Nov 28 16:14:39 2018 21107
      DDLProc um1 MAN_OFFLINE Wed Nov 28 16:11:55 2018
      DMLProc um1 MAN_OFFLINE Wed Nov 28 16:11:56 2018
      mysqld um1 ACTIVE Wed Nov 28 16:12:39 2018 21403

      ProcessMonitor pm1 ACTIVE Wed Nov 28 16:07:16 2018 23923
      ProcessManager pm1 ACTIVE Wed Nov 28 16:07:22 2018 24379
      DBRMControllerNode pm1 ACTIVE Wed Nov 28 16:12:05 2018 31560
      ServerMonitor pm1 ACTIVE Wed Nov 28 16:07:58 2018 25265
      DBRMWorkerNode pm1 ACTIVE Wed Nov 28 16:12:14 2018 31640
      PrimProc pm1 ACTIVE Wed Nov 28 16:12:20 2018 31690
      WriteEngineServer pm1 ACTIVE Wed Nov 28 16:12:26 2018 31751

      ProcessMonitor pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
      ProcessManager pm2 AUTO_OFFLINE Wed Nov 28 16:13:04 2018
      DBRMControllerNode pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
      ServerMonitor pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
      DBRMWorkerNode pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
      PrimProc pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
      WriteEngineServer pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018

      Active Alarm Counts: Critical = 3, Major = 1, Minor = 2, Warning = 0, Info = 0

      DDL/DML didnt start up because ExeMgr in BUSY_INIT – log files from um1

      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.256138 |0|0|0| D 18 CAL0000: STARTING Process: DDLProc
      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.256167 |0|0|0| D 18 CAL0000: Process location: /home/mysql/mariadb/columnstore/bin/DDLProc
      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.257881 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm1 is 4
      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.261321 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm2 is 1
      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.268606 |0|0|0| D 18 CAL0000: Dependent process of DBRMWorkerNode/um1 is 4
      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.269818 |0|0|0| D 18 CAL0000: Dependent process of ExeMgr/um1 is 19
      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.272698 |0|0|0| D 18 CAL0000: Dependent Process is not in correct state, Failed Restoral
      Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.272776 |0|0|0| I 18 CAL0000: START: ACK back to ProcMgr, return status = 8
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.281977 |0|0|0| I 18 CAL0000: MSG RECEIVED: Start process request on: DMLProc
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.284498 |0|0|0| D 18 CAL0000: checkSpecialProcessState on : DMLProc
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285704 |0|0|0| D 18 CAL0000: checkSpecialProcessState status return : 2
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285823 |0|0|0| D 18 CAL0000: STARTING Process: DMLProc
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285857 |0|0|0| D 18 CAL0000: Process location: /home/mysql/mariadb/columnstore/bin/DMLProc
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.287570 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm1 is 4
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.290779 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm2 is 1
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.295106 |0|0|0| D 18 CAL0000: Dependent process of DBRMWorkerNode/um1 is 4
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.296215 |0|0|0| D 18 CAL0000: Dependent process of DDLProc/um1 is 0
      Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.296256 |0|0|0| D 18 CAL0000: Dependent Process is not in correct state, Failed Restoral

      The storage is correct on this test

      getstorageconfig Wed Nov 28 16:24:15 2018

      System Storage Configuration

      Performance Module (DBRoot) Storage Type = external
      User Module Storage Type = internal
      System Assigned DBRoot Count = 2
      DBRoot IDs assigned to 'pm1' = 1, 2
      DBRoot IDs assigned to 'pm2' =

      This was in UM1 logs, which is from ExeMgr. After this log, ExeMgr went Active. So it was trying to communicate with PM2 PrimProc, which wasnt there.

      Nov 28 16:14:39 ip-172-31-35-86 joblist[21107]: 39.523080 |0|0|0| W 05 CAL0000: /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 298 Could not connect to PMS2: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 4 inet: 172.31.44.128 port: 8620

      Columnstore.xml is correct, it shows 1 pm and its IP address

      <PrimitiveServers>
      <Count>1</Count>

      <PMS1>
      <IPAddr>172.31.41.24</IPAddr>
      <Port>8620</Port>
      </PMS1>
      <PMS2>
      <IPAddr>172.31.41.24</IPAddr>
      <Port>8620</Port>

      Attachments

        Activity

          People

            Unassigned Unassigned
            hill David Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.