Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-1370

Network Error incorrect handled, Amazon DBROOT detach failed, but dbroot still was reassigned

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 1.1.4
    • 1.1.5
    • ?
    • None
    • amazon ec2 with ebs storage
    • 2018-11, 2018-12

    Description

      Issue reported by a customer, here is the analysis show that the system was in a bad state due to a detach failed. but the dbroot still got reassigned

      Ok here is what I see. There was some network issue where pm5 wasnt respoding to pings from pm1, so it went into failover state.
      But pm5 was not down based on the logs and looked to be idle. So looks like some network issue between pm1/pm5, best guess.

      But since PM5 was still up and some of the CS process were still running and had an access to DBROOT, it failed to get detach.
      failover code assumes the module is down and EBS can be detached and restach to pm1. So that is a BUG. The detach failed, but DBROOT 5
      still got assigned to PM1. We will open a JIRA on that issue.

      I dont know if you have anything on your side to look at an network issues between pm1/pm5 on Apr 25 13:41:26

      PM1

      LOGS SHOWING THAT PM5 WASNT RESPONDING TO PINGS, WHICH INITIATE A MODULE DOWN ANA FAILURE
      info.log:Apr 25 13:41:26 mcs1-pm1 ProcessManager[62467]: 26.465309 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
      info.log:Apr 25 13:41:26 mcs1-pm1 ProcessManager[62467]: 26.475801 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
      info.log:Apr 25 13:41:43 mcs1-pm1 ProcessManager[62467]: 43.365051 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
      info.log:Apr 25 13:41:43 mcs1-pm1 ProcessManager[62467]: 43.367980 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
      info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.310225 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
      info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.313222 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
      info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.313373 |0|0|0| C 17 CAL0000: module is down: pm5

      FAILURE TO DETACH DBROOT 5 FROM PM5
      info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952210 |0|0|0| E 08 CAL0000: ERROR: amazonReattach, detachEC2Volume failed on vol-09f84a3ec4b5f7dbb
      info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952312 |0|0|0| E 08 CAL0000: ERROR: amazonReattach api failure
      info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952374 |0|0|0| E 08 CAL0000: ERROR: manualMovePmDbroot failure: pm1:5:pm5

      PM5

      THIS SAYS THAT PM5 WAS BASICLY IDLE AS FAR AS THE CS LOGS SHOW, NO ACTIVE CPIMPORTS
      info.log:Apr 25 13:24:55 mcs1-pm5 cpimport.bin[13602]: 55.066750 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-689411; status-SUCCESS
      info.log:Apr 25 13:24:55 mcs1-pm5 writeengineserver[57886]: 55.075839 |0|0|0| I 32 CAL0000: 6607 : cpimport exit on success

      THIS SHOWS THAT 3 SECONDS AFTER THE MODULE WASNT RESPONDING TO PINGS, IT RECEIVED MSGS FROM PM1
      info.log:Apr 25 13:42:36 mcs1-pm5 ProcessMonitor[101111]: 36.678186 |0|0|0| I 18 CAL0000: MSG RECEIVED: Re-Init process request on: cpimport
      info.log:Apr 25 13:42:36 mcs1-pm5 ProcessMonitor[101111]: 36.925981 |0|0|0| I 18 CAL0000: PROCREINITPROCESS: completed, no ack to ProcMgr

      Attachments

        Activity

          People

            dleeyh Daniel Lee (Inactive)
            hill David Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.