Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-2078

PM2 failover with external disk - dbroot didnt get assigned to recovered pm

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 1.2.2
    • N/A
    • ?
    • None
    • 2um 3pm with external storage

    Description

      Customer reported that had a pm outage and the dbroot didnt get reassigned to the pm once it recovered. The dbroot did get assigned to pm1 when the outage occurred, but it failed ti get moved back to pm2 when it recovered.

      from the logs, it looks like it ran through the code to move back to pm2, but it actually just remounted back to pm1.

      Pm1 logs

      PM2 OUTAGE

      Jan 9 15:22:35 usiadmdmaxp01 ProcessManager[8766]: 35.301372 |0|0|0| C 17 CAL0000: *** module is down: pm2

      Jan 9 15:23:58 usiadmdmaxp01 ProcessManager[8766]: 58.861228 |0|0|0| D 17 CAL0000: Call autoMovePmDbroot
      Jan 9 15:23:58 usiadmdmaxp01 oamcpp[8766]: 58.861259 |0|0|0| D 08 CAL0000: autoMovePmDbroot: pm2
      Jan 9 15:23:58 usiadmdmaxp01 oamcpp[8766]: 58.879593 |0|0|0| D 08 CAL0000: autoMovePmDbroot/setPmDbrootConfig : pm1:2
      Jan 9 15:24:03 usiadmdmaxp01 oamcpp[8766]: 03.880523 |0|0|0| D 08 CAL0000: mountDBRoot api, mount dbroot2
      Jan 9 15:24:03 usiadmdmaxp01 ProcessManager[8766]: 03.890372 |0|0|0| I 17 CAL0000: MSG RECEIVED: mount dbroot : 2
      Jan 9 15:24:03 usiadmdmaxp01 ProcessManager[8766]: 03.905581 |0|0|0| D 17 CAL0000: send mountDBRoot to pm: 2/pm1
      Jan 9 15:24:04 usiadmdmaxp01 ProcessManager[8766]: 04.300736 |0|0|0| I 17 CAL0000: Mount Completed status: 0
      Jan 9 15:24:04 usiadmdmaxp01 oamcpp[8766]: 04.313360 |0|0|0| D 08 CAL0000: WRITE3: echo 'pm2|pm1|2' >> /usr/local/mariadb/columnstore/local/moveDbrootTransactionLog
      Jan 9 15:24:04 usiadmdmaxp01 ProcessManager[8766]: 04.313664 |0|0|0| D 17 CAL0000: autoMovePmDbroot success

      Jan 9 15:25:18 usiadmdmaxp01 ProcessManager[8766]: 18.815945 |0|0|0| D 17 CAL0000: *** Module alive, bring it back online: pm2

      Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.237232 |0|0|0| D 17 CAL0000: enableModule request for pm2 completed
      Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.237552 |0|0|0| D 17 CAL0000: Call autoUnMovePmDbroot
      Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.245169 |0|0|0| D 08 CAL0000: autoUnMovePmDbroot: pm2
      Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.246312 |0|0|0| D 08 CAL0000: READ: pm2|pm1|2
      Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.251094 |0|0|0| D 08 CAL0000: manualMovePmDbroot: 2 from pm1 to pm2
      Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.265265 |0|0|0| D 08 CAL0000: mountDBRoot api, umount dbroot2
      Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.274643 |0|0|0| I 17 CAL0000: MSG RECEIVED: Unmount dbroot : 2
      Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.281937 |0|0|0| D 17 CAL0000: send unmountDBRoot to pm: 2/pm1
      Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.282077 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
      Jan 9 15:25:27 usiadmdmaxp01 ProcessMonitor[17836]: 27.282605 |0|0|0| I 18 CAL0000: MSG RECEIVED: Unmount DBRoot: 2
      Jan 9 15:25:30 usiadmdmaxp01 ProcessMonitor[17836]: 30.079500 |0|0|0| D 18 CAL0000: flushInodeCache successful
      Jan 9 15:25:30 usiadmdmaxp01 ProcessMonitor[17836]: 30.345769 |0|0|0| I 18 CAL0000: PROCUNMOUNT: ACK back to ProcMgr, status: 0
      Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.345875 |0|0|0| I 17 CAL0000: UnMount Completed status: 0

      Jan 9 15:25:30 usiadmdmaxp01 oamcpp[8766]: 30.399497 |0|0|0| D 08 CAL0000: mountDBRoot api, mount dbroot2
      Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.421611 |0|0|0| I 17 CAL0000: MSG RECEIVED: mount dbroot : 2
      Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.427101 |0|0|0| D 17 CAL0000: send mountDBRoot to pm: 2/pm1
      Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.506374 |0|0|0| I 17 CAL0000: Mount Completed status: 0

      Jan 9 15:25:30 usiadmdmaxp01 oamcpp[8766]: 30.508297 |0|0|0| D 08 CAL0000: autoUnMovePmDbroot/manualMovePmDbroot : pm1:2:pm2
      Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.517216 |0|0|0| W 17 CAL0000: autoUnMovePmDbroot left no dbroots mounted, failing module restart: pm2
      Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.520170 |0|0|0| D 17 CAL0000: disableModule request for pm2

      PM2 LOGS – THIS TELLS ME THAT PM2 SERVER DIDNT CRASH, BUT JUST WAS FAILING TO RESPOND TO PINGS FROM PM1. SO SOME NETWORK ISSUE OR PM2 SERVER LOCKED UP AND RECOVERED TO START PINGING

      Jan 9 15:22:51 usiadmdmaxp02 ProcessMonitor[1688]: 51.188086 |0|0|0| D 18 CAL0000: State out-of-sync, update on ProcessMonitor/4/1688

      DIDNT RECEIVE ANY MESSAGES FROM PM2 TO REMOUNT DBROOT 2 AT 15:25 FROM PM1. IT SHOULD HAVE.

      Attachments

        Activity

          People

            ben.thompson Ben Thompson (Inactive)
            hill David Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.