Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Won't Fix
-
1.2.2
-
None
-
2um 3pm with external storage
Description
Customer reported that had a pm outage and the dbroot didnt get reassigned to the pm once it recovered. The dbroot did get assigned to pm1 when the outage occurred, but it failed ti get moved back to pm2 when it recovered.
from the logs, it looks like it ran through the code to move back to pm2, but it actually just remounted back to pm1.
Pm1 logs
PM2 OUTAGE
Jan 9 15:22:35 usiadmdmaxp01 ProcessManager[8766]: 35.301372 |0|0|0| C 17 CAL0000: *** module is down: pm2
Jan 9 15:23:58 usiadmdmaxp01 ProcessManager[8766]: 58.861228 |0|0|0| D 17 CAL0000: Call autoMovePmDbroot
Jan 9 15:23:58 usiadmdmaxp01 oamcpp[8766]: 58.861259 |0|0|0| D 08 CAL0000: autoMovePmDbroot: pm2
Jan 9 15:23:58 usiadmdmaxp01 oamcpp[8766]: 58.879593 |0|0|0| D 08 CAL0000: autoMovePmDbroot/setPmDbrootConfig : pm1:2
Jan 9 15:24:03 usiadmdmaxp01 oamcpp[8766]: 03.880523 |0|0|0| D 08 CAL0000: mountDBRoot api, mount dbroot2
Jan 9 15:24:03 usiadmdmaxp01 ProcessManager[8766]: 03.890372 |0|0|0| I 17 CAL0000: MSG RECEIVED: mount dbroot : 2
Jan 9 15:24:03 usiadmdmaxp01 ProcessManager[8766]: 03.905581 |0|0|0| D 17 CAL0000: send mountDBRoot to pm: 2/pm1
Jan 9 15:24:04 usiadmdmaxp01 ProcessManager[8766]: 04.300736 |0|0|0| I 17 CAL0000: Mount Completed status: 0
Jan 9 15:24:04 usiadmdmaxp01 oamcpp[8766]: 04.313360 |0|0|0| D 08 CAL0000: WRITE3: echo 'pm2|pm1|2' >> /usr/local/mariadb/columnstore/local/moveDbrootTransactionLog
Jan 9 15:24:04 usiadmdmaxp01 ProcessManager[8766]: 04.313664 |0|0|0| D 17 CAL0000: autoMovePmDbroot success
Jan 9 15:25:18 usiadmdmaxp01 ProcessManager[8766]: 18.815945 |0|0|0| D 17 CAL0000: *** Module alive, bring it back online: pm2
Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.237232 |0|0|0| D 17 CAL0000: enableModule request for pm2 completed
Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.237552 |0|0|0| D 17 CAL0000: Call autoUnMovePmDbroot
Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.245169 |0|0|0| D 08 CAL0000: autoUnMovePmDbroot: pm2
Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.246312 |0|0|0| D 08 CAL0000: READ: pm2|pm1|2
Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.251094 |0|0|0| D 08 CAL0000: manualMovePmDbroot: 2 from pm1 to pm2
Jan 9 15:25:27 usiadmdmaxp01 oamcpp[8766]: 27.265265 |0|0|0| D 08 CAL0000: mountDBRoot api, umount dbroot2
Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.274643 |0|0|0| I 17 CAL0000: MSG RECEIVED: Unmount dbroot : 2
Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.281937 |0|0|0| D 17 CAL0000: send unmountDBRoot to pm: 2/pm1
Jan 9 15:25:27 usiadmdmaxp01 ProcessManager[8766]: 27.282077 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
Jan 9 15:25:27 usiadmdmaxp01 ProcessMonitor[17836]: 27.282605 |0|0|0| I 18 CAL0000: MSG RECEIVED: Unmount DBRoot: 2
Jan 9 15:25:30 usiadmdmaxp01 ProcessMonitor[17836]: 30.079500 |0|0|0| D 18 CAL0000: flushInodeCache successful
Jan 9 15:25:30 usiadmdmaxp01 ProcessMonitor[17836]: 30.345769 |0|0|0| I 18 CAL0000: PROCUNMOUNT: ACK back to ProcMgr, status: 0
Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.345875 |0|0|0| I 17 CAL0000: UnMount Completed status: 0
Jan 9 15:25:30 usiadmdmaxp01 oamcpp[8766]: 30.399497 |0|0|0| D 08 CAL0000: mountDBRoot api, mount dbroot2
Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.421611 |0|0|0| I 17 CAL0000: MSG RECEIVED: mount dbroot : 2
Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.427101 |0|0|0| D 17 CAL0000: send mountDBRoot to pm: 2/pm1
Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.506374 |0|0|0| I 17 CAL0000: Mount Completed status: 0
Jan 9 15:25:30 usiadmdmaxp01 oamcpp[8766]: 30.508297 |0|0|0| D 08 CAL0000: autoUnMovePmDbroot/manualMovePmDbroot : pm1:2:pm2
Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.517216 |0|0|0| W 17 CAL0000: autoUnMovePmDbroot left no dbroots mounted, failing module restart: pm2
Jan 9 15:25:30 usiadmdmaxp01 ProcessManager[8766]: 30.520170 |0|0|0| D 17 CAL0000: disableModule request for pm2
PM2 LOGS – THIS TELLS ME THAT PM2 SERVER DIDNT CRASH, BUT JUST WAS FAILING TO RESPOND TO PINGS FROM PM1. SO SOME NETWORK ISSUE OR PM2 SERVER LOCKED UP AND RECOVERED TO START PINGING
Jan 9 15:22:51 usiadmdmaxp02 ProcessMonitor[1688]: 51.188086 |0|0|0| D 18 CAL0000: State out-of-sync, update on ProcessMonitor/4/1688
DIDNT RECEIVE ANY MESSAGES FROM PM2 TO REMOUNT DBROOT 2 AT 15:25 FROM PM1. IT SHOULD HAVE.