[MCOL-3308] Cannot move DBRoot to resurrected PM after automatic fail-over Created: 2019-05-14  Updated: 2023-10-26  Resolved: 2022-02-16

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.2.3
Fix Version/s: Icebox

Type: Bug Priority: Critical
Reporter: Assen Totin (Inactive) Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MCOL-1732 movePmDbrootConfig on external disk f... Closed

 Description   

Newly installed system to test a setup for a prospect,1 UM + 3 PM. 3 DBRoots, initially one per PM. PM1 is the OAM. Turning off PM3 resulted in automatic fail-over so DBRoot3 got attached to PM1 and queries were processed properly. (We only have 1 database with one small testing table.)

After PM3 was booted again, it came up in MAN_DISABLED state (guess this was expected?).

To initiate a DBRoot3 move back to PM3, we first had to activate the module with "alterSystem-EnableModule pm3", after which PM3 changed state to MAN_OFFLINE state. To be able to initiate a DBRoot move, we next had to stop the system processing with "stopSystem", after which the whole system state became MAN_OFFLINE.

Component Status Last Status Change
------------ -------------------------- ------------------------
System MAN_OFFLINE Tue May 14 17:43:37 2019

Module um1 MAN_OFFLINE Tue May 14 17:43:31 2019
Module pm1 MAN_OFFLINE Tue May 14 17:43:34 2019
Module pm2 MAN_OFFLINE Tue May 14 17:43:31 2019
Module pm3 MAN_OFFLINE Tue May 14 17:44:45 2019

We then triggered the move (DBRoot3 from PM1 to PM3):

mcsadmin> movePmDbrootConfig pm1 3 pm3
movepmdbrootconfig Tue May 14 17:45:07 2019

DBRoot IDs currently assigned to 'pm1' = 1, 3
DBRoot IDs currently assigned to 'pm3' =

DBroot IDs being moved, please wait...

DBRoot IDs newly assigned to 'pm1' = 1, 3
DBRoot IDs newly assigned to 'pm3' =

As can be seen, the DBRoot was not moved. Starting the system was not posisble, because PM3 has no DBRoot attached.

May 14 17:19:54 p2w1 ProcessManager[12373]: 54.438248 |0|0|0| C 17 CAL0000: startSystemThread failed: Module 'pm3' has no DBRoots assigned to it

We had to manually disable PM3 in order to start the system, which then came up and began processing queries.

The error log has no entries related to the movePmDbrootConfig command. The debug logs has some, which seem to suggest that the move was successful. One line stands out (I put it in bold), after the unmountDBRoot for DBRoot3 is sent to the PM1 (correct), the mountDBRoot is sent again to pm1 (?!). Am I missing anything here?

May 14 17:46:09 p2w1 oamcpp[6898]: 09.591518 |0|0|0| D 08 CAL0000: manualMovePmDbroot: 3 from pm1 to pm3
May 14 17:46:09 p2w1 oamcpp[6898]: 09.604461 |0|0|0| D 08 CAL0000: mountDBRoot api, umount dbroot3
May 14 17:46:09 p2w1 ProcessManager[12373]: 09.611753 |0|0|0| I 17 CAL0000: MSG RECEIVED: Unmount dbroot : 3
May 14 17:46:09 p2w1 ProcessManager[12373]: 09.616267 |0|0|0| D 17 CAL0000: send unmountDBRoot to pm: 3/pm1
May 14 17:46:09 p2w1 ProcessManager[12373]: 09.616330 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
May 14 17:46:09 p2w1 ProcessMonitor[12243]: 09.616562 |0|0|0| I 18 CAL0000: MSG RECEIVED: Unmount DBRoot: 3
May 14 17:46:09 p2w1 ProcessMonitor[12243]: 09.621860 |0|0|0| D 18 CAL0000: flushInodeCache successful
May 14 17:46:09 p2w1 ProcessMonitor[12243]: 09.798351 |0|0|0| I 18 CAL0000: PROCUNMOUNT: ACK back to ProcMgr, status: 0
May 14 17:46:09 p2w1 ProcessManager[12373]: 09.798408 |0|0|0| I 17 CAL0000: UnMount Completed status: 0
May 14 17:46:09 p2w1 oamcpp[6898]: 09.804563 |0|0|0| D 08 CAL0000: mountDBRoot api, mount dbroot3
May 14 17:46:09 p2w1 ProcessManager[12373]: 09.830309 |0|0|0| I 17 CAL0000: MSG RECEIVED: mount dbroot : 3
May 14 17:46:09 p2w1 ProcessManager[12373]: 09.834871 |0|0|0| D 17 CAL0000: send mountDBRoot to pm: 3/pm1
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.216917 |0|0|0| I 17 CAL0000: Mount Completed status: 0
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.224347 |0|0|0| I 17 CAL0000: MSG RECEIVED: Distribute Config File system/Columnstore.xml
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.224415 |0|0|0| D 17 CAL0000: distributeConfigFile called for system file = Columnstore.xml
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.301994 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module um1
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.302527 |0|0|0| D 17 CAL0000: um1 distributeConfigFile success.
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.307771 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm2
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.308426 |0|0|0| D 17 CAL0000: pm2 distributeConfigFile success.
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.313716 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm3
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.314540 |0|0|0| D 17 CAL0000: pm3 distributeConfigFile success.
May 14 17:46:10 p2w1 ProcessManager[12373]: 10.314594 |0|0|0| I 17 CAL0000: Distribute Config File Completed system/Columnstore.xml



 Comments   
Comment by David Hill (Inactive) [ 2019-05-14 ]

This is a BUG. Previous MCOL wsa been opened

https://jira.mariadb.org/browse/MCOL-1732

Comment by Assen Totin (Inactive) [ 2019-05-15 ]

It is not a bug, it is a completely missing stuff.

Check Oam::manualMovePmDbroot function - it only modifies the todbrootConfigList and residedbrootConfigList if DataRedundancyConfig is set, which in turn is only true when Gluster is enabled.

In our case we don't have Gluster (because Gluster is slow and we are testing a solution that will have to ingest 50K rows per second constantly).

Was this ever working? How can we have automated failover which works with NFS, (i.e. does move the dbroots properly) but not have manual move when we need to rejoin a resurrected node?

Comment by Roman [ 2019-07-29 ]

I agree. This looks like a missing functionality.

Comment by Roman [ 2022-02-16 ]

We don't have OAM anymore. CMAPI CLI will have this functionality.

Generated at Thu Feb 08 02:41:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.