[MCOL-396] AlarmConfig.xml file zero'd out when pm server restarted locally Created: 2016-11-08  Updated: 2023-10-26  Resolved: 2020-04-15

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.0.4
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: David Thompson (Inactive) Assignee: Andrew Hutchings (Inactive)
Resolution: Won't Fix Votes: 0
Labels: community

Issue Links:
Duplicate
duplicates MCOL-399 Node failed to recover during a node... Closed
Relates
relates to MCOL-484 DBRMControllerNode FAILED , CONN_FAIL... Closed
Sprint: 2016-22, 2016-23

 Description   

Created from KB thread: https://mariadb.com/kb/en/mariadb/issues-with-mariadb-columnstore-104/

I reinstalled the MariaDB ColumnStore 1.0.4 from scratch after resetting the locale on all worker nodes. Right now, the installation can be finished successfully and the system status is Active. So it seems that it is not an issue of firewalls. In addition, I used local disks for installation.
I have copied the ssh key from PM1 to other nodes. Also I need copy the ssh key from UM1 to UM2 so that the UM1 can configure the replication between UM1 and UM2.
My configuration is 2UM3PM. I can reproduce the zero size of AlarmConfig.xml after restarting the processes in a certain PM node using the command "/usr/local/mariadb/columnstore/bin/columnstore restart". You are right, I should use mcsadmin to manage the whole system. But I have to try other workarounds when I failed to stop the system. I have checked the settings of firewalls and stopped the service of iptables. I still encountered the issue of stopping the system. I observed that some processes cannot be stopped while the status of PM node was failed. So please help investigate the stop issue. Thank you very much.



 Comments   
Comment by David Thompson (Inactive) [ 2016-11-18 ]

The probable cause here is that the system periodically replicates configuration files from pm1 to the other nodes. If a node crashes or is brought down outside of an official stop or shutdown from mcsadmin on pm1, and the timing is right it may be possible that the file is in process of being written and is either empty or incomplete.

We may or may not be able to fix all cases of this but it's likely a better idea to have the system startup 'fix' this or have an official means to force a manual synch of config files before it is brought back up.

Comment by David Thompson (Inactive) [ 2016-11-29 ]

What i'd like done here is to test and document a workaround in the case that this happens which is on a non pm1 node in a multi node install. There is a mcsadmin command to manually redstribute config files to the other nodes that in theory should mitigate this in the rare occasion that it happens.

Obviously if we can see a simple fix to the file replication logic that would be good but it's still likely we'll need a w/a for any further edge cases.

Comment by Todd Stoffel (Inactive) [ 2020-04-15 ]

OAM is being deprecated and replaced by an enhanced API and the MaxScale orchestration project.

Generated at Thu Feb 08 02:20:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.