[MCOL-5302] mcs-savebrm.py overwrites Extent Map files multiple times with shared(non-S3) storage setup Created: 2022-11-11  Updated: 2023-11-17  Resolved: 2023-01-18

Status: Closed
Project: MariaDB ColumnStore
Component/s: installation
Affects Version/s: 22.08.3
Fix Version/s: 22.08.8

Type: Bug Priority: Critical
Reporter: Roman Assignee: Alan Mologorsky
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
causes MCOL-5304 Save extent-map backup on each node c... Open
Sprint: 2022-22, 2022-23
Assigned for Testing: Daniel Lee Daniel Lee (Inactive)

 Description   

When MCS cluster is shutdown mcs-savebrm.py is called to save extent map. The original intention was to save it only on primary node but mcs-savebrm.py doesn't detect a primary in a shared(non-S3) storage setup. This effectively allows cluster nodes to overwrite extent map files multiple times during shutdown and in some cases causes save_brm binary to stuck.
The suggested solution is to start detecting primary in mcs-savebrm.py for any type of the cluster or storage. There are two detection mechanisms implemented in mcs-savebrm.py:

  • using CMAPI
  • search for DBRMController.IPAddr in the list of local IPv4 addresses (if CMAPI is unavailable)


 Comments   
Comment by Roman [ 2022-11-28 ]

4QA
This change should be tested using multinode cluster based on NFS where all dbroots are writable from all nodes. The scenario that we fix by this issue is:

  • shutdown a cluster
  • check /var/lib/columnstore/data1/systemFiles/dbrm/* timestamps. The current stable release writes multiple times and there is a race b/w nodes saving dbrm files.
    JFYI save_brm is called from mcs-workernode@ 1 and @ 2 systemd units. You can add an artificial delay to be able to detect the race.
    This fix must resolve the race so that the primary writes dbrm files so that check /var/lib/columnstore/data1/systemFiles/dbrm/* timestamps shows that dbrm files are written only once.

Plz test a failure scenario when the primary is lost before the shutdown and: a) failover finishes, b) failover doesn't have time to converge but the cluster is shutdown

Comment by Daniel Lee (Inactive) [ 2023-01-18 ]

Build verified: 23.02 (Drone build# 6492)

Verified on a 3 node cluster with NFS shared storage

Checked timestamps for all files in the dbrm directory every .5 seconds during the test scenarios.

Timestamp for dbrm files changed only one time.
1) cluster stop
2) failover completed
3) failed node rejoining cluster
4) cluster stopped during failover

For scenario #4, stopping cluster while primary node is down caused this messages, which is expected since s1pm2 was the primary node.

[rocky8:root@rocky8~]# mcs cluster stop
{
"timestamp": "2023-01-18 14:27:40.921970",
"s1pm2": "HTTPSConnectionPool(host='s1pm2', port=8640): Max retries exceeded with url: /cmapi/0.4.0/node/shutdown (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4afa702dd0>: Failed to establish a new connection: [Errno 113] No route to host'))",
"s1pm3":

{ "timestamp": "2023-01-18 14:27:45.538025" }

,
"s1pm1":

{ "timestamp": "2023-01-18 14:27:46.352748" }

}

Generated at Thu Feb 08 02:56:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.