[MCOL-5058] CMAPI and local smcat runs can access Storage Manager too early causing assertion in SM runtime - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 5.6.5, 6.2.3
Fix Version/s: Icebox
Component/s: cmapi, Storage Manager
Labels:
None

Description

Consider a part of a startup procedure for S3-based installation [1]. CMAPI that gets cluster/start REST call initiates node/start calls at all nodes. node/start in its turn starts with mcs-workernode@1 | mcs-workernode@2. The last two units initiate mcs-loadbrm systemd unit startup that in its turn initiates its local SM running systemctl start mcs-storagemanager. There is a period when SM doesn't fill up its internal structure prefix cache[2] yet when SM bootstraps itself. SM throws an assert exception [3] if SM request[4] comes when prefix cache isn't yet filled up. This failure causes mcs-workernode@

{1,2}

units to fail [5]. The most severe issue is that non-primary nodes might look like they are OK but they have a reduced and corrupted extent maps in /dev/shm so that any extent map write operation distributed by the controllernode will set the cluster into read-only.

Together with Alan we introduced an explicit delays b/w SM and actual extent map image load at the customer's site. However this workaround can't be used as an appropriate long-term solution. IMHO there are two long-term solution options:

SM shouldn't assert at this point but return an error so that the above layers, e.g. smcat, CMAPI that calls smcat are notified and retries.
CMAPI must wait until primary node workernode is available.

The second approach doesn't solve the issues with ahead-of-time local smcat runs though so the first one looks more appropriate.

1. Here I consider systemd startup however the logic is the same for non-systemd container startup.
2. Prefix cache is the set of dir paths, e.g. data1/systemFiles/dbrm/ that SM owns/processes request for.
3. Apr 13 16:58:42 nvmesh-target-c env[3222951]: StorageManager: /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/columnstore/columnstore/storage-manager/src/Cache.cpp:300: storagemanager::PrefixCache& storagemanager::Cache::getPCache(const boost::filesystem::path&): Assertion `it != prefixCaches.end()' failed.
Apr 13 16:58:42 nvmesh-target-c systemd[1]: mcs-storagemanager.service: Main process exited, code=killed, status=6/ABRT
Apr 13 16:58:46 nvmesh-target-c systemd[1]: mcs-storagemanager.service: Failed with result 'signal'.
4. Local nodes smcat runs or remote-nodes that ask for meta/

{em, vbbm, vss, journal}

CMAPI REST endpoints.
5. Apr 13 16:17:16 nvmesh-target-c workernode[3195838]: SocketPool::getSocket() failed to connect; got 'Connection refused'
Apr 13 16:17:16 nvmesh-target-c workernode[3195838]: configcpp[3195851]: 16.594847 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'
Apr 13 16:17:16 nvmesh-target-c configcpp[3195851]: 16.594847 |0|0|0| E 12 SocketPool::getSocket() failed to connect; got 'Connection refused'

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Roman

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2022-04-18 09:54

Updated:: 2022-06-27 22:05

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.