[MCOL-4253] Possible race in the take-ownership code in storagemanager, needs investigation Created: 2020-08-18  Updated: 2023-10-25  Resolved: 2023-10-25

Status: Closed
Project: MariaDB ColumnStore
Component/s: Storage Manager
Affects Version/s: 1.4.4, 1.5.3
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Patrick LeBlanc (Inactive) Assignee: Leonid Fedorov
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

any



 Description   

We ran into an assertion failure in Cache::getPCache(), and I did a little investigation. The assertion verifies that the PrefixCache for the given prefix exists before we use it.

The prefix-cache is made on the first access of that prefix, in the code to take ownership of a prefix. Looking at the code for Ownership::takeOwnership(), I think if there are two 'simultaneous' requests for the same prefix (2 concurrent takeOwnership() fcns running), one will be allowed to continue (and access the prefix via the prefix-cache I presume), while the other one is still getting ownership. Look at the code near the top of that function. It checks for existence of ownedPrefixes[p], and if it doesn't exist yet it creates it atomically. The next call sees it ... then ... assumes the prefix is already owned?

Needs further investigation. Either need to prove this works the way it's supposed to (ie, find out what prevents another caller from returning before a prefix is owned), or need to find the real culprit & fix it.



 Comments   
Comment by Patrick LeBlanc (Inactive) [ 2020-08-19 ]

Looking again at the code, it may be simpler than that. Ownership::_takeOwnership() is signalling to other callers that it's ready before it is. Sent a speculative patch to Ben that puts the signalling after init. To test, I suggest a unit test that loops over 1) drop ownership of the test prefix, 2) X simultaneous Ownership::get(test-prefix) calls. To widen the window where the race can happen, make the init of PrefixCache take longer by populating the test-prefix directory with a lot of files it will have to scan through.

Generated at Thu Feb 08 02:48:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.