[MCOL-4253] Possible race in the take-ownership code in storagemanager, needs investigation Created: 2020-08-18 Updated: 2023-10-25 Resolved: 2023-10-25 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | Storage Manager |
| Affects Version/s: | 1.4.4, 1.5.3 |
| Fix Version/s: | Icebox |
| Type: | Bug | Priority: | Major |
| Reporter: | Patrick LeBlanc (Inactive) | Assignee: | Leonid Fedorov |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
any |
||
| Description |
|
We ran into an assertion failure in Cache::getPCache(), and I did a little investigation. The assertion verifies that the PrefixCache for the given prefix exists before we use it. The prefix-cache is made on the first access of that prefix, in the code to take ownership of a prefix. Looking at the code for Ownership::takeOwnership(), I think if there are two 'simultaneous' requests for the same prefix (2 concurrent takeOwnership() fcns running), one will be allowed to continue (and access the prefix via the prefix-cache I presume), while the other one is still getting ownership. Look at the code near the top of that function. It checks for existence of ownedPrefixes[p], and if it doesn't exist yet it creates it atomically. The next call sees it ... then ... assumes the prefix is already owned? Needs further investigation. Either need to prove this works the way it's supposed to (ie, find out what prevents another caller from returning before a prefix is owned), or need to find the real culprit & fix it. |
| Comments |
| Comment by Patrick LeBlanc (Inactive) [ 2020-08-19 ] |
|
Looking again at the code, it may be simpler than that. Ownership::_takeOwnership() is signalling to other callers that it's ready before it is. Sent a speculative patch to Ben that puts the signalling after init. To test, I suggest a unit test that loops over 1) drop ownership of the test prefix, 2) X simultaneous Ownership::get(test-prefix) calls. To widen the window where the race can happen, make the init of PrefixCache take longer by populating the test-prefix directory with a lot of files it will have to scan through. |