[MCOL-4546] Extent Map gets occasionally currupted when a multi-node cluster with shared storage for HA is recycled. Created: 2021-02-19  Updated: 2021-02-24  Resolved: 2021-02-24

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 5.5.1
Fix Version/s: 5.5.2

Type: Bug Priority: Blocker
Reporter: Gregory Dorman (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: crash

Attachments: File patch.diff     File repro.sh    
Sprint: 2021-3, 2021-4

 Description   

There exists a window of opportunity during cluster recycle to corrupt the Extent map. It first gets corrupted in memory on startup, and then permanently on disk on next save_brm (explicit by API or implicit by Shutdown).

The root cause is an error in mcs-loadbrm.py, by which all nodes point to the same BRM directory and hence BRM_saves_em file. During cluster start, the directory is copied to each node - and then it will be read by the Primary when the processes start.

It is therefore possible that the reading may coincide with a secondary node still writing. When HA is deployed (via GlusterFS or any other NFS), the reading primary is liable to loose fragments of Extent map.

In order to happen, a fairly large Extent Map is needed (lots of columns, lots of rows, or both). Customer's EM was 4.5MB, and in house reproduction was on 8MB.

When it occurs, it is almost always accompanied by log messages like

CAL0000: ExtentMap::load(): That file is not a valid ExtentMap image
CAL0000: ExtentMap::loadVersion4(): read : No such file or directory

REPRODUCTION

1. Configure 3 nodes cluster (can do more). AWS reproduced faster than dockers, but can be dockers.
2. Cluster needs to have shared disk and be configured for failover. If AWS or bare metal - make it GlusterFS. If dockers - attached volumes.
3. Create a database with large extent map.
a) an easy way to do it - create 100 tables with 1000 columns each. You do not need to populate them, emty is just fine.
b) make sure the BRM_saves_em in /var/lib/columnstore/data1/systemFiles/dbrm is at lease 8mb in size.
4. Start doing shutdowns, followed by startups. Do not do any CRUDs when in the product (selects OK). I used the script (attached), but I did see on manual actions as well. It is random, you need to do do it a few times, it depends on exact timings of things.

After every shutdown, check the size of BRM_saves_em. If it got smaller - you have a corrupted Extent Map and a blown database.



 Comments   
Comment by Gregory Dorman (Inactive) [ 2021-02-22 ]

Roman produced a trial patch (attached here). I reviewed it, I also tested it. I also gave the patched file mcs-loadbrm.py to the customer, and they were quiet ever since. I have high confidence this is a correct patch.

Comment by Daniel Lee (Inactive) [ 2021-02-24 ]

Build tested: 5.5.2 (Drone #1751)

Check new mcs-loadbrm.py only

[dlee@aloha shares]$ diff mcs-loadbrm.old.py mcs-loadbrm.new.py
125,127c125,126
< # To avoid SM storing BRM files
< if storage.lower() == 's3' and bucket.lower() != 'some_bucket':
< dbrmroot = BYPASS_SM_PATH

> # Store BRM files locally to load them up
> dbrmroot = BYPASS_SM_PATH
129,132c128,131
< if not os.path.exists(dbrmroot):
< os.makedirs(dbrmroot)
< if use_systemd:
< shutil.chown(dbrmroot, USER, GROUP)

> if not os.path.exists(dbrmroot):
> os.makedirs(dbrmroot)
> if use_systemd:
> shutil.chown(dbrmroot, USER, GROUP)
[dlee@aloha shares]$

Generated at Thu Feb 08 02:51:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.