Status: Closed (View Workflow)
There exists a window of opportunity during cluster recycle to corrupt the Extent map. It first gets corrupted in memory on startup, and then permanently on disk on next save_brm (explicit by API or implicit by Shutdown).
The root cause is an error in mcs-loadbrm.py, by which all nodes point to the same BRM directory and hence BRM_saves_em file. During cluster start, the directory is copied to each node - and then it will be read by the Primary when the processes start.
It is therefore possible that the reading may coincide with a secondary node still writing. When HA is deployed (via GlusterFS or any other NFS), the reading primary is liable to loose fragments of Extent map.
In order to happen, a fairly large Extent Map is needed (lots of columns, lots of rows, or both). Customer's EM was 4.5MB, and in house reproduction was on 8MB.
When it occurs, it is almost always accompanied by log messages like
CAL0000: ExtentMap::load(): That file is not a valid ExtentMap image
CAL0000: ExtentMap::loadVersion4(): read : No such file or directory
1. Configure 3 nodes cluster (can do more). AWS reproduced faster than dockers, but can be dockers.
2. Cluster needs to have shared disk and be configured for failover. If AWS or bare metal - make it GlusterFS. If dockers - attached volumes.
3. Create a database with large extent map.
a) an easy way to do it - create 100 tables with 1000 columns each. You do not need to populate them, emty is just fine.
b) make sure the BRM_saves_em in /var/lib/columnstore/data1/systemFiles/dbrm is at lease 8mb in size.
4. Start doing shutdowns, followed by startups. Do not do any CRUDs when in the product (selects OK). I used the script (attached), but I did see on manual actions as well. It is random, you need to do do it a few times, it depends on exact timings of things.
After every shutdown, check the size of BRM_saves_em. If it got smaller - you have a corrupted Extent Map and a blown database.