[MCOL-4566] rebuildEM utility must support compressed segment files Created: 2021-03-01  Updated: 2023-10-26  Resolved: 2021-10-26

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.2.0
Fix Version/s: 6.2.1, 6.2.2

Type: New Feature Priority: Minor
Reporter: Roman Assignee: Roman
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
PartOf
includes MCOL-4624 Implement proper calculation for HWM ... Closed
Problem/Incident
causes MCOL-4685 Eliminate some irrelevant settings (u... Closed
Relates
relates to MCOL-312 utility to rebuild extent maps Closed
relates to MCOL-4533 Research columnstore handling disk I/... Open
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MCOL-4624 Implement proper calculation for HWM ... Sub-Task Closed Denis Khalikov  
MCOL-4635 Properly insert LBID into segment fil... Sub-Task Closed Denis Khalikov  
Sprint: 2021-4, 2021-5, 2021-6, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12, 2021-13

 Description   

There are certain situations when it is impossible to restore metadata information that either was completely or partially lost. Metadata in this case includes Extent Map and auxilary database OID counter used to allocate OIDs for database object to be created.
This issue describes the part that manages with Extent Map partial or comple loss. There is a tool under tool/rebuildEM called rebuildEM which algo creates an Extent Map using the existing columnar data files. Its algorithm is very simple it counts a number of blocks and creates corresponding number of extents in the new in-memory Extent Map. In the end it writes an image of the Extent Map to disk.
The main problem with the tool is that it doesn't support compressed(MCS as of 6.1.1 supports Snappy compression) files. The suggested approach is to change the tool's algorithm so that it uses compressed files to produce Extent Map. A compressed file contains a structure CompressedDBFileHeader as the header. The fBlockCount attribute points to a number of blocks in the file.
The important assumption for the tool is that all dbroots must be available at the node where rebuildEM is run so the cluster must either have a shared or S3 storage.
To be noted that we are going to add LZ4 as a compression method.
One must remove extent map file to test how rebuildEM works with compressed files.



 Comments   
Comment by Denis Khalikov [ 2021-03-02 ]

Regarding to this task the file header should contain new information:
1. Column width.
2. Column data type.
Also we need a function which maps full filename to oid to be able to create columnExtent.

Comment by Denis Khalikov [ 2021-03-04 ]

file2Oid added for review https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1794

Comment by Denis Khalikov [ 2021-03-05 ]

patch which adds 2 new fields to `CompressedDBFileHeader` added for review https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1795

Comment by Denis Khalikov [ 2021-03-10 ]

add patch with rebuildEM tool for review https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1808

Comment by Denis Khalikov [ 2021-03-12 ]

Currently this tool works with compressed segment files created by engine, but it's not enough to rebuild full extent map.
The problem: system segment files does not have compression, that means we cannot simply restore needed data (column type and width) to create a column extent. Those extents are needed inside extent map for proper work of the database.
Some ideas:
1. Keep the initial state of extent map inside the rebuildEM tool and restore it, before going inside dbroot.
a. Keep as binary blob, a global inside rebuildEM tool, the initial size is about 4k, but looking to the hexdum it seems like a sparse matrix and the restore it using ExtentMap.load()
b. Keep as (oid, partition, segment, width, col data type, and may be other other needed data) and try to restore it by calling createColumnExtent.
Will experiment on it starting from the next week.
Problems:
Currently I'm not sure about structure of the system files, the main question could the tables increase in size over time and create another segment files to keep data. In this case the initial state will be invalid.
2. Another way to keep separate file with system extent map.
Problems:
it seems like it's the same as keeping full extent map in file, as it works now.

Comment by Denis Khalikov [ 2021-03-15 ]

After initializing extents for the system table i got to errors:
1. Was related to extent status, if we use `createColumnExtent` it by default is 'unavailable` so we need to mark it available by setLocalHWM.
2. Error related to tables with 'varchar' and `char` (all columns which has additional dictionary segment file) the current approach uses a `greedy` strategy to allocate LBID from the freelist, so after running the rebuildEM we got extents with different `range.start` compared with original if we start from oid different than it was in original pass, so I got different errors when trying to select from 'varchar` column, it could be `null` values or it could be a values from other tables.
Some solutions could be:
1. extend file header and add `range.start` field, walk on all segment files in dbroot, save all needed data and create extent map.
2. walk on dbroot and try to sort extents by oid, than create extent map.

Currently I'm able to rebuildEM and get `partially` working database. The database works with all columns except 'varchar'.

Comment by Denis Khalikov [ 2021-03-16 ]

Unfortunately we have to keep the start address of freeList in the segment files.
Following example:
create table t1 (a varchar (255)) engine = columnstore;
insert into t1 values("a");
create table t2 (a varchar (255), b varchar (255), c varchar (255)) engine=columnstore;
insert into t2 values("a", "b", "c");

will create for us following extents:

range.start|range.size|fileId|blockOffset|HWM|partition|segment|dbroot|width|status|hiVal|loVal|seqNum|isValid|
234496|8|3001|0|0|0|0|1|8|0|0|-1|4|0|
242688|8|3002|0|0|0|0|1|0|0|-9223372036854775808|9223372036854775807|2|0|
250880|8|3004|0|0|0|0|1|8|0|0|-1|2|0|
259072|8|3007|0|0|0|0|1|0|0|-9223372036854775808|9223372036854775807|1|0|
267264|8|3005|0|0|0|0|1|8|0|0|-1|2|0|
275456|8|3008|0|0|0|0|1|0|0|-9223372036854775808|9223372036854775807|1|0|
283648|8|3006|0|0|0|0|1|8|0|0|-1|2|0|
291840|8|3009|0|0|0|0|1|0|0|-9223372036854775808|9223372036854775807|1|0|

To rebuild extent map in the right way (to be able to access columns with char data through dictionary files) we have to rebuild `range.start` as it was originally, but currently we do not have enough information. I was thinking that we can walk inside dbroot, collect needed data, and keep it sorted by oid, than rebuild em, but the example above shows that it was a wrong.

Comment by Denis Khalikov [ 2021-03-16 ]

Updated https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1808
Currently it works in following way:
1. Initialize the extents for the system tables from the initial binary blob.
2. Walks dbroot and collects (oid, part, segment, coltype, colwidth, isDict) for segment files. (Keeps it sorted via map<FIleId, OidComparator>)
3. Rebuilds extent maps from the collected data starting from the lowest OID.

Current version successfully rebuild initial tables works with columns without (varchar types).

This should be updated to keep sorted not by oid, but by offset in freeList like (map<FileId, start.rangeComparator>), after this we can build extent map in greedy way.

Comment by Denis Khalikov [ 2021-03-18 ]

Updated patch on review. Current solution create extent in order it was created originally.
Tested solution on different tables, seems like working solution.
Need more testing - like adding table with million rows.

Comment by Denis Khalikov [ 2021-03-18 ]

Run some test on rows > 8kk, and found that we also need to support multiple extents per one segment file. I think we can hard-code it into 2 as in config file and check the config, if we got more. Just do not run the tool.
Also other question is how to set HWM properly?

Comment by Denis Khalikov [ 2021-03-19 ]

Currently it does not work after inserting data with `cpimport` for example inserting 17kk rows:
create table t1(a int) engine=columnstore;
`$cpimport temp t1 content.tbl`.
it will create a 3 segment files and 3 extents, but `lbid` is not changing from default values.
running:
`
$rebuilEM -v
FileId is collected [OID: 3001, partition: 0, segment: 1, col width: 4, lbid:-1, isDict: 0]
Processing file: /var/lib/columnstore/data1/000.dir/000.dir/011.dir/185.dir/000.dir/FILE000.cdf [OID: 3001, partition: 0, segment: 0]
FileId is collected [OID: 3001, partition: 0, segment: 0, col width: 4, lbid:-1, isDict: 0]
Processing file: /var/lib/columnstore/data1/000.dir/000.dir/011.dir/185.dir/000.dir/FILE002.cdf [OID: 3001, partition: 0, segment: 2]
FileId is collected [OID: 3001, partition: 0, segment: 2, col width: 4, lbid:-1, isDict: 0]
Build extent map with size 1
Extent is created, allocated size 4096 actual LBID 234496 for [OID: 3001, partition: 0, segment: 1, col width: 4, lbid:-1, isDict: 0]
`
Another thing I dont understand currently. It seems like after the size of extent is exceed 8m rows, engine should create a new extent for the same segment file. The number of extents per segment file is defined inside config, but using `cpimport` it creates a new segment file for new extent.
`
range.start|range.size|fileId|blockOffset|HWM|partition|segment|dbroot|width|status|hiVal|loVal|seqNum|isValid|
234496|4|3001|0|4095|0|0|1|4|0|999999|0|9|2|
238592|4|3001|0|4095|0|1|1|4|0|999999|0|9|2|
242688|4|3001|0|121|0|2|1|4|0|999999|751616|1|2|
`
`
./000.dir/011.dir/185.dir/000.dir/FILE001.cdf
./000.dir/011.dir/185.dir/000.dir/FILE000.cdf
./000.dir/011.dir/185.dir/000.dir/FILE002.cdf
`

Comment by Denis Khalikov [ 2021-03-19 ]

Needed to add HWM calculation such as (decompressed file size - header size * 2 ) / block size.

Updated this will not work. File size does not reflect to HWM. Currently HWM is incremented by one after next block is needed inside extent.

Will try to calculate hwm by searching an first block with empty value in file.
`
200 for (j = 0, curVal = buf; j < totalRowPerBlock; j++, curVal += column.colWidth)
201 {
202 if (isEmptyRow((uint64_t*)curVal, emptyVal, column.colWidth))
203 {
`

Comment by Denis Khalikov [ 2021-03-23 ]

Last patch https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/1808
adds:
Proper HWM recovery from segment file.
Added support for bulk insertion via cpimport.
Current limitations - it does not work with multiple extents per segment file.
We can simple detect those kind of files in case we recover hwm >= (columnWidth * numExtentRows) / blockSizeInBytes but there is no straight way to create an extent in the same order as it was created originally, because starting lbid is not known for each extent.
Test for different tables schemes, for example:
create table t1 (a int, b varchar (255), c int, d varchar(255), e int, d varchar(255)) engine=columnstore;
and insert 20M rows into table via cpimport.
In this case bulk will create 3 segment file for each int column and 6 for each varchar column (1 segment file with tokens and 1 dictionary file).

Comment by Denis Khalikov [ 2021-03-24 ]

Found bug related to HWM calculation for dictionary files.

Comment by Denis Khalikov [ 2021-03-25 ]

Final version is on review.
Currently it supports 2 extents per segment file, it could be updated if needed, but will require to add one more field in compressed header.

Comment by Roman [ 2021-04-02 ]

4QA. Plz msg me or denis0x0D when you are ready to test the tool so I can explain how does it work.

Comment by Roman Navrotskiy [ 2021-04-13 ]

dleeyh
You can try any latest build from develop branch for testing rpm-based platforms. If you want to test others, you can use this one:

https://cspkg.s3.amazonaws.com/index.html?prefix=develop/pull_request/2141/amd64/

Also it will be available from regular cron builds since tomorrow I suppose.

Comment by Daniel Lee (Inactive) [ 2021-04-13 ]

Reopened pending for requirement discussion by management.

Comment by Roman [ 2021-04-20 ]

ExtentsPerSegmentFile controls the number of extents per segment file so it doesn't affect Dictionary files. We should just remove the setting ExtentsPerSegmentFile from the default Columnstore.xml shiped with the package.
BTW All this activity crosses the scope of this project and must be done outside of this issue.

Comment by Roman [ 2021-04-20 ]

gdorman After discussion with Denis I will answer the 4th question. The effort is minimal since we just need to remove the option from the default config file shiped.

Comment by Roman [ 2021-04-20 ]

The next commentary is a developer note of how to overcome the limitation described by Denis previously.
There are two fixed extents descriptors that are added to every Segment and Dictionary file. The descriptor tells about initial LBID and number of blocks in the extent and its purpose is to map Segment files with Tokens and Dictionaries. The number of extents in a Dictionary is dynamic though. We extend the compressed Dictionary with a dynamic sized section. After the mandatory two extent descriptors there will be a number of the following descriptors and a set of these additional descriptors.

Comment by Gregory Dorman (Inactive) [ 2021-04-23 ]

Then don’t. Keep for 6.1.

Comment by Daniel Lee (Inactive) [ 2021-06-16 ]

Build tested: 6.1.1 ( Drone #2576 )

[centos8:root~]# mcsRebuildEM -v
The launch of mcsRebuildEM tool must be sanctioned by MariaDB support.
Requirement: all DBRoots must be on this node.
Note: that the launch can break the cluster.
Do you want to continue Y/N?

Build tested: 6.1.1 ( Drone #2576 )

Performed a test on a 3-node cluster with local storage and did not received any error/warning indicating that such configuration is not supported. I noticed there is a PR (#1884) was declined.

Test #1, without -v option

[centos8:root~]# mcsRebuildEM
The launch of mcsRebuildEM tool must be sanctioned by MariaDB support. 
Requirement: all DBRoots must be on this node. 
Note: that the launch can break the cluster.
Do you want to continue Y/N? 
Y

I have few concerns for the maturity of the tool.
1. The message "Note: that the launch can break the cluster." is quite alarming. Breaking a cluster is too serious that user or support team would not want to continue. The tool should be mature enough for the user to move on with confidence.
2. "Requirement: all DBRoots must be on this node. ". The user should not need to find out if all DBroots are on this node. The tool should determine where the data is. If it does not meet the requirement, it should exit with an appropriate message.
3. "Do you want to continue Y/N? ". The user's reply should be on the same line
4. After answering with an "Y", the tool simply ended. It must return a message indicating if the run was successful or not. If successful, it should print out the BRM file that was generated, with directory path.
5. For large database, the tool may take a while to run. For each dbroot, the tool should print out few steps of the BRM building process, prefixed with system time stamp. Such information will be helpful for support engineers should the run failed.
6. The tool should check if the ColumnStore cluster is active. If yes, then exit out immediately.

[centos8:root~]# mcsRebuildEM
The launch of mcsRebuildEM tool must be sanctioned by MariaDB support. 
Requirement: all DBRoots must be on this node. 
Note: that the launch can break the cluster.
Do you want to continue Y/N? 
Y
/var/lib/columnstore/data1/systemFiles/dbrm/BRM_saves_em file exists. 
Please note: this tool is only suitable in situations where there is no `BRM_saves_em` file. 
If `BRM_saves_em` exists extent map will be restored from it. 

7. The tool should check for the existence of the BRM_saves_em file at the beginning of the run. If exists, it should exit immediately without user interaction.
BRM_saved_em only when the run is successful.

Test #2, with -v option

[centos8:root~]# mcsRebuildEM -v
The launch of mcsRebuildEM tool must be sanctioned by MariaDB support. 
Requirement: all DBRoots must be on this node. 
Note: that the launch can break the cluster.
Do you want to continue Y/N? 
Y
Initialize system extents from the initial state
Collect extents for the DBRoot /var/lib/columnstore/data1
Cannot read file header from the file /var/lib/columnstore/data1/000.dir/000.dir/008.dir/019.dir/000.dir/FILE000.cdf, probably this file was created without compression. 
Cannot read file header from the file /var/lib/columnstore/data1/000.dir/000.dir/008.dir/013.dir/000.dir/FILE000.cdf, probably this file was created without compression. 
Cannot read file header from the file /var/lib/columnstore/data1/000.dir/000.dir/008.dir/016.dir/000.dir/FILE000.cdf, probably this file was created without compression. 
Cannot read file header from the file /var/lib/columnstore/data1/000.dir/000.dir/008.dir/028.dir/000.dir/FILE000.cdf, probably this file was created without compression. 
Cannot read file header from the file /var/lib/columnstore/data1/000.dir/000.dir/008.dir/025.dir/000.dir/FILE000.cdf, probably this file was created without compression. 
Cannot read file header from the file /var/lib/columnstore/data1/000.dir/000.dir/008.dir/022.dir/000.dir/FILE000.cdf, probably this file was created without compression. 
Cannot read file header from 

The above messages are always outputted. Engineer explained that there are system catalog files and they are not compressed. Since system catalog files are not compressed by design, this is an expected behavior. The tool should not output these messages. These messages should only apply to user data files that are not compressed. Please do not simply suppress such messages, only the ones for the system catalog files.

Comment by Roman [ 2021-10-26 ]

This is a certain step towards a generic MCOL-312.

Generated at Thu Feb 08 02:51:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.