[MCOL-4912] MCS bulk insertion is slow Created: 2021-11-03  Updated: 2023-11-17  Resolved: 2022-05-16

Status: Closed
Project: MariaDB ColumnStore
Component/s: cpimport
Affects Version/s: 5.6.4, 6.3.1
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Roman Assignee: Roman
Resolution: Done Votes: 0
Labels: None

Attachments: File BRM_saves_em.tar.gz    
Issue Links:
PartOf
includes MCOL-5037 Up-merge EM Index into develop-6 Closed
includes MCOL-5089 Merge RBTree-based Extent Map with EM... Closed
includes MCOL-5090 Up-merge EMIndex + RBTree-based EM in... Closed
includes MCOL-5091 Up-merge RBTree-based EM into develop Closed
is part of MCOL-5313 Re-test new Exttent Map implementation Closed
Problem/Incident
causes MCOL-5050 Worker node crash after DDL . Possibl... Closed
causes MCOL-5057 EM index code miscalculates RAM neede... Closed
Relates
relates to MCOL-4988 Table lock remained after DML failure... Closed
Sprint: 2021-14, 2021-15, 2021-16, 2021-17

 Description   

The client faced bulk insertion operation(and presumably INSERT operations) slowdown. The system has no obvious hardware bottlenecks and despite the fact that the storage is remote (FCoE) there is no evidence FCoE contributes a lot to the overall timings of bulk ingestion operations.
The table is about 80 columns(mostly dicts) and bulk ingestion of 82 records takes 18 seconds where 12 are spent in preprocessing phase. Preprocessing phase involves BRM communication and so-called HWM chunk backup(a backup of the most recent compressed chunk of a segment/dict files).
cpimport strace file showed that BRM communcations socket operations and mutex-es involved contributes a lot to the overall timings.
The immediate workaround is to horizontally scale BRM, namely EM whilst the permanent solution is to introduce lookup structure to speed up EM operations.



 Comments   
Comment by Roman [ 2022-01-21 ]

4QA There are number of environmental changes:

  • there are new binaries, namely /usr/bin/mcs_load_brm_from_file(converts EM in text form into binary representation), /usr/bin/mcs_load_em, /usr/bin/mcs_lock_grabber, /usr/bin/mcs_lock_state(these two allows to monitor/manipulate shared memory RW locks).
  • shared memory segment name prefix had changes from InfiniDB-shm-XXXXXXX to MCS-shm-XXXXXXX.
  • there is a new shmem (managed) segment for Extent Map index and corresponding shmem RWLock. This new segment has an index id 6.
  • there are two mdb UDF functions to query index shmem size and free space: mcs_emindex_size(), mcs_emindex_free().

The patch introduces index to navigate through Extent Map to speed up bulk insertion operations. The index has tree layers: dbroot, oid, partition. The last layer contains Extent Map indices to directly access extent map entries that belongs to <dbroot, oid, partition> tuple. To see the benefit of the index one needs to generate EM >= 100 MB.

Comment by Roman [ 2022-01-21 ]

I have attached the generated Extent Map image(around 320992 EM entries) that can be used to perf test the patch. It must replace /var/lib/columnstore/data1/systemFiles/dbrm/BRM_saves_em with MCS shut down.

Comment by Daniel Lee (Inactive) [ 2022-01-31 ]

Build tested: develop-5 build (3719)

Timing test #1
--------------
1. Create 300 table, each with 600 columns
2. Insert two rows after creating each table
3. cpimport 1 million rows into one of the tables
4. LDI 1 million rows into the same table

Timeing test #2
---------------
1. Load the pre-generated BRM file as the starting point
2. Create a table with 600 columns
3. Insert two rows after table creation
4. cpimport 1 million rows into one of the tables
5. LDI 1 million rows into the same table

Timing comparison between releases 5.6.2-1 and 6.2.2 can be found here:
https://docs.google.com/spreadsheets/d/1AtXR9-D-KZT4hlJ5wA8Ih-_GpNh6CYai2tjz7RNzg2k/edit?usp=sharing

During testing, I the following errors occurred sometimes, randomly.

When creating tables
ERROR 1815 (HY000) at line 2: Internal error: Lost connection to DDLProc
 
When inserting rows
ERROR 1815 (HY000) at line 608: Internal error: Lost connection to DMLProc [4]
 
During query
--------------
SELECT COUNT(*) FROM t252
--------------
ERROR 1815 (HY000) at line 610: Internal error: Lost connection to ExeMgr. Please contact your administrator

cpimport test
-------------
Tried to cpimport 1 millrow rows 50 times in a loop, a total of 50 million rows.
Core dumps occurred. This happened on both the develop-5 build, as well as release 6.2.2-1
MCOL-4974 has been created to track this issue.

MTR tests
---------
Some MTR tests failed, especially some on Window functions. I am still trying to figure out what the issues are.

Autopilot tests
---------------
No issues occurred

3-node installation tests, with cmapi-1.6
cluster shutdown, start tests

Comment by Daniel Lee (Inactive) [ 2022-02-02 ]

Build tested: 5.6.1-1

I retested release 5.6.1-1, a release before this patch, with MTR Autopilo test suites. The same window functions also failed. If memory serves me well, the MTR Autopilot test suites was being migrated from the stand alone Autopilot tool. Therefore, the failed window functions in this path is expected. There were fixes to window functions since the 5.x.x release and corresponding test cases also have been updated. The same window functions also passed in 6.2.2-1.

Therefore, the "MTR tests" issues in the my last comment were non issues.

Comment by Daniel Lee (Inactive) [ 2022-02-06 ]

Build tested: develop-5 (Jenkins bb-10.5-cs-5.6.4-2)

Centos 8 VM, 40gb memory

Did another round tests on this new build. The same errors reported in my test last are still occurring.

This build seems to have a new problem, version buffer overflow error.

There is a update1bRow test in the Autopilot test tool (non MTR). It updates 1 an integer column in a billion-row table. The version buffer size was at 4GB. This update test failed due to version buffer overflow error, even after I changed the buffer size to 8gb and 16gb. The 4gb buffer size worked for the same test in 5.6.1 and 6.2.2.

Comment by Daniel Lee (Inactive) [ 2022-02-06 ]

reopened due to reported errors

Comment by Daniel Lee (Inactive) [ 2022-02-10 ]

Did few more update1bRows test again and did not see the issue I reported earlier. It seems like it was an user error on my sides, forgetting to restart Columnstore after setting the new version buffer file size.

Comment by Roman [ 2022-02-11 ]

Fixed.

Comment by Roman [ 2022-05-16 ]

This patch has never available yet neither to Todd, nor to the public.
The functionality tested by Daniel belongs to MCOL-4917.
The overall testing of this feature should be done in terms of MCOL-5089.

Generated at Thu Feb 08 02:53:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.