[MCOL-1408] Columnstore table unable to accept writes after thousands of commits via Bulk SDK Created: 2018-05-12  Updated: 2023-10-26  Resolved: 2018-06-07

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.1.4
Fix Version/s: 1.1.5

Type: Bug Priority: Critical
Reporter: Austin Rutherford (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File 1408.txt.gz     File iot.py    
Issue Links:
Relates
relates to MCOL-1763 mcsapi MCOL-1408 test failing in 1.2.0 Closed
Sprint: 2018-10, 2018-11, 2018-12

 Description   

When running the attached data generator and writing via Bulk Write SDK Columnstore it locks up around 9500 commits and puts the table into a state where no additional writes can occur. (Even if you clear tablelocks and/or restart Columnstore). The only way identified to allow writes again is to "truncate table iot" and then restart ColumnStore. This is very reproducible. The attached python program just needs to run for about 3-4 hours before this occurs.

The python program is a bit obnoxious and does single row commits, but the actual use case would be a higher velocity of rows (10,000-100,000/second) that are committed at same frequency. If this can't do single row commits, it would not be able to do higher row counts either.

CREATE TABLE `iot` (
`deviecParameter` varchar(20) DEFAULT NULL,
`deviceValue` int(11) DEFAULT NULL,
`deviceId` char(5) DEFAULT NULL,
`dt_Time` datetime DEFAULT NULL
) ENGINE=Columnstore DEFAULT CHARSET=latin1



 Comments   
Comment by David Thompson (Inactive) [ 2018-05-13 ]

can reproduce, for me it took a lot more iterations:
committed 16383
Traceback (most recent call last):
File "iot.py", line 78, in <module>
bulk.commit()
File "/home/david/dev/mariadb-columnstore-api/python/pymcsapi.py", line 414, in commit
def commit(self): return _pymcsapi.ColumnStoreBulkInsert_commit(self)
RuntimeError: Error setting HWM

debug.log has: ExtentMap::setLocalHWM(): new HWM is past the end of the file for OID 11880; partition 0; segment 0; HWM 16384

editem:
Col OID = 11880, NumExtents = 5, width = 8
22884352 - 22892543 (8192) min: notset, max: notset, seqNum: 8190, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 0, HWM: 8190; status: avail
22917120 - 22925311 (8192) min: notset, max: notset, seqNum: 8191, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 8190; status: avail
22949888 - 22958079 (8192) min: notset, max: notset, seqNum: 8191, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 8190; status: avail
22982656 - 22990847 (8192) min: notset, max: notset, seqNum: 8191, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 3, HWM: 8190; status: avail
23019520 - 23027711 (8192) min: notset, max: notset, seqNum: 1, state: updating, fbo: 8192, DBRoot: 1, part#: 0, seg#: 0, HWM: 0; status: unavail

I can clear the table lock but then re-running hits the HWM error straight away.

Comment by David Thompson (Inactive) [ 2018-05-13 ]

in my case writing the 16384th value is obviously a much more 'special' value, 9500ish is less special if that is repeatable for you austin?

Comment by David Thompson (Inactive) [ 2018-05-14 ]

attached what i had in my terminal after adding driver.setDebug(true) to enable debug outptut.

Comment by Andrew Hutchings (Inactive) [ 2018-05-14 ]

For my notes: I think this might be related to MCOL-984 or MCOL-1176. Which means this also might also be reproducible with INSERT...SELECT and infinidb_use_import_for_batchinsert disabled.

Comment by Andrew Hutchings (Inactive) [ 2018-05-14 ]

I've reproduced this using a C API based test. Looks like the HWM on the second extent in a segment is miscalculated when an insert rolls over an extent. I can probably produce a much faster test case.

Comment by Andrew Hutchings (Inactive) [ 2018-05-15 ]

Status so far:

Other related errors I'm hitting:

Compression and single commit over barrier: [BRM error status: UNKNOWN (-1)]
Uncompressed columns + dictionary: Error reading from a database file.

Best guess is the segment file isn't growing as required when it needs to use the second extent of a segment. In particular this happens with dictionary columns. When removing the dictionary column and disabling compression the tests pass. But tests using compressed columns also fail.

Comment by Andrew Hutchings (Inactive) [ 2018-05-29 ]

OK, problem is allocRowId() is calculating the start HWM for a new extent based on the first column, not the smallest column. So the start HWM should be 4096 for the second extent in a segment (since we should always use the smallest column for HWM calculations) but instead it ends up being 8192 for that column. This is then doubled-up for the 8-byte columns so it ends up being a start HWM of 16384 for those columns.

Comment by Andrew Hutchings (Inactive) [ 2018-05-29 ]

Pull request for engine and API (the API pull request is to add a test).

For QA: A new test is added to the API regression suite, it requires the pull request from engine to pass. (warning: this test can take 10-15 minutes to execute)

Comment by Daniel Lee (Inactive) [ 2018-06-05 ]

Build tested: 1.1.5-1

/root/columnstore/mariadb-columnstore-server
commit 3e6e2b692a5847cc6e89fb200558ebc781b14b09
Merge: 0c983bf 1f94211
Author: David.Hall <david.hall@mariadb.com>
Date: Thu May 31 09:30:12 2018 -0500

Merge pull request #118 from mariadb-corporation/MCOL-1435

Merge MariaDB 10.2.15 into develop-1.1

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit 36a536f4722ff2b2c8e21985de0639f22dd4a6bd
Merge: 3ab634b ed8e774
Author: benthompson15 <ben.thompson@mariadb.com>
Date: Mon Jun 4 09:56:04 2018 -0500

Merge pull request #488 from mariadb-corporation/MCOL-1370

Mcol 1370

I tried the api test on 1.1.5-1 on Centos 7 and got:

I executed the api test for MCOL-1408 and got:
[root@localhost test]# ./mcol1408
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from mcol1408
[ RUN ] mcol1408.mcol1408
/root/mariadb-columnstore-api/test/mcol1408.cpp:126: Failure
Failed
Could not get result data: Internal error: An unexpected condition within the query caused an internal processing error within InfiniDB. Please check the log files for more details. Additional Information: error in BatchPrimitiveProces
[ FAILED ] mcol1408.mcol1408 (95470 ms)
[ RUN ] mcol1408.BRM
/root/mariadb-columnstore-api/test/mcol1408.cpp:179: Failure
Failed
Error caught: Backup error for column file for OID-3040; DBRoot-1; partition-0; segment-0; hwm-8192; chunkIdx-16; numPtrs-16; not in hdrPtrs; Error searching for compressed chunk in db file being backed up.

[ FAILED ] mcol1408.BRM (351 ms)
[----------] 2 tests from mcol1408 (95821 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (96243 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 2 tests, listed below:
[ FAILED ] mcol1408.mcol1408
[ FAILED ] mcol1408.BRM

2 FAILED TESTS

I tried reproduced the reporter's test case on 1.1.4-1. the script stopped/hanged after 4095 commits:

.
.
committed 4091
committed 4092
committed 4093
committed 4094
committed 4095

The WriteEngineServ process was cranking at 100% CPU.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16035 root 20 0 2611044 1.275g 17124 S 100.0 22.5 137:54.24 WriteEngineServ

The same test on 1.1.5-1 did past 8250 commits.

Comment by Daniel Lee (Inactive) [ 2018-06-05 ]

Reopen per my last comment.

Comment by Andrew Hutchings (Inactive) [ 2018-06-05 ]

With the CentOS 7 test the LBID/HWM for the second extent in the segment doesn't appear to have been committed causing the header read error (extent stuck in "Updating" state). My guess is some kind of de-duplication not taking into account the offset.

Comment by Andrew Hutchings (Inactive) [ 2018-06-06 ]

Pull request in engine and API to fix more HWM issues found in CentOS 7.

Engine pull request fixes:

  • Generate error if calculateRowId fails
  • No data written when first extent is completely full on a write, all data going to second extent.
  • 0 byte valArray malloc
  • valArray free() on no malloc
  • Column touched but no data written if all data going to second extent
  • Wrong colWidth used on second extent calculateRowId
  • Out of bounds memory write (crash) when no data for first extent
  • Extent not committed if all data going to second extent

API pull request fixes a duplicate extent issue when extents from the same segment report HWMs (therefore the wrong HWM can be committed)

For QA: API test should pass with these pull requests.

Comment by Daniel Lee (Inactive) [ 2018-06-07 ]

Build verified: 1.1.5-1 source

/root/columnstore/mariadb-columnstore-server
commit 3e6e2b692a5847cc6e89fb200558ebc781b14b09
Merge: 0c983bf 1f94211
Author: David.Hall <david.hall@mariadb.com>
Date: Thu May 31 09:30:12 2018 -0500

Merge pull request #118 from mariadb-corporation/MCOL-1435

Merge MariaDB 10.2.15 into develop-1.1

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit 4a20da68acb197e9d300b57f37757e7d7669ea80
Merge: 10479de 05f1752
Author: benthompson15 <ben.thompson@mariadb.com>
Date: Wed Jun 6 14:17:27 2018 -0500

Merge pull request #491 from mariadb-corporation/MCOL-1405

MCOL-1405 - fix launch of mysql monitor thread on seperate module ins…

Executed the following tests:

1. Report's test case, ran beyond 16383 commits
2. api mcol1408 test. passed
3. Autopilot's boundary test. passed

Got boundary test working again today.

Generated at Thu Feb 08 02:28:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.