[MCOL-5153] Disk-based aggregation fails with ERROR 1815 (HY000): Internal error: TupleAggregateStep::threadedAggregateRowGroups()[24] MCS-2054: Unknown error while aggregation. (part 1) Created: 2022-07-07 Updated: 2022-10-26 Resolved: 2022-08-17 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ExeMgr |
| Affects Version/s: | 6.2.3, 6.3.1, 6.4.1 |
| Fix Version/s: | 22.08.1, 6.4.4-dompe |
| Type: | Bug | Priority: | Major |
| Reporter: | Roman | Assignee: | Alexey Antipovsky (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Sprint: | 2021-17 | ||||||||
| Assigned for Testing: | |
||||||||
| Description |
|
The aggregation on VARCHAR(128) column(number of distinct values is aproximately 31 bln) fails with an obscure error.
The current implementation of RowAggStorage::increaseSize() can raise RowAggStorage::Data::fMask 4 times before rehashing happens. The guarding check in increaseSize() is too restrictive and fails easily with big numbers in fCurData->fMask and fCurData->fSize(see RowAggStorage::increaseSize() for details). The suggested solution is to increase the multiplier in the expression:
|
| Comments |
| Comment by Roman [ 2022-07-07 ] |
|
Plz review. |
| Comment by Roman [ 2022-07-09 ] |
|
4QA I have seen it in the wild on a beefy hardware with 1.5 TB RAM with S3-based cluster on NVME. The issue happens with aggregation on VARCHAR(30) column when the number of DISTINCT values equals to 31 bln. |
| Comment by Daniel Lee (Inactive) [ 2022-07-18 ] |
|
Build tested: 6.4.2-1 (Jenkins build bb-10.6.8-4-cs-6.4.2-1) storage: local With disk-join disabled, the query would run out of memory. Also tested "select count Is this test for S3 only? or local storage would be sufficient? |
| Comment by Roman [ 2022-08-04 ] |
|
Another iteration on the disk-based aggregation code. This try replaces MariaDB collation aware hashing with a combination of strnxfrm(converts bytes array into collation aware weights array) + MM3 byte array hash. There is also an optimization borrowed from Robin Hood that is triggered when RowStorage::increaseSize() is called when there are plenty of space available in the current fCurData w/o taking more RAM(see the patch for the details). |
| Comment by Daniel Lee (Inactive) [ 2022-08-10 ] |
|
Build tested: 22.08-1 (#5243) Executed the same 300gb DBT3 database above successfully. |