[MCOL-3490] CAL0002: Delete Failed: IDB-2008: The version buffer overflowed. Increase VersionBufferFileSize or limit the rows to be processed but variable is 64G already Created: 2019-09-09  Updated: 2021-05-10  Resolved: 2021-05-10

Status: Closed
Project: MariaDB ColumnStore
Component/s: writeengine
Affects Version/s: 1.2.4, 1.2.5
Fix Version/s: 5.6.1

Type: Bug Priority: Major
Reporter: Rick Pizzi Assignee: Roman
Resolution: Cannot Reproduce Votes: 1
Labels: None


 Description   

One of our customer encountered the above error, trying to delete a 3 rows table and with the variable mentioned already set to 64 GB.

A restartSystem has resolved the problem.

System Info:

getsysteminfo Mon Sep 9 06:17:59 2019
 
System columnstore-1
 
System and Module statuses
 
Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Fri Aug 23 10:43:03 2019
 
Module um1 ACTIVE Fri Aug 23 10:42:54 2019
Module pm1 ACTIVE Fri Aug 23 10:42:36 2019
Module pm2 ACTIVE Fri Aug 23 10:42:44 2019
 
Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore set for Distributed Install
 
 
MariaDB ColumnStore Process statuses
 
Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Fri Aug 23 10:42:00 2019 11832
ServerMonitor um1 ACTIVE Fri Aug 23 10:42:22 2019 12484
DBRMWorkerNode um1 ACTIVE Fri Aug 23 10:42:36 2019 12611
ExeMgr um1 ACTIVE Fri Aug 23 10:42:47 2019 14097
DDLProc um1 ACTIVE Fri Aug 23 10:42:51 2019 14147
DMLProc um1 ACTIVE Fri Aug 23 10:43:01 2019 14193
mysqld um1 ACTIVE Thu Aug 29 13:40:46 2019 2797
 
ProcessMonitor pm1 ACTIVE Fri Aug 23 10:41:42 2019 9241
ProcessManager pm1 ACTIVE Fri Aug 23 10:41:49 2019 9650
DBRMControllerNode pm1 ACTIVE Fri Aug 23 10:42:30 2019 10507
ServerMonitor pm1 ACTIVE Fri Aug 23 10:42:32 2019 10528
DBRMWorkerNode pm1 ACTIVE Fri Aug 23 10:42:32 2019 10560
PrimProc pm1 ACTIVE Fri Aug 23 10:42:36 2019 10630
WriteEngineServer pm1 ACTIVE Fri Aug 23 10:42:37 2019 10654
 
ProcessMonitor pm2 ACTIVE Fri Aug 23 10:42:08 2019 9241
ProcessManager pm2 HOT_STANDBY Fri Aug 23 10:43:01 2019 9813
DBRMControllerNode pm2 COLD_STANDBY Fri Aug 23 10:42:23 2019
ServerMonitor pm2 ACTIVE Fri Aug 23 10:42:26 2019 9671
DBRMWorkerNode pm2 ACTIVE Fri Aug 23 10:42:40 2019 9708
PrimProc pm2 ACTIVE Fri Aug 23 10:42:44 2019 9725
WriteEngineServer pm2 ACTIVE Fri Aug 23 10:42:45 2019 9735
 
Active Alarm Counts: Critical = 1, Major = 4, Minor = 4, Warning = 0, Info = 0



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2019-10-07 ]

We need some more information here, schema, delete query, data, etc...

Comment by Andrew Hutchings (Inactive) [ 2019-10-08 ]

For public:
The user is doing a loop of 10000 deletes at a time.

Given the feedback provided, a single delete command could create several GB of version buffer data depending on the spread of the data. A delete in ColumnStore is currently an in-place update of every column for that row. ColumnStore takes a copy of the pre-modified block when applying the delete, a delete applies to every column of a row. So if there is a wide spread of data that creates a lot of version buffer data.

If they are doing the loop in a transaction (which I highly suspect would be the case) then it would be very easy to blow 64GB of data.

Comment by Rick Pizzi [ 2019-10-22 ]

We verified and each loop is actually a new connection to the server, so there is no transaction in play here. Issue is repeating, not only with DELETE but also with UPDATE

Comment by Roman [ 2019-11-13 ]

Extents in our meta data storage called Extent Map could be either locked or unlocked. The long UPDATE that eventually had failed causes Extents to become locked untill the next reboot. These locked extents in its turn severely break virtual buffer operations used in DELETE and UPDATE
I'm building a special release for the customer this week. It contains a workaround and additional debugging information that will tell us what's causes buffer related failures.
However the original problem that is very slow(>30min) UPDATE over 10 000 records calls for additional research that goes out of scope of this issue.

Comment by David Hill (Inactive) [ 2020-10-01 ]

Customer again is asking for an update.

Generated at Thu Feb 08 02:43:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.