[MCOL-140] Concurrent Insert threads generate errors Created: 2016-06-14  Updated: 2016-08-23  Resolved: 2016-08-23

Status: Closed
Project: MariaDB ColumnStore
Component/s: DDLProc, DMLProc
Affects Version/s: None
Fix Version/s: 1.0.2

Type: Task Priority: Critical
Reporter: Anders Karlsson Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 1.0.2-2

 Description   

Inserting data into a simple table using multiple concurrent threads cause errors:
Table schema:
CREATE TABLE `facts` (
`invoice_id` int(11) NOT NULL,
`order_value` int(11) NOT NULL,
`customer_id` int(11) NOT NULL
) ENGINE=Columnstore
Multiple concurrent inserts, 2 or more, using array inserts or not, eventually seems to generate this error:
2016-06-14 20:24:11 MySQL Error: (1815)
Internal error: CAL0001: Insert Failed: a BRM VB entry error.



 Comments   
Comment by Dipti Joshi (Inactive) [ 2016-06-22 ]

This may be related to MCOL-66 being worked upon right now by David.Hall

Comment by David Hall (Inactive) [ 2016-07-21 ]

This is caused by a combination of two things. First, the VSS can't handle transactions arriving out of order if they are acting on the same block. Second, since VSS access is down the execution path, simultaneous threaded transactions arrive in unpredictable order based on cpu cycles given to each thread by the scheduler. Sometimes these two properties clash and result in a transaction being rejected.

Comment by David Hall (Inactive) [ 2016-07-28 ]

I've created a mechanism to serialize transactions only within each table. Transactions on different tables continue concurrently. The mechanism is surrounded by an ifdef, so it will be easy to turn off when we start testing a new VSS.

The mechanism is thus (as documented in the code):
// Blocks a thread if there is another trx working on the same fTableOid
// return 1 when thread should continue.
// return 0 if error. Right now, no error detection is implemented.
//
// txnid was being created before the call to this function. This caused race conditions
// so creation is delayed until we're inside the lock here. Nothing needs it before
// this point in the execution.
//
// The algorithm is this. When the first txn for a given fTableOid arrives, start a queue
// containing a list of waiting or working txnId. Put this txnId into the queue (working)
// Put the queue into a map keyed on fTableOid.
//
// When the next txn for this fTableOid arrives, it finds the queue in the map and adds itself,
// then waits for condition.
// When a thread finishes, it removes its txnId from the queue and notifies all. If the queue is
// empty, it removes the entry from the map.
// Upon wakeup from wait(), a thread checks to see if it's next in the queue. If so, it is released
// to do work. Otherwise it goes back to wait.
//
// There's a chance (CTRL+C) for instance, that the txn is no longer in the queue. Release it to work.
// Rollback will most likely be next.
//
// A tranasaction for one fTableOid is not blocked by a txn for a different fTableOid.

A new test212 has been added to the regression suite to test multiple threads doing DML on single tables. The test runs up multiple tables and multiple threads for each table and then slams DML at them.

Comment by David Hall (Inactive) [ 2016-07-28 ]

This is really a Kludge. Eventually we need to re-engineer the VSS to not need things arriving in order. However, that is an expensive and risky operation, so we have this kludge. Have fun following the flow!

Comment by Ben Thompson (Inactive) [ 2016-08-01 ]

Review Completed

Comment by Dipti Joshi (Inactive) [ 2016-08-16 ]

David.Hall Has this been run through regression ? If so please assign to dleeyh for auto pilot testing

Comment by Daniel Lee (Inactive) [ 2016-08-23 ]

Have been testing in Autopilot's concurrent DDL and DML tests. All good.

Generated at Thu Feb 08 02:18:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.