[MCOL-66] Autopilot systemTest.concurDML test failed for delete and ldi commands Created: 2016-05-20  Updated: 2016-08-02  Resolved: 2016-08-02

Status: Closed
Project: MariaDB ColumnStore
Component/s: DDLProc, DMLProc
Affects Version/s: None
Fix Version/s: 1.0.2

Type: Bug Priority: Major
Reporter: Daniel Lee (Inactive) Assignee: Ben Thompson (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MCOL-35 211 Concurrent Transactions Test failed Closed
Sprint: 1.0.2-1, 1.0.2-2

 Description   

Build tested:

InfiniDB> getcalpontsoft
getcalpontsoftwareinfo Fri May 20 14:46:37 2016

Name : infinidb-platform Relocations: (not relocatable)
Version : 5.0 Vendor: MariaDB, Inc.
Release : 0 Build Date: Sun 15 May 2016 07:24:29 PM CDT

The issue was identified by Autopilot test systemTest.concurDML. End test results did not match with expected results



 Comments   
Comment by Dipti Joshi (Inactive) [ 2016-05-23 ]

Concurrent DML causes some DML statement to not get executed

Comment by Daniel Lee (Inactive) [ 2016-05-25 ]

Build tested:

mscadmin> getsoftware
getsoftwareinfo Wed May 25 17:10:38 2016
Name : mariadb-columnstore-platform Relocations: (not relocatable)
Version : 1.0 Vendor: MariaDB Corporation Ab
Release : 0 Build Date: Tue 24 May 2016 07:24:06 PM CDT
Install Date: Wed 25 May 2016 12:27:43 PM CDT Build Host: srvbuilder

Executed the same test case with the latest build and did not encounter the issue. It is difficult to identify the root cause, or even a reproducible scenario, that caused the issue.

Comment by Daniel Lee (Inactive) [ 2016-05-26 ]

The Autopilot test run also showed this issue. Thinking that the concurDDL issue, as described in MCOL-14, occurred prior before this test may have something to do this error, I reran the test skipping concurDDL. The test passed.

Comment by Dipti Joshi (Inactive) [ 2016-06-03 ]

David.HallWhat is the progress on this issue ?

Comment by David Hall (Inactive) [ 2016-07-05 ]

There seems to be two separate problems involved.

First, the DDL and DML parsers in columnstore are not thread safe. This requires a re-factoring of the bison and flex code. This causes an abort or seg fault because the parsers use global pointers to hold intermediate values, and all threads create objects into those global values and delete them when done. So you can get a double delete (abort) or a invalid pointer access (segv), depending on the timing. Or if none of the above, then you could get a scrambled parse.

Second, the vss portion of the DBRM is not transaction safe. That is, concurrent transactions seem to cause problems with each other, causing them all to rollback or fail completely. At this point, some block locks are left on and the DBRM is in a non-recoverable bad state. This problem is only seen when problem #1 is fixed (because of the crash).

Comment by David Hall (Inactive) [ 2016-07-06 ]

Changes to make the DDL and DML parsers re-entrant are checked in. Concurrent tests still fail, but rather than crashing, they leave the DBRM in an unusable state.

Comment by David Hall (Inactive) [ 2016-07-08 ]

DML statements take a table lock for a portion of their processing. DDL statements do too (except CREATE TABLE). Unfortunately, DDL modifies the system catalog tables, and don't take a lock there. The DBRM can't handle this and breaks in horrible ways. I've added a mutex lock to simulate a system catalog lock for DDL, serializing that portion of the processing that modifies the DBRM and the system catalog.

This works great, but it uncovered another problem:

The VSS in the DBRM uses the transaction ID as the version number. The way it's designed, it expects to receive all transactions in numerical order. There are comments in the code stating that if that assumption changes, then we need to do something. Nothing is mentioned about what changes are needed.

With the change to MariaDB, there is no guarantee of transactions arriving at the DBRM in the same order they were created. The transaction ID's don't arrive in strict numerical order when concurrent transactions are executed. Things break. It appears that DROP TABLE breaks more, but I'm not convinced other types of DDL won't also break. More testing needs to be done to see if DML breaks also. So far, concurrent DML hasn't broken for me.

Comment by David Hall (Inactive) [ 2016-07-20 ]

Added code to force serialization of DDL code independent of DML being re-entrant.

An additional issue was exposed during testing which may be related to eventum #8648. When DELETE is being run with autocommit=off by machine (humans are too slow to make this appear), it's possible for the response from the DELETE to be returned before all cleanup is complete. The COMMAND (COMMIT or ROLLBACK) arrives and the DELETE operation is still in packageHandlerMap. It was not designed to happen like this. However, it shouldn't hurt. Unfortunately, the ctrl-c logic (to detect the user hit ctrl-c while the delete is running) consumes the COMMAND data, and processing continues with an empty buffer. When the COMMAND logic attempts to read the data, an exception is thrown. For reasons lost in antiquity, the exception handler did nothing – it was an empty block.

catch { <nothing here> }

So nothing was logged and the transaction just hung.

Fixed the code so that ctrl-c logic doesn't consume the buffer. Also added exception handling. Unfortunately, it still leaves the table lock on. At least we're now notified of the error. With the fix of the ctrl-c logic, there's no known way to follow this error path, so I don't think it will be a problem.

Comment by David Hall (Inactive) [ 2016-07-20 ]

This does not fix MCOL-140 where multiple updates or inserts on the same table may cause an error.

Comment by Ben Thompson (Inactive) [ 2016-08-02 ]

Review Complete

Generated at Thu Feb 08 02:18:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.