[MCOL-700] Failure of clientrotator.cpp causes DMLProc to lock up Created: 2017-05-04  Updated: 2017-05-08  Resolved: 2017-05-08

Status: Closed
Project: MariaDB ColumnStore
Component/s: DMLProc, ExeMgr, ProcMgr
Affects Version/s: 1.0.8
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Allan Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Centos 7; XFS; 32GB mem;
single server; 1 UM; 2 PM

% ll /usr/local/mariadb
lrwxrwxrwx 1 root root 26 Apr 12 16:59 columnstore -> /data/mariadb/columnstore/

% df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/dev2_sys-sysroot 272G 65G 207G 24% /


Attachments: File columnstoreSupportReport.columnstore-1.tar.gz    
Issue Links:
PartOf
includes MCOL-680 cpimport not creating clean (new) par... Closed
includes MCOL-685 support dropping extents in addition ... Closed

 Description   

A java program was written to get around the issue of bulk deleting data in the current partition scheme (SEE https://jira.mariadb.org/browse/MCOL-685 , https://jira.mariadb.org/browse/MCOL-680)

This program works by doing repeated DELETE sql calls of a certain LIMIT size until the data for a particular column value is gone. The program starts fine and with repeated tries does some number of success deletes until the following occurs:

SQLState: HY000; Error Code: 1815; Internal error: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr

Subsequent attempts always fail with the following:

SQLState: HY000; Error Code: 1815; Internal error: IDB-2009: Unable to perform the delete operation because DMLProc with PID 27675 is currently holding the table lock for session 71

The problem is resolved by doing at kill -TERM on the DMLProc.

The program now runs fine again for some number of block deletes until the problem happens again. And again killing the DMLProc resolves the problem temporarily.

The pertinent section of code doing the delete is the following. The whole program is available if needed for your needs:

{{ private static void doDelete(Connection connection) {
String partitionDrop = "SELECT calDropPartitionsByValue('" + database + "','" + table + "','" + column + "'," + //
"'" + columnValue + "','" + columnValue + "')";

String deleteRecords = "DELETE FROM " + database + "." + table + //
" WHERE " + column + "='" + columnValue + "'" + //
" LIMIT " + batchSize;

log.info("Creating statement ...");

try (Statement statement = connection.createStatement()) {
int numberDeleted = 0;

// See if we can get rid of alot using the fast method, then take care
// of the rest using the slow method

log.info("Executing » " + partitionDrop);
try (ResultSet rs = statement.executeQuery(partitionDrop))

{ ;// don't care about result }

catch (Exception e)

{ log.info("Ignoring » " + e); }

// Now get rid of the rest the hard way

do

{ log.info("Executing » " + deleteRecords); numberDeleted = statement.executeUpdate(deleteRecords); if (numberDeleted > 0) log.info("Deleted block of " + numberDeleted + " records."); }

while (numberDeleted > 0);
}
catch (SQLException e)

{ exitCode = 1; logSQLException(e); }

catch (Exception e)

{ exitCode = 1; log.error("Deletion Failure", e); }

}}}



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2017-05-04 ]

This sounds like it could be MCOL-529. Can you please generate a report for us with this tool so that we can get more information?:
https://mariadb.com/kb/en/mariadb/system-troubleshooting-mariadb-columnstore/#mariadb-columnstore-support-tool

In the mean time if it is MCOL-529 this should stop it happening, run these as root and restart ColumnStore:

echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 1 > /proc/sys/net/ipv4/tcp_tw_recycle
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse

Comment by Allan [ 2017-05-04 ]

Update: Adding a one second pause between DELETE calls seems to make the problem go away. the test is still running and it has been an hour. I never got this far before.

Comment by Allan [ 2017-05-04 ]

[root@dev2 ~]# /usr/local/mariadb/columnstore/bin/columnstoreSupport -a
Get software report data for pm1
Get config report data for pm1

Note: This output shows SysV services only and does not include native
systemd services. SysV configuration data might be overridden by native
systemd configuration.

If you want to list systemd services use 'systemctl list-unit-files'.
To see services enabled on particular target use
'systemctl list-dependencies [target]'.

Note: This output shows SysV services only and does not include native
systemd services. SysV configuration data might be overridden by native
systemd configuration.

If you want to list systemd services use 'systemctl list-unit-files'.
To see services enabled on particular target use
'systemctl list-dependencies [target]'.

Get log report data for pm1
Get log config data for pm1
Get hardware report data for pm1
Get resource report data for pm1
Get dbms report data for pm1
NOTE: MariaDB Columnstore root user password is set
NOTE: No password provide on command line or found uncommented in my.cnf
columnstoreSupportReport.columnstore-1.tar.gz

      • Enter MariaDB Columnstore password >

Columnstore Support Script Successfully completed, files located in columnstoreSupportReport.columnstore-1.tar.gz

Comment by Andrew Hutchings (Inactive) [ 2017-05-04 ]

This definitely makes it sound like MCOL-529. With the TCP/IP settings in my earlier comment the problem should go away.

Comment by David Thompson (Inactive) [ 2017-05-08 ]

Please reopen if 1.0.9 does not fix this (has fix for MCOL-529)

Comment by Allan [ 2017-05-08 ]

Do you have a reference to where I can download 1.0.9 to try it out?

Comment by David Thompson (Inactive) [ 2017-05-08 ]

we are working on final bug fixes / stabilization so hopefully within the next week.

Comment by Allan [ 2017-05-08 ]

Generated at Thu Feb 08 02:23:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.