[MCOL-353] with 6 PMs, cpimport has an long lag time Created: 2016-10-10  Updated: 2017-03-07  Resolved: 2017-03-01

Status: Closed
Project: MariaDB ColumnStore
Component/s: cpimport
Affects Version/s: 1.0.1
Fix Version/s: 1.1.0

Type: Bug Priority: Major
Reporter: David Hall (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS


Attachments: Text File MCOL-353-pmp.txt    
Issue Links:
Duplicate
duplicates MCOL-388 Latency after cpimport job has been c... Closed
Relates
relates to MCOL-611 1.0.8 backport MCOL-353 with 6 PMs, c... Closed
Sprint: 2017-01, 2017-2, 2017-3, 2017-4, 2017-5

 Description   

Reported by a customer of InfiniDB. When cpimport is run, it reports complete rather quickly (most times) – 1 -2 seconds. Then there is a 19-30 second delay before it completes. This happens with LDI and cpimport direct. During the wait, we see the following which may or may not be related or significant:
Oct 7 21:50:28 ps3651 ProcessMonitor[874]: 28.645948 |0|0|0| D 18 CAL0000: ***read error, close create thread: InetStreamSocket::readToMagic: Remote is closed.

This doesn't appear on a single server system.



 Comments   
Comment by Dipti Joshi (Inactive) [ 2016-11-11 ]

What cpimport mode is this happening ?

Comment by way [ 2016-11-15 ]

Hi,

I'am the customer who reported this issue, cpimport is used in default mode :

mode 1 - rows will be loaded in a distributed manner across PMs.

2016-10-07 22:11:14 (87677) INFO : Running distributed import (mode 1) on all PMs...
2016-10-07 22:11:16 (87677) INFO : For table modulo_0005.p1029: 105 rows processed and 1 rows inserted.
2016-10-07 22:11:16 (87677) INFO : Bulk load completed, total run time : 2.25727 seconds

So 1 ligne I have to wait near 20 secondes, as you can see : (I cut useless)

Comment by Andrew Hutchings (Inactive) [ 2017-01-06 ]

Can someone please provide the full debug log at the time this occurs so we can look into a way of reproducing it?

Comment by Andrew Hutchings (Inactive) [ 2017-01-16 ]

Reproduced the problem. Attached a PMP output during the delay. Appears to be the message queue waiting for data and timing out.

Runtime data:

2017-01-16 22:12:54 (27859) INFO : Running distributed import (mode 1) on all PMs...
2017-01-16 22:13:28 (27859) INFO : For table test.numbers: 100000000 rows processed and 100000000 rows inserted.
2017-01-16 22:13:28 (27859) INFO : Bulk load completed, total run time : 34.6954 seconds
 
real	0m58.931s
user	0m7.145s
sys	0m1.320s

Comment by Andrew Hutchings (Inactive) [ 2017-01-17 ]

At the end of a cpimport there is a read timeout added which is 1 sec * PM count on each connection. There is then a race where we could (very likely) enter a socket poll with that timeout before we check to see if the connection has actually ended.

This accumulates at the end of the run due to the serial thread joining which again is a race between thread joining and polling. Maximum delay would be in the region of 'number of PMs' squared.

There is no obvious reason for the delay to be there since the loop will re-enter the recv() on the default 20ms so the extra timeout only appears to save a little CPU time on the loops at the end. This fix removes the extra timeout.

This has been tested with a full speed VirtualBox network and an artificial 500ms delay added using 'tc qdisc add dev enp0s3 root netem delay 500ms' to make sure there are no obvious side-effects.

Comment by David Hall (Inactive) [ 2017-01-17 ]

A one time test with multiple PMs to see if the delay is gone should suffice. Regression test should catch the unlikely scenario that a problem was introduced.

Comment by Daniel Lee (Inactive) [ 2017-03-01 ]

Build tested: Github source

[root@localhost columnstore]# cd mariadb-columnstore-server/
[root@localhost mariadb-columnstore-server]# git show
commit 3da188e5c8a2630019ea810fb8c1bd3ece5e058b
Merge: 5d9686c 53c1df7
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Fri Feb 10 15:07:31 2017 +0000

Merge pull request #31 from jbfavre/fix_deb_package_dependency

MCOL-562 Fix Debian package dependencies

[root@localhost mariadb-columnstore-server]# cd mariadb-columnstore-engine/
[root@localhost mariadb-columnstore-engine]# git show
commit 16cef50caedd9ec7585b04c096996a9441bdf2d5
Author: David Hill <david.hill@mariadb.com>
Date: Wed Mar 1 10:39:11 2017 -0600

change the check for prompt back to the previous code

Verified on a 1UM4PM stack.

[root@localhost ~]# date /usr/local/mariadb/columnstore/bin/cpimport
cpimport cpimport.bin
[root@localhost ~]# date /usr/local/mariadb/columnstore/bin/cpimport mytest lineitem /data/qa/source/dbt3/1g/lineitem.tbl
date: extra operand ‘mytest’
Try 'date --help' for more information.
[root@localhost ~]# time /usr/local/mariadb/columnstore/bin/cpimport mytest lineitem /data/qa/source/dbt3/1g/lineitem.tbl
2017-03-01 22:34:31 (11804) INFO : Running distributed import (mode 1) on all PMs...
2017-03-01 22:34:43 (11804) INFO : For table mytest.lineitem: 6001215 rows processed and 6001215 rows inserted.
2017-03-01 22:34:43 (11804) INFO : Bulk load completed, total run time : 12.2397 seconds

real 0m12.622s
user 0m1.220s
sys 0m1.612s
You have mail in /var/spool/mail/root

Generated at Thu Feb 08 02:20:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.