[MCOL-1289] Python bulk load is slower than expected Created: 2018-03-21  Updated: 2023-10-26  Resolved: 2021-07-08

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: None
Fix Version/s: N/A, 1.4.5

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None

Issue Links:
Relates
relates to MCOL-808 mcsapi needs to use async write calls. Closed
Sprint: 2018-07, 2018-08, 2018-09, 2018-10, 2018-11, 2018-12, 2018-13, 2018-14, 2018-15, 2018-16, 2018-17, 2018-18, 2018-19, 2018-20, 2018-21, 2019-01, 2019-02, 2019-03, 2019-04

 Description   

A user has reported that bulk loading data with the Python API is slower than expected. He said that neither the network nor the WriteEngine seem to be the bottleneck, so he suspects that performance can be improved.

One suggestion provided was to have some functions in the Python API where he could pass the data already formatted for the table, and then have all the setColumn calls and data casting performed in the underlying C API instead of Python. He suspects that this would probably help to speed up the load process.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2018-03-21 ]

MCOL-808 will improve the situation quite a bit. I've linked that as a related issue.

Comment by Andrew Hutchings (Inactive) [ 2018-03-22 ]

This has been assigned and set to a sprint to be investigated where the bottleneck is before we look into where we can make improvements.

Comment by Jens Röwekamp (Inactive) [ 2018-03-24 ]

								Time to ingest 5000000 rows
 
				local				remote
Python 2, Swig 3.0.13:		193.642560334s		178.118374129s
Python 3, Swig 3.0.13:		385.190170498s		386.924084597s
 
Python 2, Swig 2.0.10:		198.347457877s		202.099096384s
Python 3, Swig 2.0.10:		395.491980514s		406.624436263s

Benchmarked on my laptop local machine CentOS 7 CS 1.1.3-2, remote machine Debian 9 CS 1.1.3-1

Comment by Jens Röwekamp (Inactive) [ 2018-04-03 ]

One reason why Python 3 is slower than Python 2 is that

_pymcsapi.ColumnStoreBulkInsert_setColumn,
_pymcsapi.ColumnStoreBulkInsert_writeRow,
_pymcsapi.ColumnStoreBulkInsert_seNull, and
_pymcsapi.ColumnStoreBulkInsert_resetRow are

calling pymcsapi.ColumnStoreBulkInsert._setattr_ which triggers _swig_setattr()

_pymcsapi compiled for Python 2 doesn't show this behaviour.

The optional Swig flag -py3 didn't help either.

Callgrind logs added to the log files.

-------------------------------------------------------------------------------------------

A general injection increase from about 34% for Python 2 and 29% for Python 3 was observed when renaming overloaded functions ColumnStoreBulkInsert.setColumn() into unique functions (e.g. ColumnStoreBulkInsert.setColumn_int32()).

Verified on the example of setColumn_int32().

Generated at Thu Feb 08 02:27:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.