[MCOL-1287] Small batches of loads are not distributed across PMs with Python bulk load API Created: 2018-03-21  Updated: 2021-01-17  Resolved: 2021-01-17

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: None
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Todd Stoffel (Inactive)
Resolution: Won't Do Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MCOL-808 mcsapi needs to use async write calls. Closed

 Description   

A user has reported that when he loads a relatively small batch of rows (100,000 or less) using the Python bulk load API, the rows are not distributed across PMs. Instead, all rows go to the first PM.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2018-03-21 ]

This is because the batch size is set to 100,000 rows. What would be the expected behaviour here?

Comment by patrice [ 2018-03-21 ]

I was under the impression that cpimport balance the data on the PMs automatically, even if there are multiple small imports. So the mcsapi would have the same behavior, in facts i understand if the 8 million row extent is not full to keep pushing to the same PM, but then it will create the next extent on the next PM , roundrobin like. That way after multiple small batch load, the data will still be balanced.

Comment by Andrew Hutchings (Inactive) [ 2018-03-22 ]

No, the API currently round robins every 100,000 rows. Until MCOL-808 is implemented then it would be a performance hit to do otherwise. We are also looking into having an instance of the API bulk insert per PM which would let you control which data goes into which PM.

I'll change this to a feature request for now and we will review it after MCOL-808.

Comment by Dipti Joshi (Inactive) [ 2018-04-09 ]

LinuxJedi While API round robins 100,000 rows by default, I thought the batch size could be changed programmatically by API user. - Is that not correct?

Comment by Andrew Hutchings (Inactive) [ 2018-04-09 ]

dshjoshi The API is there to do it, but it is not hooked up yet. It would be trivial to hook it up but with the caveat that I'm not sure what would happen if you set it too high or too low (> 8 million will very likely do bad things). Since I have not had time to test it this has been left disabled in 1.1 so far and documented as such.

Generated at Thu Feb 08 02:27:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.