[MCOL-1177] SparkConnector runs out of memory for large datasets, JDBC can handle the datasets just fine Created: 2018-01-25  Updated: 2023-10-26  Resolved: 2018-01-26

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.1.2
Fix Version/s: 1.1.3

Type: Bug Priority: Major
Reporter: Jens Röwekamp (Inactive) Assignee: Andrew Hutchings (Inactive)
Resolution: Fixed Votes: 0
Labels: None


 Description   

Both Scala and Python run out of memory while exporting large datasets to ColumnStore using their respective benchmark scripts.

Contrary JDBC handles the same datasets well and writes them to ColumnStore without major increases in memory demand.

Need to investigate the source of the huge demand of memory in the SparkConnector and reduce it if possible.



 Comments   
Comment by Jens Röwekamp (Inactive) [ 2018-01-25 ]

Fixed 2 bugs in the benchmark result output to command line.

Changed the amount of rows to write to 7000000 with regards to MCOL-1176 to be comparable.

Memory issue was a configuration matter. Both Scala and PySpark weren't executed with enough heap allocation. Fixed that to 10GiB max in the execution scripts. Now the benchmark runs successfully.

Before, in case of Scala only around 2.5GiB of max heap size were set and the dataframe to write to ColumnStore allocated around 2GiB alone in memory.

Generated at Thu Feb 08 02:26:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.