[MCOL-1362] Add a export function that utilizes (sequential) write from Spark workers - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.2.2
Component/s: None
Labels:
None

Sprint:
2018-20

Description

The current export function calls collect() on the DataFrame, and thereby writes it to memory in the Spark driver. This can lead to ridiculous amounts of memory usage (depending on the DF size). The current option only needs the bulk write SDK installed on Spark's driver and avoids concurrency problems, as the driver is the only process writing to CS.

Another option is to export each DataFrame's partition directly from the worker. This would result in less memory usage. On the downside every worker needs to have the CS bulk write API installed and we might run into concurrency problems if multiple processes want to write simultaneously to the same table.

This ticket covers the export from worker nodes and not the driver.

Depending on the concurrency problem, we might want to consider writing sequentially from each worker, or writing in parallel to different tables and joining them afterwards.

Attachments

Issue Links

relates to

MCOL-1852 Spark Exporter uses collect() instead of toLocalIterator() on DataFrames to export and therefore uses too much memory on the Driver

Closed

MCOL-1077 Two applications using Bulk Insert API

Closed

Activity

People

Assignee:: David Thompson (Inactive)

Reporter:: Jens Röwekamp (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2018-04-24 00:58

Updated:: 2023-10-26 13:17

Resolved:: 2018-11-30 23:03

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.