Details
-
New Feature
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
2018-20
Description
The current export function calls collect() on the DataFrame, and thereby writes it to memory in the Spark driver. This can lead to ridiculous amounts of memory usage (depending on the DF size). The current option only needs the bulk write SDK installed on Spark's driver and avoids concurrency problems, as the driver is the only process writing to CS.
Another option is to export each DataFrame's partition directly from the worker. This would result in less memory usage. On the downside every worker needs to have the CS bulk write API installed and we might run into concurrency problems if multiple processes want to write simultaneously to the same table.
This ticket covers the export from worker nodes and not the driver.
Depending on the concurrency problem, we might want to consider writing sequentially from each worker, or writing in parallel to different tables and joining them afterwards.