Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-1362

Add a export function that utilizes (sequential) write from Spark workers

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.2.2
    • Component/s: mcsapi
    • Labels:
      None
    • Sprint:
      2018-20

      Description

      The current export function calls collect() on the DataFrame, and thereby writes it to memory in the Spark driver. This can lead to ridiculous amounts of memory usage (depending on the DF size). The current option only needs the bulk write SDK installed on Spark's driver and avoids concurrency problems, as the driver is the only process writing to CS.

      Another option is to export each DataFrame's partition directly from the worker. This would result in less memory usage. On the downside every worker needs to have the CS bulk write API installed and we might run into concurrency problems if multiple processes want to write simultaneously to the same table.

      This ticket covers the export from worker nodes and not the driver.

      Depending on the concurrency problem, we might want to consider writing sequentially from each worker, or writing in parallel to different tables and joining them afterwards.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              dthompson David Thompson (Inactive)
              Reporter:
              jens.rowekamp Jens Röwekamp (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Git Integration