[MCOL-1119] spark connector for publishing dataframe results using mcsapi to columnstore. Created: 2017-12-18 Updated: 2023-10-26 Resolved: 2018-04-02 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 1.1.2 |
| Fix Version/s: | 1.1.3 |
| Type: | New Feature | Priority: | Major |
| Reporter: | David Thompson (Inactive) | Assignee: | David Thompson (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Sprint: | 2017-25, 2018-01, 2018-02, 2018-03, 2018-04, 2018-05, 2018-06, 2018-07 |
| Description |
|
We should support a data adapter that allows bridging spark (both scala and pyspark) to columnstore. The intended use case is to support publishing ML results to column store as both results system of record and to enable easier consumption of that data with SQL and other data already stored in MariaDB. Broadly speaking the goal is to take a DataFrame object and serialize that to a ColumnStore table using mcsapi. This requires creation of new code to bridge the spark world to mcsapi. The first implementation can make assumptions that an appropriate table exists but it would be valuable to either create or adapt some code to generate appropriate columnstore create table statements that could be run as stage 1 before writing the data. |
| Comments |
| Comment by Jens Röwekamp (Inactive) [ 2018-01-17 ] |
|
Added a spark-connector that uses mcsapi to export a DataFrame to ColumnStore. Supported are Python2/3 and Scala. Automatic build and basic tests have been included in CMakeLists.txt and are executed successfully in Ubuntu 16.04. |
| Comment by Jens Röwekamp (Inactive) [ 2018-01-17 ] |
|
Attached my docker test environment build file. cd spark-dev One can access Jupyter on http://localhost:8888 |
| Comment by Andrew Hutchings (Inactive) [ 2018-01-18 ] |
|
Looks great! Moved to DT to test as he understands the requirements for this better than me. |