[MCOL-383] R Programming In-Database: MariaDB ColumnStore ENGINE Created: 2016-10-31  Updated: 2022-11-05  Resolved: 2022-11-05

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: 1.0.4
Fix Version/s: Icebox

Type: New Feature Priority: Major
Reporter: Juan Telleria Assignee: Todd Stoffel (Inactive)
Resolution: Won't Do Votes: 1
Labels: None
Environment:

Windows, Linux.


Attachments: Microsoft Word R Statistical Programming using MariaDB as the background database.docx    

 Description   

I would like to suggest that MariaDB implements into its new Engine "ColumnStore" Built-In Database R Programming (Based on Microsoft Open R), similar to the one that
"Microsoft SQL Server 2016" already offers.



 Comments   
Comment by David Thompson (Inactive) [ 2016-10-31 ]

Some details on sql servers offering: https://msdn.microsoft.com/en-us/library/mt604845.aspx

Currently use of R requires a separate installation and use of the RMySQL package.

Comment by Juan Telleria [ 2017-03-04 ]

¿Any news on this topic?

I personally think that future of Databases goes somehow in this direction. In fact, SQL Server 2016 was considered to be one of the best database from 2016 (http://db-engines.com/en/blog_post/67), and one of the features it included was this.

The point is to avoid having to move big amounts of data from MariaDB to Open R (With RStudio GUI), and instead analyze it in the database itself.

Comment by David Thompson (Inactive) [ 2017-03-04 ]

Today you could deploy a single larger server system running both columnstore and R which would help reduce network round traffic time. To do this you would need to tune columnstores memory settings down to leave adequate memory for R.

Going forward we are looking into adding lower level api's to the PM level for higher performance direct access. We are prioritizing the write path first but the intention of the read api would be to support higher performance access (but with much simpler query capabilities) from systems like R or Spark.

We also plan to add a native regression function and support for custom aggregate functions that would allow you to develop other solutions.

Probably the main benefit of the microsoft r / sql server support is the scale-r libraries which i believe are proprietary not open source.

Do you have a specific use case in mind that does not currently work with columnstore and R installed either remotely or on a shared server?

Comment by Juan Telleria [ 2017-03-04 ]

I did no thought of installing R and Columstore on the same server and adequate memory. But It's a brilliant idea! By doing so network traffic for data transmission can be avoided.

In fact, it is possible to use RStudio Server (Open Source) to access directly to a remote R server connection.

I think it might work for my purpose of alarm log analysis. I'll give it a try!

Thank you

Comment by David Thompson (Inactive) [ 2017-03-05 ]

Let me know how it goes! As a reminder here are the memory settings to play with to reduce the amount of memory columnstore will consume to ensure you have enough for R:
https://mariadb.com/kb/en/mariadb/mariadb-columnstore-performance-related-configuration-settings/#memory-management-numblockspct-and-totalummemory

Comment by Juan Telleria [ 2017-05-31 ]

My review on the topic:

I installed MariaDB and "Microsft R Open" in the same machine, in order to perform Big Data Analytics, and what I can say about it is that:

  • RMySQL library has to many bugs, and I normally loose connection when I am transferring multi-million rows from R to MariaDB, and viceversa.
  • Therefore, I use RODBC with MariaDB ODBC plugin, which allows to slice data in chunks and, although it offers a low data transfer speed, it performs robustly as expected, and I do not loose connection.

Therefore, my recommendation are:

  • To use RODBC for big data transfer between R and MariaDB (Not RMySQL).
  • To reserve some memory in the Server for R.
  • In R, a must is "tidyverse" library.

And some solutions could be:

  • MariaDB, when used along with R, innodb / aria dynamic memory allocation, up to a maximum amount of memory.
  • ¿New RMariaDB R library? For high speed big data transfers, in chunks, between R and MariaDB.

And a disruptive innovation could be:

  • ¿Allow some R functions to be executed within user defined functions in MariaDB? ¿With single value return? Mmm... too slow I guess.
Comment by David Thompson (Inactive) [ 2017-05-31 ]

Thanks for your research. Would you be ok if i converted this to a knowledge base article?

It looks like the RMySQL community is fairly active so if you have some specific test cases you can file with them maybe we can make it bettter?
https://github.com/rstats-db/RMySQL/issues

Comment by David Thompson (Inactive) [ 2017-05-31 ]

I checked with some developers internally and one suggestion was to try increasing the mariadb server variable net_write_timeout. The default is 60s but potentially larger queries could be inactive for longer. Also it's possible RMySQL has its own timeout but i can't find any reference to such a setting yet.

Comment by Juan Telleria [ 2017-06-02 ]

Yes, but I would prefer to prepare a Word document and pass it to you through here so that you can check it before releasing anything in MariaDB Knowledge Base.

I learned how to program in R recently by means of 2 excellent books concept-oriented:

  • R Cookbook.
  • R Graphics Cookbook.

Also, my main development GUI is RStudio.

And in my personal PC, I use the general distribution of R (From R Project Foundation), but for a Production Environment I use "Microsoft R Open", which uses a fixed snapshot of CRAN.

The best webpage for searching R related information is:
http://rseek.org/

And explanations on how to use RODBC can be found here:
https://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf

Kind regards,
Juan

Comment by Juan Telleria [ 2017-06-02 ]

If I worked in MariaDB, I would also recompile RMySQL from scrach, and:

  • Modify its code for enabling massive and secure data transfers (GPL-2).
  • Distribute it through CRAN, and MariaDB Foundation.
  • Do some vignetes on how to use it.

Thank you.

Comment by Juan Telleria [ 2017-06-05 ]

There is a very interesting github project for using R formulas in Excel, which could also be extended to MariaDB:

https://bert-toolkit.com/bert-quick-start

Although I have not used it and I do not know how well it works.

Comment by Juan Telleria [ 2017-08-09 ]

Just discovered a package in R (Based on RODBC) which allows IBM in-Database Analytics for R (ibmdbR):

CRAN ibmdbR

Porting such package to MariaDB is extremely simple, as it is all Open Source (GPL-v3).

This would be really useful for performing Machine Learning with R using MariaDB as the background database.

Comment by Juan Telleria [ 2017-08-13 ]

Dear Mr. Thompson,

I attach to this issue, as you requested some time ago, a Word document containing some basic notions on "How-To" start using R Statistical Programming along with MariaDB.

If appropriate, this document could be added to the MariaDB Knowledge Base.

Although simple, hope it is useful.

Kind regards,
Juan Telleria

R Statistical Programming using MariaDB as the background database.docx

Comment by Juan Telleria [ 2017-08-14 ]

Just added the article to the MariaDB Knowledge Base:

R Statistical Programming using MariaDB as the background database

Kind regards,
Juan

Comment by David Thompson (Inactive) [ 2017-08-15 ]

Awesome - many thanks and looks great! I'll hopefully have some time to run through an example later this week.

Comment by Juan Telleria [ 2019-06-09 ]

Current Microsoft Implementation I think is no adequate for a Database Level.

I feel that the integration between R with MariaDB shall be understood only as Scheduled Events, or Triggers, that are able to call from within MariaDB Server certain RMarkdown Documents or RScripts. You could directly contact RStudio Company for collaborating with them in this sense.

Comment by Todd Stoffel (Inactive) [ 2022-11-05 ]

Item is out of date. Closing due to inactivity. If you feel this was done in error please open a new ticket.

Generated at Thu Feb 08 02:20:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.