Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.10
-
None
-
Linux
Description
I have a CONNECT to a remote MySQL table (named, say, remotetbl} containing "many" rows (several hundered million). When I issue a simpe SELECT count( * ) FROM tbl_connect, I would expect this statement to be passed through to the remote server, which would send back exactly one line as a result.
What does happen, however, is that the whole table is copied over the link from the remote server (SELECT * FROM remotetbl), and that the count is then done locally. This involves the transfer of data in the gigabyte range and takes very long.
Remarkably, for a statement like SELECT count( * ) FROM tbl_connect WHERE a=42, the count is still not done on the remote server, but at least not all the columns are copied, the effective statement reduces to SELECT a FROM remotetebl WHERE a=42. Still highly suboptimal, but quite a bit more efficient than without the WHERE.
SImilarly, SELECT a, count( * ) FROM tbl_connect GROUP BY a leads to the statement SELECT a FROM remotetebl, with all the grouping and counting done on the local host, ignoring all indexes that are in place to help on the remote server, and again transferring big amounts of data.
I appreciate this is not strictly a bug, since I will eventually get the correct result (unless I have a timeout first, which I see happening), but it might be considered a "severe performance bug". (Which will, of course, not really be noticeable for small tables.)
I also appreciate that the CONNECT engine does not only connect to "intelligent" remote data sources like MySQL/MariaDB tables/servers but also to "dumb" remote sources like CSV tables, where no handing down is possible. I would assume, however, that the CONNECT engine would be capable of differentiating dumb from intelligent data sources.
Finally, I appreciate that for statements where a CONNECTed table is joined to a local table, data just have to be transferred. I guess it would be a task for the query optimizer to differentiate between queries that can be delegated, and those that can't.