[MDEV-30497] CONNECT ignores "intelligence" of underlying MySQL table - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6.10
Fix Version/s: 10.6, 10.11
Component/s: Storage Engine - Connect
Labels:
None
Environment:
Linux

Description

I have a CONNECT to a remote MySQL table (named, say, remotetbl} containing "many" rows (several hundered million). When I issue a simpe SELECT count( * ) FROM tbl_connect, I would expect this statement to be passed through to the remote server, which would send back exactly one line as a result.

What does happen, however, is that the whole table is copied over the link from the remote server (SELECT * FROM remotetbl), and that the count is then done locally. This involves the transfer of data in the gigabyte range and takes very long.

Remarkably, for a statement like SELECT count( * ) FROM tbl_connect WHERE a=42, the count is still not done on the remote server, but at least not all the columns are copied, the effective statement reduces to SELECT a FROM remotetebl WHERE a=42. Still highly suboptimal, but quite a bit more efficient than without the WHERE.

SImilarly, SELECT a, count( * ) FROM tbl_connect GROUP BY a leads to the statement SELECT a FROM remotetebl, with all the grouping and counting done on the local host, ignoring all indexes that are in place to help on the remote server, and again transferring big amounts of data.

I appreciate this is not strictly a bug, since I will eventually get the correct result (unless I have a timeout first, which I see happening), but it might be considered a "severe performance bug". (Which will, of course, not really be noticeable for small tables.)

I also appreciate that the CONNECT engine does not only connect to "intelligent" remote data sources like MySQL/MariaDB tables/servers but also to "dumb" remote sources like CSV tables, where no handing down is possible. I would assume, however, that the CONNECT engine would be capable of differentiating dumb from intelligent data sources.

Finally, I appreciate that for statements where a CONNECTed table is joined to a local table, data just have to be transferred. I guess it would be a task for the query optimizer to differentiate between queries that can be delegated, and those that can't.

Attachments

Activity

Ascending order - Click to sort in descending order

Gisbert W. Selke added a comment - 2023-01-29 22:36

Here's how to reproduce:
On the remote server, create a table like this:

CREATE TABLE remotetebl ( a int, b int, c int, d int, key idx1 (a) );

Populate this table with 100 million random records.
On the local host, create a CONNECTed table tbl_connect to remotetbl.
On the local host, run SELECT count( * ) FROM tbl_connect.
On the remote host run SHOW FULL PROCESSLISt.
See what command was started from your local host.

Gisbert W. Selke added a comment - 2023-01-29 22:36 Here's how to reproduce: On the remote server, create a table like this: CREATE TABLE remotetebl ( a int , b int , c int , d int , key idx1 (a) ); Populate this table with 100 million random records. On the local host, create a CONNECTed table tbl_connect to remotetbl. On the local host, run SELECT count( * ) FROM tbl_connect . On the remote host run SHOW FULL PROCESSLISt. See what command was started from your local host.

Gisbert W. Selke added a comment - 2023-01-31 12:04

Similarly, the query SELECT * FROM tbl_connect LIMIT 10 translates, on the remote database for the underlying table, into a simple SELECT * FROM remotetbl (note: no LIMIT).
Needless to say, this may transfer multi-GB where a few hundred bytes (depending on the table structure) would be enough.
Using the LIMIT here would probably not require any big amount of intelligence on the side of the engine accessing the underlying table. e.g., it could be implemented even for a CSV table – and certainly for more intelligent engines like MySQL/MariaDB.

Gisbert W. Selke added a comment - 2023-01-31 12:04 Similarly, the query SELECT * FROM tbl_connect LIMIT 10 translates, on the remote database for the underlying table, into a simple SELECT * FROM remotetbl (note: no LIMIT). Needless to say, this may transfer multi-GB where a few hundred bytes (depending on the table structure) would be enough. Using the LIMIT here would probably not require any big amount of intelligence on the side of the engine accessing the underlying table. e.g., it could be implemented even for a CSV table – and certainly for more intelligent engines like MySQL/MariaDB.

People

Assignee:: Unassigned

Reporter:: Gisbert W. Selke

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2023-01-29 22:28

Updated:: 2024-11-26 10:44

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server