Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.0.10
-
None
Description
Hi,
I've just tested the CONNECT engine on a "big" table (24GB) on my local server :
CREATE TABLE `url_list_connect` ( |
`id` int(10) unsigned NOT NULL, |
`url` varchar(2048) NOT NULL DEFAULT '', |
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, |
`id_license_domain` int(10) unsigned NOT NULL DEFAULT '0', |
`id_license` int(10) unsigned NOT NULL DEFAULT '0', |
`dom` mediumint(8) unsigned NOT NULL DEFAULT '0', |
`loadtime` mediumint(8) unsigned NOT NULL DEFAULT '0', |
`gen` mediumint(8) unsigned NOT NULL DEFAULT '0', |
`script` varchar(2048) NOT NULL DEFAULT '', |
`database_time` mediumint(8) unsigned NOT NULL DEFAULT '0' |
) ENGINE=CONNECT DEFAULT CHARSET=latin1 `table_type`=MYSQL `dbname`=test `tabname`=url_list `option_list`='user=root,host=localhost,password=xxxx' |
The original table is a tokudb table with a tokudb_cache_size of 12Go
(I can see the same behaviour with an InnoDB table).
Then I've tested the following :
SELECT * FROM url_list_connect LIMIT 10;
This query seems to execute a full scan of the table and load the whole table in memory, before returning the 10 first results.
Hence the mysql server finally crashes because out of memory. (even with a small innodb_buffer_pool_size / tokudb_cache_size / tmp_table_size).
I don't see any variable like in cassandra SE which allows to control the full scan behaviour :
"Full table scans are performed in a memory-efficient way. Cassandra SE performs a full table scan as a series of batches, each of which reads not more than cassandra_rnd_batch_size records."
To reproduce : create a table with a size > the total memory on the server, then select without a WHERE clause.
Thanks and regards,
Jocelyn Fournier
Normally, CONNECT does not make a full table scan when retrieving query result. However there are exceptions:
1 - XML tables. Because the used libraries (DOMDOC or libxml2) are not used in SAX mode but in DOM mode. This means that they parse the whole document and make a memory image that is later used by CONNECT.
2 - MYSQL tables.
What is done for MYSQL tables is to construct a SELECT statement specifying only the columns used in the query and adding a WHERE clause retrieved by cond_push (unlike FEDERATED that retrieve the whole table except when indexing is used) The last version of CONNECT also can use "REMOTE" indexing.
Unfortunately, it seems difficult to handle the LIMIT clause. Firstly because engines have no way to now about it (except by re-parsing the original query) but chiefly because of queries like this:
The engine is called by MariaDB to return the table data, the engine not knowing for what purpose. Here CONNECT will construct the query:
(tab being the t1 source table) However, adding the LIMIT clause would be wrong because it applies to the GROUP BY result, not to the table that is grouped.
Perhaps, a solution could be to retrieve the table data row by row, or by fixed number of rows as CASSANDRA does. This is a little more complicated to implement but feasible if the mysql API permit doing so. I am not sure whether or not they built a memory result completely in memory in any case. Concerning the ODBC tables, they are retrieved by CONNECT by block of size given by the BLOCK_SIZE option.
Meanwhile, you can browse the first rows of a big MYSQL table using a view or a SRCDEF option that includes the LIMIT clause, for instance replacing:
`tabname`=url_list
by
`srcdef`='SELECT * FROM url_list LIMIT 10'