Details
-
New Feature
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
hi,
index condition pushdown was added with in mysql 5.7.3 for partitioned tables with commit https://github.com/mysql/mysql-server/commit/67d2e4ef917e49b0c14cee50e47a498f3f3d95d1
by what i found out by now this feature is not yet available in mariadb and would be a great addition.
thx
markus
Attachments
Issue Links
- causes
-
MDEV-33792 ANALYZE for statement shows incorrect selectivity of pushed index condition for table with partitions
-
- Closed
-
- is duplicated by
-
MDEV-32227 Partition Table do not using (Using index condition) ICP
-
- Closed
-
- relates to
-
MDEV-21625 Index condition pushdown is not used with ref access of partitioned tables
-
- Closed
-
Activity
The problem:
Some functions in class handler assume the record is read to table->record[0]:
virtual int read_range_first(const key_range *start_key,
|
const key_range *end_key,
|
bool eq_range, bool sorted);
|
virtual int read_range_next();
|
virtual int multi_range_read_next(range_id_t *range_info);
|
Some functions put the record into the passed buffer:
int ha_index_read_map(uchar * buf, const uchar * key,
|
key_part_map keypart_map,
|
enum ha_rkey_function find_flag);
|
int ha_index_next_same(uchar *buf, const uchar *key, uint keylen);
|
IndexConditionPushdown assumes that the index column values are unpacked into table->record[0]. This is because Item *pushed_index_cond unpacks there.
Typically, there's no difference as buf == table->record[0].
There are exceptions: handler::get_auto_increment(), handler::ha_check_overlaps, partitioning.
We hit the issue with partitioning. ha_partition::handle_ordered_index_scan() uses a Priority Queue to merge ordered streams of records it has read from different partitions.
It has calls like:
|
case partition_index_read:
|
error= file->ha_index_read_map(rec_buf_ptr,
|
m_start_key.key,
|
m_start_key.keypart_map,
|
m_start_key.flag);
|
|
case partition_index_first:
|
error= file->ha_index_first(rec_buf_ptr);
|
reverse_order= FALSE;
|
break;
|
When the API can only read to table->record[0], it does so and copies the record:
case partition_read_range:
|
{
|
/*
|
This can only read record to table->record[0], as it was set when
|
the table was being opened. We have to memcpy data ourselves.
|
*/
|
error= file->read_range_first(m_start_key.key? &m_start_key: NULL,
|
end_range, eq_range, TRUE);
|
if (likely(!error))
|
memcpy(rec_buf_ptr, table->record[0], m_rec_length);
|
reverse_order= FALSE;
|
break;
|
}
|
Currently, this sequence of SE API calls:
h->push_index_cond(cond);
|
h->ha_index_read_map(some_buffer, ...);
|
does NOT produce a valid result, for both MyISAM and InnoDB.
Both will unpack the index columns into some_buffer and then call handler_index_cond_check() which will read index columns from table->record[0].
Possible ways out
1. Adopt and document the limitation
2. Disallow reads to side buffer.
3. Make index_read_map and co work with ICP and side buffers
Adopt and document the limitation
Document that one must not call ha_index_read_map(buffer != table->record[0], .... ) when
ICP is enabled, add an assertion in the function.
ha_partitioning will copy record when necessary.
Disallow reads to buffer!=table->record[0]
Makes the API uniform but will incur some record copying where previously was none.
Make index_read_map and co work with ICP and side buffers
let them unpack index columns into table->record[0] for checking.
If check succeeded and we're reading to side buffer, unpack into *buffer.
ha_partitioning will not need any modification. Some extra copying will be done in the engine.
... Decided to use yet another solution with table->move_fields(). Dave has hit a roadblock with it as something crashes inside InnoDB.
This fixes the InnoDB problem for me:
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
|
index 93127bb1c3a..986056dade1 100644
|
--- a/storage/innobase/handler/ha_innodb.cc
|
+++ b/storage/innobase/handler/ha_innodb.cc
|
@@ -817,7 +817,8 @@ get_field_offset(
|
const TABLE* table,
|
const Field* field)
|
{
|
- return field->offset(table->record[0]);
|
+// return field->offset(table->record[0]);
|
+ return field->offset(field->record_ptr());
|
} |
Full diff: https://gist.github.com/spetrunia/abbe5daa0899c85b79f5f3d66c59cd50 .
Gosselin, can you continue with move_fields solution with this?
A side note.
void TABLE::move_fields(Field **ptr, const uchar *to, const uchar *from)
|
Would it make sense to add an assert that fields actually point into the record pointed by "from" ... This is outside of scope of this MDEV.
Better change field.h as
--- a/sql/field.h
|
+++ b/sql/field.h
|
@@ -1528,8 +1528,7 @@ class Field: public Value_source
|
inline void move_field(uchar *ptr_arg) { ptr=ptr_arg; }
|
inline uchar *record_ptr() // record[0] or wherever the field was moved to
|
{
|
- my_ptrdiff_t offset= table->s->field[field_index]->ptr - table->s->default_values;
|
- return ptr - offset;
|
+ return ptr - offset();
|
}
|
virtual void move_field_offset(my_ptrdiff_t ptr_diff)
|
{
|
@@ -1636,9 +1635,9 @@ class Field: public Value_source
|
{ return max_length;}
|
virtual bool is_packable() const { return false; }
|
|
- uint offset(const uchar *record) const
|
+ uint offset() const
|
{
|
- return (uint) (ptr - record);
|
+ return table->s->field[field_index]->ptr - table->s->default_values;
|
}
|
void copy_from_tmp(int offset);
|
uint fill_cache_field(struct st_cache_field *copy); |
psergei thank you for taking a look, but your patch doesn't pass the parts suite when applied on top of my changes.
serg I'm not sure what your patch is trying to accomplish, but it doesn't compile because the change to the offset method changes the contract which affects callers. Of course I can debug, but what is your intent?
Separately, for both of your patches, these are changes independent (although uncovered by) my feature change. IMHO they should be delivered apart from my change for the sake of (1) sound engineering practice and (2) future software archaeology efforts. If they introduce a bug and are introduced with my feature, then a future maintainer will have lots of extra work to determine where in the commit the problem lies. But if we deliver them apart from my feature, then it will be clearer in the commit history.
Additionally, I don't understand what bugs they are fixing because there's no description as to the fault.
Confirm, there were many tests failing in suite/parts.
This fix in innodb was incomplete. This seems to fix them all:
diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
|
index 93127bb1c3a..1d8e1c1bf1c 100644
|
--- a/storage/innobase/handler/ha_innodb.cc
|
+++ b/storage/innobase/handler/ha_innodb.cc
|
@@ -5507,7 +5508,7 @@ innobase_vcol_build_templ(
|
|
if (field->real_maybe_null()) {
|
templ->mysql_null_byte_offset =
|
- field->null_offset();
|
+ field->null_offset(field->record_ptr());
|
|
templ->mysql_null_bit_mask = (ulint) field->null_bit;
|
} else {
|
@@ -7221,7 +7222,7 @@ build_template_field(
|
|
if (field->real_maybe_null()) {
|
templ->mysql_null_byte_offset =
|
- field->null_offset();
|
+ field->null_offset(field->record_ptr());
|
|
templ->mysql_null_bit_mask = (ulint) field->null_bit;
|
} else { |
Let me get a buildbot run and document this.
Asserts are also needed.
Will also look at incorporating Serg's input.
serg, tried implementing your suggestion and it didn't work. Temporary (aka "work") tables have table->s->field==NULL. There are many calls to offset() for the fields of temporary tables.
A patch from attempt to implement it: https://gist.github.com/spetrunia/4e8d8bdded2f5c698bf3ff385bb272cd .
Yes, this means record_ptr won't work for temp.tables, either.
inline uchar *record_ptr() // record[0] or wherever the field was moved to |
{
|
my_ptrdiff_t offset= table->s->field[field_index]->ptr - table->s->default_values;
|
return ptr - offset; |
}
|
The function is only used in a few places and was introduced in:
commit d137b4dbbac1ce53906ca15817334e3a4daa2655
|
Author: Sergei Golubchik <serg@mariadb.org>
|
Date: Wed Nov 23 17:33:40 2016 +0100
|
|
MDEV-5800 MyISAM support for indexed vcols
|
so it is just not invoked for temporary tables.
Also note that the following doesn't always hold, even when fields are pointing into table->record[0]:
table->record[0] + table->s->null_bytes == table->field[0]->ptr
|
so it's not trivial to find which record the fields are pointing to.
Gosselin, extra asserts we've discussed:
https://github.com/MariaDB/server/pull/3123
Merged the above with ANALYZE support:
bb-11.4-MDEV-18478-v4-MDEV-12404
|
Testing done. Ok to push. Please, don't forget changes from branch
bb-11.4-MDEV-18478-v4-MDEV-12404
|
I rebased to 11.5 and will merge once the buildbot checks pass on the PR: https://github.com/MariaDB/server/pull/3040
A simple testcase:
shows
{
"query_block": {
"select_id": 1,
"cost": 0.02196592,
"nested_loop": [
{
"table": {
"table_name": "t1",
"partitions": ["p0", "p1", "p2", "p3"],
"access_type": "range",
"possible_keys": ["a"],
"key": "a",
"key_length": "5",
"used_key_parts": ["a"],
"loops": 1,
"rows": 9,
"cost": 0.02196592,
"filtered": 100,
"attached_condition": "t1.a < 10 and t1.b + 1 > 3"
}
}
]
}
}
Note that the conditions are in attached_condition.
For comparison, let's run the same on a non-partitioned table:
);
shows
{
"query_block": {
"select_id": 1,
"cost": 0.0146548,
"nested_loop": [
{
"table": {
"table_name": "t2",
"access_type": "range",
"possible_keys": ["a"],
"key": "a",
"key_length": "5",
"used_key_parts": ["a"],
"loops": 1,
"rows": 9,
"cost": 0.0146548,
"filtered": 100,
"index_condition": "t2.a < 10 and t2.b + 1 > 3"
}
}
]
}
}
Now, the conditions are in the index_condition.