[MDEV-15140] Implement Partial / Filtered Indexes Created: 2018-01-31  Updated: 2024-01-18

Status: Open
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Data Definition - Create Table, Optimizer, Storage Engine - InnoDB
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: Pavel Cibulka Assignee: Ralf Gebhardt
Resolution: Unresolved Votes: 32
Labels: index

Issue Links:
Blocks
is blocked by MDEV-17567 Atomic DDL Closed
is blocked by MDEV-17598 InnoDB index option for per-record tr... Open
Relates
relates to MDEV-11655 Transactional data dictionary Open

 Description   

It would be nice, if MariaDB has some way to index only subset of values / rows.

For example:
We have 1% of rows value set to "active" and 99% of rows set to "inactive". There is no need to index "inactive" rows. It would be similar to table scan. But indexing "active" rows would work well.

This feature is called Partial Index or Filtered Index in other databases.
PostgreSQL:
https://www.postgresql.org/docs/current/static/indexes-partial.html
Microsoft SQL Server:
https://docs.microsoft.com/en-us/sql/relational-databases/indexes/create-filtered-indexes
Oracle (it can be created by Function-based index and fact that Oracle doesn't indexing NULL values):
http://dba-presents.com/index.php/databases/oracle/41-filtered-index-equivalent-in-oracle

There is no way to create this index in MySQL / Mariadb. Closest thing would be using another table and trigger. This complicate a lot of things and isn't much useful.

Feature request in MySQL: https://bugs.mysql.com/bug.php?id=76631



 Comments   
Comment by Nuno [ 2019-06-18 ]

Was just looking to see if this existed in MariaDB.
It's an absolute waste to store 100% of the data, when I just want to index 1% of the data, based on a column value.

Not only that, SQL Server (at least) also has a very useful thing with this, where you can create a UNIQUE index with a WHERE clause, so that the uniqueness only applies for that criteria. This is great if we want, for example, only 1 active entry per group, while all other entries for those groups are disabled.

Hopefully this will be considered. Thank you!

Comment by Federico Razzoli [ 2019-08-17 ]

Please note how this feature makes life easier with temporal tables. An index could be on:

  • current data
  • historical data not older than X
  • all historical data

All these make sense to optimise certain analytics queries and would potentially avoid a relevant waste of disk space.

Comment by Dean Trower [ 2019-09-01 ]

Nuno, for your 2nd case (UNIQUE index with a WHERE clause), you can already achieve that by creating a generated column (either VIRTUAL or STORED is ok) that evaluates to a unique index key for those rows matching your desired WHERE clause, and evaluates to NULL for all other rows. Then create a UNIQUE index on that column: The uniqueness constraint doesn't apply to the NULL rows, but does apply to all the others.

It's still a waste of space, though, as the index still does reference every row.

(BTW in MySQL since v8.0.13 you can now index directly on functions/expressions, so you don't even need to create the virtual column in that DB).

Comment by Nuno [ 2019-09-01 ]

Thank Dean.

However,

> It's still a waste of space, though, as the index still does reference every row.

This is exactly what I want to avoid

But yes, VIRTUAL/PERSISTENT is perfect for the condition I was mentioning of "1 active entry per group", etc...

Comment by David Taylor [ 2020-04-08 ]

Curious if any progress has been made on this. I have utilized this multiple times with MSSQL. It is very useful when you need some type of work queue, so for example if you have a nullable processed_time (or similar) field that indicates it still needs to be processed if the value is NULL. You can create a filtered index on this property to only pick up the NULL rows so that your query will be very performant when querying for new tasks.

Comment by Marko Mäkelä [ 2020-09-16 ]

I would like to ‘dumb down’ InnoDB if possible and rather reduce, not increase, the number of special index types. But, this feature seems to be simply make a secondary index cover a subset of the rows in the table. It sounds fairly simple, especially once we have per-record transaction identifiers (MDEV-17598) and can simplify the purge of history accordingly. I have been toying with a similar idea in the past, for omitting NULL values from an index.

Because I would like to eliminate the internal InnoDB data dictionary in some form (MDEV-11655) at some point of the future, I would rather not extend the InnoDB data dictionary with this. The metadata would have to be added to the .frm file. InnoDB cannot really trust .frm files for internal operations (such as rollback and purge) until MDEV-17567 makes all operations on .frm files atomic (crash-safe).

I am pretty sure that also the query executor and planner would have to be aware of this feature.

Comment by Marko Mäkelä [ 2022-02-24 ]

Even though MDEV-17567 was fixed, the .frm files are not available during InnoDB startup when incomplete transactions are being rolled back, which may be before any DDL recovery takes place. The metadata on indexes needs to be written in transactional storage, and I would not want to extend the current InnoDB SYS_INDEXES or SYS_FIELDS tables to duplicate any new type of metadata.

Comment by Johannes B [ 2022-02-25 ]

Hi @marko , what are the implications of:

> The metadata on indexes needs to be written in transactional storage, and I would not want to extend the current InnoDB `SYS_INDEXES` or `SYS_FIELDS` tables to duplicate any new type of metadata.

Does this mean that this issue is not to be implemented any more and rather becomes a WONTFIX?
Or do you have an alternative solution in mind?

Comment by Tom Conlon [ 2022-03-13 ]

Interbase/Firebird was first DB to support blobs and other early items (domains, cursors, etc) and usually closely folllows ansi standards.
Their implementation seems straightforward and no need for virtual or other columns,
If the where clause matches the index expression then it is used
Is this a possible route to go down to implement partial/filtered indexes?
https://firebirdsql.org/rlsnotesh/indexing-expression.html

CREATE [UNIQUE] [ASC[ENDING] | DESC[ENDING]] INDEX <index name> ON <table name> COMPUTED BY ( <value expression> )
eg.
CREATE INDEX IDX2 ON T2 COMPUTED BY ( EXTRACT(YEAR FROM COL2) || EXTRACT(MONTH FROM COL2) );
...
SELECT * FROM T2 ORDER BY EXTRACT(YEAR FROM COL2) || EXTRACT(MONTH FROM COL2)
– PLAN (T2 ORDER IDX2)

Comment by VAROQUI Stephane [ 2022-03-14 ]

The more i think on it the more i tend to think that some tables should force ROW FORMAT login. A new CDC plugin binlog synchronus or assynchronus consumer would update an extra materialized table in case of partial index , the parser rewrite queries by injecting extra subquery given by the plugin in relation to a specific table primary key. This could work for materialized view, partial index and external multi table index like sphinx where the consumer would update rt_index using such plugin. Table definition would store a list of CDC plugin to call with row based images of the records trigger could be a specific implementation of such plugins

Comment by Michael Widenius [ 2022-07-18 ]

Benefit of a partial index:

  • Smaller disk space
  • Smaller memory usage/easier to cache (not that relevant in today words when memory is cheap)

Disadvantages:

  • Makes optimizer much more complex
  • Helps mostly (only?) when the indexed columns are compared with constants: 'index_column=1000' but cannot be used with 'index_column=other_column'
  • Most databases seam to support only trivial expression for partial indexes.
  • Would have to be implemented separately in each engine (may be a problem for
    engines that needs this information during recovery where the .frm file (and thus the partial index expression is not available).

The one case where partial indexes makes sense is to remove NULL values.
The reason for this is that the optimizer can in this case still use the index for
'index_column=other_column' as this would never match a NULL value.
This would require almost no changes in the optimizer and very small changes
in the engines that should support this.

I suggest that we start by adding support for:
CREATE INDEX ... WHERE column is not null.
The engine supporting this would just need to store one bit per index part to flag
if nulls should be stored in the index.

Comment by Federico Razzoli [ 2022-07-18 ]

I wouldn't say that NULL is the one case. It wouldn't cover (most of the times) the most important use case I can think of: data warehousing.

Fact tables may have "optional dimensions". Of course they could be represented with NULL, but that would makes dwh queries even longer than they usually are. Instead they correctly add a nully row into dimensions:

INSERT INTO dim_city (city_id, city_name, country_id) VALUES (1, '', NULL);

In this case, in fact tables I'd want to exclude rows WHERE city_id = 1 OR device_id = 1 OR ...

Comment by Nuno [ 2022-07-18 ]

Yep - totally agree. 90% of a table might sometimes have a "city_id = 1" (using Frederico's example), but I want to index the other 10% of that table that has other city_id values.

I've just read a comment I wrote back in 2019, saying:
"But yes, VIRTUAL/PERSISTENT is perfect for the condition I was mentioning of "1 active entry per group", etc..."

The problem with VIRTUAL/PERSISTENT is that is totally breaks ALTER ONLINE for that table, which is a big problem, so I'm actually avoiding using this in the future, and migrating away from using it completely.

Generated at Thu Feb 08 08:18:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.