[MDEV-25546] LIMIT partitioning does not respect ROLLBACK - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.3(EOL), 10.4(EOL), 10.5, 10.6
Fix Version/s: 10.7.4, 10.3.35, 10.4.25, 10.5.16, 10.6.8, 10.8.3
Component/s: Partitioning, Versioned Tables
Labels:
None

Description

LIMIT partition temporarily populated in transaction is not used anymore after ROLLBACK

Note: There can be a race condition/non-determinism during MTR execution of the test case, as statistics on an InnoDB table is not precise (described in not-a-bug ~~MDEV-23640~~). I can't put ANALYZE in the test case, as it terminates the transaction. For reproducing, just re-run or put sleeps if necessary. For a regression suite, something smarter would need to be done to maintain determinism.

If we have several empty (not full) LIMIT partitions, rows are put in the first one first, when it's overflows then next, etc – that's the expected behavior.
Same happens when partition population happens inside a transaction.
However if the transaction is rolled back and partitions again become empty, the orderly population does not happen anymore – partitions which were once (over)filled remain ignored, and rows are put into the next ones.

It affects both explicit and implicit (including auto-created) partitions. The test case below is for explicit ones, to be applicable to all versions which support system versioning.

--source include/have_partition.inc

--source include/have_sequence.inc

--source include/have_innodb.inc

create or replace table t1 (pk int primary key) with system versioning

partition by system_time limit 100 (

  partition p0 history,

  partition p1 history,

  partition pn current);

insert into t1 select seq from seq_1_to_90;

start transaction;

# Puts 80 rows into p0

replace into t1 select seq from seq_1_to_80;

# Puts another 70 rows into p0

replace into t1 select seq from seq_1_to_70;

# Puts 60 rows into p1

replace into t1 select seq from seq_1_to_60;

select partition_name, table_rows

from information_schema.partitions

where table_name = 't1';

rollback;

select partition_name, table_rows

from information_schema.partitions

where table_name = 't1';

# Should put 10 rows into the empty partition p0,

# but instead they go to p1

replace into t1 select seq from seq_1_to_10;

select partition_name, table_rows

from information_schema.partitions

where table_name = 't1';

 # Cleanup

drop table t1;

10.3 4d412e98
replace into t1 select seq from seq_1_to_80;
replace into t1 select seq from seq_1_to_70;
replace into t1 select seq from seq_1_to_60;
select partition_name, table_rows from information_schema.partitions where table_name = 't1';
partition_name table_rows
p0 150
p1 60
pn 90
rollback;
select partition_name, table_rows from information_schema.partitions where table_name = 't1';
partition_name table_rows
p0 0
p1 0
pn 90
replace into t1 select seq from seq_1_to_10;
select partition_name, table_rows from information_schema.partitions where table_name = 't1';
partition_name table_rows
p0 0
p1 10
pn 90

Attachments

Activity

Ascending order - Click to sort in descending order

Aleksey Midenkov added a comment - 2022-03-28 15:04

Proposed fix

vers_info->hist_part must be reverted on rollback for each table that was changed by transaction. vers_info->hist_part is initalized as first history partition in check_vers_constants() and then recalculated in vers_set_hist_part(). vers_info is stored in TABLE object (in part_info).

Processing each TABLE on rollback is not possible without help of storage engine: ha_rollback_trans() can access only handlertons and the more detailed data about what was changed by transaction is known only by storage engine.

InnoDB stores knowledge about what tables were modified into trx->mod_tables. This data is processed on partial rollback for trx-versioning needs (0b89a42ffc7). The structure is map with key dict_table_t *.

Proposed InnoDB changes:

1. trx_rollback_to_savepoint_low() calls mod_tables.rollback() not only for partial rollback but for full rollback as well (savept == NULL).
2. mod_tables.rollback() accesses TABLE object and calls TABLE::rollback().
3. TABLE::rollback() updates vers_info->hist_part.

Changes for p. 2:

1. Generally there is no stored connection between dict_table_t and TABLE. The connection is done only by innodb_find_table_for_vc() for vcol needs.

handler::ha_open() has TABLE object, handler::ha_open() calls handler::open() without TABLE object. That is virtual interface implemented by ha_innobase::open(). The goal is to supply TABLE to ha_innobase::open() and assign it to dict_table_t. That can be done like that in handler::ha_open():

  if (unlikely((error=open2(table_arg, name,mode,test_if_locked))))

...

  if (error == HA_NOT_IMPLEMENTED &&

      unlikely((error=open(name,mode,test_if_locked))))

    if ((error == EACCES || error == EROFS) && mode == O_RDWR &&

	(table->db_stat & HA_TRY_READ_ONLY))

      table->db_stat|=HA_READ_ONLY;

      error=open(name,O_RDONLY,test_if_locked);

handler::ha_open() first tries to call open2() which by default returns HA_NOT_IMPLEMENTED. ha_innobase has this interface: we rename ha_innobase::open() to ha_innobase::open2() and pass table_arg argument.

2. innodb_find_table_for_vc() now is not needed. We remove that function and table->vc_templ->mysql_table member. As far as I'm concerned this condition holds all tests except one below case:

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -20951,6 +20951,8 @@ static TABLE* innodb_find_table_for_vc(THD* thd, dict_table_t* table)

        TABLE* mysql_table = find_fk_open_table(thd, db_buf, db_buf_len,

                                                tbl_buf, tbl_buf_len);

+       ut_ad(table->vc_templ->mysql_table_query_id == ~0ULL ||

+               table->vc_templ->mysql_table == mysql_table);

        table->vc_templ->mysql_table = mysql_table;

        table->vc_templ->mysql_table_query_id = thd_get_query_id(thd);

        return mysql_table;

The condition it cannot hold is this:

--- a/mysql-test/suite/gcol/t/innodb_virtual_fk.test

+++ b/mysql-test/suite/gcol/t/innodb_virtual_fk.test

@@ -687,6 +687,7 @@ INSERT INTO email_stats (id, email_id,  date_sent) VALUES (1,1,'Jan');

 INSERT INTO emails_metadata VALUES (1);

 UPDATE emails SET id=2;

+flush tables;

 DELETE FROM emails;

 DROP TABLE email_stats;

But this is OK, since TABLE member will be updated at each ha_innobase::open2() call.

Aleksey Midenkov added a comment - 2022-03-28 15:04 Proposed fix vers_info->hist_part must be reverted on rollback for each table that was changed by transaction. vers_info->hist_part is initalized as first history partition in check_vers_constants() and then recalculated in vers_set_hist_part() . vers_info is stored in TABLE object (in part_info ). Processing each TABLE on rollback is not possible without help of storage engine: ha_rollback_trans() can access only handlertons and the more detailed data about what was changed by transaction is known only by storage engine. InnoDB stores knowledge about what tables were modified into trx->mod_tables . This data is processed on partial rollback for trx-versioning needs (0b89a42ffc7). The structure is map with key dict_table_t * . Proposed InnoDB changes: 1. trx_rollback_to_savepoint_low() calls mod_tables.rollback() not only for partial rollback but for full rollback as well ( savept == NULL ). 2. mod_tables.rollback() accesses TABLE object and calls TABLE::rollback() . 3. TABLE::rollback() updates vers_info->hist_part . Changes for p. 2: 1. Generally there is no stored connection between dict_table_t and TABLE . The connection is done only by innodb_find_table_for_vc() for vcol needs. handler::ha_open() has TABLE object, handler::ha_open() calls handler::open() without TABLE object. That is virtual interface implemented by ha_innobase::open() . The goal is to supply TABLE to ha_innobase::open() and assign it to dict_table_t . That can be done like that in handler::ha_open() : if (unlikely((error=open2(table_arg, name,mode,test_if_locked)))) { ... } if (error == HA_NOT_IMPLEMENTED && unlikely((error=open(name,mode,test_if_locked)))) { if ((error == EACCES || error == EROFS) && mode == O_RDWR && (table->db_stat & HA_TRY_READ_ONLY)) { table->db_stat|=HA_READ_ONLY; error=open(name,O_RDONLY,test_if_locked); } } handler::ha_open() first tries to call open2() which by default returns HA_NOT_IMPLEMENTED. ha_innobase has this interface: we rename ha_innobase::open() to ha_innobase::open2() and pass table_arg argument. 2. innodb_find_table_for_vc() now is not needed. We remove that function and table->vc_templ->mysql_table member. As far as I'm concerned this condition holds all tests except one below case: --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -20951,6 +20951,8 @@ static TABLE* innodb_find_table_for_vc(THD* thd, dict_table_t* table) TABLE* mysql_table = find_fk_open_table(thd, db_buf, db_buf_len, tbl_buf, tbl_buf_len); + ut_ad(table->vc_templ->mysql_table_query_id == ~0ULL || + table->vc_templ->mysql_table == mysql_table); table->vc_templ->mysql_table = mysql_table; table->vc_templ->mysql_table_query_id = thd_get_query_id(thd); return mysql_table; The condition it cannot hold is this: --- a/mysql-test/suite/gcol/t/innodb_virtual_fk.test +++ b/mysql-test/suite/gcol/t/innodb_virtual_fk.test @@ -687,6 +687,7 @@ INSERT INTO email_stats (id, email_id, date_sent) VALUES (1,1,'Jan'); INSERT INTO emails_metadata VALUES (1); UPDATE emails SET id=2; +flush tables; DELETE FROM emails; DROP TABLE email_stats; But this is OK, since TABLE member will be updated at each ha_innobase::open2() call.

Aleksey Midenkov added a comment - 2022-03-28 15:06

marko Please review the above proposed changes to InnoDB.

Aleksey Midenkov added a comment - 2022-03-28 15:06 marko Please review the above proposed changes to InnoDB.

Marko Mäkelä added a comment - 2022-03-29 05:59

As far as I understand, there is no prototype implementation of the proposed idea yet. I would like to see a prototype both for the earliest suggested version and for 10.6, where the InnoDB transaction commit was heavily refactored.

The InnoDB dict_table_t actually corresponds to TABLE_SHARE in the SQL layer and not TABLE. The TABLE is a table handle or a cursor, which would roughly correspond to row_prebuilt_t in InnoDB.

It is unclear to me how the proposed idea would work on incomplete transactions that were recovered after the server was killed and restarted. A quick look in the code suggests that trx_t::mod_tables are only being updated for recovered transactions that are in the XA PREPARE state (and will remain in that state until an explicit commit or rollback statement).

When it comes to innodb_find_table_for_vc(), nikitamalyavin should know this code. I think that it is unfortunate that the MySQL 5.7 implementation was imported to MariaDB 10.2 without thorough review. Implementing MDEV-17598 and MDEV-22361 would allow us to eventually remove that code.

When it comes to partitioned tables, nayuta-yanagisawa is the subsystem owner and should be asked to review those changes.

If this functionality involves the dynamic creation or dropping of partitions during the execution of DML transactions, I think that it is problematic. Until ~~MDEV-25506~~ and ~~MDEV-25919~~ it was impossible to mix DDL and DML operations in a single InnoDB transaction, and I would say that it still is a bad idea, because it would cause locking conflicts on the InnoDB dictionary tables.

Marko Mäkelä added a comment - 2022-03-29 05:59 As far as I understand, there is no prototype implementation of the proposed idea yet. I would like to see a prototype both for the earliest suggested version and for 10.6, where the InnoDB transaction commit was heavily refactored. The InnoDB dict_table_t actually corresponds to TABLE_SHARE in the SQL layer and not TABLE . The TABLE is a table handle or a cursor, which would roughly correspond to row_prebuilt_t in InnoDB. It is unclear to me how the proposed idea would work on incomplete transactions that were recovered after the server was killed and restarted. A quick look in the code suggests that trx_t::mod_tables are only being updated for recovered transactions that are in the XA PREPARE state (and will remain in that state until an explicit commit or rollback statement). When it comes to innodb_find_table_for_vc() , nikitamalyavin should know this code. I think that it is unfortunate that the MySQL 5.7 implementation was imported to MariaDB 10.2 without thorough review. Implementing MDEV-17598 and MDEV-22361 would allow us to eventually remove that code. When it comes to partitioned tables, nayuta-yanagisawa is the subsystem owner and should be asked to review those changes. If this functionality involves the dynamic creation or dropping of partitions during the execution of DML transactions, I think that it is problematic. Until MDEV-25506 and MDEV-25919 it was impossible to mix DDL and DML operations in a single InnoDB transaction, and I would say that it still is a bad idea, because it would cause locking conflicts on the InnoDB dictionary tables.

Aleksey Midenkov added a comment - 2022-03-29 10:23

It is unclear to me how the proposed idea would work on incomplete transactions that were recovered after the server was killed and restarted. A quick look in the code suggests that trx_t::mod_tables are only being updated for recovered transactions that are in the XA PREPARE state (and will remain in that state until an explicit commit or rollback statement).

In this bug we solve stale data in TABLE object. On server restart of course any data is reloaded.

Aleksey Midenkov added a comment - 2022-03-29 10:23 It is unclear to me how the proposed idea would work on incomplete transactions that were recovered after the server was killed and restarted. A quick look in the code suggests that trx_t::mod_tables are only being updated for recovered transactions that are in the XA PREPARE state (and will remain in that state until an explicit commit or rollback statement). In this bug we solve stale data in TABLE object. On server restart of course any data is reloaded.

Aleksey Midenkov added a comment - 2022-03-31 10:50

The InnoDB dict_table_t actually corresponds to TABLE_SHARE in the SQL layer and not TABLE. The TABLE is a table handle or a cursor, which would roughly correspond to row_prebuilt_t in InnoDB.

You right. Moreover lifetimes of TABLE_SHARE and dict_table_t are independent by design. So we have to acquire TABLE_SHARE each time we need it in InnoDB.

Aleksey Midenkov added a comment - 2022-03-31 10:50 The InnoDB dict_table_t actually corresponds to TABLE_SHARE in the SQL layer and not TABLE. The TABLE is a table handle or a cursor, which would roughly correspond to row_prebuilt_t in InnoDB. You right. Moreover lifetimes of TABLE_SHARE and dict_table_t are independent by design. So we have to acquire TABLE_SHARE each time we need it in InnoDB.

Aleksey Midenkov added a comment - 2022-04-01 06:16

Then the simplest solution is to process partitions each time from start for LIMIT in vers_set_hist_part().

Aleksey Midenkov added a comment - 2022-04-01 06:16 Then the simplest solution is to process partitions each time from start for LIMIT in vers_set_hist_part().

Aleksey Midenkov added a comment - 2022-04-01 07:05

Please review bb-10.3-midenok

Aleksey Midenkov added a comment - 2022-04-01 07:05 Please review bb-10.3-midenok

Nikita Malyavin added a comment - 2022-04-19 14:47

ok to push

Nikita Malyavin added a comment - 2022-04-19 14:47 ok to push

People

Assignee:: Aleksey Midenkov

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2021-04-27 18:38

Updated:: 2022-04-22 13:25

Resolved:: 2022-04-22 13:25

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server