[MDEV-13820] trx_id_check() fails during row_log_table_apply() Created: 2017-09-15 Updated: 2017-12-12 Resolved: 2017-12-07 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3 |
| Fix Version/s: | 10.3.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
The test case is not deterministic. If it doesn't fail right away, try running with --repeat. Also, running with innodb_flush_log_at_trx_commit=2 significantly increased the probability of the race condition in original stress tests while running on disk – once in a few minutes vs once in many hours; and further, running in shm increased the probability even more, to once in a few seconds. Currently the MTR test below fails pretty much every time for me, even on disk and without innodb_flush_log_at_trx_commit=2, so I cannot say whether those options have any effect on it.
Does not fail on 10.2. |
| Comments |
| Comment by Marko Mäkelä [ 2017-10-24 ] | ||||||||||||||||||||||||||
|
This assertion only exists in 10.3.
There is a race condition in the test, but in my case this transaction ID was assigned right before the transaction for ALTER TABLE t2 started. It looks like the ROLLBACK completed execution. In t2.ibd (and in the buffer pool copy of its root page 5:3) there is no user record. As part of The assertion fails on the intermediate copy that is being created for t2.ibd (page 6:3). That record must have been copied during the initial copy phase. At that point, the transaction was active. The proper fix would seem to be to relax this assertion. After all, we are applying the delete operation related to rolling back the INSERT. Currently row_merge_read_clustered_index() only resets the DB_TRX_ID for records that belonged to transactions that were committed before the ALTER TABLE started:
| ||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-10-24 ] | ||||||||||||||||||||||||||
|
We cannot simply relax the assertion.
Special consideration is needed when the operation involves DROP PRIMARY KEY, ADD PRIMARY KEY. | ||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-12-07 ] | ||||||||||||||||||||||||||
|
Here is a more deterministic version of the test. It is important that the transaction that did performed concurrent modifications was started (and a transaction ID assigned) before the ALTER TABLE was started:
This certainly highlights a thinking mistake that I made when developing the following fix:
The assertion trx_id_check() was introduced in the above fix. It is valid by design in row_merge_read_clustered_index(), because there we are performing REPEATABLE READ, so we will not see any records that were written after the ALTER TABLE. Also, thanks to the exclusive lock during ha_innobase::prepare_inplace_alter_table(), there cannot exist any older transactions that have already modified the table at the time the ALTER TABLE started the copying. So, at that point, REPEATABLE READ cannot see rows from active but earlier-started transactions. | ||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-12-07 ] | ||||||||||||||||||||||||||
|
When logging ROW_T_INSERT or ROW_T_UPDATE records, we did not normalize the DB_TRX_ID of the current transaction into 0 if the current transaction had started (modifying other tables) before the ALTER TABLE started. |