[MDEV-13654] Various crashes due to DB_TRX_ID mismatch in table-rebuilding ALTER TABLE…LOCK=NONE Created: 2017-08-26 Updated: 2020-07-23 Resolved: 2017-09-01 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Definition - Alter Table, Storage Engine - InnoDB |
| Affects Version/s: | 10.3.1 |
| Fix Version/s: | 10.3.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
NOTE: This test case is for reproducing only, don't put it into the regression suite!
|
| Comments |
| Comment by Marko Mäkelä [ 2017-08-28 ] | |||||||||
|
This is definitely caused by There are a few options:
elenst, can you observe any other failures in row0log.cc when there are no concurrent DELETE operations? | |||||||||
| Comment by Elena Stepanova [ 2017-08-28 ] | |||||||||
|
> Elena Stepanova, can you observe any other failures in row0log.cc when there are no concurrent DELETE operations? I haven't yet, but there hasn't been enough testing yet to say for sure there are none. I'll keep watching. A side note (unrelated to | |||||||||
| Comment by Marko Mäkelä [ 2017-08-31 ] | |||||||||
|
The innodb.innodb-table-online test that I imported into 10.0 (and merged up to 10.2 so far) as part of
| |||||||||
| Comment by Marko Mäkelä [ 2017-08-31 ] | |||||||||
|
I made a simple change that makes row_merge_read_clustered_index() reset DB_TRX_ID,DB_ROLL_PTR in the row->fields[] when the history is not needed (when the DB_TRX_ID refers to a no-longer-active transaction). The remaining question is: Do the callers of row_log_table_open() need any adjustment? A further observation is that ha_innobase::prepare_inplace_alter_table() acts as a barrier. During the execution of this function, both MDL_EXCLUSIVE lock on the table name and a LOCK_X on the dict_table_t will be held for some time. These locks cannot be granted until any concurrent transactions that accessed these tables (the MDL_EXCLUSIVE blocks even non-locking MVCC reads). So, any active transaction that was observed during ha_innobase::inplace_alter_table() in row_merge_read_clustered_index() or thereafter, would necessarily have been started after ha_innobase::prepare_inplace_alter_table() returned and the exclusive lock was downgraded. Note: After a table-rebuilding ALTER TABLE…LOCK=NONE, it could be unavoidable to have some nonzero DB_TRX_ID columns in the table. Especially with DROP PRIMARY KEY, ADD PRIMARY KEY, the apply logic partly identifies the rows by the combination of PRIMARY KEY and DB_TRX_ID. These nonzero DB_TRX_ID would necessarily refer to concurrent DML operations that were started during ha_innobase::inplace_alter_table(). | |||||||||
| Comment by Marko Mäkelä [ 2017-08-31 ] | |||||||||
|
One more consideration is what happens in an online table-rebuilding ALTER TABLE after upgrading from MariaDB 10.3.0 or earlier. With the above mentioned change to row_merge_read_clustered_index(), the initial rebuild of the table would reset DB_TRX_ID,DB_ROLL_PTR in the table. For row_log_table_delete(), the fix seems simple: treat the parameter sys=NULL as a request to reset the DB_TRX_ID,DB_ROLL_PTR in the log record. The only caller with non-NULL value is the ROLLBACK of an UPDATE, which needs to reset the original value. (As noted above, such rollback can only be a rollback of a transaction that was started after ha_innobase::prepare_inplace_alter_table() returned.) If the update that is now being rolled back was operating on a record whose DB_TRX_ID referred to a purged transaction (from before the upgrade), then we should reset the fields in the log record. For row_log_table_update(), there is the parameter old_pk=row_log_table_get_pk() that is used during ADD PRIMARY KEY. The DB_TRX_ID in the old_pk parameter could refer to a purged transaction. If that is the case, we should reset the DB_TRX_ID,DB_ROLL_PTR in old_pk. With these changes to the logging, row_log_table_apply() should always find a match between the rebuilt table (which would contain DB_TRX_ID=0 most of the time) and the online_log. Again, the problem only affects a table-rebuilding ALTER TABLE…LOCK=NONE. There is also row_log_apply() for ADD INDEX and ADD UNIQUE INDEX when the table is not being rebuilt, and that logging is unaffected by this, because secondary indexes do not contain DB_TRX_ID columns (only the PAGE_MAX_TRX_ID in leaf pages). | |||||||||
| Comment by Marko Mäkelä [ 2017-08-31 ] | |||||||||
| Comment by Jan Lindström (Inactive) [ 2017-09-01 ] | |||||||||
|
ok to push. |