[MDEV-23244] ALTER TABLE…ADD PRIMARY KEY fails to flag duplicate key error from concurrent DML Created: 2020-07-21 Updated: 2020-07-23 Resolved: 2020-07-23 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3.1, 10.3, 10.4, 10.5 |
| Fix Version/s: | 10.3.24, 10.4.14, 10.5.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | ddl, not-10.1, not-10.2, regression-10.3, rr-profile-analyzed | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
This failure had been reported in There are multiple assertions log->blobs in the function row_log_table_apply_update(). The one that fails in the rr replay trace appears to be specific to 10.3 and later, likely caused by
In the trace that I analyzed, old_trx_id->data is reset_trx_id. The value was assigned earlier:
The code in row_log_table_get_pk() is directly related to this:
We should adjust the condition to take log->min_trx into account. At the time of the mismatch, we have mach_read_from_6(rec_trx_id) == 6388 and log->min_trx == 6385. It seems that we are correctly treating this as a mismatch of the DB_TRX_ID, because in row_log_table_get_pk() the original DB_TRX_ID must have been smaller than 6385. So, this might be a 'harmless' debug assertion failure. This should only affect cases where the PRIMARY KEY is being redefined while concurrent DML is being executed. In the trace that I analyzed, ALTER TABLE t2 ADD PRIMARY KEY (col2) was being executed. |
| Comments |
| Comment by Marko Mäkelä [ 2020-07-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mleich, you posted a RQG grammar to I just realized that applying the reset_trx_id should not affect the outcome of the memcmp(), and the assertion should still fail. So, this might not be caused by | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I checked the shorter trace (with 31666 events until the crash). It is executing ADD PRIMARY KEY on a ROW_FORMAT=COMPRESSED table. I think that I must first parse the log and then try to understand what exactly went wrong in the INSERT part of applying a ROW_T_UPDATE record:
Partially decoded: The 0xc014 indicate BLOB pointers (0x14=20 bytes of pointer written to the clustered index record). On a quick look, the immediate problem seems to be that both ROW_T_INSERT of the old primary key values DB_ROW_ID=0x205 and DB_ROW_ID=0x206 are using reset_trx_id while the ROW_T_UPDATE are for DB_TRX_ID=0x4f. The log->min_trx is 0x4b. I do not yet see any evidence that | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Even though the rr replay trace is relatively short, it includes many irrelevant statements, because the RQG grammar has not been simplified for this error. Both the original table and the copy of the table are initially empty at the time the first ROW_T_INSERT is logged. The sequence of interesting rr replay events (identifed by when) follows:
Based on the above, I created a reasonably reduced test case that repeats the crash. 10.1 and 10.2 are not affected:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2020-07-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The RQG test simplifier was used. But it generated only two bug replays on > 3000 RQG test runs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This problem is caused by the fix of I simplified the test case even further and analyzed the contents of the log in both 10.2 and 10.3.
Notably, in 10.3, if I append LIMIT 1 to the UPDATE statement, the test will no longer crash, but it will not report the duplicate PRIMARY KEY either, like it should. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following partial revert of the
It is not an actual fix yet, and we would definitely be missing a change to row_log_table_low_redundant() if this is the right direction. I believe that the root cause is that applying the second ROW_T_INSERT did not flag a duplicate key error. Here is a simpler test for that, causing no crash, but a wrong result (the ALTER TABLE wrongly succeeds, and the table will contain 1 row, even though 2 were inserted):
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Finally, after some debugging on 10.2 and 10.3, I found the actual root cause. The following fix should have limited negative impact: more ‘fake’ duplicate key errors could be reported.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Also the following would fail to flag a duplicate key error from an INSERT (or possibly UPDATE) transaction that was started before the ALTER TABLE but did not access the table in question before the ALTER TABLE started:
and
Starting with MariaDB 10.2, a simple ALTER TABLE t2 DROP b would be refused. In MariaDB 10.0 and 10.1, it would imply the PRIMARY KEY operations. |