[MDEV-29144] ER_TABLE_SCHEMA_MISMATCH or InnoDB: Failing assertion: index != 0 upon IMPORT TABLESPACE Created: 2022-07-20 Updated: 2023-01-09 Resolved: 2022-12-09 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Definition - Alter Table, Locking, Storage Engine - InnoDB |
| Affects Version/s: | 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10 |
| Fix Version/s: | 10.11.2, 10.3.38, 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Elena Stepanova | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | corruption, rr-profile-analyzed | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
The test case is highly non-determinstic, run with a high value of repeat=N. It usually fails for me within several dozen attempts, but it can vary a lot on different machines and builds. The compound block in the test case is irrelevant to the problem, it is used to get better concurrency within MTR. Instead, the same results can be achieved by running ALTER concurrently with a single UPDATE (send ALTER => run UPDATE => reap ALTER), but then it takes longer to reproduce. The failure happens only when ALTER ends with ER_LOCK_WAIT_TIMEOUT. However, it doesn't always happen when ALTER ends with the timeout; apparently, a much thiner concurrency is involved, at least I couldn't serialize it neither on the SQL level nor via existing sync points in sql_table/sql_base.
So, the test creates/populates a table, then runs some DML on it and inplace ALTER .. ADD INDEX in parallel. ADD INDEX fails, the table seemingly doesn't get the index (neither show create nor I_S tables show it), but when the tablespace is imported in an identical table, it complains about an extra index in meta-data:
If we modify the test case slightly, removing this two lines:
– that is, instead of re-creating the table from scratch, we re-import the tablespace into the original one – it causes an assertion failure instead:
On 10.3 the same test case variation causes a different assertion failure:
|
| Comments |
| Comment by Elena Stepanova [ 2022-07-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
On a related note, InnoDB is generally very verbose in the error log upon discard/import tablespace, here is the log of one repetition of the above when it ends normally:
But when a problem occurs, it isn't really helpful at all:
It could be useful if instead of (or at least in addition to) printing all the lines when things go well, it were more elaborate when things go wrong. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-07-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst, can you please produce one https://rr-project.org trace of a crash? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-07-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
After I removed the two statements from the test, I got a nice crash on 10.3, also with ./mtr --rr. I determined the root cause of it to be a thread that will fail to release a table handle:
From this, I got a long log. In my trace, the interesting part starts when the reference count was initially incremented from 0 to 1 and never back again. For me, the last thread to increment the count from 0 to 1 correspondended to connection default in the test, executing the INSERT. The table handle was not released, because ha_innobase::close() was never called between that and the next statement of that connection:
This statement would fail due to a metadata lock timeout:
I do not know the purpose of the NOWAIT, but I can imagine that it is to request that the operation be refused if a metadata lock is not immediately available. The timeout will occur after the index was already successfully created, when attempting to acquire an exclusive metadata lock for replacing the table definition. There was no call to ha_innobase::close() in the entire run from the start to the crash. If I add a FLUSH TABLES statement right after the ALTER TABLE t NOWAIT, the assertion failure on the table reference count would go away, but occasionally the assertion index != 0 would fail during IMPORT TABLESPACE. It took several attempts to reproduce that under rr. The reason why the assertion fails is that dict_table_get_index_on_name() will pretend that the index c does not exist, because the index was marked as uncommitted, in that very same ALTER TABLE t NOWAIT ADD INDEX (c) that had timed out. There was a call handler::ha_commit_inplace_alter_table(…, commit=false) to roll back the change in InnoDB, but that call would not remove the stub for the uncommitted index, because the reference count is nonzero. The index != 0 assertion fails because row_quiesce_write_indexes() wrongly writes information about the uncommitted index c. I think that it would be cleanest to assert that no references to the table exist at that point, and to remove any stubs for uncommitted indexes before writing to the file. But, I do not think that it can be fixed before the missing handler::close() call has been added. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2022-09-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I will for now assume that other variations of ER_TABLE_SCHEMA_MISMATCH happening upon IMPORT have the same root cause, e.g.
etc. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2022-10-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
With breakpoints on open and close (in rr trace and the original test case), I see
If the test is modified to trigger an assert (removing the line with DROP) then, indeed, there will be no ha_innobase::close(). I've added this close(), but the behavior hasn't changed, still the error (or the assert) happens as before.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If I add the call that serg highlighted (with the last parameter bool kill_delayed_threads=true because I believe that it is more appropriate), then a number of tests will start to fail like this:
On any other form of DDL or DML on the table than ALTER TABLE…IMPORT TABLESPACE or DROP TABLE it makes sense to issue these warnings. How could IMPORT TABLESPACE be detected by ha_innobase::open()?
Note: There are two problems that we have to fix here: The never-released table reference, and the incorrect metadata written by row_quiesce_write_indexes(). For fixing the table->get_ref_count() assertion, some change outside InnoDB is needed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The following addresses the reference-count problem without affecting the output of too many tests:
We already issue such warnings in many ha_innobase member functions that will be invoked on DML. To compensate for the removal of the warning from ha_innobase::open(), we issue the warning on TRUNCATE and native ALTER TABLE, which would always be preceded by a call to ha_innobase::open(). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The above patch indeed seems to make the crash go away. Occasionally the test fails as follows, which should be easy to fix:
The following additional patch fixes this:
After this, the only failure seems to be the following (due to the nondeterministic nature of the test):
After I disabled that part of the test output, it passed 15×1000 rounds:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2022-11-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I don't quite understand why DISCARD TABLESPACE have to do TDC_RT_REMOVE_NOT_OWN there and, for example, IMPORT doesn't. But anyway, it's InnoDB-only feature at the moment, so let's do what InnoDB needs. If we ever get another storage engine that can discard/import tablespaces — then we can reconsider, but until then it's pointless to try to guess what it'll need. sql_table.cc change is ok to push | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-12-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you. The call to ha_innobase::close() is only required on ALTER TABLE…DISCARD TABLESPACE. Omitting the condition if (discard) would not cause any failures of the regression test suite (nor repeated runs of the test of this fix), but I wanted to play it safe and only implement the necessary change. I hope that one day MDEV-11658 will provide an alternative to the problematic InnoDB data file import workflow, which in its current form requires the user to break the referential integrity of the data dictionary. |