[MDEV-28923] atomic.rename_table occasionally fails Created: 2022-06-22 Updated: 2022-06-23 Resolved: 2022-06-23 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6, 10.7, 10.8, 10.9, 10.10 |
| Fix Version/s: | 10.6.9, 10.7.5, 10.8.4, 10.9.2, 10.10.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | recovery | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
While
I would need at least a copy of a data directory that failed to recover. Preferrably from 10.8 or a later branch, because the |
| Comments |
| Comment by Marko Mäkelä [ 2022-06-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I checked a copy of the data directory for a 10.9 kvm-ubsan failure that had failed like this:
data.tar.xz The first mention of the name test/t5 in the ib_logfile0 looks like a WRITE|0x80 record for SYS_TABLES at LSN (and file byte offset) 0x10e71, presumably updating an earlier record:
The second mention is a FILE_RENAME mini-transaction at 0x10f00 (the last 5 bytes are the end marker and the checksum):
Note: At this point, the tablespace ID was 5. There is quite a bit of renaming going on. It is best to search the ib_logfile0 for test/t5.ibd and not test/t5 (which would be updates of SYS_TABLES.NAME). There is also a FILE_RENAME for renaming test/t5.ibd back to test/t1.ibd as tablespace 5. The last record that mentions test/t5.ibd is a FILE_RENAME of test/t4.ibd to test/t5.ibd:
The tablespace identifier here is 8, just like in the message. All data files test/t1.ibd, test/t2.ibd, test/t4.ibd, test/t5.ibd in the data directory are 6*16384 bytes long and consist of NUL bytes. The last checkpoint is 0xec63, and it seems to be straight at the end of the ./mtr --bootstrap. All of the test/ tables were created after that, so the data directory should be fully recoverable. thiru, it looks like the deferred recovery logic of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I debugged an rr replay trace of the failed recovery, checking all occurrences of fil_name_process for space_id==8. Recovery starts scanning the log at the following records that were written by fil_names_clear() corresponding to the latest checkpoint:
Then it rewinds to the checkpoint LSN (0xec63) and scans all records. There is quite a bit of renaming back and forth between t4 and t5. Here are all scanned FILE_ records for tablespace 8:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test is executing a multi-table RENAME, with crashes injected after different numbers of performed operations. The log file of the execution said:
This indicates that we successfully recovered from earlier injected crashes, and the final one, an injected crash after the very last RENAME operation, failed to recover:
In case no crash was injected, the operation would be rolled back by executing the following:
This operation would swap t1 with t2 and ‘increment’ the names of t3 and t4. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks to some debugging by thiru, who was able to reproduce the issue locally, the problem was that fil_name_process() was wrongly avoiding a call to fil_ibd_load() when the tablespace identifier occurred in deferred_spaces. |