[MDEV-15608] Crash during transaction rollback when using optimistic parallel replication, few threads, non-durable configuration. Created: 2018-03-20 Updated: 2023-03-20 Resolved: 2021-04-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2.12 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Juan | Assignee: | Marko Mäkelä |
| Resolution: | Cannot Reproduce | Votes: | 3 |
| Labels: | assertion, innodb, optimistic, parallelreplication | ||
| Environment: |
CentOS 7.4.1708 |
||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
A user is seeing this crash frequently when testing Optimistic parallel replication. Note that the crash happens reliably, possibly on rollback, with fewer replication threads (4) and a non-durable configuration (sync_binlog = 0 and trx_commit = 2) w binlog & slave-updates enabled. Could be a duplicate of MDEV-13800 ?
note also the rollbacks:
|
| Comments |
| Comment by Marko Mäkelä [ 2018-03-21 ] | ||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2018-03-23 ] | ||||||||||||||||||||||||||||||||||||||||
|
See also comment in the linked issue | ||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-07-26 ] | ||||||||||||||||||||||||||||||||||||||||
|
If the workload involved InnoDB temporary tables, this could be a duplicate of | ||||||||||||||||||||||||||||||||||||||||
| Comment by Jean-François Gagné [ 2018-07-26 ] | ||||||||||||||||||||||||||||||||||||||||
|
Hi Marko, sorry, I cannot test if the problem repeats with a newer version, as I do not have access (anymore) to the system where I initially saw this problem. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Chris Calender (Inactive) [ 2020-05-05 ] | ||||||||||||||||||||||||||||||||||||||||
|
We've just seen the the same crash and stack trace in a 10.2.27 instance. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Chris Calender (Inactive) [ 2020-05-05 ] | ||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-06 ] | ||||||||||||||||||||||||||||||||||||||||
|
I resolved the numeric addresses in the MariaDB 10.2.27 stack trace that ccalender posted:
I do not have the matching libc.so.6 code that this resolved part of the stack trace is invoking. The return address in the stack is for the following function call:
It corresponds to the following source code:
I would guess that len is negative (or close to 16 EiB), that is, we have a corrupted undo log page, or maybe even more likely, a corrupted or stale DB_ROLL_PTR that is not pointing to the start of a valid undo log record. This stack trace does not look like the originally reported bug, but rather something that is likely related to Normally, the undo log records are written ahead of the clustered index leaf page changes, and on rollback or purge, the DB_ROLL_PTR references in the clustered index leaf page would be invalidated before the undo record becomes invalid. All these operations are written ahead to the redo log. If InnoDB ever reported of some LSN mismatch (such as, FIL_PAGE_LSN being in the future), that could explain why the clustered index B-tree leaf page and the undo log page got out of sync. This would be a sign of something being seriously wrong with the database, and it would be a sign of a bug in InnoDB redo logging, crash recovery, or Mariabackup. (If innodb_force_recovery=6 is ever used or the ib_logfile* are discarded or renamed before starting the server, or if data files are being copied while the server is running, then it could be a user error.) | ||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-12-28 ] | ||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-04-13 ] | ||||||||||||||||||||||||||||||||||||||||
|
Both reported failures in this ticket could have been the result of corruption that was caused by the bug that was present in all InnoDB implementations until That bug can corrupt anything in the InnoDB system tablespace (including undo logs) as well as corrupt secondary indexes. | ||||||||||||||||||||||||||||||||||||||||
| Comment by Ian Gilfillan [ 2023-03-14 ] | ||||||||||||||||||||||||||||||||||||||||
|
Note the possibly-related discussion at https://mariadb.zulipchat.com/#narrow/stream/118759-general/topic/InnoDB.20Rollback.20Bug | ||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-03-15 ] | ||||||||||||||||||||||||||||||||||||||||
|
When it comes to the corruption that causes the crash in trx_undo_rec_copy(), in The rollback of some operations involves a purge of older history, nowadays in the function row_undo_mod_must_purge(). It can access undo log records for not yet purged committed transactions. Also in the posted mariadb-10.2.27 stack trace, we see a similar check, via the call row_vers_old_has_index_entry(). Such undo log records could become garbage due to the bug that was fixed in | ||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-03-20 ] | ||||||||||||||||||||||||||||||||||||||||
|
For the discussion that greenman linked to, I filed | ||||||||||||||||||||||||||||||||||||||||
| Comment by Jean-François Gagné [ 2023-03-20 ] | ||||||||||||||||||||||||||||||||||||||||
|
Nice, thanks Marko. I think we got to the root cause of this. Optimistic Parallel replication needs to do some rollback when something run optimistically ends-up conflicting with something that should have run before, so with the possibility of a rollback causing a crash, things make sense. I will continue to follow in This MDEV ( |