[MDEV-33363] CI failure: innodb.import_corrupted: Assertion failed: oldest_lsn > log_sys.last_checkpoint_lsn Created: 2024-02-02  Updated: 2024-02-07

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.8.1, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2, 11.3, 11.4
Fix Version/s: 10.11, 11.0, 11.1, 11.2, 11.3, 11.4

Type: Bug Priority: Critical
Reporter: Debarun Banerjee Assignee: Debarun Banerjee
Resolution: Unresolved Votes: 0
Labels: debug

Issue Links:
Relates
relates to MDEV-22027 Assertion oldest_lsn >= log_sys.last_... Closed
relates to MDEV-14425 Change the InnoDB redo log format to ... Closed

 Description   

The failure is observed in st-10.11-MDEV-33361. Since the assert looks unrelated to the change in the branch, the issue must be from the base branch.

The assert is introduced in MDEV-14425 and likely to have introduced the issue. I am setting the affected version as 10.8.1 when MDEV-14425 was merged.

https://buildbot.mariadb.org/#/builders/234/builds/26196/steps/9/logs/stdio
Test: innodb.import_corrupted
Branch: st-10.11-MDEV-33361
Commit: 93189df44
Platform: amd64-windows

innodb.import_corrupted 'innodb'         w6 [ fail ]
        Test ended at 2024-02-02 10:01:21
 
CURRENT_TEST: innodb.import_corrupted
mysqltest: At line 65: query 'ALTER TABLE t2 IMPORT TABLESPACE' failed with wrong errno <Unknown> (2013): 'Lost connection to server during query', instead of ER_NOT_KEYFILE (1034)...

Server [mysqld.1 - pid: 47680, winpid: 47680, exit: 0] failed during test run
Server log from this test:
----------SERVER LOG START-----------
2024-02-02 10:01:20 124 [Note] InnoDB: Sync to disk of `test`.`t1` started.
2024-02-02 10:01:20 124 [Note] InnoDB: Stopping purge
2024-02-02 10:01:21 124 [Note] InnoDB: Writing table metadata to './test/t1.cfg'
2024-02-02 10:01:21 124 [Note] InnoDB: Table `test`.`t1` flushed to disk
2024-02-02 10:01:21 124 [Note] InnoDB: Deleting the meta-data file './test/t1.cfg'
2024-02-02 10:01:21 124 [Note] InnoDB: Resuming purge
2024-02-02 10:01:21 124 [Note] InnoDB: Importing tablespace for table 'test/t1' that was exported from host 'WIN-5PHL1LR4M8G'
2024-02-02 10:01:21 124 [Note] InnoDB: Phase I - Update all pages
2024-02-02 10:01:21 124 [Note] InnoDB: Sync to disk
2024-02-02 10:01:21 124 [Note] InnoDB: Sync to disk - done!
2024-02-02 10:01:21 124 [Note] InnoDB: Phase II - Purge records from index `PRIMARY`
Assertion failed: oldest_lsn > log_sys.last_checkpoint_lsn, file C:\Buildbot\amd64-windows\build\storage\innobase\buf\buf0flu.cc, line 1982

server.dll!my_sigabrt_handler()[my_thr_init.c:465]
ucrtbase.dll!raise()
ucrtbase.dll!abort()
ucrtbase.dll!_get_wpgmptr()
ucrtbase.dll!_wassert()
server.dll!log_checkpoint_low()[buf0flu.cc:1982]
server.dll!log_checkpoint()[buf0flu.cc:2042]
server.dll!buf_flush_page_cleaner()[buf0flu.cc:2509]



 Comments   
Comment by Marko Mäkelä [ 2024-02-02 ]

Here is some code up to the failing assertion in log_checkpoint_low():

10.11 93189df44eeb431110bf9ca759bafa856ae90c82

  if (oldest_lsn == log_sys.last_checkpoint_lsn /* || … */) {
    log_sys.latch.wr_unlock();
    return true;
  }
 
  ut_ad(!recv_no_log_write);
  ut_ad(oldest_lsn > log_sys.last_checkpoint_lsn);

This would seem to guarantee that the equality cannot hold. So, we must have had an oldest_lsn that is smaller than the log_sys.last_checkpoint_lsn. That would seem to be a sign of a severe error in the checkpoint logic.

The parameter oldest_lsn was computed by the following in log_checkpoint():

10.11 93189df44eeb431110bf9ca759bafa856ae90c82

  log_sys.latch.wr_lock(SRW_LOCK_CALL);
  const lsn_t end_lsn= log_sys.get_lsn();
  mysql_mutex_lock(&buf_pool.flush_list_mutex);
  const lsn_t oldest_lsn= buf_pool.get_oldest_modification(end_lsn);
  mysql_mutex_unlock(&buf_pool.flush_list_mutex);
  return log_checkpoint_low(oldest_lsn, end_lsn);

The field log_sys.last_checkpoint_lsn is protected by log_sys.latch (and write_lock and flush_lock) when log_t::write_checkpoint() is modifying it. Also the end_lsn=log_sys.lsn is protected by the exclusive log_sys.latch. Here are some local variables from the stack trace:

10.11 93189df44eeb431110bf9ca759bafa856ae90c82

ucrtbase!wassert
Unable to enumerate locals, Win32 error 0n87
Private symbols (symbols.pri) are required for locals.
Type ".hh dbgerr005" for details.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
server!log_checkpoint_low [C:\Buildbot\amd64-windows\build\storage\innobase\buf\buf0flu.cc @ 1982]
unsigned int64 oldest_lsn = 0xdbbfe9
unsigned int64 end_lsn = 0xdc0a71
unsigned int64 flush_lsn = 0x00000216`d35ab7b0
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
server!log_checkpoint [C:\Buildbot\amd64-windows\build\storage\innobase\buf\buf0flu.cc @ 2042]
unsigned int64 oldest_lsn = 0xdbbfe9
unsigned int64 end_lsn = 0xdc0a71
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
server!buf_flush_page_cleaner [C:\Buildbot\amd64-windows\build\storage\innobase\buf\buf0flu.cc @ 2509]
unsigned int64 n = 0
unsigned int64 dirty_blocks = 0x1b5
unsigned int64 n_flushed = 0
unsigned int64 oldest_lsn = 0xdbbfe9
double pct_lwm = 0
unsigned int64 soft_lsn_limit = 0
double dirty_pct = 87.2255489021956123
unsigned int64 lsn_limit = 0
struct timespec abstime = struct timespec
unsigned int64 last_pages = 0x71a
unsigned int64 last_activity_count = 0
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The variable flush_lsn, whose scope had not started before the failing assertion was evaluated, is garbage, but the rest looks reasonable.

The oldest modified block in the buf_pool.flush_list must have an buf_page_t::oldest_modification_ of 0xdbbfe9. For some reason, the latest checkpoint LSN (log_sys.last_checkpoint_lsn) is larger than this. This means that the log checkpoint had been advanced prematurely on a previous invocation of log_checkpoint_low(), while some older page modifications had not yet been written back from the buffer pool.

Comment by Marko Mäkelä [ 2024-02-07 ]

MDEV-22027 was a bug like this.

Generated at Thu Feb 08 10:38:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.