[MDEV-15912] InnoDB: Failing assertion: purge_sys.tail.commit <= purge_sys.rseg->last_commit upon upgrade from 10.0 or 10.1 to 10.3 Created: 2018-04-17  Updated: 2023-09-27  Resolved: 2021-06-21

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3, 10.4
Fix Version/s: 10.3.30, 10.4.20, 10.5.11, 10.6.3

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: affects-tests, compat56, compat57, regression, upgrade

Attachments: File ib_logfile0.bz2     File ib_logfile1.bz2     File mdev15912_data.bgz    
Issue Links:
Duplicate
is duplicated by MDEV-15464 Assertion `purge_sys.purge_queue.empt... Closed
Problem/Incident
causes MDEV-25981 Upgrade results in InnoDB failures Closed
causes MDEV-26465 Race condition in trx_purge_rseg_get_... Closed
is caused by MDEV-12288 Reset DB_TRX_ID when the history is r... Closed
is caused by MDEV-15370 Upgrade fails when both insert_undo a... Closed
Relates
relates to MDEV-16952 Introduce SET GLOBAL innodb_max_purge... Closed
relates to MDEV-27437 Galera snapshot transfer fails to upg... Closed
relates to MDEV-29475 trx_undo_rseg_free() does not write r... Closed
relates to MDEV-18454 Assertion `0' failed in ReadView::che... Closed
relates to MDEV-18966 Transaction recovery may be broken af... Closed
relates to MDEV-27800 upgrade from MariaDB 10.2 to 10.5.13 ... Closed

 Description   

10.3 server crashes with a non-debug assertion failure when it starts on the attached datadir.
The datadir was created on the current 10.0 server as a part of the undo-upgrade scenario:

  • start the current 10.0 server;
  • create some tables and run some DML on them;
  • kill the server during operation (with SIGKILL);
  • restart the server with innodb-force-recovery=3, no client activity;
  • shutdown the server normally;

Then the current 10.3 server is started on the same datadir. It starts, but crashes immediately afterwards as below.

10.3 b52bb6eb82db8

2018-06-12 18:15:28 0x7fb1809ea700  InnoDB: Assertion failure in file /data/src/10.3/storage/innobase/trx/trx0purge.cc line 121
InnoDB: Failing assertion: purge_sys.tail.commit <= purge_sys.rseg->last_commit
 
#5  0x00007fb1a1634448 in __GI_abort () at abort.c:89
#6  0x00007fb1a44f44fb in ut_dbg_assertion_failed (expr=0x7fb1a4b6b0c8 "purge_sys.tail.commit <= purge_sys.rseg->last_commit", file=0x7fb1a4b6af80 "/data/src/10.3/storage/innobase/trx/trx0purge.cc", line=121) at /data/src/10.3/storage/innobase/ut/ut0dbg.cc:61
#7  0x00007fb1a44c2dfb in TrxUndoRsegsIterator::set_next (this=0x7fb1a530c0a0 <purge_sys+416>) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:121
#8  0x00007fb1a44c0d4b in trx_purge_choose_next_log () at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1213
#9  0x00007fb1a44c0fa4 in trx_purge_get_next_rec (n_pages_handled=0x7fb1809e9dd0, heap=0x7fb1a85d2780) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1286
#10 0x00007fb1a44c11f5 in trx_purge_fetch_next_rec (roll_ptr=0x7fb17400d8f0, n_pages_handled=0x7fb1809e9dd0, heap=0x7fb1a85d2780) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1356
#11 0x00007fb1a44c151c in trx_purge_attach_undo_recs (n_purge_threads=4) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1429
#12 0x00007fb1a44c19ff in trx_purge (n_purge_threads=4, truncate=false) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1559
#13 0x00007fb1a4492d44 in srv_do_purge (n_total_purged=0x7fb1809e9ed0) at /data/src/10.3/storage/innobase/srv/srv0srv.cc:2583
#14 0x00007fb1a4493172 in srv_purge_coordinator_thread (arg=0x0) at /data/src/10.3/storage/innobase/srv/srv0srv.cc:2714
#15 0x00007fb1a3293064 in start_thread (arg=0x7fb1809ea700) at pthread_create.c:309
#16 0x00007fb1a16e662d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This current test was run with --innodb-page-size=8K --loose-innodb_log_compressed_pages=on --loose-innodb-change-buffering=none, I'm not sure whether any of them important. Naturally, to reproduce the crash on the attached datadir, the server needs to be also started with --innodb-page-size=8K, other two options don't make a difference; otherwise all defaults.

ib_logfile-s are compressed and attached separately just to overcome the 10M limitation in JIRA. I don't know if they are needed, the crash happens with and without them.

Similar-looking crashes upon upgrade from 10.1 have also been observed before.
10.2 doesn't crash on the same datadir.



 Comments   
Comment by Ben Anderson [ 2018-07-25 ]

We're also seeing this issue under similar circumstances.

We've got a 10.0 data dir which should have been shutdown cleanly but may not have in certain situations. Installing mariadb-server 10.3 and running mysql_upgrade works fine, and the server will boot. If the server is running enough or a certain action is performed (i'm not sure which, don't know enough), a purge will start. This will cause the crash.

With some help from `dragonheart` on IRC, I installed debug symbols and gdb'd. Here is some clumsy GDB info that may help.

(gdb) up
#1  0x00007ffff5e7402a in __GI_abort () at abort.c:89
89  abort.c: No such file or directory.
(gdb) up
#2  0x00005555559f34e2 in ut_dbg_assertion_failed (
    expr=expr@entry=0x5555563c2098 "purge_sys.tail.commit <= purge_sys.rseg->last_commit",
    file=file@entry=0x5555563c1f68 "/home/buildbot/buildbot/build/mariadb-10.3.8/storage/innobase/trx/trx0purge.cc",
    line=line@entry=121) at /home/buildbot/buildbot/build/mariadb-10.3.8/storage/innobase/ut/ut0dbg.cc:61
61  /home/buildbot/buildbot/build/mariadb-10.3.8/storage/innobase/ut/ut0dbg.cc: No such file or directory.
(gdb) p purge_sys.tail.commit
$1 = 127992
(gdb) purge_sys.rseg->last_commit
Undefined command: "purge_sys".  Try "help".
(gdb) p purge_sys.rseg->last_commit
$2 = 127991

I then did some reading around and noticed we had missed a step in the MariaDB upgrade notes that for one of the version (i've lost the page now, but it might be have been 10.0 -> 10.1) it recommends doing a shutdown with innodb_fast_shutdown=0 before doing the upgrade. I tried this, and booted back up with 10.3, ran the mysql_upgrade and then booted 10.3, left it running under some workloads (very minor) and it appears to be stable.

So - this is a case of user error, but I don't think MariaDB should be crashing quite this badly in this case. I'm keen to help debug any more if it means a better error message or something handled. I've watched this thread, if anyone needs more info.

Comment by Marko Mäkelä [ 2018-07-25 ]

ben.anderson, MariaDB upgrade is supposed to work even if the undo logs are not empty. Tests for that were implemented in MDEV-13269.

In MariaDB 10.3, there are two major changes to the undo log format.

  1. MDEV-12288 eliminated the separate insert_undo log; the undo log records of a transaction will be written to a single persistent undo log, which was formerly called update_undo. On upgrade, old_insert logs of any incomplete transactions will be handled. There was a fix related to this in MDEV-15370. Because insert_undo logs were never visible to purge, this change should not be causing the issue.
  2. MDEV-15132 and MDEV-15158 stopped writing anything to the TRX_SYS page during transaction commit. This could be what is causing the reported problem. We could be retrieving the maximum transaction ID in the wrong way.

There was also a minor change to the undo log format, in MDEV-14717 (making RENAME TABLE crash-safe inside InnoDB; the .frm operations still are not crash-safe).

Note: A shutdown after SET GLOBAL innodb_fast_shutdown=0 may fail to empty some undo logs. This was fixed in MariaDB 10.3.6 only:
MDEV-13603 innodb_fast_shutdown=0 may fail to purge all history
It is not feasible to backport that to 10.1 or earlier. It might not be feasible to backport to 10.2 either.

Comment by Marko Mäkelä [ 2018-07-25 ]

In the mdev15912_data.bgz when the assertion fails, both purge_sys.rseg->last_commit and purge_sys.tail.commit are 340491, and purge_sys.tail.commit is one more, causing the assertion to fail. Here are some interesting messages:

10.3 a86a02a8447233940cc3b0f40e2b403691997807

2018-07-25 12:58:45 0 [Note] InnoDB: Trx id counter is 170241
2018-07-25 12:58:45 0 [Note] InnoDB: Rolled back recovered transaction 169859
2018-07-25 12:58:45 0 [Note] InnoDB: Rollback of non-prepared transactions completed
Version: '10.3.9-MariaDB-debug-log'  socket: '/dev/shm/10.3/mysql-test/var/tmp/mysqld.1.sock'  port: 16000  Source distribution
2018-07-25 13:03:45 0x7fffe0c10700  InnoDB: Assertion failure in file /mariadb/10.3/storage/innobase/trx/trx0purge.cc line 121

If I start with innodb_force_recovery=3, then the server will not crash. But, this will not only prevent the rollback of recovered transactions but also the purge of old history (which is where the assertion would fail). With a source code modification we can only disable the rollback. This will let the server start up successfully:

diff --git a/storage/innobase/trx/trx0roll.cc b/storage/innobase/trx/trx0roll.cc
index c0864e40622..bd2e23716c4 100644
--- a/storage/innobase/trx/trx0roll.cc
+++ b/storage/innobase/trx/trx0roll.cc
@@ -862,7 +862,6 @@ DECLARE_THREAD(trx_rollback_all_recovered)(void*)
 	if (trx_sys.rw_trx_hash.size()) {
 		ib::info() << "Starting in background the rollback of"
 			" recovered transactions";
-		trx_rollback_recovered(true);
 		ib::info() << "Rollback of non-prepared transactions"
 			" completed";
 	}

So, the failure must be somehow related to the rollback of the recovered transaction.

Comment by Marko Mäkelä [ 2018-07-25 ]

ben.anderson reported:

purge_sys.tail.commit = 127992
purge_sys.rseg->last_commit = 127991

Also in mdev15912_data.bgz, the difference happens to be exactly 1:

purge_sys.tail.commit = 340492
purge_sys.rseg.last_commit = 340491

The two-component last_commit field (trx->no << 1 | !old_insert) was introduced in MDEV-15370, in an attempt to relax this assertion. But, even if we ignore the low-order bit, the difference remains. It looks like the fix in MDEV-15370 was incomplete. Purge should not need to process any old_insert logs; we only have to make sure that the undo log pages are made available for reuse.

Comment by Marko Mäkelä [ 2018-07-25 ]

MDEV-13495 changed the logic so that the recovered old_insert undo logs would be moved to the purge queue. Before MDEV-12288, the TRX_UNDO_TO_FREE state was used. I tried various things, but this turns out to be more difficult to fix than I thought. I would prefer to avoid reintroducing the TRX_UNDO_TO_FREE state.

diff --git a/storage/innobase/include/trx0rseg.h b/storage/innobase/include/trx0rseg.h
index dbd80486b71..794f4e66ca3 100644
--- a/storage/innobase/include/trx0rseg.h
+++ b/storage/innobase/include/trx0rseg.h
@@ -174,6 +174,8 @@ struct trx_rseg_t {
 
 	/** @return the commit ID of the last committed transaction */
 	trx_id_t last_trx_no() const { return last_commit >> 1; }
+	/** @return whether the last committed transaction is an update */
+	bool last_is_update() const { return bool(last_commit & 1); }
 
 	void set_last_trx_no(trx_id_t trx_no, bool is_update)
 	{
diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
index 32e050e9b91..b3c6f41827e 100644
--- a/storage/innobase/trx/trx0purge.cc
+++ b/storage/innobase/trx/trx0purge.cc
@@ -118,9 +118,11 @@ inline bool TrxUndoRsegsIterator::set_next()
 	ut_ad(purge_sys.rseg->space->id == TRX_SYS_SPACE
 	      || srv_is_undo_tablespace(purge_sys.rseg->space->id));
 
-	ut_a(purge_sys.tail.commit <= purge_sys.rseg->last_commit);
+	if (purge_sys.rseg->last_is_update()) {
+		ut_a(purge_sys.tail.commit <= purge_sys.rseg->last_commit);
+		purge_sys.tail.commit = purge_sys.rseg->last_commit;
+	}
 
-	purge_sys.tail.commit = purge_sys.rseg->last_commit;
 	purge_sys.hdr_offset = purge_sys.rseg->last_offset;
 	purge_sys.hdr_page_no = purge_sys.rseg->last_page_no;
 
@@ -307,10 +309,13 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr)
 				 MLOG_2BYTES, mtr);
 	}
 
-	if (rseg->last_page_no == FIL_NULL) {
+	const bool is_update = undo == trx->rsegs.m_redo.undo;
+
+	if (rseg->last_page_no == FIL_NULL
+	    || (is_update && !rseg->last_is_update())) {
 		rseg->last_page_no = undo->hdr_page_no;
 		rseg->last_offset = undo->hdr_offset;
-		rseg->set_last_trx_no(trx->no, undo == trx->rsegs.m_redo.undo);
+		rseg->set_last_trx_no(trx->no, is_update);
 		rseg->needs_purge = true;
 	}
 
@@ -1048,9 +1053,11 @@ static void trx_purge_truncate_history()
 	purge_sys_t::iterator& head = purge_sys.head.commit
 		? purge_sys.head : purge_sys.tail;
 
-	if (head.trx_no() >= purge_sys.view.low_limit_no()) {
+	const trx_id_t low_limit = purge_sys.view.low_limit_no();
+	if (head.trx_no() >= low_limit) {
 		/* This is sometimes necessary. TODO: find out why. */
-		head.reset_trx_no(purge_sys.view.low_limit_no());
+		ib::info() << "reset " << head.trx_no() << " to " << low_limit;
+		head.reset_trx_no(low_limit);
 		head.undo_no = 0;
 	}
 

The last hunk is not emitting any message, and the dataset is still crashing. (Note: it is necessary to restore the data directory from mdev15912_data.bgz before every startup attempt, because the rollback of the recovered transaction would change the persistent undo log data structures, triggering the assertion failure on every subsequent startup even if the code for the rollback code path was later fixed.)

Comment by Marko Mäkelä [ 2018-08-31 ]

We may have to introduce trx_rseg_t::old_insert_cached and some separate mechanism that will guarantee that old insert_undo pages from before the upgrade to 10.3 (or later) will eventually be freed, without interfering with the purge of transaction history.

Comment by Marko Mäkelä [ 2021-05-25 ]

As described in MDEV-23755, we can prevent these failures by refusing InnoDB startup if any incomplete transactions have undo logs in the pre-MDEV-12288 format, with an error message saying that the older server version must be shut down cleanly, without using innodb_force_recovery=3.

It might be possible to remove the old_insert data members altogether.

Comment by Marko Mäkelä [ 2021-06-21 ]

We will refuse an upgrade from older versions than MariaDB 10.3 if a clean shutdown was not performed, so that the undo logs would be emptied:

2021-06-13 16:42:45 0 [ERROR] InnoDB: upgrade from older version than MariaDB 10.3 requires clean shutdown

There will be only one persistent undo log per transaction. No main-memory data structures related to the partitioned persistent undo log before MDEV-12288 are retained.

Comment by Hans Borresen [ 2021-06-21 ]

It looks like this just got pulled into the 10.3 branch, but it breaks upgrades for anyone trying to go from 5.7 -> 10.3.30, or even 5.7 -> 10.3.29 -> 10.3.30

I started a topic for this in the zulip chat.

Log snippet, just in case it is helpful:

Version: '10.3.29-MariaDB'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server
2021-06-21 20:58:29 0 [Note] /usr/sbin/mysqld (initiated by: unknown): Normal shutdown
...
2021-06-21 20:58:29 0 [Note] InnoDB: FTS optimize thread exiting.
2021-06-21 20:58:29 0 [Note] InnoDB: Starting shutdown...
2021-06-21 20:58:29 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
2021-06-21 20:58:29 0 [Note] InnoDB: Buffer pool(s) dump completed at 210621 20:58:29
2021-06-21 20:58:31 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2021-06-21 20:58:31 0 [Note] InnoDB: Shutdown completed; log sequence number 3281751; transaction id 3852
2021-06-21 20:58:31 0 [Note] /usr/sbin/mysqld: Shutdown complete
...
(after updating to 10.3.30 via yum)
2021-06-21 20:58:31 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2021-06-21 20:58:31 0 [ERROR] InnoDB: upgrade from older version than MariaDB 10.3 requires clean shutdown
2021-06-21 20:58:31 0 [Note] InnoDB: Retry with innodb_force_recovery=5
2021-06-21 20:58:31 0 [ERROR] InnoDB: Plugin initialization aborted with error Data structure corruption
2021-06-21 20:58:31 0 [Note] InnoDB: Starting shutdown...
2021-06-21 20:58:31 0 [ERROR] Plugin 'InnoDB' init function returned error.
2021-06-21 20:58:31 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2021-06-21 20:58:31 0 [Note] Plugin 'FEEDBACK' is disabled.
2021-06-21 20:58:31 0 [ERROR] Could not open mysql.plugin table. Some plugins may be not loaded
2021-06-21 20:58:31 0 [ERROR] Unknown/unsupported storage engine: InnoDB

Edit: I went ahead and filed MDEV-25981 for this issue.

Comment by Marko Mäkelä [ 2021-06-23 ]

hborresen, thank you. I should have actually tested the upgrade. As I expected, fixing MDEV-25981 involved only minor changes to the validation logic.

Generated at Thu Feb 08 08:24:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.