Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.3(EOL), 10.4(EOL)
Description
10.3 server crashes with a non-debug assertion failure when it starts on the attached datadir.
The datadir was created on the current 10.0 server as a part of the undo-upgrade scenario:
- start the current 10.0 server;
- create some tables and run some DML on them;
- kill the server during operation (with SIGKILL);
- restart the server with innodb-force-recovery=3, no client activity;
- shutdown the server normally;
Then the current 10.3 server is started on the same datadir. It starts, but crashes immediately afterwards as below.
10.3 b52bb6eb82db8 |
2018-06-12 18:15:28 0x7fb1809ea700 InnoDB: Assertion failure in file /data/src/10.3/storage/innobase/trx/trx0purge.cc line 121
|
InnoDB: Failing assertion: purge_sys.tail.commit <= purge_sys.rseg->last_commit
|
|
#5 0x00007fb1a1634448 in __GI_abort () at abort.c:89
|
#6 0x00007fb1a44f44fb in ut_dbg_assertion_failed (expr=0x7fb1a4b6b0c8 "purge_sys.tail.commit <= purge_sys.rseg->last_commit", file=0x7fb1a4b6af80 "/data/src/10.3/storage/innobase/trx/trx0purge.cc", line=121) at /data/src/10.3/storage/innobase/ut/ut0dbg.cc:61
|
#7 0x00007fb1a44c2dfb in TrxUndoRsegsIterator::set_next (this=0x7fb1a530c0a0 <purge_sys+416>) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:121
|
#8 0x00007fb1a44c0d4b in trx_purge_choose_next_log () at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1213
|
#9 0x00007fb1a44c0fa4 in trx_purge_get_next_rec (n_pages_handled=0x7fb1809e9dd0, heap=0x7fb1a85d2780) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1286
|
#10 0x00007fb1a44c11f5 in trx_purge_fetch_next_rec (roll_ptr=0x7fb17400d8f0, n_pages_handled=0x7fb1809e9dd0, heap=0x7fb1a85d2780) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1356
|
#11 0x00007fb1a44c151c in trx_purge_attach_undo_recs (n_purge_threads=4) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1429
|
#12 0x00007fb1a44c19ff in trx_purge (n_purge_threads=4, truncate=false) at /data/src/10.3/storage/innobase/trx/trx0purge.cc:1559
|
#13 0x00007fb1a4492d44 in srv_do_purge (n_total_purged=0x7fb1809e9ed0) at /data/src/10.3/storage/innobase/srv/srv0srv.cc:2583
|
#14 0x00007fb1a4493172 in srv_purge_coordinator_thread (arg=0x0) at /data/src/10.3/storage/innobase/srv/srv0srv.cc:2714
|
#15 0x00007fb1a3293064 in start_thread (arg=0x7fb1809ea700) at pthread_create.c:309
|
#16 0x00007fb1a16e662d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
|
This current test was run with --innodb-page-size=8K --loose-innodb_log_compressed_pages=on --loose-innodb-change-buffering=none, I'm not sure whether any of them important. Naturally, to reproduce the crash on the attached datadir, the server needs to be also started with --innodb-page-size=8K, other two options don't make a difference; otherwise all defaults.
ib_logfile-s are compressed and attached separately just to overcome the 10M limitation in JIRA. I don't know if they are needed, the crash happens with and without them.
Similar-looking crashes upon upgrade from 10.1 have also been observed before.
10.2 doesn't crash on the same datadir.
Attachments
Issue Links
- causes
-
MDEV-25981 Upgrade results in InnoDB failures
-
- Closed
-
-
MDEV-26465 Race condition in trx_purge_rseg_get_next_history_log()
-
- Closed
-
- is caused by
-
MDEV-12288 Reset DB_TRX_ID when the history is removed, to speed up MVCC
-
- Closed
-
-
MDEV-15370 Upgrade fails when both insert_undo and update_undo exist for recovered transactions
-
- Closed
-
- is duplicated by
-
MDEV-15464 Assertion `purge_sys.purge_queue.empty() || purge_sys.purge_queue.top() != m_rsegs' failed in TrxUndoRsegsIterator::set_next upon upgrade from 10.0/10.1
-
- Closed
-
- relates to
-
MDEV-16952 Introduce SET GLOBAL innodb_max_purge_lag_wait
-
- Closed
-
-
MDEV-27437 Galera snapshot transfer fails to upgrade between some major versions
-
- Closed
-
-
MDEV-29475 trx_undo_rseg_free() does not write redo log
-
- Closed
-
-
MDEV-18454 Assertion `0' failed in ReadView::check_trx_id_sanity upon crash-upgrade from 10.2.6
-
- Closed
-
-
MDEV-18966 Transaction recovery may be broken after upgrade to 10.3
-
- Closed
-
-
MDEV-27800 upgrade from MariaDB 10.2 to 10.5.13 results in [ERROR] InnoDB: corrupted TRX_NO
-
- Closed
-
- mentioned in
-
Page Loading...
We're also seeing this issue under similar circumstances.
We've got a 10.0 data dir which should have been shutdown cleanly but may not have in certain situations. Installing mariadb-server 10.3 and running mysql_upgrade works fine, and the server will boot. If the server is running enough or a certain action is performed (i'm not sure which, don't know enough), a purge will start. This will cause the crash.
With some help from `dragonheart` on IRC, I installed debug symbols and gdb'd. Here is some clumsy GDB info that may help.
(gdb) up
#1 0x00007ffff5e7402a in __GI_abort () at abort.c:89
89 abort.c: No such file or directory.
(gdb) up
#2 0x00005555559f34e2 in ut_dbg_assertion_failed (
expr=expr@entry=0x5555563c2098 "purge_sys.tail.commit <= purge_sys.rseg->last_commit",
file=file@entry=0x5555563c1f68 "/home/buildbot/buildbot/build/mariadb-10.3.8/storage/innobase/trx/trx0purge.cc",
line=line@entry=121) at /home/buildbot/buildbot/build/mariadb-10.3.8/storage/innobase/ut/ut0dbg.cc:61
61 /home/buildbot/buildbot/build/mariadb-10.3.8/storage/innobase/ut/ut0dbg.cc: No such file or directory.
(gdb) p purge_sys.tail.commit
$1 = 127992
(gdb) purge_sys.rseg->last_commit
Undefined command: "purge_sys". Try "help".
(gdb) p purge_sys.rseg->last_commit
$2 = 127991
I then did some reading around and noticed we had missed a step in the MariaDB upgrade notes that for one of the version (i've lost the page now, but it might be have been 10.0 -> 10.1) it recommends doing a shutdown with innodb_fast_shutdown=0 before doing the upgrade. I tried this, and booted back up with 10.3, ran the mysql_upgrade and then booted 10.3, left it running under some workloads (very minor) and it appears to be stable.
So - this is a case of user error, but I don't think MariaDB should be crashing quite this badly in this case. I'm keen to help debug any more if it means a better error message or something handled. I've watched this thread, if anyone needs more info.