[MDEV-29422] InnoDB crashes when dict_load_table_one() notices a corrupted table Created: 2022-08-31  Updated: 2022-09-12  Resolved: 2022-09-12

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.8, 10.7.4, 10.8.3, 10.9.1, 10.10.1
Fix Version/s: 10.6.10, 10.7.6, 10.8.5, 10.9.3, 10.10.2

Type: Bug Priority: Major
Reporter: Nuno Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File gdb.18109.txt     Text File gdb.18394.txt     Text File gdb.19812.txt    
Issue Links:
Duplicate
duplicates MDEV-29440 InnoDB instant ALTER TABLE recovery w... Closed

 Description   

10.8.4

Yesterday, after starting to get the same issue described in MDEV-29364 (after a few days of not getting that following a restart),
I restarted MariaDB and then created a fresh backup.

Transferred to the other server, and ran the prepare.

Prepare seems to run OK (this runs in a docker container btw), then it runs a number of restarts and SQL queries to sort out a new random root password (same process that's been running for years)

However, with yesterday's backup, it consistently crashes after a while, with:

80 Segmentation fault (core dumped) mysqld -u root

And a core file is created. The core file seems to be unreadable if I open with Notepad++,
but the "mysql.err" file contains possibly more readable information.

Here are some excerpts of the "mysql.err":

2022-08-30  8:40:24 0 [Note] InnoDB: Rolled back recovered transaction 45987095395
2022-08-30  8:40:24 0 [Note] InnoDB: Rolled back recovered transaction 45987095393
2022-08-30  8:40:24 0 [Note] InnoDB: Rolled back recovered transaction 45987095399
2022-08-30  8:40:24 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2022-08-30  8:40:25 0 [Note] Server socket created on IP: '0.0.0.0'.
2022-08-30  8:40:25 1 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/gtid_slave_pos.ibd' page [page id: space=20, page number=3]. You may have to recover from a backup.
2022-08-30  8:40:25 1 [Note] InnoDB: Page dump (16384 bytes):
2022-08-30  8:40:25 1 [Note] InnoDB: .............. [huge dump] ..............
2022-08-30  8:40:25 1 [Note] InnoDB: End of page dump
2022-08-30  8:40:25 1 [ERROR] InnoDB: File './mysql/gtid_slave_pos.ibd' is corrupted
2022-08-30  8:40:25 1 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2022-08-30  8:40:25 1 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/gtid_slave_pos.ibd' page [page id: space=20, page number=3]. You may have to recover from a backup.
2022-08-30  8:40:25 1 [Note] InnoDB: Page dump (16384 bytes):
2022-08-30  8:40:25 1 [Note] InnoDB: .............. [huge dump] ..............
2022-08-30  8:40:25 1 [Note] InnoDB: End of page dump
2022-08-30  8:40:25 1 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2022-08-30  8:40:25 1 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/gtid_slave_pos.ibd' page [page id: space=20, page number=3]. You may have to recover from a backup.
2022-08-30  8:40:25 1 [Note] InnoDB: Page dump (16384 bytes):
2022-08-30  8:40:25 1 [Note] InnoDB: .............. [huge dump] ..............
2022-08-30  8:40:25 1 [Note] InnoDB: End of page dump
2022-08-30  8:40:25 1 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2022-08-30  8:40:25 1 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/gtid_slave_pos.ibd' page [page id: space=20, page number=3]. You may have to recover from a backup.
2022-08-30  8:40:25 1 [Note] InnoDB: Page dump (16384 bytes):
2022-08-30  8:40:25 1 [Note] InnoDB: .............. [huge dump] ..............
2022-08-30  8:40:25 1 [Note] InnoDB: End of page dump
2022-08-30  8:40:25 1 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2022-08-30  8:40:25 1 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/gtid_slave_pos.ibd' page [page id: space=20, page number=3]. You may have to recover from a backup.
2022-08-30  8:40:25 1 [Note] InnoDB: Page dump (16384 bytes):
2022-08-30  8:40:25 1 [Note] InnoDB: .............. [huge dump] ..............
2022-08-30  8:40:25 1 [Note] InnoDB: End of page dump
2022-08-30  8:40:25 1 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2022-08-30  8:40:25 1 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/gtid_slave_pos.ibd' page [page id: space=20, page number=3]. You may have to recover from a backup.
2022-08-30  8:40:25 1 [Note] InnoDB: Page dump (16384 bytes):
2022-08-30  8:40:25 1 [Note] InnoDB: .............. [huge dump] ..............
2022-08-30  8:40:25 1 [Note] InnoDB: End of page dump
2022-08-30  8:40:25 1 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2022-08-30  8:40:25 1 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/gtid_slave_pos.ibd' page [page id: space=20, page number=3]. You may have to recover from a backup.
2022-08-30  8:40:25 1 [Note] InnoDB: Page dump (16384 bytes):
2022-08-30  8:40:25 1 [Note] InnoDB: .............. [huge dump] ..............

(many of the same)

At some point it starts showing my own databases/tables:

2022-08-30  8:41:05 0 [ERROR] InnoDB: Failed to read page 4631 from file './my_db/my_table.ibd': Page read from tablespace is corrupted.
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './my_db/my_table.ibd' page [page id: space=393, page number=5773]. You may have to recover from a backup.
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: Page dump (16384 bytes):

...

2022-08-30  8:41:05 0 [Note] InnoDB: End of page dump
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [ERROR] InnoDB: Failed to read page 5210 from file './my_db/my_table.ibd': Page read from tablespace is corrupted.
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [Note] InnoDB: [small dump]
2022-08-30  8:41:05 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './my_db/my_table.ibd' page [page id: space=393, page number=5774]. You may have to recover from a backup.

...
...
continues with a lot of my databases/tables
...
...

and then the final bit:

2022-08-30 11:07:20 0 [Note] InnoDB: End of page dump
2022-08-30 11:07:20 0 [Note] InnoDB:  You can use CHECK TABLE to scan your table for corruption. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
220830 11:07:20 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.8.4-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=2
max_threads=12
thread_count=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 919329 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x55ea6b2d75b8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f9879e60be8 thread_stack 0x49000
addr2line: 'mysqld': No such file
Printing to addr2line failed
mysqld(my_print_stacktrace+0x2e)[0x55ea67ebfd9e]
mysqld(handle_fatal_signal+0x307)[0x55ea6792aef7]
sigaction.c:0(__restore_rt)[0x7f9db45f4630]
addr2line: 'mysqld': No such file
mysqld(+0xeab38c)[0x55ea67db138c]
mysqld(+0xeb8c09)[0x55ea67dbec09]
mysqld(+0xebafe5)[0x55ea67dc0fe5]
mysqld(+0xebbb66)[0x55ea67dc1b66]
mysqld(+0xeaf98c)[0x55ea67db598c]
mysqld(+0xdd6baf)[0x55ea67cdcbaf]
mysqld(+0xd91c08)[0x55ea67c97c08]
mysqld(+0xdf6480)[0x55ea67cfc480]
mysqld(_ZN5tpool10task_group7executeEPNS_4taskE+0xa6)[0x55ea67e4a536]
mysqld(_ZN5tpool19thread_pool_generic11worker_mainEPNS_11worker_dataE+0x61)[0x55ea67e48ee1]
??:0(std::this_thread::__sleep_for(std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> >))[0x7f9db418e330]
pthread_create.c:0(start_thread)[0x7f9db45ecea5]
??:0(__clone)[0x7f9db3b07b0d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 0
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
 
We think the query pointer is invalid, but we will try to print it anyway. 
Query: 
 
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       127822               127822               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h
 
Kernel version: Linux version 3.10.0-1160.62.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Tue Apr 5 16:57:59 UTC 2022

Do you think this could be caused by MDEV-29383?

Thanks.

Today's backup, however, seems to have gone well. I have a stable Snapshot, finally.
(on other days, it's MDEV-28974 that keeps affecting my daily snapshots)

It's really hard to get valid backups...

.
.



 Comments   
Comment by Nuno [ 2022-09-01 ]

Today's backup got the same issue again (with core dump).

Comment by Marko Mäkelä [ 2022-09-02 ]

nunop, because of MDEV-29374 (which I think that is a more likely cause of trouble) and MDEV-29383, I cannot recommend using MariaDB Server 10.6.9, 10.7.5, 10.8.4, 10.9.2. Not only do you get bad backups, but your database could end up unrecoverable after any crash. I have advocated for an out-of-schedule release to address this, but I am not aware of any update.

From your description it looks like you may have multiple page dumps interleaved in the output, reported concurrently by different threads.

I think that the cause of this is one of the tickets. If you are willing to try it, I can create a custom build for you. Meanwhile, can you please try to provide a full stack trace of the crash? After all, the goal of MDEV-13542 (which ironically introduced these regressions) was to avoid crashes when the database is corrupted.

You will likely have to install some dbgsym or debuginfo package that goes along with the server so that the function and parameter names and values will be resolved by gdb. In MDEV-27983 I see that the correct package name for that server version and environment was MariaDB-server-debuginfo-10.6.8-1.el8.x86_64.rpm.

You could invoke the debugger something like this:

gdb /usr/sbin/mariadbd core

In GDB, you can use the following commands to have the stack traces of all threads to be dumped to a file.

set height 0
set log file /tmp/gdb.txt
set log on
thread apply all backtrace
quit

Based on the incompletely resolved stack trace, I have the feeling that the crash would occur in an InnoDB purge task. You might be able to prevent that crash by starting the server with innodb_force_recovery=2.

Comment by Nuno [ 2022-09-02 ]

Hi marko

your database could end up unrecoverable after any crash

Ok.. that's frightening...!!
I can't have this risk in my production database.

If you are willing to try it, I can create a custom build for you.

Just to clarify, you mean like a release candidate for 10.8.5 ?

Would it be safer to downgrade to 10.8.3? (if it's possible to downgrade minor versions)

If you believe your custom build is stable enough for me to use, I'm ok to try it - just want to be very sure it won't corrupt my database.

can you please try to provide a full stack trace of the crash?

Ok. I'll try to figure out how to get the stacktrace. This will be on the Snapshot server (docker container), where it's failing.
I'll do this later today.

Let me know about the questions above, for the build.
Thank you very much.

Comment by Nuno [ 2022-09-02 ]

marko

Am I right to say that if I install MariaDB-server-debuginfo, then I don't need to build mysqld/mariadbd in debug mode? It's not clear in the documentation

I restored the database from the 30th,
let it crash with the segmentation fault,
ran the "gdb" command following your instructions,
and I've attached stack traces for 2 of the crashes.

Let me know if this helps.
(please also answer my other questions above, if possible)
Thank you!

gdb.18394.txt gdb.18109.txt

Comment by Nuno [ 2022-09-02 ]

Here's another attachment. I saved one of the cores (hopefully the "main" one) when it crashed on the 30th originally.

Hopefully these are helpful.

gdb.19812.txt

Comment by Daniel Black [ 2022-09-03 ]

> Am I right to say that if I install MariaDB-server-debuginfo, then I don't need to build mysqld/mariadbd in debug mode?

yes.

> It's not clear in the documentation

ack. I will try to clear that up.

gdb.18394 - holding a lock in Thread 34,28,22 and 21 writing a log error message.

none of the gdb traces have the segv in them. I notice the core_pattern is abrt. Can you access the backtrace with abrt-backtrace (or a similar abrt family tool) of the original segv?

Comment by Nuno [ 2022-09-04 ]

danblack

Thank you. Sorry for the delay.

I'm not having luck figuring out how to install abrt-backtrace. I can find some webpages mentioning it, but doesn't seem to be available anywhere to install.

I can however use abrt-action-generate-core-backtrace which is even installed on my system already.

However...

# mkdir core
# cp /usr/sbin/mariadbd core/executable
# cp core.19812 core/coredump
# abrt-action-generate-core-backtrace -d .
Generating core_backtrace
Error: Input file too big (25217744). Maximum size is 20000000.

I think it relates to mariadbd which is 24MB.
The coredump is 888MB.

Comment by Nuno [ 2022-09-05 ]

marko danblack

Based on the stacktraces I was able to produce for you,
are you still confident that 10.8.5 will fix the issues I've been getting?

Or maybe you can't tell?

Thank you.

Comment by Daniel Black [ 2022-09-06 ]

gdb.19812.txt doesn't appear to be the original crash (no SEGV in backtrace).

The locks around the error messages are a false alarm. I'm assuming because of the volume of error logs and the container injestion mechanism is slowing these down.

As the backtraces assumedly came from core dumps, and there's no signal handling there, I'm assuming these are sigkill, which I would assume to be OOM and therefore more MDEV-29374 related.

MDEV-29374 hasn't quite been merged into 10.8 yet, but when it does and shows up in the 10.8 on just below the commit message on this github page, a mariadb 10.8 container should be build (look at update time and org.opencontainers.image.revision from podman run quay.io/mariadb-foundation/mariadb-devel:10.8 cat /manifest.txt to ensure its in the history of the image). You'll be able to use quay.io/mariadb-foundation/mariadb-devel:10.8 container until the next release.

Comment by Marko Mäkelä [ 2022-09-06 ]

nunop, downgrades within a major release are always supposed to work.

There used to be a crash during the purge of committed transaction history in one of our crash-injection tests, but I just verified that the problem must have been fixed in MDEV-13542:

diff --git a/mysql-test/suite/innodb/t/leaf_page_corrupted_during_recovery.test b/mysql-test/suite/innodb/t/leaf_page_corrupted_during_recovery.test
index ddf0090c9e2..21d5336528a 100644
--- a/mysql-test/suite/innodb/t/leaf_page_corrupted_during_recovery.test
+++ b/mysql-test/suite/innodb/t/leaf_page_corrupted_during_recovery.test
@@ -49,8 +49,8 @@ EOF
 --error ER_UNKNOWN_STORAGE_ENGINE
 SELECT * FROM t1 WHERE PK = 1;
 
-# We will disable also purge, to not let it request the corrupted page.
-let $restart_parameters=--innodb-force-recovery=2;
+# Force recovery to ignore the corrupted page.
+let $restart_parameters=--innodb-force-recovery=1;
 --source include/restart_mysqld.inc
 SELECT * FROM t1 WHERE PK = 1;
 --error ER_NOT_KEYFILE
@@ -58,5 +58,3 @@ SELECT * FROM t1 WHERE pk = 12;
 CHECK TABLE t1;
 
 DROP TABLE t1;
-let $restart_parameters=;
---source include/restart_mysqld.inc

Because I have no idea how to reproduce the crash, it would be very important to get a resolved stack trace.

Which package was the MariaDB Server 10.8.4 executable installed from? It could be possible for me to resolve the numeric addresses in the original stack traces.

Comment by Nuno [ 2022-09-06 ]

Hi guys - thank you for your replies.

danblack

> "no SEGV in backtrace"

Do you find anything on the other 2 crashes?
If I get a new fresh crash, I'll try to generate the stacktraces again using the first method.

These days I've been getting the other crashes on prepare itself, which is irrelevant to this ticket. But I've been monitoring anyway, for another seg fault.

> "I'm assuming because of the volume of error logs and the container injestion mechanism is slowing these down."

I've recently moved to an HDD server because my database is growing too big and the SSD server has a small disk. The HDD server is in fact slow, so I wonder if that's the reason.
I'll do an experiment today and see if I can restore the database of the 30th in the SSD server and see if I get the same crash / seg fault.

> "You'll be able to use quay.io/mariadb-foundation/mariadb-devel:10.8 container until the next release."

Ok, thank you very much.
Although, I'm not using the official MariaDB container anywhere (except for smaller services with smaller databases).
For my production database, I'm using the actual host (AlmaLinux), with MariaDB installed by yum/dnf.
For the snapshot, I'm using a container, but the container is CentOS with MariaDB installed by yum.

My understanding from what you said is that MDEV-29374 is affecting my snapshot and not my actual prod database, right? So I guess I can workout something to test the devel version against my snapshot and see how it behaves. Thanks much.

marko

> "Which package was the MariaDB Server 10.8.4 executable installed from?"

Production (AlmaLinux 8.6):

$ curl -sS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | bash
$ yum install -y MariaDB-server MariaDB-client MariaDB-backup MariaDB-shared MariaDB-devel

(MariaDB-shared & MariaDB-devel are installed for compatibility with Sphinx Search)

Snapshot (CentOS 7.9 container - https://hub.docker.com/_/centos/):

$ curl -sS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | bash -s -- --mariadb-server-version="mariadb-10.8"
$ yum install -y MariaDB-server MariaDB-client MariaDB-backup

Hopefully this helps!

(I'll come back to you regarding the experiment with the backup of the 30th on the SSD server)

Comment by Daniel Black [ 2022-09-06 ]

I didn't see anything in the other two backtraces apart from assuming a OOM event killed the server.

A slow HDD isn't the reason here.

Latest devel packages from recent merge. ci number on packages is from the tarball-docker builder build number for the 10.8 branch build.

Comment by Nuno [ 2022-09-08 ]

danblack, marko

Hi guys
I hope you're keeping well.

I haven't had a good backup for several days now.
I'll be very afraid to upgrade or downgrade without a proper backup prior to doing that.

If I do the following:

1) put my website offline

2) restart MariaDB

3) backup

4) put my website back online

Will this guarantee a good backup?

My understanding is that the backups are corrupted because of "crash recovery" and/or DDL changes,
so I'm assuming that doing a Checkpoint just before the backup, and create the backup while it's not being used, should make it valid, is that correct?

I know SQL Server has a "CHECKPOINT" command/query that we can run, which writes current in-memory dirty pages and transaction log records to the disk.
Is there any equivalent in MariaDB? If not, will the above steps do something equivalent?

Thank you very much.
Have a great day.

Comment by Daniel Black [ 2022-09-09 ]

I'd suggest the following rough plan without having an idea of acceptable down time or the time taking for any of the backup/copy operations:

  1. put website offline
  2. mariadb-dump / mydumper to get logicalbackup
  3. shutdown mariadb
  4. copy datadir to new location
  5. start newer/older mariadb version as you see fit
  6. if there's a problem starting/restoring, use the mariadb-dump to load into re-initialize it all in the new version
  7. put website online

> Will this guarantee a good backup?

The guarantee of a good backup is the successful restore. I've tried to cultivate multiple options above.

backup stages appear to be the "CHECKPOINT" equivalent facilitating an online copy of the datadir once in "BACKUP STAGE BLOCK_COMMIT".

Comment by Nuno [ 2022-09-09 ]

Hi danblack

Thank you very much. I appreciate your reply.

Yeah, I was afraid the solution would be a dump. The databases are quite large for a SQL dump (and that was the reason I had to look for alternatives and eventually found xtrabackup at the time, now mariabackup).
And my website has significant traffic 24 hours a day.

I can afford a 5-minute downtime here and there, but hours becomes a user & revenue loss.
And I never had so much downtime has I've been having recently, restarting MariaDB often trying to get a proper valid backup (MDEV-28974, MDEV-29364), or because it's using too much swap (MDEV-29097)

A few days ago, I actually did try to use "mariadb-dump"/mysqldump against each database (while online - but I know it can cause foreign key issues if I try to restore later), but I also included information_schema in it, but when it was dumping one of the tables in information_schema, the website went completely offline and MariaDB was hanging completely. This never happened to me when I was using mysqldump many years ago (probably in MariaDB 10.1 or around that).
I had to restart MariaDB in emergency and discarded the idea of using mysqldump !!

Thanks for mentioning about BACKUP STAGE. Yeah, I was assuming that mariabackup does the checkpoint, but then if that's the case, I shouldn't need to run those commands if I already run mariabackup.

So, I did an experiment:

1) put my website offline

2) restart MariaDB

3) run mariabackup

4) run mariabackup again
(which does another checkpoint, but this time without any history of DDL changes, tables swapped, etc... MDEV-28974)

5) put my website back online

And FINALLY I got a good backup, that's not crashing on prepare nor seg-faulting...!!

So at least that's something I can do whenever MariaDB 10.8.5 is releasing.
Do you have any date for 10.8.5 ?

Comment by Marko Mäkelä [ 2022-09-09 ]

nunop, there is no decision on unscheduled releases yet.

You can find packages of development snapshots of the 10.8 branch at http://hasky.askmonty.org/archive/10.8/. The earliest build where this problem should be fixed is build-50104.

Just today, I happened to encounter this failure where preparing a backup succeeds but a table is reported as corrupted.

10.10 c0a6ce61d88104db14f7107cb16edf475b46bffd

mariabackup.huge_lsn 'strict_full_crc32' w3 [ fail ]
        Test ended at 2022-09-06 10:55:11
[01] 2022-09-06 10:55:09 Copying ./mysql/transaction_registry.ibd to /dev/shm/var/3/mysqld.1/data/mysql/transaction_registry.ibd
[01] 2022-09-06 10:55:09         ...done
[00] 2022-09-06 10:55:09 completed OK!
mysqltest: At line 92: query 'SELECT * FROM t' failed: ER_TABLE_CORRUPT (1877): Table test/t is corrupted. Please drop the table and recreate.

This was with a code revision that did not include the fixes of MDEV-29383 or MDEV-29374.

Comment by Marko Mäkelä [ 2022-09-09 ]

nunop, one more thing:

The HDD server is in fact slow, so I wonder if that's the reason.

Slower storage could improve the chances of hitting MDEV-29383, which I think can allow the checkpoint LSN to advance prematurely. That is, some log a little before the incorrect checkpoint LSN would not be applied by crash recovery or when preparing a backup.

Comment by Nuno [ 2022-09-09 ]

Thank you marko for your reply.

you can find packages of development snapshots of the 10.8 branch

My problem with nightly builds is that I don't know what has been tested or not, from the new changes. I see every day there are new commits.
Is there any build you consider stable enough, that you'd be happy to release as "the" 10.8.5 version? Or is it that only what's been tested is merged to the 10.8 branch?

I happened to encounter this failure where preparing a backup succeeds but a table is reported as corrupted.

Oh, that's another frightening news !!

I've just opened phpMyAdmin against that Snapshot database that restored well today.... clicked to open the tables list, boom!

65 Segmentation fault (core dumped) mysqld -u root

Slower storage could improve the chances of hitting MDEV-29383

Ok, I'll give it a try on the SSD server then.
Thank you.

Comment by Marko Mäkelä [ 2022-09-09 ]

I have some news: There might be new releases coming out "soon".

I see that danblack used the word "checkpoint" in a different meaning than InnoDB log checkpoint. I do not think that mariadb-backup will trigger a log checkpoint. It simply copies log from the latest checkpoint LSN that is available when backup starts. The backup lock stages will only block operations at the SQL layer, not any lower-level I/O, such as buffer pool page writes and InnoDB log checkpoints. If you want a log checkpoint to occur, you can issue the statement

SET GLOBAL innodb_max_dirty_pages_pct=0;

and wait some time, and then start a backup. You would want to restore it (and the connected innodb_max_dirty_pages_pct_lwm) afterwards, to restore acceptable write performance again.

Comment by Nuno [ 2022-09-09 ]

Thanks marko
I can consider doing that. I assume that if innodb_max_dirty_pages_pct_lwm = 0, I don't have to worry about it, and just innodb_max_dirty_pages_pct = 0, from what I read in the documentation.

Anyway - I restored today's backup in the SSD server, and it works well!
At least that's a relief.
The HDD server isn't appropriate for this database, or at least not with 10.8.5.

Ok. I guess I can go have a good nap now. These days have been really stressful to me!

Thank you both.

Comment by Marko Mäkelä [ 2022-09-12 ]

nunop, I am glad that you were able to tweak something that works for you until 10.8.5 is available.

When it comes to this bug report, I think that this is about avoiding a server crash when the data is corrupted in a particular way. To fix that, I would need to resolve the following stack trace that you posted earlier:

mysqld(+0xeab38c)[0x55ea67db138c]
mysqld(+0xeb8c09)[0x55ea67dbec09]
mysqld(+0xebafe5)[0x55ea67dc0fe5]
mysqld(+0xebbb66)[0x55ea67dc1b66]
mysqld(+0xeaf98c)[0x55ea67db598c]
mysqld(+0xdd6baf)[0x55ea67cdcbaf]
mysqld(+0xd91c08)[0x55ea67c97c08]
mysqld(+0xdf6480)[0x55ea67cfc480]
mysqld(_ZN5tpool10task_group7executeEPNS_4taskE+0xa6)[0x55ea67e4a536]

I will download the package that you used, and try to resolve this manually.

Comment by Marko Mäkelä [ 2022-09-12 ]

My attempt to find and download the correct package files into Debian apparently failed. I can resolve the short addresses, but they do not make any sense, that is, they were resolving to unrelated functions that are not calling each other.

Here is what I did:

  1. Download MariaDB-server-10.8.4-1.el8.x86_64.rpm from one of the mirrors that mariadb.org points to.
  2. Extract the server executable usr/sbin/mariadbd, for example, using the GNOME file-roller.
  3. Download MariaDB-server-debuginfo-10.8.4-1.el8.x86_64.rpm from the same mirror.
  4. Extract all of it to the current directory.
  5. Invoke the debugger:

    gdb --ex 'set debug-file-directory usr/lib/debug' --ex 'file mariadbd'
    

  6. Inside the debugger, type commands like list *0xdf6480 to resolve the addresses one at a time.
Comment by Marko Mäkelä [ 2022-09-12 ]

I repeated the exercise with MariaDB-server-10.8.4-1.el7.x86_64.rpm and the corresponding debuginfo. The stack traces are suddenly making some sense:

mariadb-10.8.4

dict_sys_t::remove(dict_table_t*, bool, bool) (/usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0dict.cc:1885)
dict_load_table_one(st_::span<char const> const&, dict_err_ignore_t, dict_names_t&) (/usr/src/debug/MariaDB-/src_0/storage/innobase/include/mem0mem.inl:407)
dict_sys_t::load_table(st_::span<char const> const&, dict_err_ignore_t) (/usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0load.cc:2495)
dict_load_table_on_id(unsigned long, dict_err_ignore_t) (/usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0load.cc:2571)
dict_table_open_on_id<true>(unsigned long, bool, dict_table_op_t, THD*, MDL_ticket**) (/usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0dict.cc:864)
row_purge_step(que_thr_t*) (/usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0purge.cc:944)
que_run_threads(que_thr_t*) (/usr/src/debug/MariaDB-/src_0/storage/innobase/que/que0que.cc:653)
purge_worker_callback(void*) (/usr/src/debug/MariaDB-/src_0/include/mysql/psi/mysql_thread.h:745).

Yes, sometimes the current instruction in a stack frame is referring to some inlined code. The important part here is that dict_load_table_one() was invoking dict_sys_t::remove() to evict the definition of a corrupted table. Let us check the relevant part of the output of disassemble dict_load_table_one in GDB:

   0x0000000000eb8bec <+3084>:	je     0xeb8ef0 <dict_load_table_one(st_::span<char const> const&, dict_err_ignore_t, dict_names_t&)+3856>
   0x0000000000eb8bf2 <+3090>:	mov    -0x1c00(%rbp),%rsi
   0x0000000000eb8bf9 <+3097>:	lea    0xb88200(%rip),%rdi        # 0x1a40e00 <dict_sys>
   0x0000000000eb8c00 <+3104>:	xor    %ecx,%ecx
   0x0000000000eb8c02 <+3106>:	xor    %edx,%edx
   0x0000000000eb8c04 <+3108>:	call   0xeab250 <_ZN10dict_sys_t6removeEP12dict_table_tbb>
   0x0000000000eb8c09 <+3113>:	mov    -0x1c28(%rbp),%rax

Because we have the debug information, we can check which code is associated with that preceding conditional branch (je):

(gdb) list *0xeb8bec
0xeb8bec is in dict_load_table_one(st_::span<char const> const&, dict_err_ignore_t, dict_names_t&) (/usr/src/debug/MariaDB-/src_0/storage/innobase/dict/dict0load.cc:2348).
2343	
2344		mem_heap_t* heap = mem_heap_create(32000);
2345	
2346		dict_load_tablespace(table, ignore_err);
2347	
2348		if (dict_load_columns(table, heap) || dict_load_virtual(table)) {
2349	evict:
2350			dict_sys.remove(table);
2351			mem_heap_free(heap);
2352			DBUG_RETURN(nullptr);

At this point, the table had not been added to dict_sys yet, and therefore the call dict_sys.remove(table) is incorrect. Instead, we would need dict_mem_table_free(). This is something that was recently fixed as part of MDEV-29440.

Comment by Nuno [ 2022-09-12 ]

Hey marko

Strange that with el7 you find a match that makes more sense.

I'm definitely using the el8 release, since I'm on AlmaLinux 8.6.

# yum list installed | grep maria
MariaDB-backup.x86_64                       10.8.4-1.el8                             @mariadb-main
MariaDB-client.x86_64                       10.8.4-1.el8                             @mariadb-main
MariaDB-common.x86_64                       10.8.4-1.el8                             @mariadb-main
MariaDB-devel.x86_64                        10.8.4-1.el8                             @mariadb-main
MariaDB-server.x86_64                       10.8.4-1.el8                             @mariadb-main
MariaDB-shared.x86_64                       10.8.4-1.el8                             @mariadb-main
galera-4.x86_64                             26.4.12-1.el8                            @mariadb-main
 
# yum info mariadb
Name         : MariaDB
Version      : 10.8.4
Release      : 1.el8
Architecture : src
Size         : 91 M
Source       : None
Repository   : mariadb-main
...

Comment by Nuno [ 2022-09-12 ]

marko ignore what I said above. The crash happened on the Docker container which runs CentOS 7. So el7 is the right package to look at! Sorry.

Generated at Thu Feb 08 10:08:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.