[MDEV-29422] InnoDB crashes when dict_load_table_one() notices a corrupted table Created: 2022-08-31 Updated: 2022-09-12 Resolved: 2022-09-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.8, 10.7.4, 10.8.3, 10.9.1, 10.10.1 |
| Fix Version/s: | 10.6.10, 10.7.6, 10.8.5, 10.9.3, 10.10.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Nuno | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
10.8.4 Yesterday, after starting to get the same issue described in Transferred to the other server, and ran the prepare. Prepare seems to run OK (this runs in a docker container btw), then it runs a number of restarts and SQL queries to sort out a new random root password (same process that's been running for years) However, with yesterday's backup, it consistently crashes after a while, with: 80 Segmentation fault (core dumped) mysqld -u root And a core file is created. The core file seems to be unreadable if I open with Notepad++, Here are some excerpts of the "mysql.err":
(many of the same) At some point it starts showing my own databases/tables:
...
... and then the final bit:
Do you think this could be caused by Thanks. Today's backup, however, seems to have gone well. I have a stable Snapshot, finally. It's really hard to get valid backups... . |
| Comments |
| Comment by Nuno [ 2022-09-01 ] | ||||||||||||||||||||||||||||
|
Today's backup got the same issue again (with core dump). | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-02 ] | ||||||||||||||||||||||||||||
|
nunop, because of From your description it looks like you may have multiple page dumps interleaved in the output, reported concurrently by different threads. I think that the cause of this is one of the tickets. If you are willing to try it, I can create a custom build for you. Meanwhile, can you please try to provide a full stack trace of the crash? After all, the goal of You will likely have to install some dbgsym or debuginfo package that goes along with the server so that the function and parameter names and values will be resolved by gdb. In You could invoke the debugger something like this:
In GDB, you can use the following commands to have the stack traces of all threads to be dumped to a file.
Based on the incompletely resolved stack trace, I have the feeling that the crash would occur in an InnoDB purge task. You might be able to prevent that crash by starting the server with innodb_force_recovery=2. | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-02 ] | ||||||||||||||||||||||||||||
|
Hi marko
Ok.. that's frightening...!!
Just to clarify, you mean like a release candidate for 10.8.5 ? Would it be safer to downgrade to 10.8.3? (if it's possible to downgrade minor versions) If you believe your custom build is stable enough for me to use, I'm ok to try it - just want to be very sure it won't corrupt my database.
Ok. I'll try to figure out how to get the stacktrace. This will be on the Snapshot server (docker container), where it's failing. Let me know about the questions above, for the build. | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-02 ] | ||||||||||||||||||||||||||||
|
Am I right to say that if I install MariaDB-server-debuginfo, then I don't need to build mysqld/mariadbd in debug mode? It's not clear in the documentation – I restored the database from the 30th, Let me know if this helps. | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-02 ] | ||||||||||||||||||||||||||||
|
Here's another attachment. I saved one of the cores (hopefully the "main" one) when it crashed on the 30th originally. Hopefully these are helpful. | ||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-09-03 ] | ||||||||||||||||||||||||||||
|
> Am I right to say that if I install MariaDB-server-debuginfo, then I don't need to build mysqld/mariadbd in debug mode? yes. > It's not clear in the documentation ack. I will try to clear that up. gdb.18394 - holding a lock in Thread 34,28,22 and 21 writing a log error message. none of the gdb traces have the segv in them. I notice the core_pattern is abrt. Can you access the backtrace with abrt-backtrace (or a similar abrt family tool) of the original segv? | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-04 ] | ||||||||||||||||||||||||||||
|
Thank you. Sorry for the delay. I'm not having luck figuring out how to install abrt-backtrace. I can find some webpages mentioning it, but doesn't seem to be available anywhere to install. I can however use abrt-action-generate-core-backtrace which is even installed on my system already. However...
I think it relates to mariadbd which is 24MB. | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-05 ] | ||||||||||||||||||||||||||||
|
Based on the stacktraces I was able to produce for you, Or maybe you can't tell? Thank you. | ||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-09-06 ] | ||||||||||||||||||||||||||||
|
gdb.19812.txt doesn't appear to be the original crash (no SEGV in backtrace). The locks around the error messages are a false alarm. I'm assuming because of the volume of error logs and the container injestion mechanism is slowing these down. As the backtraces assumedly came from core dumps, and there's no signal handling there, I'm assuming these are sigkill, which I would assume to be OOM and therefore more
| ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-06 ] | ||||||||||||||||||||||||||||
|
nunop, downgrades within a major release are always supposed to work. There used to be a crash during the purge of committed transaction history in one of our crash-injection tests, but I just verified that the problem must have been fixed in
Because I have no idea how to reproduce the crash, it would be very important to get a resolved stack trace. Which package was the MariaDB Server 10.8.4 executable installed from? It could be possible for me to resolve the numeric addresses in the original stack traces. | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-06 ] | ||||||||||||||||||||||||||||
|
Hi guys - thank you for your replies. > "no SEGV in backtrace" Do you find anything on the other 2 crashes? These days I've been getting the other crashes on prepare itself, which is irrelevant to this ticket. But I've been monitoring anyway, for another seg fault. > "I'm assuming because of the volume of error logs and the container injestion mechanism is slowing these down." I've recently moved to an HDD server because my database is growing too big and the SSD server has a small disk. The HDD server is in fact slow, so I wonder if that's the reason. > "You'll be able to use quay.io/mariadb-foundation/mariadb-devel:10.8 container until the next release." Ok, thank you very much. My understanding from what you said is that > "Which package was the MariaDB Server 10.8.4 executable installed from?" Production (AlmaLinux 8.6):
(MariaDB-shared & MariaDB-devel are installed for compatibility with Sphinx Search) Snapshot (CentOS 7.9 container - https://hub.docker.com/_/centos/):
Hopefully this helps! (I'll come back to you regarding the experiment with the backup of the 30th on the SSD server) | ||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-09-06 ] | ||||||||||||||||||||||||||||
|
I didn't see anything in the other two backtraces apart from assuming a OOM event killed the server. A slow HDD isn't the reason here. Latest devel packages from recent merge. ci number on packages is from the tarball-docker builder build number for the 10.8 branch build. | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-08 ] | ||||||||||||||||||||||||||||
|
Hi guys I haven't had a good backup for several days now. If I do the following: 1) put my website offline 2) restart MariaDB 3) backup 4) put my website back online Will this guarantee a good backup? My understanding is that the backups are corrupted because of "crash recovery" and/or DDL changes, I know SQL Server has a "CHECKPOINT" command/query that we can run, which writes current in-memory dirty pages and transaction log records to the disk. Thank you very much. | ||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-09-09 ] | ||||||||||||||||||||||||||||
|
I'd suggest the following rough plan without having an idea of acceptable down time or the time taking for any of the backup/copy operations:
> Will this guarantee a good backup? The guarantee of a good backup is the successful restore. I've tried to cultivate multiple options above. backup stages appear to be the "CHECKPOINT" equivalent facilitating an online copy of the datadir once in "BACKUP STAGE BLOCK_COMMIT". | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-09 ] | ||||||||||||||||||||||||||||
|
Hi danblack Thank you very much. I appreciate your reply. Yeah, I was afraid the solution would be a dump. The databases are quite large for a SQL dump (and that was the reason I had to look for alternatives and eventually found xtrabackup at the time, now mariabackup). I can afford a 5-minute downtime here and there, but hours A few days ago, I actually did try to use "mariadb-dump"/mysqldump against each database (while online - but I know it can cause foreign key issues if I try to restore later), but I also included information_schema in it, but when it was dumping one of the tables in information_schema, the website went completely offline and MariaDB was hanging completely. This never happened to me when I was using mysqldump many years ago (probably in MariaDB 10.1 or around that). – Thanks for mentioning about BACKUP STAGE. Yeah, I was assuming that mariabackup does the checkpoint, but then if that's the case, I shouldn't need to run those commands if I already run mariabackup. So, I did an experiment: 1) put my website offline 2) restart MariaDB 3) run mariabackup 4) run mariabackup again 5) put my website back online And FINALLY I got a good backup, that's not crashing on prepare nor seg-faulting...!! – So at least that's something I can do whenever MariaDB 10.8.5 is releasing. | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-09 ] | ||||||||||||||||||||||||||||
|
nunop, there is no decision on unscheduled releases yet. You can find packages of development snapshots of the 10.8 branch at http://hasky.askmonty.org/archive/10.8/. The earliest build where this problem should be fixed is build-50104. Just today, I happened to encounter this failure where preparing a backup succeeds but a table is reported as corrupted.
This was with a code revision that did not include the fixes of | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-09 ] | ||||||||||||||||||||||||||||
|
nunop, one more thing:
Slower storage could improve the chances of hitting | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-09 ] | ||||||||||||||||||||||||||||
|
Thank you marko for your reply.
My problem with nightly builds is that I don't know what has been tested or not, from the new changes. I see every day there are new commits.
Oh, that's another frightening news !! I've just opened phpMyAdmin against that Snapshot database that restored well today.... clicked to open the tables list, boom! 65 Segmentation fault (core dumped) mysqld -u root
Ok, I'll give it a try on the SSD server then. | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-09 ] | ||||||||||||||||||||||||||||
|
I have some news: There might be new releases coming out "soon". I see that danblack used the word "checkpoint" in a different meaning than InnoDB log checkpoint. I do not think that mariadb-backup will trigger a log checkpoint. It simply copies log from the latest checkpoint LSN that is available when backup starts. The backup lock stages will only block operations at the SQL layer, not any lower-level I/O, such as buffer pool page writes and InnoDB log checkpoints. If you want a log checkpoint to occur, you can issue the statement
and wait some time, and then start a backup. You would want to restore it (and the connected innodb_max_dirty_pages_pct_lwm) afterwards, to restore acceptable write performance again. | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-09 ] | ||||||||||||||||||||||||||||
|
Thanks marko Anyway - I restored today's backup in the SSD server, and it works well! Ok. I guess I can go have a good nap now. These days have been really stressful to me! Thank you both. | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-12 ] | ||||||||||||||||||||||||||||
|
nunop, I am glad that you were able to tweak something that works for you until 10.8.5 is available. When it comes to this bug report, I think that this is about avoiding a server crash when the data is corrupted in a particular way. To fix that, I would need to resolve the following stack trace that you posted earlier:
I will download the package that you used, and try to resolve this manually. | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-12 ] | ||||||||||||||||||||||||||||
|
My attempt to find and download the correct package files into Debian apparently failed. I can resolve the short addresses, but they do not make any sense, that is, they were resolving to unrelated functions that are not calling each other. Here is what I did:
| ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-12 ] | ||||||||||||||||||||||||||||
|
I repeated the exercise with MariaDB-server-10.8.4-1.el7.x86_64.rpm and the corresponding debuginfo. The stack traces are suddenly making some sense:
Yes, sometimes the current instruction in a stack frame is referring to some inlined code. The important part here is that dict_load_table_one() was invoking dict_sys_t::remove() to evict the definition of a corrupted table. Let us check the relevant part of the output of disassemble dict_load_table_one in GDB:
Because we have the debug information, we can check which code is associated with that preceding conditional branch (je):
At this point, the table had not been added to dict_sys yet, and therefore the call dict_sys.remove(table) is incorrect. Instead, we would need dict_mem_table_free(). This is something that was recently fixed as part of | ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-12 ] | ||||||||||||||||||||||||||||
|
Hey marko Strange that with el7 you find a match that makes more sense. I'm definitely using the el8 release, since I'm on AlmaLinux 8.6.
| ||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-12 ] | ||||||||||||||||||||||||||||
|
marko ignore what I said above. The crash happened on the Docker container which runs CentOS 7. So el7 is the right package to look at! Sorry. |