[MDEV-33260] Crash at startup when unclean shutdown Created: 2024-01-16 Updated: 2024-01-25 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB, Virtual Columns |
| Affects Version/s: | 10.11.6 |
| Fix Version/s: | 10.6, 10.11 |
| Type: | Bug | Priority: | Major |
| Reporter: | Cuchac | Assignee: | Nikita Malyavin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | crash, recovery, startup | ||
| Environment: |
official mariadb docker image on K8S cluster |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
When unclean shutdown happens and recovery kicks in during launch, the server crashes. This happens on 10.11.6 version, version 10.11.5 is unaffected. Unfortunately I'm unable to produce backtrace, because quay.io/mariadb-foundation/mariadb-debug:10.11 image does not seem to contain correct debug symbols. Tried to post this to Zulip, so far without response. I have core dump but no debug symbols. I fixed the crash by downgrading to 10.11.5, doing clean shutdown and upgrading to 10.11.6 again. Then 10.11.6 correctly starts. |
| Comments |
| Comment by Sergei Golubchik [ 2024-01-17 ] | ||||||||||||||||
|
Do you mean it was unclean shutdown of 10.11.6 and then recovery in the same 10.11.6? That is, not an upgrade after the unclean shutdown | ||||||||||||||||
| Comment by Tomas Leypold [ 2024-01-17 ] | ||||||||||||||||
|
Yeah, image quay.io/mariadb-foundation/mariadb-debug seems to have broken main/debug repo in /etc/apt/sources.list.d/mariadb.list as you can see below:
EDIT: I see that it is being worked on at MDBF-637 | ||||||||||||||||
| Comment by Sergei Golubchik [ 2024-01-17 ] | ||||||||||||||||
|
Okay. So packaging aside (it's being worked on, indeed), this goes down to "10.11.6 crashes when started after an unclean shutdown, 10.11.5 recovers the same datadir fine" — correct? What is "unclean shutdown" — a crash/kill? Or a normal shutdown with innodb_fast_shutdown>0 ? | ||||||||||||||||
| Comment by Cuchac [ 2024-01-17 ] | ||||||||||||||||
|
"10.11.6 crashes when started after an unclean shutdown, 10.11.5 recovers the same datadir fine" - yes, that is correct. Affected instance running 10.11.6 was cluster master and OOM happened inside container. Using K8s cluster. | ||||||||||||||||
| Comment by Sergei Golubchik [ 2024-01-17 ] | ||||||||||||||||
|
You forgot to answer what do you mean under "unclean shutdown". What it a crash/kill or a normal shutdown with innodb_fast_shutdown>0 ? | ||||||||||||||||
| Comment by Cuchac [ 2024-01-17 ] | ||||||||||||||||
|
It was OOM kill. I managed to produce stack trace from core file from yesterday. Attached. But I'm unable to reproduce the issue | ||||||||||||||||
| Comment by Daniel Black [ 2024-01-17 ] | ||||||||||||||||
|
Thanks. At it was a purge thread its going to be rather dependent on timing. This was 10.11.6? or a image after that (if so do you have source revision has from logs?). Seems down to innobase_report_computed_value_failed. If you can keep the core that would be appreciated, seems some deeper pointer values and their contents might be needed:
| ||||||||||||||||
| Comment by Cuchac [ 2024-01-18 ] | ||||||||||||||||
|
Yes, it was official 10.11.6. I tried also mariadb-debug:10.11 approx 3 days ago and it crashed as well. Core file is quite small, several hundreds MB. I keep it and I can upload it. There should not be any customer data in memory, because no query was executed, so no problem. | ||||||||||||||||
| Comment by Sergei Golubchik [ 2024-01-18 ] | ||||||||||||||||
|
marko, could you guess by looking at 10.11.5..10.11.6 InoDB changes, what could be a reason for this bug? A crash on recovery, where 10.11.5 recovers the same datadir just fine. | ||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-22 ] | ||||||||||||||||
|
Theoretically, any corruption (cuchac also filed MDEV-33178) could be due to the bug Both this and MDEV-33178 seem to be related to indexed virtual columns, on which we have a number of open bugs, I am afraid. One of them would be MDEV-29181 (there is a testing status update in MDEV-30869). In the mariadbd_full_bt_all_threads.txt
I will need more details to be able to debug the exact cause of the failure. cuchac, can you post the output of the following GDB commands on this crash?
| ||||||||||||||||
| Comment by Cuchac [ 2024-01-23 ] | ||||||||||||||||
|
Hello,
attached | ||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-23 ] | ||||||||||||||||
|
Yes, the m_loc_kind == FIELD_LOC_KIND_BITPOS assertion failure in GDB is something that probably is in the GDB in the latest Debian or Ubuntu releases. I think that it is related to bit-fields in structs. I don’t remember seeing this crash in my Debian Sid (unstable) for a while. The disassembler output does not include the current address 0x000056455cd081ec; it would end at 0x000056455cd0048b, which is several kilobytes earlier. That does not help us, but index->table->vc_templ seems to be a null pointer, according to the following line in gdb.log
In debug builds, we would have an assertion to catch this:
This could be related to the open ticket MDEV-26263. | ||||||||||||||||
| Comment by Cuchac [ 2024-01-23 ] | ||||||||||||||||
|
Thanks for the analysis, I can send you the core dump, if it will help to diagnose the problem better. It is quite small. | ||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-24 ] | ||||||||||||||||
|
cuchac, thank you. A core dump would only work if the executable and all the shared libraries (info sharedlibrary in GDB) are saved, and passed to the "receiving" GDB by using the commands set solib-search-path or set solib-absolute-prefix. It has been somewhat of a nuisance in the past, and in one case (Wind River Linux) the only thing that worked was to debug the core dump in an equivalent environment. I haven’t had such trouble when debugging core dumps from other rpm and AMD64 based GNU/Linux distributions on my Debian system. However, if I have understood it correctly, containers would make debugging core dumps easy, because everything except the Linux kernel should be contained in the Docker image. For the purpose of testing this hypothesis, I think that it would be very useful if you could upload the core dump. I’d recommend to compress it with xz to make it smaller. Please arrange the details with danblack. | ||||||||||||||||
| Comment by Daniel Black [ 2024-01-24 ] | ||||||||||||||||
|
Can upload to https://mariadb.com/kb/en/meta/mariadb-ftp-server/. | ||||||||||||||||
| Comment by Cuchac [ 2024-01-24 ] | ||||||||||||||||
|
Hello, I used 'mariadb:10.11.6' with following command added to install debugging symbols (from wiki):
MDEV-33260_core.xz is now in "private" ftp folder. | ||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-25 ] | ||||||||||||||||
|
Just for my education, I successfully analyzed the core dump in the Docker container. I see the correct address for the crash highlighted in the disassembly output (also in display/i $pc):
According to info registers, %rdx contains 0, so we are dereferencing a null pointer. The offset 0x10 corresponds to dict_vcol_templ_t::vtempl. If we had not crashed here, a few instructions later, at offset +207 from the start of the function, we would dereference dict_vcol_templ_t::n_col at the start of the structure. |