[MDEV-33260] Crash at startup when unclean shutdown Created: 2024-01-16  Updated: 2024-01-25

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB, Virtual Columns
Affects Version/s: 10.11.6
Fix Version/s: 10.6, 10.11

Type: Bug Priority: Major
Reporter: Cuchac Assignee: Nikita Malyavin
Resolution: Unresolved Votes: 0
Labels: crash, recovery, startup
Environment:

official mariadb docker image on K8S cluster


Attachments: File gdb.log     Text File mariadbd_full_bt_all_threads.txt    
Issue Links:
Relates
relates to MDBF-637 missing debuginfo packages due to rep... Open
relates to MDEV-26263 Investigate possible race on vc_templ... Open

 Description   

When unclean shutdown happens and recovery kicks in during launch, the server crashes.

This happens on 10.11.6 version, version 10.11.5 is unaffected.

Unfortunately I'm unable to produce backtrace, because quay.io/mariadb-foundation/mariadb-debug:10.11 image does not seem to contain correct debug symbols. Tried to post this to Zulip, so far without response. I have core dump but no debug symbols.

I fixed the crash by downgrading to 10.11.5, doing clean shutdown and upgrading to 10.11.6 again. Then 10.11.6 correctly starts.



 Comments   
Comment by Sergei Golubchik [ 2024-01-17 ]

Do you mean it was unclean shutdown of 10.11.6 and then recovery in the same 10.11.6? That is, not an upgrade after the unclean shutdown

Comment by Tomas Leypold [ 2024-01-17 ]

Yeah, image quay.io/mariadb-foundation/mariadb-debug seems to have broken main/debug repo in /etc/apt/sources.list.d/mariadb.list as you can see below:

root@807996c0b05e:/# apt update
Ign:1 https://ci.mariadb.org/42419/amd64-ubuntu-2204-deb-autobake jammy InRelease
Hit:2 http://security.ubuntu.com/ubuntu jammy-security InRelease    
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease              
Hit:4 https://ci.mariadb.org/42419/amd64-ubuntu-2204-deb-autobake jammy Release
Ign:5 https://buildbot.mariadb.net/archive/builds/mariadb-4.x/latest/kvm-deb-jammy-amd64-gal/debs ./ InRelease
Ign:6 https://ci.mariadb.org/42419/amd64-ubuntu-2204-deb-autobake jammy Release.gpg
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Ign:8 https://buildbot.mariadb.net/archive/builds/mariadb-4.x/latest/kvm-deb-jammy-amd64-gal/debs ./ Release
Hit:9 https://buildbot.mariadb.net/archive/builds/mariadb-4.x/latest/kvm-deb-jammy-amd64-gal/debs ./ Packages
Hit:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
3 packages can be upgraded. Run 'apt list --upgradable' to see them.
W: Skipping acquire of configured file 'main/debug/binary-amd64/Packages' as repository 'https://ci.mariadb.org/42419/amd64-ubuntu-2204-deb-autobake jammy InRelease' doesn't have the component 'main/debug' (component misspelt in sources.list?)

EDIT: I see that it is being worked on at MDBF-637

Comment by Sergei Golubchik [ 2024-01-17 ]

Okay. So packaging aside (it's being worked on, indeed), this goes down to "10.11.6 crashes when started after an unclean shutdown, 10.11.5 recovers the same datadir fine" — correct?

What is "unclean shutdown" — a crash/kill? Or a normal shutdown with innodb_fast_shutdown>0 ?

Comment by Cuchac [ 2024-01-17 ]

"10.11.6 crashes when started after an unclean shutdown, 10.11.5 recovers the same datadir fine" - yes, that is correct.

Affected instance running 10.11.6 was cluster master and OOM happened inside container. Using K8s cluster.

Comment by Sergei Golubchik [ 2024-01-17 ]

You forgot to answer what do you mean under "unclean shutdown". What it a crash/kill or a normal shutdown with innodb_fast_shutdown>0 ?

Comment by Cuchac [ 2024-01-17 ]

It was OOM kill.

I managed to produce stack trace from core file from yesterday. Attached.

But I'm unable to reproduce the issue It has happened yesterday on every launch and every crash-recovery but now ..... it does not crash with completely same container image.

Comment by Daniel Black [ 2024-01-17 ]

Thanks. At it was a purge thread its going to be rather dependent on timing.

This was 10.11.6? or a image after that (if so do you have source revision has from logs?).

Seems down to innobase_report_computed_value_failed.

If you can keep the core that would be appreciated, seems some deeper pointer values and their contents might be needed:

  • col->v_pos
  • index->table
  • index->table->vc_templ
  • index->table->vc_templ->vtempl
Comment by Cuchac [ 2024-01-18 ]

Yes, it was official 10.11.6. I tried also mariadb-debug:10.11 approx 3 days ago and it crashed as well.

Core file is quite small, several hundreds MB. I keep it and I can upload it. There should not be any customer data in memory, because no query was executed, so no problem.

Comment by Sergei Golubchik [ 2024-01-18 ]

marko, could you guess by looking at 10.11.5..10.11.6 InoDB changes, what could be a reason for this bug? A crash on recovery, where 10.11.5 recovers the same datadir just fine.

Comment by Marko Mäkelä [ 2024-01-22 ]

Theoretically, any corruption (cuchac also filed MDEV-33178) could be due to the bug MDEV-33275. To work around that one, you could set innodb_adaptive_flushing=OFF. This would make most sense if you can reproduce the crash after loading the database from a logical dump (SQL statements).

Both this and MDEV-33178 seem to be related to indexed virtual columns, on which we have a number of open bugs, I am afraid. One of them would be MDEV-29181 (there is a testing status update in MDEV-30869). In the mariadbd_full_bt_all_threads.txt it looks like we have a crash on the following source code line of innobase_get_computed_value():

mariadb-10.11.6

#0  0x000056455cd081ec in innobase_get_computed_value (row=row@entry=0x7f3dec366150, col=0x7f3e015a5920, index=index@entry=0x7f3e015a6690, local_heap=local_heap@entry=0x7f3dff7fcaf0, heap=heap@entry=0x7f3dec3660c8, ifield=ifield@entry=0x0, thd=0x564560eeb648, mysql_table=0x7f3dec03deb8, mysql_rec=0x7f3dec3665c0 "", old_table=0x0, update=0x0, ignore_warnings=false) at ./storage/innobase/handler/ha_innodb.cc:20213

	const mysql_row_templ_t*
			vctempl =  index->table->vc_templ->vtempl[
				index->table->vc_templ->n_col + col->v_pos];

I will need more details to be able to debug the exact cause of the failure. cuchac, can you post the output of the following GDB commands on this crash?

print index->name
print *index->table
print *index->table->vc_templ
print *col
disassemble innobase_get_computed_value

Comment by Cuchac [ 2024-01-23 ]

Hello,
command "print *index->table" break GDB (core dumped) so I put it as last command.
Output from

gdb /usr/sbin/mariadbd core.1 --batch --eval-command="print index->name" --eval-command="print *index->table->vc_templ" --eval-command="print *col"  --eval-command="disassemble innobase_get_computed_value" --eval-command="print *index->table" > gdb.log 2>&1

attached

Comment by Marko Mäkelä [ 2024-01-23 ]

Yes, the m_loc_kind == FIELD_LOC_KIND_BITPOS assertion failure in GDB is something that probably is in the GDB in the latest Debian or Ubuntu releases. I think that it is related to bit-fields in structs. I don’t remember seeing this crash in my Debian Sid (unstable) for a while.

The disassembler output does not include the current address 0x000056455cd081ec; it would end at 0x000056455cd0048b, which is several kilobytes earlier. That does not help us, but index->table->vc_templ seems to be a null pointer, according to the following line in gdb.log:

Cannot access memory at address 0x0

In debug builds, we would have an assertion to catch this:

	ut_ad(index->table->vc_templ);

This could be related to the open ticket MDEV-26263.

Comment by Cuchac [ 2024-01-23 ]

Thanks for the analysis, I can send you the core dump, if it will help to diagnose the problem better. It is quite small.

Comment by Marko Mäkelä [ 2024-01-24 ]

cuchac, thank you. A core dump would only work if the executable and all the shared libraries (info sharedlibrary in GDB) are saved, and passed to the "receiving" GDB by using the commands set solib-search-path or set solib-absolute-prefix. It has been somewhat of a nuisance in the past, and in one case (Wind River Linux) the only thing that worked was to debug the core dump in an equivalent environment. I haven’t had such trouble when debugging core dumps from other rpm and AMD64 based GNU/Linux distributions on my Debian system.

However, if I have understood it correctly, containers would make debugging core dumps easy, because everything except the Linux kernel should be contained in the Docker image. For the purpose of testing this hypothesis, I think that it would be very useful if you could upload the core dump. I’d recommend to compress it with xz to make it smaller. Please arrange the details with danblack.

Comment by Daniel Black [ 2024-01-24 ]

Can upload to https://mariadb.com/kb/en/meta/mariadb-ftp-server/.

Comment by Cuchac [ 2024-01-24 ]

Hello,
coredump from container definitely has to work in another instance of same container. I destroyed problematic container several times and created again to inspect core file.

I used 'mariadb:10.11.6' with following command added to install debugging symbols (from wiki):

apt-get update  && apt-get install -y gdb mariadb-server-core-dbgsym=$(dpkg -s mariadb-server-core |  awk '/Version/{print $NF}') ; 

MDEV-33260_core.xz is now in "private" ftp folder.

Comment by Marko Mäkelä [ 2024-01-25 ]

Just for my education, I successfully analyzed the core dump in the Docker container. I see the correct address for the crash highlighted in the disassembly output (also in display/i $pc):

   0x000056455cd081cc <+156>:	mov    $0x200,%ebx
   0x000056455cd081d1 <+161>:	shl    %cl,%ebx
   0x000056455cd081d3 <+163>:	mov    %rbx,-0xc78(%rbp)
   0x000056455cd081da <+170>:	mov    0x258(%rax),%rdx
   0x000056455cd081e1 <+177>:	mov    0x98(%rax),%rbx
   0x000056455cd081e8 <+184>:	mov    0x28(%r15),%eax
=> 0x000056455cd081ec <+188>:	mov    0x10(%rdx),%rcx
   0x000056455cd081f0 <+192>:	mov    %rbx,-0xc80(%rbp)
   0x000056455cd081f7 <+199>:	shr    $0xa,%eax
   0x000056455cd081fa <+202>:	and    $0x3ff,%eax
   0x000056455cd081ff <+207>:	add    (%rdx),%rax
   0x000056455cd08202 <+210>:	cmpq   $0x0,-0xc88(%rbp)
   0x000056455cd0820a <+218>:	mov    (%rcx,%rax,8),%rax
   0x000056455cd0820e <+222>:	mov    %rax,-0xc90(%rbp)

According to info registers, %rdx contains 0, so we are dereferencing a null pointer. The offset 0x10 corresponds to dict_vcol_templ_t::vtempl. If we had not crashed here, a few instructions later, at offset +207 from the start of the function, we would dereference dict_vcol_templ_t::n_col at the start of the structure.

Generated at Thu Feb 08 10:37:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.