[MDEV-29969] Random crashes (signal 8) when restoring mariadb-server memory state using CRIU (OpenVZ 7) Created: 2022-11-07  Updated: 2022-11-07  Resolved: 2022-11-07

Status: Closed
Project: MariaDB Server
Component/s: N/A
Affects Version/s: 10.9.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Philippe Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Host : OpenVZ 7 (7.0.18)
Container : Debian 10 (Buster)


Attachments: Text File config_CT.txt     Text File error_log.txt     Text File gbd_log-1.txt    

 Description   

Good evening!

Long story short: MariaDB randomly crashes after being restored (using a backup) and having its memory state restored by CRIU.

Command that triggers the bug: "vzctl resume <ctid>"

MariaDB 10.9.3 is installed inside an OpenVZ 7 container (Debian 10).

When resuming/restoring this CT using OpenVZ 7 commands, MariaDB sometimes crashes inside the container (mysqld got signal 8)"

#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=8) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1  0x00005610ab644a47 in my_write_core (sig=sig@entry=8) at ./mysys/stacktrace.c:424
#2  0x00005610ab13e5c0 in handle_fatal_signal (sig=8) at ./sql/signal_handler.cc:355
#3  <signal handler called>
#4  0x00007f11b97db1e4 in __difftime (time1=1667850110, time0=0) at difftime.c:114
#5  0x00005610ab4d1ba1 in srv_monitor () at ./storage/innobase/srv/srv0srv.cc:1194
#6  srv_monitor_task () at ./storage/innobase/srv/srv0srv.cc:1281
#7  0x00005610ab5d9178 in tpool::thread_pool_generic::timer_generic::run (this=0x5610aea78b20) at ./tpool/tpool_generic.cc:343
#8  tpool::thread_pool_generic::timer_generic::execute (arg=0x5610aea78b20) at ./tpool/tpool_generic.cc:363
#9  0x00005610ab5d9e0b in tpool::task::execute (this=0x5610aea78b60) at ./tpool/task.cc:37
#10 tpool::task::execute (this=0x5610aea78b60) at ./tpool/task.cc:27
#11 0x00005610ab5d7a0f in tpool::thread_pool_generic::worker_main (this=0x5610ae6e2610, thread_var=0x5610ae6e2cc0) at ./tpool/tpool_generic.cc:580
#12 0x00007f11b9b26b2f in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#13 0x00007f11b9bfbfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#14 0x00007f11b981f06f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95^

Please find attached MariaDB "error log", "gdb log" and "config CT" (contains all commands to reproduce environment).

If needed, I can also provide CRIU "dump.log", "restore.log" and MariaDB core dump (246 Mo).

Have a great evening!



 Comments   
Comment by Daniel Black [ 2022-11-07 ]

Signal 8 is a SIGFPE generated by the __difftime.

Looks like https://bugzilla.kernel.org/show_bug.cgi?id=4532, but seems too old. Probably not our bug.

Comment by Philippe [ 2022-11-07 ]

Hello Daniel.

Many thanks for your time and your help.

My knowlegde in C is limited but, if I understand good, there is an inconsistency with "last_monitor_time" value? (0, not set or undefined?)

OpenVZ 7.0.18 use kernel branch RHEL7 (3.10)

— uname -r
3.10.0-1160.53.1.vz7.185.3

— cat /etc/os-release
NAME="Virtuozzo"
VERSION="7.0.18"
ID="virtuozzo"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="OpenVZ release 7.0.18"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:virtuozzoproject:vz:7"
HOME_URL="http://www.virtuozzo.com"
BUG_REPORT_URL="https://bugs.openvz.org/"

Comment by Daniel Black [ 2022-11-07 ]

Sorry I was wrong with the original C code posted. There is a little later setting of monitor_state.last_monitor_time to 0.

difftime is a glibc implemented calculation, quite simple. It requires the floating point unit (FPU) of the processor to be enabled (as its arguments are actually double precision numbers). The SIGFPE is as a result of the attempted use of the processor feature without it being initialized. Either a) the kernel should have fully initialized this before passing to userspace code, or b) the kernel should enable the FPU and allow mariadb to continue.

I've done a brief search on https://bugs.openvz.org/issues/?jql=text%20~%20%22SIGFPE%22 or https://bugzilla.redhat.com and have been unable to find this bug. So I recommend reporting it on openvz with your repoducer attachments here.

While the code at ./storage/innobase/srv/srv0srv.cc +1194 could be done with a non-double based subtraction, there are other parts of the codebase that use double numbers which could easily trigger it.

Comment by Daniel Black [ 2022-11-07 ]

Closing as "Not our bug".

Thanks for the well written bug report which hopefully the openvz folks can parse and correct.

Thanks for using MariaDB and reporting bugs.

Comment by Philippe [ 2022-11-07 ]

Again, many many thanks!

Seems related to CRIU project.

I will contact OpenVZ devs as you suggested.

MariaDB is a wonderfull projet and devs like you, responding to "normal" people, this is just awesome.

Have a great day!

Generated at Thu Feb 08 10:12:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.