[MDEV-25806] SIGILL on FreeBSD Aarch64 Created: 2021-05-28  Updated: 2021-08-02

Status: Open
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.4.19
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Vincent Milum Jr Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: crash
Environment:

FreeBSD 13.0-RELEASE on Aarch64



 Description   

kernel: pid 12177 (mysqld), jid 0, uid 88: exited on signal 4

Fresh install of MariaDB 10.4.19 from pkg, or compiling fresh from Ports has the same result. Enabling debug information isn't really adding much.

This happens when initially launching MariaDB with the default basic DBs installed.

MariaDB 10.5.10 is not having this issue.

Core was generated by `/usr/local/libexec/mysqld --defaults-extra-file=/var/db/mysql/my.cnf --basedir=/'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x000000004234a360 in ?? ()
[Current thread is 1 (LWP 101757)]
(gdb) bt 
#0  0x000000004234a360 in ?? ()
#1  0x000000004234ca80 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)



 Comments   
Comment by Daniel Black [ 2021-05-28 ]

can you disassemble at the signal location in the debugger? Can you objdump -d (or otherwise disassemble) of mysqld around the 4234a360 address?

Comment by Daniel Black [ 2021-05-28 ]

Also its potentially MDEV-25807 if you are running something pre Armv8 try this https://salsa.debian.org/mariadb-team/mariadb-10.5/-/blob/master/debian/patches/revert-isb-assembly-instruction.patch#L1

Comment by Vincent Milum Jr [ 2021-05-28 ]

ARMv7 and earlier are 32-bit. As noted in the bug report title, this is specially on 64-bit ARM. The CPU in question is the Apple M1 that I'm using for debugging, but I first experienced the issue on a Raspberry Pi 4.

I'll try to get some more debugging information later today. But so far, as the backtrace notes, it looks like memory corruption of some kind.

Comment by Vincent Milum Jr [ 2021-05-29 ]

When running MariaDB from gdb, I get SIGSEGV, but when run it outside and it core dumps, I get SIGILL.

(gdb) file /usr/local/libexec/mysqld
Reading symbols from /usr/local/libexec/mysqld...
(gdb) run
Starting program: /usr/local/libexec/mysqld 
2021-05-29 22:33:56 0 [Note] /usr/local/libexec/mysqld (mysqld 10.4.19-MariaDB-debug) starting as process 1260 ...                                  
[New LWP 100631 of process 1260]
 
Thread 2 received signal SIGSEGV, Segmentation fault.
[Switching to LWP 100631 of process 1260]
0x000000004234a360 in ?? () from /lib/libgcc_s.so.1
(gdb) bt
#0  0x000000004234a360 in ?? () from /lib/libgcc_s.so.1
#1  0x000000004234ca80 in ?? () from /lib/libgcc_s.so.1
#2  0x0000000042349d20 in ?? () from /lib/libgcc_s.so.1
#3  0x0000000042348664 in ?? () from /lib/libgcc_s.so.1
#4  0x000000004234750c in _Unwind_Backtrace () from /lib/libgcc_s.so.1
#5  0x0000000041d38c98 in backtrace () from /usr/lib/libexecinfo.so.1
#6  0x00000000019382b0 in sf_malloc (size=208, my_flags=24) at safemalloc.c:141
#7  0x00000000019234a4 in my_malloc (size=208, my_flags=24) at my_malloc.c:101
#8  0x0000000001912d90 in my_multi_malloc (myFlags=24) at mulalloc.c:51
#9  0x000000000192f024 in safe_mutex_lazy_init_deadlock_detection (mp=0x2639ba8 <LOCK_timer>) at thr_mutex.c:159
#10 0x000000000192ed28 in safe_mutex_lock (mp=0x2639ba8 <LOCK_timer>, my_flags=0, 
    file=0x55ef7e "/usr/ports/databases/mariadb104-server/work/mariadb-10.4.19/mysys/thr_timer.c", line=273) at thr_mutex.c:319
#11 0x0000000001930f30 in inline_mysql_mutex_lock (that=0x2639ba8 <LOCK_timer>, 
    src_file=0x55ef7e "/usr/ports/databases/mariadb104-server/work/mariadb-10.4.19/mysys/thr_timer.c", src_line=273)
    at /usr/ports/databases/mariadb104-server/work/mariadb-10.4.19/include/mysql/psi/mysql_thread.h:717
#12 0x0000000001930a70 in timer_handler (arg=0x0) at thr_timer.c:273
#13 0x0000000042176fe0 in ?? () from /lib/libthr.so.3
#14 0x0000000042176b38 in pthread_create () from /lib/libthr.so.3
#15 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Comment by Sergei Golubchik [ 2021-06-05 ]

You show a crash in the debug build of the server. It seems that it's not really representative of what is happening in optimized builds. This one tries to allocate memory, this goes to our builtin malloc debugger safemalloc, it tries to grab a stack to remember where the allocation took place, invokes backtrace() from system library libexecinfo. And that crashes.

In the optimized build there's no safemalloc. But we do call backtrace() on SIGSEGV. So if backtrace() is for some reason broken, then you'll get a crash in crash signal handler. Which could explain a SIGILL.

You can try to pretend that there is no backtrace() with

cmake -DHAVE_BACKTRACE=OFF

it won't fix the crash, but it might stop obscuring the first crash with the second crash from a signal handler

Comment by Vincent Milum Jr [ 2021-06-05 ]

Actually, funny enough, just adding cmake -DHAVE_BACKTRACE=OFF allows MariaDB to start up and run normally, no crashing.

On FreeBSD...

A) installing from pkg: crashes (SIGILL)
B) compiling from ports: crashes (SIGILL)
C) adding HAVE_BACKTRACE: works normally

Comment by Sergei Golubchik [ 2021-06-17 ]

That suggests that either backtrace() from libexecinfo on FreeBSD aarch64 is somehow broken or we use it incorrectly in a way that works everywhere but the FreeBSD aarch64. Given that libexecinfo causes a crash even when not used, it suggests that something is wrong with the library indeed.

Comment by Vincent Milum Jr [ 2021-06-17 ]

I'm more thinking the latter rather than the former. Earlier versions of MariaDB 10.4 worked fine, and MariaDB 10.5 works fine. This was a regression in more resent versions of 10.4.

Comment by Sergei Golubchik [ 2021-06-17 ]

Do you think you could track it down to the exact version where it appeared? We don't have FreeBSD aarch64, but may be I'll be able to spot the change if I'll know where to look.

Comment by Vincent Milum Jr [ 2021-06-18 ]

I've been able to narrow things down quite a bit so far.

MariaDB 10.4.14 through 10.4.19 all have the issue. I've yet to test 10.4.13 and prior because it requires additional patches I'd need to go and hunt down in order to compile on FreeBSD Aarch64, which is getting too complex at the moment.

I tested with 10.3.29, and the same SIGILL issue happens there.

Sadly, 10.2 has been removed from the FreeBSD Ports collection last year, so I don't have a simple test case for that.

10.5.10 behaves correctly.
10.6.2 behaves correctly.

Now, for the interesting bits: mysql_install_db is what triggers the SIGILL, but not until after it populates the data directory. This may be why other users (myself included) have not caught this issue previously. Starting the service with an empty data directory on FreeBSD will automatically call mysql_install_db first, followed by starting the actual service. Since the directory was properly populated, attempting to start the service a second time succeeds. I'm sure myself and others just went "oh, it didn't fully launch the first time, let's try again" and it just worked at that point.

Alternatively users had a data directory from a previous version or copied from another server.

Either way, having a proper populated data directory misses the SIGILL.

This was just highly annoying from a Galera stand point, since on FreeBSD, the data directory is populated via mysql_install_db prior to starting the WSREP code, so it would SIGILL and fail very early on... and this was the scenario I was in when I discovered this issue.

Comment by Vincent Milum Jr [ 2021-06-18 ]

I ran truss mysql_install_db and narrowed it down to this interesting section.

Installing MariaDB/MySQL system tables in '/var/db/mysql' ...
write(1,"Installing MariaDB/MySQL system "...,62) = 62 (0x3e)
read(10,"daemon with:"\n  echo\n  echo " "...,1024) = 1024 (0x400)
pipe2(0xffffffffe7c8,0)				 = 0 (0x0)
fork()						 = 30674 (0x77d2)
close(4)					 = 0 (0x0)
pipe2(0xffffffffe7c8,0)				 = 0 (0x0)
fork()						 = 30676 (0x77d4)
close(3)					 = 0 (0x0)
close(5)					 = 0 (0x0)
fork()						 = 30677 (0x77d5)
close(4)					 = 0 (0x0)
wait4(-1,{ SIGNALED,sig=SIGPIPE },0x0,0x0)	 = 30676 (0x77d4)
wait4(-1,{ SIGNALED,sig=SIGPIPE },0x0,0x0)	 = 30674 (0x77d2)
wait4(-1,{ SIGNALED,sig=SIGILL },0x0,0x0)	 = 30677 (0x77d5)
fstatat(AT_FDCWD,"/usr/share/nls/C.UTF-8/libc.cat",0xffffffffe4e0,0x0) ERR#2 'No such file or directory'
fstatat(AT_FDCWD,"/usr/share/nls/libc/C.UTF-8",0xffffffffe4e0,0x0) ERR#2 'No such file or directory'
fstatat(AT_FDCWD,"/usr/local/share/nls/C.UTF-8/libc.cat",0xffffffffe4e0,0x0) ERR#2 'No such file or directory'
fstatat(AT_FDCWD,"/usr/local/share/nls/libc/C.UTF-8",0xffffffffe4e0,0x0) ERR#2 'No such file or directory'
Illegal instruction
write(2,"Illegal instruction\n",20)		 = 20 (0x14)

which is narrowed down to:

# Create the system and help tables by passing them to "mysqld --bootstrap"
s_echo "Installing MariaDB/MySQL system tables in '$ldata' ..."
if cat_sql | eval "$filter_cmd_line" | mysqld_install_cmd_line > /dev/null

Comment by Sergei Golubchik [ 2021-06-20 ]

The "interesting" part in mysql_install_db is just the line that invokes mysqld. This is where it had to crash, because it crashes inside mysqld, so this was very much expected.

10.5.10 doesn't crash, but 10.4.19 does — this should've been very helpful. But turned out quite confusing, I diffed all suspicious files between mariadb-10.4.19 and mariadb-10.5.10 and didn't find anything that could've explained the fix.

But as 10.3 crashes and 10.4.13-to-10.4.19 crashes, I would think it'd be a waste of time to try earlier 10.4 versions. If you still want to narrow it down — try earlier 10.5 versions. I could diff the whole tree if versions are close enough.

Comment by Vincent Milum Jr [ 2021-08-02 ]

Why was this closed? I think a "MariaDB crashes on startup" is a pretty serious error to keep tracking.

Comment by Sergei Golubchik [ 2021-08-02 ]

The issue was closed as incomplete, because it was waiting for feedback for more than a month.
Closing an issue isn't final, I would've reopened it if you'd provided more info.

See above — you said that 10.4.19 crashes and 10.5.10 doesn't. I looked at the delta, but it was huge and I wasn't able to see anything related. If would help a lot if you could narrow it down to two subsequent versions, it would produce a much smaller delta and I would be able to read every single line in it to see if it could've been the culprit.

Comment by Vincent Milum Jr [ 2021-08-02 ]

That body of work was already done

https://jira.mariadb.org/browse/MDEV-25806?focusedCommentId=192316&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-192316

It is a significantly larger body of work getting older MariaDB versions running on FreeBSD, because MariaDB has not been including our patch sets upstream in your repo, so the patches to make older versions even compile would have to be hunted down from old commits that have since been overwritten.

It sounds like you're essentially asking me to do a major body of work, not being an expert in the MariaDB codebase itself, instead of just having a MariaDB internel dev just install and run MariaDB on FreeBSD ARM who would have all the knowledge and debugging tools at their disposal.

Comment by Sergei Golubchik [ 2021-08-02 ]

Yes, I thought it would be easy for you to test few 10.5.x releases, given that you've tested a range of 10.3.x and 10.4.x versions.

But if not, we'll of course, have to do it. I'll mark this issue as no longer waiting for a feedback.

As we currently have no freebsd builders for any architecture in any of our CIs, it might take some time before we create a FreeBSD/ARM builder. And then I hope we'll be able to repeat the crash.

Generated at Thu Feb 08 09:40:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.