[MDEV-25806] SIGILL on FreeBSD Aarch64 Created: 2021-05-28 Updated: 2021-08-02 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Server |
| Affects Version/s: | 10.4.19 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Vincent Milum Jr | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | crash | ||
| Environment: |
FreeBSD 13.0-RELEASE on Aarch64 |
||
| Description |
|
kernel: pid 12177 (mysqld), jid 0, uid 88: exited on signal 4 Fresh install of MariaDB 10.4.19 from pkg, or compiling fresh from Ports has the same result. Enabling debug information isn't really adding much. This happens when initially launching MariaDB with the default basic DBs installed. MariaDB 10.5.10 is not having this issue.
|
| Comments |
| Comment by Daniel Black [ 2021-05-28 ] | |||||||||||||||||||||||||||||||
|
can you disassemble at the signal location in the debugger? Can you objdump -d (or otherwise disassemble) of mysqld around the 4234a360 address? | |||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2021-05-28 ] | |||||||||||||||||||||||||||||||
|
Also its potentially | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-05-28 ] | |||||||||||||||||||||||||||||||
|
ARMv7 and earlier are 32-bit. As noted in the bug report title, this is specially on 64-bit ARM. The CPU in question is the Apple M1 that I'm using for debugging, but I first experienced the issue on a Raspberry Pi 4. I'll try to get some more debugging information later today. But so far, as the backtrace notes, it looks like memory corruption of some kind. | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-05-29 ] | |||||||||||||||||||||||||||||||
|
When running MariaDB from gdb, I get SIGSEGV, but when run it outside and it core dumps, I get SIGILL.
| |||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-06-05 ] | |||||||||||||||||||||||||||||||
|
You show a crash in the debug build of the server. It seems that it's not really representative of what is happening in optimized builds. This one tries to allocate memory, this goes to our builtin malloc debugger safemalloc, it tries to grab a stack to remember where the allocation took place, invokes backtrace() from system library libexecinfo. And that crashes. In the optimized build there's no safemalloc. But we do call backtrace() on SIGSEGV. So if backtrace() is for some reason broken, then you'll get a crash in crash signal handler. Which could explain a SIGILL. You can try to pretend that there is no backtrace() with
it won't fix the crash, but it might stop obscuring the first crash with the second crash from a signal handler | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-06-05 ] | |||||||||||||||||||||||||||||||
|
Actually, funny enough, just adding cmake -DHAVE_BACKTRACE=OFF allows MariaDB to start up and run normally, no crashing. On FreeBSD... A) installing from pkg: crashes (SIGILL) | |||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-06-17 ] | |||||||||||||||||||||||||||||||
|
That suggests that either backtrace() from libexecinfo on FreeBSD aarch64 is somehow broken or we use it incorrectly in a way that works everywhere but the FreeBSD aarch64. Given that libexecinfo causes a crash even when not used, it suggests that something is wrong with the library indeed. | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-06-17 ] | |||||||||||||||||||||||||||||||
|
I'm more thinking the latter rather than the former. Earlier versions of MariaDB 10.4 worked fine, and MariaDB 10.5 works fine. This was a regression in more resent versions of 10.4. | |||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-06-17 ] | |||||||||||||||||||||||||||||||
|
Do you think you could track it down to the exact version where it appeared? We don't have FreeBSD aarch64, but may be I'll be able to spot the change if I'll know where to look. | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-06-18 ] | |||||||||||||||||||||||||||||||
|
I've been able to narrow things down quite a bit so far. MariaDB 10.4.14 through 10.4.19 all have the issue. I've yet to test 10.4.13 and prior because it requires additional patches I'd need to go and hunt down in order to compile on FreeBSD Aarch64, which is getting too complex at the moment. I tested with 10.3.29, and the same SIGILL issue happens there. Sadly, 10.2 has been removed from the FreeBSD Ports collection last year, so I don't have a simple test case for that. 10.5.10 behaves correctly. Now, for the interesting bits: mysql_install_db is what triggers the SIGILL, but not until after it populates the data directory. This may be why other users (myself included) have not caught this issue previously. Starting the service with an empty data directory on FreeBSD will automatically call mysql_install_db first, followed by starting the actual service. Since the directory was properly populated, attempting to start the service a second time succeeds. I'm sure myself and others just went "oh, it didn't fully launch the first time, let's try again" and it just worked at that point. Alternatively users had a data directory from a previous version or copied from another server. Either way, having a proper populated data directory misses the SIGILL. This was just highly annoying from a Galera stand point, since on FreeBSD, the data directory is populated via mysql_install_db prior to starting the WSREP code, so it would SIGILL and fail very early on... and this was the scenario I was in when I discovered this issue. | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-06-18 ] | |||||||||||||||||||||||||||||||
|
I ran truss mysql_install_db and narrowed it down to this interesting section.
which is narrowed down to:
| |||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-06-20 ] | |||||||||||||||||||||||||||||||
|
The "interesting" part in mysql_install_db is just the line that invokes mysqld. This is where it had to crash, because it crashes inside mysqld, so this was very much expected. 10.5.10 doesn't crash, but 10.4.19 does — this should've been very helpful. But turned out quite confusing, I diffed all suspicious files between mariadb-10.4.19 and mariadb-10.5.10 and didn't find anything that could've explained the fix. But as 10.3 crashes and 10.4.13-to-10.4.19 crashes, I would think it'd be a waste of time to try earlier 10.4 versions. If you still want to narrow it down — try earlier 10.5 versions. I could diff the whole tree if versions are close enough. | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-08-02 ] | |||||||||||||||||||||||||||||||
|
Why was this closed? I think a "MariaDB crashes on startup" is a pretty serious error to keep tracking. | |||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-08-02 ] | |||||||||||||||||||||||||||||||
|
The issue was closed as incomplete, because it was waiting for feedback for more than a month. See above — you said that 10.4.19 crashes and 10.5.10 doesn't. I looked at the delta, but it was huge and I wasn't able to see anything related. If would help a lot if you could narrow it down to two subsequent versions, it would produce a much smaller delta and I would be able to read every single line in it to see if it could've been the culprit. | |||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2021-08-02 ] | |||||||||||||||||||||||||||||||
|
That body of work was already done It is a significantly larger body of work getting older MariaDB versions running on FreeBSD, because MariaDB has not been including our patch sets upstream in your repo, so the patches to make older versions even compile would have to be hunted down from old commits that have since been overwritten. It sounds like you're essentially asking me to do a major body of work, not being an expert in the MariaDB codebase itself, instead of just having a MariaDB internel dev just install and run MariaDB on FreeBSD ARM who would have all the knowledge and debugging tools at their disposal. | |||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-08-02 ] | |||||||||||||||||||||||||||||||
|
Yes, I thought it would be easy for you to test few 10.5.x releases, given that you've tested a range of 10.3.x and 10.4.x versions. But if not, we'll of course, have to do it. I'll mark this issue as no longer waiting for a feedback. As we currently have no freebsd builders for any architecture in any of our CIs, it might take some time before we create a FreeBSD/ARM builder. And then I hope we'll be able to repeat the crash. |