[MDEV-30125] "[ERROR] mysqld got signal 11 ;" only on AMD EPYC 7773X servers versus intel servers, and suspected poor performance/stability on AMD servers versus intel servers Created: 2022-11-29 Updated: 2023-02-20 Resolved: 2023-02-20 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | N/A |
| Affects Version/s: | 10.6.11 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | F K | Assignee: | Daniel Black |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Red Hat Enterprise Linux release 8.7 (Ootpa) |
||
| Description |
|
Hello, Two major issues were noted only on AMD servers. First of all we noticed the AMD servers have much poor performance compared to Intel servers(when used as master, the slave are really unneccessary and have little load), nearly being unresponsive with high load, while intel servers having much lower specs were much more stable(when used as master). Then with Both options had really poor performance exactly when the both CPUs are used due to load increase. Of note, we did not use such options on Intel servers, because we never had any problems, and didn't know of the recommendation of such options in MariaDB official documentation. We had much better stability with Still we suspect in several metrics and stabililty during high load that the server has much more slow queries and spikes compared to intel servers. Second of all, we ran optimize command on one large about 120GB tables on all of the 5 slaves. Here is the error log:
Of note we used one of the AMD servers to binary copy(after shutting down mariadb completely) and rsync mirror all the database data to the Intel and other AMD servers a week ago, and had no replication errors. So the Intel server is actually is a copy of the AMD server data and had no error, so it perhaps some bug is related with AMD servers and MariaDB. |
| Comments |
| Comment by F K [ 2022-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Additional note: Along with AMD servers being run as "ExecStart=numactl --cpunodebind=1 --interleave=all ...." due to suspected severe performance problems when 2 CPUs are used. Hyperthreading was turned off all the time on AMD servers(as they have a total 128 cores we didn't think it would help much). | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you please try to identify the crashing instruction, similar to how it was done in As a starting point, you should install the debug symbols for the server. I think that in rpm-based packages, the package name has -debuginfo in it. Then, try to produce a full stack trace using a debugger. In GDB, I would additionally suggest
in the crashing thread (usually but not always it is thread 1). Also, which location did you download the package from? I think that it will be necessary to examine the exact same binary code to figure out what is going on. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
> Can you please try to identify the crashing instruction, similar to how it was done > Also, which location did you download the package from? I think that it will be necessary to examine the exact same binary code to figure out what is going on. One question is do you have any opinion or comments regarding the performance problems we suspect on AMD EPYC dual 7773X servers? Do you think we would have better performance on single EPYC cpu servers or recommend just using Intel servers? Tha nkyou. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
FK, the problem is that the built-in stack trace reporter is garbage, at least in anything above mysql_alter_table(). Something apparently crashes during a table rebuild that is being executed during OPTIMIZE TABLE. InnoDB used to crash on corrupted data until I downloaded the following two packages and will figure out how to extract the files on my Debian system: MariaDB-server-10.6.11-1.el8.x86_64.rpm But, I will need the correct address of the crashing instruction as a starting point. I am afraid that the hexadecimal addresses in the log output are subject to ASLR. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I deleted one comment, that I found unrelated to this problem and due to oom-killer. This bug in this ticket had no such error. Maybe this will help, the following log was noted in /var/log/messages(of note the segfault line and next kernel Code line is same in both servers): Nov 29 16:48:27 f95 kernel: mariadbd[4095472]: segfault at 7efa413d6000 ip 0000564074ccff0c sp 00007efb10076ab0 error 4 in mariadbd[56407 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I think I can get somewhere with that: ip - base_address (0x564073f96000) = 0xD39F0C
On the crash it appears systemd-coredump has captured the coredump. After installing debuginfo packages (dnf install install MariaDB-server-debuginfo), (note: they don't need to be installed at the time of the crash, only at debugging) Assuming gdb is installed too we'll look at the backtrace and the code generated. Then we can look for AMD incompatible code and out build chain.
On performance measurements, suggest using perf record -g -p $(pidof mariadb) / perf report, comparing, maybe with some flame graphs between the two. Note systemd v243 onwards has better numa options - https://www.freedesktop.org/software/systemd/man/systemd.exec.html#NUMAPolicy= | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Just a note: no core dump file. [root@[REDACTED] ~]# coredumpctl Coredump entry has no core attached (neither internally in the journal nor externally on disk). | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Additional note: Same DB and table was binary copied from non-crashed intel server(more binlogs applied and after the optimize was done successfully on the intel server). We track all config changes through monitoring tool. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
One additional note regarding "First of all we noticed the AMD servers have much poor performance compared to Intel servers(when used as master, the slave are really unneccessary and have little load), nearly being unresponsive with high load, while intel servers having much lower specs were much more stable(when used as master)." When changing from Intel to AMD servers we did three major changes other than hardware and bigger innodb buffer. However, the "[ERROR] mysqld got signal 11" error was not found on any of the 3 Intel servers running 10.6.11 and only on all 2 AMD servers running 10.6.11. I do not know if that is coincidence or not. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
FK, thanks to danblack’s help, we now know that the crash occurs during an online table rebuild, while applying log that was written by concurrently running DML operations. Somehow the log file is corrupted. I do not think that we have seen this in our internal testing. Have you tried to run sudo memtester on the affected system? This could still be a software bug, requiring some (mis)fortune to reproduce it. Were you able to identify the OPTIMIZE TABLE statement that was running during the crash? Could we get the SHOW CREATE TABLE output for that table? You can obfuscate the names of the table, the indexes, and the columns, but not the data types and the indexes. MDEV-28880 might be a duplicate of | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
> Have you tried to run sudo memtester on the affected system? This is the schema with redacted names: CREATE TABLE `[TABLENAME]` ( This is the command used: | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Another note, I made a typo above should be https://jira.mariadb.org/browse/MDEV-29988 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
FK, the table structure looks rather basic. Only NOT NULL integer columns, no virtual columns. I hope that you can enable core dumps so that when it occurs again, we will have something more to debug on. Note: ever since | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2022-11-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Do you have a recommended method for doing so for Mariadb (as I am not familiar with core dumps). | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-12-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
danblack should be able to guide you through reproducing the problem. As you can find in https://mariadb.org/contribute/, a better option than this ticket system could be Zulip chat. His approximate time zone is UTC+1100. I would really appreciate the effort, so that we can solve the mystery. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by F K [ 2023-02-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I just tested with the backup data I saved during crash.
This ticket can be closed, as I failed to reproduce the issue. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-02-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
You too FK. Thanks for reporting back. |