[MDEV-30125] "[ERROR] mysqld got signal 11 ;" only on AMD EPYC 7773X servers versus intel servers, and suspected poor performance/stability on AMD servers versus intel servers - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 10.6.11
Fix Version/s: N/A
Component/s: N/A
Labels:
None
Environment:
Red Hat Enterprise Linux release 8.7 (Ootpa)
MariaDB-server-10.6.11-1.el8.x86_64

Description

Hello,
We had exact same MariaDB data(copied from slaves as full binary), and synced on three intel servers and two AMD servers as slaves and one AMD server as master.
The three intel servers have different specs(lower than AMD) while the AMD are all dual EPYC 7773X with 1 TB of RAM.

Two major issues were noted only on AMD servers.

First of all we noticed the AMD servers have much poor performance compared to Intel servers(when used as master, the slave are really unneccessary and have little load), nearly being unresponsive with high load, while intel servers having much lower specs were much more stable(when used as master).
We noticed that it occurs exactly when the both CPUs are used due to load increase.
We tested in production with
kernel.numa_balancing=0

Then with
kernel.numa_balancing=1
and ExecStart=numactl --interleave=all /usr/sbin/mariadbd......

Both options had really poor performance exactly when the both CPUs are used due to load increase. Of note, we did not use such options on Intel servers, because we never had any problems, and didn't know of the recommendation of such options in MariaDB official documentation.

We had much better stability with
ExecStart=numactl --cpunodebind=1..... with or without interleave option.

Still we suspect in several metrics and stabililty during high load that the server has much more slow queries and spikes compared to intel servers.

Second of all, we ran optimize command on one large about 120GB tables on all of the 5 slaves.
All the intel servers finished relatively quickly, while ALL two AMD servers crashed with
[ERROR] mysqld got signal 11 ;

Here is the error log:

221129 16:48:25 [ERROR] mysqld got signal 11 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.6.11-MariaDB-log

key_buffer_size=33554432

read_buffer_size=131072

max_used_connections=29

max_threads=65537

thread_count=13

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142902243 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7ef7fc0008e8

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7efb1007bb18 thread_stack 0x49000

??:0(my_print_stacktrace)[0x564074e6533e]

??:0(handle_fatal_signal)[0x564074945295]

??:0(__restore_rt)[0x7f7c8e59ccf0]

??:0(void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag))[0x564074ccff0c]

??:0(void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag))[0x564074cd191c]

??:0(void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag))[0x564074cd2243]

??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x564074c31d2d]

??:0(mysql_alter_table(THD*, st_mysql_const_lex_string const*, st_mysql_const_lex_string const*, HA_CREATE_INFO*, TABLE_LIST*, Alter_info*, unsigned int, st_order*, bool, bool))[0x5640747ce4fc]

??:0(mysql_recreate_table(THD*, TABLE_LIST*, bool))[0x5640747cfea0]

??:0(MDL_ticket::~MDL_ticket())[0x56407483a38d]

??:0(fill_check_table_metadata_fields(THD*, List<Item>*))[0x56407483c851]

??:0(Sql_cmd_optimize_table::execute(THD*))[0x56407483d7ed]

??:0(mysql_execute_command(THD*, bool))[0x564074732519]

??:0(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x564074724703]

??:0(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x56407472e7a6]

??:0(do_command(THD*, bool))[0x56407472fed0]

??:0(tp_callback(TP_connection*))[0x5640748d48d3]

??:0(get_event(worker_thread_t*, thread_group_t*, timespec*))[0x564074ac30b0]

??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x564074b6713d]

??:0(start_thread)[0x7f7c8e5921ca]

:0(__GI___clone)[0x7f7c8d8e2e73]

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x7ef7fc013130): optimize table xe_comments_list

Connection ID (thread ID): 1790600

Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /db/mysql

Resource Limits:

Limit                     Soft Limit           Hard Limit           Units

Max cpu time              unlimited            unlimited            seconds

Max file size             unlimited            unlimited            bytes

Max data size             unlimited            unlimited            bytes

Max stack size            8388608              unlimited            bytes

Max core file size        0                    unlimited            bytes

Max resident set          unlimited            unlimited            bytes

Max processes             4127465              4127465              processes

Max open files            300000               300000               files

Max locked memory         65536                65536                bytes

Max address space         unlimited            unlimited            bytes

Max file locks            unlimited            unlimited            locks

Max pending signals       4127465              4127465              signals

Max msgqueue size         819200               819200               bytes

Max nice priority         0                    0

Max realtime priority     0                    0

Max realtime timeout      unlimited            unlimited            us

Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e

Kernel version: Linux version 4.18.0-425.3.1.el8.x86_64 (mockbuild@x86-vm-08.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-15) (GCC)) #1 SMP Fri Sep 30 11:45:06 EDT 2022

2022-11-29 16:48:33 0 [Note] mariadbd: Aria engine: starting recovery

recovered pages: 0% 14% 48% 64% 100% (0.0 seconds); tables to flush: 1 0

 (0.0 seconds);

2022-11-29 16:48:33 0 [Note] mariadbd: Aria engine: recovery done

2022-11-29 16:48:33 0 [Note] InnoDB: Compressed tables use zlib 1.2.11

2022-11-29 16:48:33 0 [Note] InnoDB: Number of pools: 1

2022-11-29 16:48:33 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions

2022-11-29 16:48:33 0 [Note] InnoDB: Using Linux native AIO

2022-11-29 16:48:33 0 [Note] InnoDB: Initializing buffer pool, total size = 549755813888, chunk size = 134217728

2022-11-29 16:48:35 0 [Note] InnoDB: Completed initialization of buffer pool

2022-11-29 16:48:36 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=35451806789271,35458406739008

2022-11-29 16:48:50 0 [Note] InnoDB: Read redo log up to LSN=35453331680768

2022-11-29 16:49:05 0 [Note] InnoDB: Read redo log up to LSN=35454931545600

2022-11-29 16:49:20 0 [Note] InnoDB: Read redo log up to LSN=35456493792768

2022-11-29 16:49:35 0 [Note] InnoDB: Read redo log up to LSN=35458012786176

2022-11-29 16:49:39 0 [Note] InnoDB: Starting final batch to recover 306852 pages from redo log.

2022-11-29 16:49:50 0 [Note] InnoDB: To recover: 232421 pages from log

2022-11-29 16:50:05 0 [Note] InnoDB: To recover: 205875 pages from log

2022-11-29 16:50:20 0 [Note] InnoDB: To recover: 186210 pages from log

2022-11-29 16:50:35 0 [Note] InnoDB: To recover: 169740 pages from log

2022-11-29 16:50:50 0 [Note] InnoDB: To recover: 155305 pages from log

2022-11-29 16:51:05 0 [Note] InnoDB: To recover: 142354 pages from log

2022-11-29 16:51:20 0 [Note] InnoDB: To recover: 130471 pages from log

2022-11-29 16:51:35 0 [Note] InnoDB: To recover: 119584 pages from log

2022-11-29 16:51:50 0 [Note] InnoDB: To recover: 109353 pages from log

2022-11-29 16:52:05 0 [Note] InnoDB: To recover: 99638 pages from log

2022-11-29 16:52:20 0 [Note] InnoDB: To recover: 90445 pages from log

2022-11-29 16:52:35 0 [Note] InnoDB: To recover: 81628 pages from log

2022-11-29 16:52:50 0 [Note] InnoDB: To recover: 73197 pages from log

2022-11-29 16:53:05 0 [Note] InnoDB: To recover: 65106 pages from log

2022-11-29 16:53:20 0 [Note] InnoDB: To recover: 57337 pages from log

2022-11-29 16:53:35 0 [Note] InnoDB: To recover: 49803 pages from log

2022-11-29 16:53:50 0 [Note] InnoDB: To recover: 42528 pages from log

2022-11-29 16:54:05 0 [Note] InnoDB: To recover: 35449 pages from log

2022-11-29 16:54:20 0 [Note] InnoDB: To recover: 28553 pages from log

2022-11-29 16:54:35 0 [Note] InnoDB: To recover: 21831 pages from log

2022-11-29 16:54:50 0 [Note] InnoDB: To recover: 15281 pages from log

2022-11-29 16:55:05 0 [Note] InnoDB: To recover: 8870 pages from log

2022-11-29 16:55:20 0 [Note] InnoDB: To recover: 2608 pages from log

2022-11-29 16:55:24 0 [Note] InnoDB: Last binlog file '/db/mysql/mysql-bin.001023', position 155517573

2022-11-29 16:55:24 0 [Note] InnoDB: 128 rollback segments are active.

2022-11-29 16:55:24 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"

2022-11-29 16:55:24 0 [Note] InnoDB: Creating shared tablespace for temporary tables

2022-11-29 16:55:24 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...

2022-11-29 16:55:24 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

2022-11-29 16:55:24 0 [Note] InnoDB: 10.6.11 started; log sequence number 35458407103175; transaction id 79095061155

2022-11-29 16:55:24 0 [Note] InnoDB: Loading buffer pool(s) from /db/mysql/ib_buffer_pool

2022-11-29 16:55:24 0 [Note] Plugin 'FEEDBACK' is disabled.

2022-11-29 16:55:24 0 [Note] Recovering after a crash using /db/mysql/mysql-bin

2022-11-29 16:55:24 0 [Note] Starting table crash recovery...

2022-11-29 16:55:24 0 [Note] Crash table recovery finished.

2022-11-29 16:55:25 0 [Note] DDL_LOG: Crash recovery executed 1 entries

2022-11-29 16:55:25 0 [Note] Server socket created on IP: '0.0.0.0'.

2022-11-29 16:55:25 0 [Note] Server socket created on IP: '::'.

2022-11-29 16:55:25 5 [Note] Slave I/O thread: Start asynchronous replication to master 'slave_user@[REDACTED]:3306' in log 'mysql-bin.000176' at position 331827117

2022-11-29 16:55:25 5 [Note] Slave I/O thread: connected to master 'slave_user@[REDACTED]:3306',replication started in log 'mysql-bin.000176' at position 331827117

2022-11-29 16:55:25 0 [Note] /usr/sbin/mariadbd: ready for connections.

Version: '10.6.11-MariaDB-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MariaDB Server

2022-11-29 16:55:25 6 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000176' at position 331827117, relay log '/db/mysql/relay-bin.000266' position: 238805436

2022-11-29 16:56:22 0 [Note] InnoDB: Buffer pool(s) load completed at 221129 16:56:22

Of note we used one of the AMD servers to binary copy(after shutting down mariadb completely) and rsync mirror all the database data to the Intel and other AMD servers a week ago, and had no replication errors. So the Intel server is actually is a copy of the AMD server data and had no error, so it perhaps some bug is related with AMD servers and MariaDB.

Attachments

Activity

People

Assignee:: Daniel Black

Reporter:: F K

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2022-11-29 11:14

Updated:: 2023-02-20 10:32

Resolved:: 2023-02-20 10:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.