Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Incomplete
-
10.3.29, 10.5.9, 10.5
-
None
Description
Hi.
We are doing some testing with more load on our current mariadb multi-master (3) setup. However we've seen multiple times where our database stopped responding on all 3 nodes. When we checked closer it was always one node stuck with system user process like these which sleep there never-ending:
| 1 | system user | | NULL | Sleep | 531 | Write_rows_log_event::write_row(17936337) | insert into volume ... REST OF SQL QUERY | 0.000 | |
| 5 | system user | | NULL | Sleep | 3849 | Update_rows_log_event::find_row(17819732) | UPDATE `volume` SET ... REST OF SQL QUERY | 0.000 | |
Then LOCKS started to buildout and processes to hang. There was nothing we could do to unblock this only restart the affected node. After restart of affected node everything seems to start again. We've checked the queries and they differ. The primary keys are set on the table(s).
Our current setup config
mariadb.cnf: |
|
[client]
|
default-character-set = utf8 |
[mysqld]
|
core-file
|
unix_socket = OFF
|
performance_schema = ON
|
character-set-server = utf8
|
collation-server = utf8_general_ci
|
ignore-db-dirs = lost+found
|
max_connections = 350 |
interactive_timeout = 450 |
wait_timeout = 450 |
join_buffer_size = 524288 |
key_buffer_size = 600MB
|
# InnoDB tuning
|
innodb_log_file_size = 600MB
|
innodb_buffer_pool_size = 5800MB
|
galera.cnf: |
|
[galera]
|
user = mysql
|
bind-address = 0.0.0.0 |
default_storage_engine = InnoDB
|
binlog_format = ROW
|
innodb_autoinc_lock_mode = 2 |
innodb_flush_log_at_trx_commit = 2 |
query_cache_size = 0 |
query_cache_type = 0 |
|
# MariaDB Galera settings
|
wsrep_on=ON
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_sst_method=rsync
|
wsrep_slave_threads=8 |
wsrep_sync_wait=7 |
|
# Cluster settings (automatically updated)
|
wsrep_cluster_address=gcomm:// |
wsrep_cluster_name=mysql
|
wsrep_node_address=127.0.0.1 |
Is there some specific case when this can happen or why we keep hitting this error?
Are there maybe some specific options we can specify for this to not happen and get stuck?
I've tried to dig deeper into this but not really sure why it happens and how to eliminate this. It's "random" but starts to happen quite often.
UPDATE another case today:
MariaDB [(none)]> show full processlist ;
|
+-------+--------------+--------------------+---------------------+---------+-------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+-------+--------------+--------------------+---------------------+---------+-------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|
| 1 | system user | | | Busy | 76698 | wsrep aborter idle | NULL | 0.000 | |
| 2 | system user | | NULL | Sleep | 34574 | Write_rows_log_event::write_row(18047602) | insert into snapshot | 0.000 | |
| 7 | system user | | NULL | Sleep | 34602 | Write_rows_log_event::write_row(18047598) | insert into volume | 0.000 | |
| 6 | system user | | NULL | Sleep | 34620 | wsrep applier committed | NULL | 0.000 | |
| 10 | system user | | NULL | Sleep | 34573 | Write_rows_log_event::write_row(18047603) | insert into snapshot | 0.000 | |
| 8 | system user | | NULL | Sleep | 34608 | wsrep applier committed | NULL | 0.000 | |
| 5 | system user | | NULL | Sleep | 34602 | Write_rows_log_event::write_row(18047599) | insert into volume( | 0.000 | |
| 9 | system user | | NULL | Sleep | 34620 | wsrep applier committed | NULL | 0.000 | |
| 11 | system user | | NULL | Sleep | 34603 | Write_rows_log_event::write_row(18047596) | insert into volume| 0.000 | |
When I checked logs it seems there was signal 11 around that time
2021-04-22 19:23:02 59446 [Warning] Aborted connection 59446 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets) |
2021-04-22 19:25:02 59566 [Warning] Aborted connection 59566 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets) |
2021-04-22 19:26:02 59614 [Warning] Aborted connection 59614 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets) |
2021-04-22 19:29:01 59804 [Warning] Aborted connection 59804 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets) |
2021-04-22 19:30:01 59868 [Warning] Aborted connection 59868 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets) |
210422 19:33:18 [ERROR] mysqld got signal 11 ; |
This could be because you hit a bug. It is also possible that this binary |
or one of the libraries it was linked against is corrupt, improperly built,
|
or misconfigured. This error can also be caused by malfunctioning hardware.
|
|
To report this bug, see https://mariadb.com/kb/en/reporting-bugs |
|
We will try our best to scrape up some info that will hopefully help |
diagnose the problem, but since we have already crashed,
|
something is definitely wrong and this may fail. |
|
Server version: 10.5.9-MariaDB-1:10.5.9+maria~focal |
key_buffer_size=134217728 |
read_buffer_size=131072 |
max_used_connections=33 |
max_threads=252 |
thread_count=43 |
It is possible that mysqld could use up to
|
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 685804 K bytes of memory |
Hope that's ok; if not, decrease some variables in the equation. |
|
Thread pointer: 0x0 |
Attempting backtrace. You can use the following information to find out
|
where mysqld died. If you see no messages after this, something went |
terribly wrong...
|
stack_bottom = 0x0 thread_stack 0x49000 |
mysqld(my_print_stacktrace+0x32)[0x55a4a3daeaf2] |
mysqld(handle_fatal_signal+0x485)[0x55a4a38046f5] |
However at that time it didn't create crashdump. I've send signal 11 to mysqld process which generated crash. Hopefully will help for this case https://file.io/yZfXkbIUYe0w