[MDEV-25469] transactions blocked by system user rows_log_event process that never finish Created: 2021-04-21  Updated: 2022-02-01  Resolved: 2022-01-31

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB, wsrep
Affects Version/s: 10.3.29, 10.5.9, 10.5
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Jaroslav Assignee: Seppo Jaakola
Resolution: Incomplete Votes: 3
Labels: None

Attachments: File error_20210702.log.gz     Text File system.log    

 Description   

Hi.
We are doing some testing with more load on our current mariadb multi-master (3) setup. However we've seen multiple times where our database stopped responding on all 3 nodes. When we checked closer it was always one node stuck with system user process like these which sleep there never-ending:

|     1 | system user  |                    | NULL       | Sleep   |   531 | Write_rows_log_event::write_row(17936337) | insert into volume ... REST OF SQL QUERY |    0.000 |
|      5 | system user |                    | NULL                | Sleep   |  3849 | Update_rows_log_event::find_row(17819732) | UPDATE `volume` SET ... REST OF SQL QUERY |    0.000 |

Then LOCKS started to buildout and processes to hang. There was nothing we could do to unblock this only restart the affected node. After restart of affected node everything seems to start again. We've checked the queries and they differ. The primary keys are set on the table(s).
Our current setup config

  mariadb.cnf: |
    [client]
    default-character-set = utf8
    [mysqld]
    core-file
    unix_socket = OFF
    performance_schema = ON
    character-set-server = utf8
    collation-server = utf8_general_ci
    ignore-db-dirs = lost+found 
    max_connections = 350
    interactive_timeout = 450 
    wait_timeout = 450
    join_buffer_size = 524288
    key_buffer_size = 600MB
    # InnoDB tuning
    innodb_log_file_size = 600MB
    innodb_buffer_pool_size = 5800MB
 galera.cnf: |
    [galera]
    user = mysql
    bind-address = 0.0.0.0
    default_storage_engine = InnoDB
    binlog_format = ROW
    innodb_autoinc_lock_mode = 2
    innodb_flush_log_at_trx_commit = 2
    query_cache_size = 0
    query_cache_type = 0
 
    # MariaDB Galera settings
    wsrep_on=ON
    wsrep_provider=/usr/lib/galera/libgalera_smm.so
    wsrep_sst_method=rsync
    wsrep_slave_threads=8
    wsrep_sync_wait=7
 
    # Cluster settings (automatically updated)
    wsrep_cluster_address=gcomm://
    wsrep_cluster_name=mysql
    wsrep_node_address=127.0.0.1

Is there some specific case when this can happen or why we keep hitting this error?
Are there maybe some specific options we can specify for this to not happen and get stuck?
I've tried to dig deeper into this but not really sure why it happens and how to eliminate this. It's "random" but starts to happen quite often.

UPDATE another case today:

MariaDB [(none)]> show full processlist ;
+-------+--------------+--------------------+---------------------+---------+-------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| Id    | User         | Host               | db                  | Command | Time  | State                                     | Info                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | Progress |
+-------+--------------+--------------------+---------------------+---------+-------+-------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|     1 | system user  |                    |                     | Busy    | 76698 | wsrep aborter idle                        | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |    0.000 |
|     2 | system user  |                    | NULL                | Sleep   | 34574 | Write_rows_log_event::write_row(18047602) | insert into snapshot |    0.000 |
|     7 | system user  |                    | NULL                | Sleep   | 34602 | Write_rows_log_event::write_row(18047598) | insert into volume |    0.000 |
|     6 | system user  |                    | NULL                | Sleep   | 34620 | wsrep applier committed                   | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |    0.000 |
|    10 | system user  |                    | NULL                | Sleep   | 34573 | Write_rows_log_event::write_row(18047603) | insert into snapshot |    0.000 |
|     8 | system user  |                    | NULL                | Sleep   | 34608 | wsrep applier committed                   | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |    0.000 |
|     5 | system user  |                    | NULL                | Sleep   | 34602 | Write_rows_log_event::write_row(18047599) | insert into volume( |    0.000 |
|     9 | system user  |                    | NULL                | Sleep   | 34620 | wsrep applier committed                   | NULL                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |    0.000 |
|    11 | system user  |                    | NULL                | Sleep   | 34603 | Write_rows_log_event::write_row(18047596) | insert into volume|    0.000 |

When I checked logs it seems there was signal 11 around that time

2021-04-22 19:23:02 59446 [Warning] Aborted connection 59446 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets)
2021-04-22 19:25:02 59566 [Warning] Aborted connection 59566 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets)
2021-04-22 19:26:02 59614 [Warning] Aborted connection 59614 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets)
2021-04-22 19:29:01 59804 [Warning] Aborted connection 59804 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets)
2021-04-22 19:30:01 59868 [Warning] Aborted connection 59868 to db: 'storage' user: 'root' host: '10.255.5.166' (Got an error reading communication packets)
210422 19:33:18 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.9-MariaDB-1:10.5.9+maria~focal
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=33
max_threads=252
thread_count=43
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 685804 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mysqld(my_print_stacktrace+0x32)[0x55a4a3daeaf2]
mysqld(handle_fatal_signal+0x485)[0x55a4a38046f5]

However at that time it didn't create crashdump. I've send signal 11 to mysqld process which generated crash. Hopefully will help for this case https://file.io/yZfXkbIUYe0w



 Comments   
Comment by Gavin Pidgley [ 2021-04-23 ]

Hi,

We also see the same issue on MariaDB 10.3.28 (10.3.28+maria~bionic).

Example of a problematic query:

| 2 | system user | | NULL | Sleep | 624 | Write_rows_log_event::write_row(21328484) | INSERT INTO `tm_options` (`option_name`, `option_value`, `autoload`) VALUES ('_transient_doing_cron', '1619186903.2794630527496337890625', 'yes') ON DUPLICATE KEY UPDATE `option_name` = VALUES(`option_name`), `option_value` = VALUES(`option_value`), `autoload` = VALUES(`autoload`) | 0.000 |

Same symptoms of queries backing up on all nodes until mysqld is killed on the node in question.

Comment by Jan de Lalène [ 2021-07-05 ]

Hello,
I think we have the same issue here, (from "show processlist") we see the following process stuck
(here the sample from 2021-07-02 09:07:01):
1 system user NULL Sleep 50 Write_rows_log_event::write_row(115259450) INSERT INTO ..

attached the extract from /var/log/mysql/error.log error_20210702.log.gz

currently we cannot reproduce (force) the error, but observe it 1-5 times a week
any need for additional logs we are glad to help

Comment by Seppo Jaakola [ 2021-12-30 ]

To troubleshoot this further, more information is needed of the execution state in the hanging node. In jaroslav scenario, the table definitions of the related tables 'volume' and 'snapshot' are of interest, and the SQL statements executed in the hanging transactions.
Note, that if the table definitions contain foreign key constraints, the problem could relate to recently fixed issue: MDEV-26803

For all the above incarnations of this type of issue, configuring wsrep_provider_options with: 'cert.optimistic_pa=NO', would be effective workaround.

Comment by Jan de Lalène [ 2022-02-01 ]

Just as a hint - we have "fixed" the issue (at least is has never occurred since):
we have increased wsrep_slave_threads (from default 1) to 12 according to the "calculation" indicated in https://galeracluster.com/library/kb/parallel-slave-threads.html

Comment by Steve Baroti [ 2022-02-01 ]

Hello Jan de Lalene,
Thank you for your hint, unfortunately this used to happen on our cluster, 2x CPU VMs, with 8 galera slave threads, the setting based on the calculation from documentation: wsrep_slave_threads = 8. The updated MariaDB-10.3.32 cluster is stable again - won't upgrade anytime soon Thanks, SB.

Generated at Thu Feb 08 09:37:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.