[MDEV-30718] Cluster hanging regularly on Update_rows_log_event Created: 2023-02-23  Updated: 2023-02-23

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.10.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Marc Bachmann Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Ubuntu 22.04


Attachments: Text File journalctl.txt     Text File mycnf.txt     PNG File processlist.png    

 Description   

Hi experts,

we are facing regularly (at least once a week, today 3 times) an outage in our 5 node cluster (+1 additional node for backup tasks).

In every case one of the nodes is staying at this point:

MariaDB [(none)]> show processlist;
+--------+-------------+--------------------+---------------------------------+---------+-------+----------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id     | User        | Host               | db                              | Command | Time  | State                                                    | Info                                                                                                 | Progress |
+--------+-------------+--------------------+---------------------------------+---------+-------+----------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|      1 | system user |                    | NULL                            | Sleep   | 85248 | wsrep aborter idle                                       | NULL                                                                                                 |    0.000 |
|      2 | system user |                    | ****                                            | Sleep   |     0 | Update_rows_log_event::find_row(46124856) on table `ocr` | UPDATE ocr SET ascii='7jgoqHVSDCCQfQt/17uwcDkwcjq428o5yt+adtVkQtUvbGI5vNN3F7T9OMFD5Td7TS9RW00gbJU/It |    0.000 |
|      6 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|      5 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|      7 | system user |                    | NULL                            | Sleep   |     4 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|      9 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|      8 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|     13 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|     11 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|     12 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|     15 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|     14 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |
|     16 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL      

All other queries are staying then in Updating or Waiting for certification state (see screenshot attachment).
Every time we need to find out which node has the Update_rows_log_event hanging and then restart that VM. Stopping the database service almost everytime will not work. Often we have to reboot the VM. After that all other nodes will work normaly and after restarted the affected node is syncing again.
But we are losing the queued queries on that machine.

I think this happened while there is many traffic on the different nodes.
But it is an assumption.

Are there any hint how we can find out what is happening? How can we examine ?
We cannot find any hint in journalctl. How can we proceed ?

Regards

Marc


Generated at Thu Feb 08 10:18:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.