[MDEV-30718] Cluster hanging regularly on Update_rows_log_event - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 10.10.2
Fix Version/s: N/A
Component/s: None
Labels:
None
Environment:
Ubuntu 22.04

Description

Hi experts,

we are facing regularly (at least once a week, today 3 times) an outage in our 5 node cluster (+1 additional node for backup tasks).

In every case one of the nodes is staying at this point:

MariaDB [(none)]> show processlist;

+--------+-------------+--------------------+---------------------------------+---------+-------+----------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+

| Id     | User        | Host               | db                              | Command | Time  | State                                                    | Info                                                                                                 | Progress |

+--------+-------------+--------------------+---------------------------------+---------+-------+----------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+

|      1 | system user |                    | NULL                            | Sleep   | 85248 | wsrep aborter idle                                       | NULL                                                                                                 |    0.000 |

|      2 | system user |                    | ****                                            | Sleep   |     0 | Update_rows_log_event::find_row(46124856) on table `ocr` | UPDATE ocr SET ascii='7jgoqHVSDCCQfQt/17uwcDkwcjq428o5yt+adtVkQtUvbGI5vNN3F7T9OMFD5Td7TS9RW00gbJU/It |    0.000 |

|      6 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|      5 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|      7 | system user |                    | NULL                            | Sleep   |     4 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|      9 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|      8 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|     13 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|     11 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|     12 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|     15 | system user |                    | NULL                            | Sleep   |     0 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|     14 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL                                                                                                 |    0.000 |

|     16 | system user |                    | NULL                            | Sleep   |     1 | wsrep applier committed                                  | NULL

All other queries are staying then in Updating or Waiting for certification state (see screenshot attachment).
Every time we need to find out which node has the Update_rows_log_event hanging and then restart that VM. Stopping the database service almost everytime will not work. Often we have to reboot the VM. After that all other nodes will work normaly and after restarted the affected node is syncing again.
But we are losing the queued queries on that machine.

I think this happened while there is many traffic on the different nodes.
But it is an assumption.

Are there any hint how we can find out what is happening? How can we examine ?
We cannot find any hint in journalctl. How can we proceed ?

Regards

Marc

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

processlist.png
2023-02-23 15:38
44 kB
Marc Bachmann
mycnf.txt
2023-02-23 15:53
2 kB
Marc Bachmann
journalctl.txt
2023-02-23 15:50
5 kB
Marc Bachmann

Issue Links

relates to

MDEV-29346 update_rows_log_event hung causing galera cluster failure

Closed

Activity

People

Assignee:: Jan Lindström

Reporter:: Marc Bachmann

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2023-02-23 15:57

Updated:: 2024-04-29 11:30

Resolved:: 2024-04-29 11:30

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.