[MDEV-27308] 3 problems encountered when node failure during Galera fragmented transaction running Created: 2021-12-19  Updated: 2022-06-15

Status: Open
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.5.12
Fix Version/s: 10.5

Type: Bug Priority: Major
Reporter: William Wong Assignee: Alexey
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Redhat 7 on VMware


Attachments: Text File galera-donor-desync.txt    

 Description   

Hi,

In our production env, Galera transaction fragment is used for running batch job. In some incidents (tmp directory full , VM reboot during hardware memory issue) , we encountered below 3 problems.

Problem #1: SST triggered to recover failed node but IST is expected
Problem #2: in some test, failed node encounters crash with signal 11 repeatedly until node 1 commit
Problem #3: local node state of donor node changed to "Donor/Desynced" unexpectedly after failed recovered

Workaround is manual restart node. But Galera should resume automatically on its own when hardware issue and running IST in most cases.

Repeatable testcase (galera-donor-desync.txt) is attached


Generated at Thu Feb 08 09:51:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.