[MDEV-26803] Galera crash - Assertion. Possible parallel writeset problem - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4.21, 10.4.22, 10.5(EOL), 10.6
Fix Version/s: 10.4.23, 10.5.14, 10.6.6, 10.7.2
Component/s: Galera
Labels:
None
Environment:
Ubuntu Bionic, using community packages from MariaDB repo.
Also reproduced with build_43208 of 10.4.22

Description

We are experiencing a crash of all galera nodes receiving write sets. The operation is a "last resort" clean up stored procedure, that deletes many rows from the same set of related tables. It generally takes 4-5 minutes to run based on our data size, but is crashing within 10-20 seconds if it is going to go wrong.

We have been using this stored procedure, reasonably regularly, without problem on 10.1 for several years. As suggested by Enterprise support, I have also tried this on the latest 10.4 build, which they provided me with a URL to. This also exhibits the problem.

Unfortunately, I have been unable to replicate either simplified reproduction steps, or from a different system of ours. However, I have been able to take a "mariabackup" i.e. physical backup, and reproduce the fault on 2 other clusters. The original, and first replication were on VMware machines. The third system, is an AWS EC2 setup. All 3 have the same MariaDB configuration. I suspect the problem is exposed due to the particular on disk data.

Attached is the log of one of the nodes receiving the writeset.

First round of testing, I found that autocommit needs to be ON.

Due to suspecting the data, and knowing that our QA team were trying to delete rows - I started my test again and used "OPTIMIZE TABLE" on the tables that are touched. This caused

[ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table mediator.SEQUENCE; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 276, Internal MariaDB error code: 1213

to appear in the log, at an unusual point in the crash logging.

Because of finding that info, I have now set wsrep_slave_thread = 1, and this completes successfully. Previously the value was 12. I have also tested = 4, which also crashed.

Therefore with this additional knowledge, I am presuming that something in Galera is presuming it can apply certain writesets in parallel when it cannot.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gdb.txt
2021-12-01 18:20
205 kB
Jan Lindström
mysql-receivingNode.log
2021-10-11 15:51
6 kB
Brendon Abbott
second-of-crash.combined.log
2021-12-02 10:06
110 kB
Walter Doekes
storedProcedure.sql
2021-10-12 12:08
1.0 kB
Brendon Abbott
table-structure.sql
2021-10-12 11:52
2 kB
Brendon Abbott
unable-to-read-page.fatal.log
2021-11-30 16:16
4.96 MB
Walter Doekes

Issue Links

relates to

MDEV-27115 10.4.22 segfault at SELECT RELEASE_LOCK() in ull_get_key (bad MDL_ticket)

Closed

MDEV-27547 Galera node INCONSISTENT state on DELETE with FKs having wsrep_slave_threads > 1

Closed

Activity

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Brendon Abbott

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2021-10-11 15:52

Updated:: 2024-07-07 21:01

Resolved:: 2021-12-20 13:13

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

8d 35m

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.