[MDEV-33014] Transaction only partially applied Created: 2023-12-13  Updated: 2024-01-05

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.11.6
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Xan Charbonnet Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Debian, using MariaDB from mariadb.org



 Description   

We have a Galera cluster with 5 total machines. 4 of them perform overnight processing each night. This consists of a Perl script that starts a transaction and then touches many tables.

There are on the order of 1000 of these transactions each night.

Each of the last two nights, 1 of these transactions was only partially committed.

I've added a lot of logging inside of these transactions to try to catch this happening. Each transaction modifies many rows in a particular table, typically incrementing one value in that table, and then often creating a corresponding entry in a log table.

For the incident last night, I have logging showing that:

  • The row was considered.
  • An updated value was computed.
  • The row was updated (the return value of my $dbh->do() call was 1).
  • A fresh SELECT for that row showed the new value.
  • A row in our log table was created for the event, and I have the value of the AUTO_INCREMENT for that log row.

For the particular account that was affected last night, there were 296 of these rows that were updated in this way, all in one transaction. At the same time, each of the 4 servers was processing 4 such transactions, so a total of 16 of these transactions were underway at the time.

The first 94 updated rows did not have their updates committed. The remaining 202 rows did. The log table has no entries corresponding to the row updates, even though I have the AUTO_INCREMENT values in syslog. The updates for the 94 rows, including the log table insertion, do not appear in the binary log. There isn't any kind of cutoff or anything interesting between the 94 and the 202: certainly no commit or rollback. We were looping through the rows, and some of the rows had their changes committed and some did not.

What should I do to diagnose this further?

Thank you!



 Comments   
Comment by Xan Charbonnet [ 2023-12-28 ]

Update: it seems that an example of this problem was happening most nights. To make the logs more useful, I started processing the rows in order (so that my logging would help me identify affected ranges of row IDs). Since making that change, this issue has not been observed.

I'm planning to revert the change and see if it crops back up, probably next week.

Comment by Xan Charbonnet [ 2024-01-04 ]

I reverted the change yesterday, and the overnight processing did demonstrate the problem. Among other inserts and updates are insertions into my log table, from which rows are never deleted. I have auto-increment IDs in the syslog for each of these created rows.

There are two transactions from last night where some of the log entries exist and some do not. All events in the transaction before a certain point failed to commit, but all events after that point were successfully committed.

This problem seems to disappear when I process the rows in primary key order, which is what I will go back to doing. It's fairly distressing that this can happen, though, and I'm left hoping that it isn't happening anywhere else.

Comment by Xan Charbonnet [ 2024-01-05 ]

Well, it does seem to be less common when sorted this way, but it did happen again last night: there was a transaction where the first 18 of 103 rows failed to be updated.

Generated at Thu Feb 08 10:35:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.