Details
-
Bug
-
Status: Open (View Workflow)
-
Critical
-
Resolution: Unresolved
-
10.11.6
-
None
-
None
-
Debian, using MariaDB from mariadb.org
Description
We have a Galera cluster with 5 total machines. 4 of them perform overnight processing each night. This consists of a Perl script that starts a transaction and then touches many tables.
There are on the order of 1000 of these transactions each night.
Each of the last two nights, 1 of these transactions was only partially committed.
I've added a lot of logging inside of these transactions to try to catch this happening. Each transaction modifies many rows in a particular table, typically incrementing one value in that table, and then often creating a corresponding entry in a log table.
For the incident last night, I have logging showing that:
- The row was considered.
- An updated value was computed.
- The row was updated (the return value of my $dbh->do() call was 1).
- A fresh SELECT for that row showed the new value.
- A row in our log table was created for the event, and I have the value of the AUTO_INCREMENT for that log row.
For the particular account that was affected last night, there were 296 of these rows that were updated in this way, all in one transaction. At the same time, each of the 4 servers was processing 4 such transactions, so a total of 16 of these transactions were underway at the time.
The first 94 updated rows did not have their updates committed. The remaining 202 rows did. The log table has no entries corresponding to the row updates, even though I have the AUTO_INCREMENT values in syslog. The updates for the 94 rows, including the log table insertion, do not appear in the binary log. There isn't any kind of cutoff or anything interesting between the 94 and the 202: certainly no commit or rollback. We were looping through the rows, and some of the rows had their changes committed and some did not.
What should I do to diagnose this further?
Thank you!