[MDEV-33014] Transaction only partially applied Created: 2023-12-13 Updated: 2024-01-05 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.11.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Xan Charbonnet | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Debian, using MariaDB from mariadb.org |
||
| Description |
|
We have a Galera cluster with 5 total machines. 4 of them perform overnight processing each night. This consists of a Perl script that starts a transaction and then touches many tables. There are on the order of 1000 of these transactions each night. Each of the last two nights, 1 of these transactions was only partially committed. I've added a lot of logging inside of these transactions to try to catch this happening. Each transaction modifies many rows in a particular table, typically incrementing one value in that table, and then often creating a corresponding entry in a log table. For the incident last night, I have logging showing that:
For the particular account that was affected last night, there were 296 of these rows that were updated in this way, all in one transaction. At the same time, each of the 4 servers was processing 4 such transactions, so a total of 16 of these transactions were underway at the time. The first 94 updated rows did not have their updates committed. The remaining 202 rows did. The log table has no entries corresponding to the row updates, even though I have the AUTO_INCREMENT values in syslog. The updates for the 94 rows, including the log table insertion, do not appear in the binary log. There isn't any kind of cutoff or anything interesting between the 94 and the 202: certainly no commit or rollback. We were looping through the rows, and some of the rows had their changes committed and some did not. What should I do to diagnose this further? Thank you! |
| Comments |
| Comment by Xan Charbonnet [ 2023-12-28 ] |
|
Update: it seems that an example of this problem was happening most nights. To make the logs more useful, I started processing the rows in order (so that my logging would help me identify affected ranges of row IDs). Since making that change, this issue has not been observed. I'm planning to revert the change and see if it crops back up, probably next week. |
| Comment by Xan Charbonnet [ 2024-01-04 ] |
|
I reverted the change yesterday, and the overnight processing did demonstrate the problem. Among other inserts and updates are insertions into my log table, from which rows are never deleted. I have auto-increment IDs in the syslog for each of these created rows. There are two transactions from last night where some of the log entries exist and some do not. All events in the transaction before a certain point failed to commit, but all events after that point were successfully committed. This problem seems to disappear when I process the rows in primary key order, which is what I will go back to doing. It's fairly distressing that this can happen, though, and I'm left hoping that it isn't happening anywhere else. |
| Comment by Xan Charbonnet [ 2024-01-05 ] |
|
Well, it does seem to be less common when sorted this way, but it did happen again last night: there was a transaction where the first 18 of 103 rows failed to be updated. |