[MDEV-32896] Unstable XA + binglog tests, with possible MDEV-32830 caused issues Created: 2023-11-28  Updated: 2024-02-03

Status: Stalled
Project: MariaDB Server
Component/s: Binary Protocol, Tests, XA
Affects Version/s: N/A
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Roel Van de Paar Assignee: Roel Van de Paar
Resolution: Unresolved Votes: 0
Labels: affects-tests

Issue Links:
Blocks
is blocked by MDEV-32830 refactor XA binlogging for better int... In Review
Relates

 Description   

The following tests: binlog_xa_recover, binlog_xa_prepared_disconnect, and binlog_empty_xa_prepared have proven to be unstable.

The issue is that they fail differently, and additionally fail in various different ways, on base 10.6 and the MDEV-32830 patch trees. As such, MTR stress testing of XA + binlog on the MDEV-32830 patch is not possible, and it is possible that MDEV-32830 is causing different/additional issues.

Given this, these tests will need to be fixed and stabilized before signoff on the MDEV-32830 patch can happen.

The failures occur even when run in single-thread instances (verified), but various issues can be made to shown quickly (< ~1 minute) using:

rm -Rf /dev/shm/var_auto*; MTR_MEM=/dev/shm ./mysql-test-run --repeat=35 --parallel=30 --mem --force binlog_xa_recover{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} | tee mtr_output.txt
rm -Rf /dev/shm/var_auto*; MTR_MEM=/dev/shm ./mysql-test-run --repeat=35 --parallel=30 --mem --force binlog_xa_prepared_disconnect{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} | tee mtr_output.txt
rm -Rf /dev/shm/var_auto*; MTR_MEM=/dev/shm ./mysql-test-run --repeat=35 --parallel=30 --mem --force binlog_empty_xa_prepared{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} | tee mtr_output.txt



 Comments   
Comment by Roel Van de Paar [ 2023-11-28 ]

Regrettably I am also seeing, more sporadically, issues with binlog_xa_checkpoint, binlog_xa_handling and xa_binlog, though the latter thus far on base only.

binlog_xa_checkpoint and binlog_xa_handling will thus need to be checked also. For binlog_xa_handling issues have been seen only on the patch tree thus far.

Additionally, what may be of interest, xa_binlog is considerably faster (17 seconds versus 70 seconds for 1085 tests) on the patch tree than on base. For this testcase, there seems to be a clear parallelism at work in the patch tree unlike base.

Comment by Andrei Elkin [ 2023-11-28 ]

roel, I can't confirm by running them locally the way you did. On both bb-10.6-MDEV-31949 and the vanilla 10.6.
Crossref is really slow but I could query out of it a list of binlog_xa_recover failures which may confirm the test is unstable.

Let me ask you to paste 10.6 and bb-10.6-MDEV-31949 traces in two separate comments so that I'd try to match, or explain any difference?
Let's start with binlog_xa_recover and binlog_xa_prepared_disconnect.

Comment by Andrei Elkin [ 2023-11-28 ]

> I am also seeing, more sporadically, issues with binlog_xa_checkpoint
In which branch?
The test has been altered in 31949 in 9de57a483e7. Previously it must've been non-deterministic.
So let's do the same as above, while I can't (could not) reproduce on 31949 I need traces.

Please always paste them - even for your own records - as apparently mtr invocation references may not suffice for one with different env.

Comment by Roel Van de Paar [ 2023-11-29 ]

> I can't confirm by running them locally the way you did. On both bb-10.6-MDEV-31949 and the vanilla 10.6.
On bb-10.6-MDEV-31949, binlog_xa_recover looks stable, testing others.

> In which branch?
bb-10.6-MDEV-32830-qa before, but now testing bb-10.6-MDEV-31949.

> The test has been altered in 31949 in 9de57a483e7. Previously it must've been non-deterministic.
Understood, it looks like it.

> Please always paste them - even for your own records - as apparently mtr invocation references may not suffice for one with different env.
Agreed, and I generally would. In this case it wasn't crash/assert stack traces, but various "somewhat random errors" scrolling for many pages

Comment by Roel Van de Paar [ 2024-01-19 ]

This is waiting for MDEV-32830 ftm, so I have reversed the blocker direction. Retesting required once MDEV-32830 and MDEV-31949 are ready for testing.

Generated at Thu Feb 08 10:34:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.