[MDEV-15918] Configure buildbot so that Galera suites are run with lower concurrency Created: 2018-04-18  Updated: 2023-09-28  Resolved: 2023-09-28

Status: Closed
Project: MariaDB Server
Component/s: Galera, Tests
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Jan Lindström (Inactive) Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-19937 Galera test failures on 10.2/10.3 Closed

 Description   

Effected suites:

  • wsrep
  • galera
  • galera_3nodes

These should be run so that no other suite is run in parallel.



 Comments   
Comment by Elena Stepanova [ 2018-04-27 ]

I have no problem doing it technically, but we need to have justification for it first. Running tests without parallelism is an expensive exercise and it shouldn't be done just because it seems easier than fixing badly written tests. For example, if it's just a hope that the tests will run faster and thus will hit less timeouts, it's not good enough reason.

Also, please note that before we do it, you need to remove them from the list of default suites. If they can't run in a normal fashion, they can't be a part of the default set.

Comment by Seppo Jaakola [ 2018-05-17 ]

With galera and galera_3nodes suites, mtr deploys a synchronous mariadb cluster, where nodes need to maintain consensus. In highly loaded test environment, nodes may not get enough CPU cycles to keep up communicating with each other, and this may lead to cluster split. Some sporadic test failure logs suggest that this scenario has happened in buildbot testing.
And there is nothing wrong with this, cluster split is natural reaction of synchronous cluster to isolate bad behaving part of the cluster and prevent service break. However, cluster split is not an expected result in mtr testing, and will lead to test failures.
It is possible to extend cluster timeouts to tolerate longer communication breakages, but I would not go too far in this. Extending timeouts may lead to other side effects, which should also we worked around somehow. e.g. cluster has strict commit order and clients cannot commit until cluster has re-configured itself. This may trigger client side timeouts, or lock wait timeouts, and thus, again test failures.

Comment by Elena Stepanova [ 2018-05-17 ]

It does't make sense to me. If test cases are designed so that they fail just because the machine is slow, reducing parallelism is not a solution to anything. There will still be slow builders, there might still be delays of various sorts, failures will still be happening.

That said, if dbart and serg both agree to it, I can make it happen, but it absolutely means that all affected Galera tests must be excluded from the default test set before we reduce parallelism for them.

Comment by Sergei Golubchik [ 2018-05-17 ]

I agree with elenst and I'd rather increase timeouts. You cannot know how slow the builder is. Even no parallelism in the builder, buildslave aidi still runs 30 builders in parallel. A test has no control over that.

Comment by Seppo Jaakola [ 2018-05-30 ]

Ok then, I will extend galera timeouts in galera and galera_3nodes suites, and create a pull request with that. It is probable that more tests with fail after this, but we can fix them one by one.

Comment by Alexey Bychko (Inactive) [ 2019-07-02 ]

I'll check if we have this on Azure

Comment by Elena Stepanova [ 2023-09-28 ]

We have already reduced parallelism for galera tests significantly (sadly).
I don't think the remaining parallel=2 is an important factor in the current galera problems. When/If it is, Galera people will raise it again, I expect.

Generated at Thu Feb 08 08:25:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.