[MDEV-25835] mariadb 10.3.29 galera cluster crashes with errors like: "[ERROR] WSREP: Trx 236236 tries to abort slave trx 236238." Created: 2021-06-01 Updated: 2022-02-10 Resolved: 2021-10-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.2.36, 10.3.29 |
| Fix Version/s: | 10.2.41, 10.3.32, 10.4.22, 10.5.13, 10.6.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jan Horstmann | Assignee: | Seppo Jaakola |
| Resolution: | Fixed | Votes: | 8 |
| Labels: | None | ||
| Environment: |
KVM machine with ubuntu bionic cloud image. Dockerised deployment of mariadb 10.3.29 from mariadb repository based on ubuntu bionic docker image: |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Starting with mariadb 10.3.29 when deploying a three node galera cluster, we are seeing crashes with errors like in [1]. [1]
|
| Comments |
| Comment by Walter Doekes [ 2021-06-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Okay, so we notice this too:
It went into a crashloop, with the same: Trx 54088471 tries to abort slave trx 54088527. First core dump looks like:
Went back to 10.3.25 because of Let me know if I can get you more debug info from the dump. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Horstmann [ 2021-06-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@wdoekes would you consider it to be save to go back from 10.3.29 to 10.3.28 or lower? Did you do a full SST per node and which method do you use? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Walter Doekes [ 2021-06-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If you're not affected by I'm not aware what ops did when reverting. SSTs generally start automatically here after certain failures, more often than appears necessary. So if there was an SST, I'm not sure if it was intentional either. The crash loop started on the second node, apparently. ("Rebooted, started maria on the 2nd node. Crashloop. Downgrading made it happy again.") | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Patrick Schlirf [ 2021-06-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey guys, are there any updates? We downgraded after the first three times this happened and freeze the patch cycle of a big number of database servers to prevent to hit this bug again. Kind regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-07-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This issue will be fixed when (if) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-09-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I need more information. Can you run with --wsrep-debug=1 and provide error logs ? Naturally, steps how to reproduce would be most useful. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This issue seems to cause applier-applier conflicts, configuring for only one applier (wsrep_slave_threads=1) should be immediate help for this. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-10-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
probably a duplicate | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2021-10-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
With 47ba5523046094db33e68c92a182491a629bbd56 reverting | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-10-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, |