[MDEV-32974] Member fails to join due to old seqno in GTID Created: 2023-12-08 Updated: 2024-01-12 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera, Galera SST, Storage Engine - InnoDB |
| Affects Version/s: | 11.0.1, 11.0.2, 11.0.3, 11.0.4 |
| Fix Version/s: | 11.0, 11.1, 11.2, 11.3, 11.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ihtisham ul Haq | Assignee: | Seppo Jaakola |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | regression, upgrade | ||
| Environment: |
bitnami/mariadb-galera:11.0.4-debian-11-r0 in k8s cluster |
||
| Issue Links: |
|
||||||||
| Description |
|
After upgrading initially from 10.6->10.11->11.0.3 and now to 11.0.4 we see that 2 of the members start up without any issues but the 3rd member(db-0) fails to start up due to an old sequence no.(117376) that i believe is being passed on from the donor. We are unable to find the old seqno anywhere except in the `ibdata1` file of the donor(by searching for the hex of it). But not sure how to get rid of this old seqno. Logs from member db-0:
|
| Comments |
| Comment by Ranjan Ghosh [ 2023-12-08 ] | |
|
I have the same problem and I didn't even update from a very old version. Somehow I cannot start the second node anymore even if I wipe /var/lib/mysql completely on the joining node. The RSync-SST finishes but then I get: "SST script passed bogus GTID" and the node fails. How can I reset this and get out of this situation? | |
| Comment by Ihtisham ul Haq [ 2023-12-08 ] | |
|
What worked was bootstrapping the cluster. And then manual SST for the affected node(probably the SST on startup would also work but we didn't try). We have other databases which have this issue as well(with 11.0.3) and we thought upgrading to 11.0.4 might fix it due this change. But the issue can be seen even with the change, and we don't want to upgrade/restart those due to risk of running in to the same issue. | |
| Comment by Ihtisham ul Haq [ 2023-12-11 ] | |
|
@RanjanGhosh Have you been able to reproduce this issue, by any chance? | |
| Comment by Ranjan Ghosh [ 2023-12-11 ] | |
|
@ihti: I haven't been able to reproduce it but your tip to bootstrap the whole cluster was also what worked for me. I needed to shut down all nodes and start booting them up one-by-one from scratch. It's quite surprising because it's the first time I can think of that could not be solved by delete /var/lib/mysql on the second node if the first node seems to be running without any problems. Or put differently: You cannot somehow "see" that the first node is in a weird state that doesn't allow the second node to come on. Everything seems normal. The first node is running. It accepts queries etc. And then you try to start the second node and it just won't work. Even if you delete /var/lib/mysql completely - it is impossible to bring up the second node. I tried it multiple times. Only after restarting/bootstrapping the first node, things start to work again. Googling didn't immediately yield an answer for me so I hope people will find this now looking for: SST script passed bogus GTID | |
| Comment by Ihtisham ul Haq [ 2023-12-19 ] | |
|
Steps to reproduce the issue: 1. Run mariadb-galera cluster with 3 peers in version 10.11.6 | |
| Comment by Max Lamprecht [ 2024-01-05 ] | |
|
We have narrowed it down to commit: https://github.com/MariaDB/server/commit/44dce3b2077e64a1efc607668d0d7b42a7c4ee78 If we set innodb_undo_tablespaces to 0 the start works. | |
| Comment by Max Lamprecht [ 2024-01-12 ] | |
|
We figured out that we are not able to change the innodb_undo_tablespaces setting online to 3 with the mariabackup wsrep_sst method. This is because with mariabackup there is no clean shutdown of the innodb donor. We used wsrep_sst_method=mysqldump as a workaround to migrate our galera clusters to the new default | |
| Comment by Marko Mäkelä [ 2024-01-12 ] | |
|
Thank you for the detailed bug report, including clear steps to reproduce.
It is unclear to me why the following error would be caused by a change of innodb_undo_tablespaces:
InnoDB does not store any GTID information in its internal data structures, such as undo logs. For normal replication, there is a table mysql.gtid_slave_pos, which is in InnoDB format by default. I am not familiar with the GTID and how it plays with Galera. For Galera, we store a wsrep_checkpoint in a rollback segment header. The storage format was last changed in MariaDB Server 10.3, related to I’m assigning this to the Galera developers for root cause analysis. |