[MDEV-26018] Update breaks cluster Created: 2021-06-24  Updated: 2021-12-22  Resolved: 2021-12-22

Status: Closed
Project: MariaDB Server
Component/s: wsrep
Affects Version/s: 10.4.20
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Tim van Dijen Assignee: Seppo Jaakola
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Red Hat Enterprise Linux Server release 7.9 (Maipo)

Linux 3.10.0-1160.25.1.el7.x86_64 #1 SMP Tue Apr 13 18:55:45 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

3-node wsrep-cluster


Attachments: Text File primary_donor.txt     Text File primary_donor.txt     Text File secondary_joiner.txt     Text File secondary_joiner.txt     File server.cnf    

 Description   

After updating from 10.4.19 > 10.4.20 I couldn't get any secondary nodes to join the cluster.
I could see the nodes join and immediately leave again. See attached logs.

I was able to resolve this by downgrading back to 10.4.19



 Comments   
Comment by Julius Goryavsky [ 2021-06-25 ]

tvdijen Hi! I looked at the log and it looks as if the encryption mode encrypt = 2 was changed to encrypt = 4, but without renaming the tca parameter to tkey or ssl-ca to ssl-key (what needs to be done when switching from encrypt = 2 to encrypt = 4). Please tell me if such configuration changes have been made? Or is it the result of some kind of automatic change?

Comment by Tim van Dijen [ 2021-06-25 ]

Hi @Julius Goryavsky!

Config wasn't changed between updates. It's been like this for years;

encrypt = 4
tca = /etc/somecert.crt
tcert = /etc/some-sst-cert.pem

So maybe I've been doing it all wrong for years, but it used to work..
Do I understand correct that I need to rename `tca` to `tkey`?

Comment by Tim van Dijen [ 2021-06-25 ]

I've tried setting ssl-ca, ssl-cert and ssl-key instead, but the issue remains..
With those settings in place 10.4.19 works like a charm, but update to 10.4.20 breaks. Same behaviour

Comment by Jan Lindström (Inactive) [ 2021-06-25 ]

Please provide error logs from latest try.

Comment by Tim van Dijen [ 2021-06-25 ]

OK, so as Julius pointed out I've set config from:

[sst]
encrypt = 4
tca = /etc/pki/mariadb/certs/mariadb-ca.crt
tcert = /etc/pki/mariadb/private/mariadb-sst-1.crt

To:

[sst]
encrypt = 4
ssl-ca = /etc/pki/mariadb/certs/mariadb-ca.crt
ssl-cert = /etc/pki/mariadb/certs/mariadb-sst-1.crt
ssl-key = /etc/pki/mariadb/private/mariadb-sst-1.key

On 10.4.19 I can bootstrap the cluster and join the secondaries. When I update to 10.4.20 it breaks in a similar fashion as with the old config. See attached logs. I've also added my server.cnf

Comment by Rob Brown [ 2021-07-23 ]

CONFIRMED! I'm seeing the same problem.

MariaDB 10.4.20 is able to successfully join a valid Galera. So that's good.

And MariaDB 10.4.20 works fine as a Galera cluster. So that's good.

Any MariaDB 10.4.20 that joins a Galera cluster via IST (Incremental State Transfers) can be slaved from using GTID. So that's good.

But any MariaDB 10.4.20 node that has ever joined a Galera cluster via SST (State Snapshot Transfers) can never be slaved from using GTID. BAD!

I was able to duplicate this issue 100% of the time:

Galera Node #1:
Setup MariaDB 10.4.20 with Galera
(wsrep_on, wsrep_sst_auth, wsrep_cluster_address, etc)
Then launch:

[root@node1 ~]# galera_new_cluster

Galera Node #2:
Upgrade to MariaDB 10.4.20 and configure Galera normally.

[root@node2 ~]# systemctl stop mariadb
[root@node2 ~]# yum install MariaDB-server-10.4.20
[root@node2 ~]# rm -rf /var/lib/mysql/*
[root@node2 ~]# systemctl restart mariadb

Node 2 uses SST to join Node 1 and the cluster links up fine. So that's good.

Slave Node #3:

Configure to replicate from Node #1 using hard-coded position:

MariaDB[3]> CHANGE MASTER TO MASTER_HOST='node1', MASTER_LOG_FILE='mysql-binlog.000003', MASTER_LOG_POS=4;

It works fine. So that's good.

Then switch to GTID mode:

MariaDB[3]> CHANGE MASTER TO MASTER_USE_GTID=slave_pos;

It still works fine. So that's good.

Try slaving from Node 2:

MariaDB[3]> CHANGE MASTER TO MASTER_HOST='node2';

It will break because Node 2 is 10.4.20 that used SST to join the cluster.

Seconds_Behind_Master: NULL
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 1-26-1006, which is not in the master's binlog'

Switch back to the goodness:

MariaDB[3]> CHANGE MASTER TO MASTER_HOST='node1';

And it will replicate perfectly if Node 1 had never used SST:

Seconds_Behind_Master: 0
Last_IO_Errno: 0
{{Last_IO_Error:

Downgrade Node 2 to 10.4.19

[root@node2 ~]# systemctl stop mariadb
[root@node2 ~]# yum downgrade MariaDB-server-10.4.19
[root@node2 ~]# rm -rf /var/lib/mysql/*
[root@node2 ~]# systemctl restart mariadb

Then Node 3 will suddenly be able to replicate from anywhere again:

MariaDB[3]> CHANGE MASTER TO MASTER_HOST='node2';

Seconds_Behind_Master: 0

MariaDB[3]> CHANGE MASTER TO MASTER_HOST='node1';

Seconds_Behind_Master: 0

If you keep all Galera Master servers on 10.4.19 and DO NOT upgrade to 10.4.20, then you'll be safe.

Replication Slaves are safe to upgrade to 10.4.20 (as long as you never promote them to Master).

Comment by Rob Brown [ 2021-08-07 ]

RESOLUTION CONFIRMED!

10.4.19 = GOOD
10.4.20 = BAD
10.4.21 = GOOD AGAIN

VERIFICATION:

[root@node2 ~]# systemctl stop mariadb
[root@node2 ~]# yum downgrade MariaDB-server-10.4.19
[root@node2 ~]# rm -rf /var/lib/mysql/*
[root@node2 ~]# systemctl restart mariadb

GTID Slaving works.
Thus 10.4.19 SST GOOD.

[root@node2 ~]# systemctl stop mariadb
[root@node2 ~]# yum install MariaDB-server-10.4.20
[root@node2 ~]# systemctl restart mariadb

GTID Slaving works.
Thus 10.4.20 IST GOOD.

[root@node2 ~]# systemctl stop mariadb
[root@node2 ~]# rm -rf /var/lib/mysql/*
[root@node2 ~]# systemctl restart mariadb

GTID Slaving FAILURE!
Seconds_Behind_Master: NULL
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-216-639464, which is not in the master's binlog'
Thus 10.4.20 SST BAD.

[root@node2 ~]# systemctl stop mariadb
[root@node2 ~]# yum install MariaDB-server-10.4.21
[root@node2 ~]# systemctl restart mariadb

GTID Slaving still FAILURE!

[root@node2 ~]# systemctl stop mariadb
[root@node2 ~]# rm -rf /var/lib/mysql/*
[root@node2 ~]# systemctl restart mariadb

GTID Slaving works again.
Thus 10.4.21 SST GOOD.

[root@node2 ~]# systemctl restart mariadb
[root@node2 ~]#

GTID Slaving still works.
Thus 10.4.21 IST GOOD.

Changing MariaDB version on Slave server [node3] to 10.4.19 or 10.4.20 or 10.4.21 has no effect on the success or failure of GTID Slaving in each scenario.

You can close this ticket now.

THANKS!

Comment by Tim van Dijen [ 2021-08-26 ]

I'm still experiencing the exact same issue with 10.4.21... Cluster bootstraps just fine, secondaries won't join. I think my setup is different from Rob's, because I'm running multi-master and never did anything like promoting slaves to master as he does in his comments above.

Comment by Tim van Dijen [ 2021-11-15 ]

We've just tried upgrading to 10.5.x and that worked..
We suspect we may have forgotten to update the mariadb-backup package on our earlier attempts (and that's what we use as wsrep-method).

This issue may be closed now!

Comment by Ralf Gebhardt [ 2021-12-22 ]

Reported to be fixed with newer versions

Generated at Thu Feb 08 09:42:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.