[MDEV-20439] WSREP_CLUSTER_SIZE at 0 after rolling update a node Created: 2019-08-28  Updated: 2022-08-04  Resolved: 2020-12-11

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.7
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Levieux stéphane Assignee: Stepan Patryshev (Inactive)
Resolution: Cannot Reproduce Votes: 3
Labels: need_feedback, need_verification
Environment:

Debian 9


Issue Links:
Relates
relates to MDEV-22723 Data loss when performing rolling upg... Closed
relates to MDEV-29246 WSREP_CLUSTER_SIZE at 0 after rolling... Closed

 Description   

Hello,

I'm currently doing a rolling upgrade of my cluster mariadb galera from 10.3.17 to 10.4.7
I upgraded just a node following instructions.
All seem working fine (the last commit is the same etc... ) The WSREP_CLUSTER_SIZE is 3 on my 2 nodes in 10.3.17 but the WSREP_CLUSTER_SIZE is 0 on my node in 10.4.7 .

i have this warning in log but don"t know if it's linked to my problem .
" WSREP: View recovered from stable storage was empty. If the server is doing rolling upgrade from previous version which does not support storing view info into stable storage, this is ok. Otherwise this may be a sign of malfunction. "

I'm little afraid to continu rolling upgrade another node ...

Thanks by advance



 Comments   
Comment by Levieux stéphane [ 2019-09-02 ]

Friday, the node has reboot and it has changes the node state.... the node (10.4.7 ) was a cluster site at 3 (good) but ws_ready was off ...
Today i decided to restart the service mysql ans all is fine ... node is synced and ready .

So instead i followed the procedure, it seems a reboot and and after a restart of service was needed.

I wonder what is rational in that ...? a reboot is different from a simple restart of service for mariadb ?

Comment by Karl Levik [ 2019-11-11 ]

I've observed the same issue while doing a rolling upgrade from 10.3.19 to 10.4.10. I also see the same warning in the log:

2019-11-11 10:17:28 3 [Warning] WSREP: View recovered from stable storage was empty. If the server is doing rolling upgrade from previous version which does not support storing view info into stable storage, this is ok. Otherwise this may be a sign of malfunction.

Although it says wsrep_cluster_size is 0, it seems that replication to and from the other nodes is still working fine. After a while (maybe after all the nodes have been upgraded?), the wsrep_cluster_size is back to what I would expect, i.e. 3.

Comment by Jesse [ 2020-04-24 ]

Same issue on CentOS 7 when performing rolling upgrade from MariaDB-server-10.3.22-1.el7.centos.x86_64 to MariaDB-compat-10.4.12-1.el7.centos.x86_64

All nodes are have wsrep_local_state_comment : Synced
nothing special in error log except the previously mentioned [Warning] WSREP: View recovered from stable... and cluster is working.

First upgraded node has correct wsrep_cluster_size : 3
Second upgraded node has incorrect wsrep_cluster_size : 0
Third non-upgraded node has correct wsrep_cluster_size : 3

After last node was upgraded all nodes reported correct wsrep_cluster_size of 3

Comment by Chris McGrail [ 2020-07-17 ]

Also experiencing this with a rolling upgrade of a 3 node cluster from 10.3.23 to 10.4.13 on CentOS 7.8.

Upgrading our 2 node dev cluster with the same settings worked fine. We ran with one node 10.4 and one 10.3 for days and the cluster stayed synced.

On production we upgraded one node to 10.4. It will successfully perform IST or SST as needed, declare itself synced, but then it fails to maintain sync during normal replication. The mysql.log is full of lines like this:

Jul 17 14:57:31 prod1 mysqld: 2020-07-17 14:57:31 2 [Warning] WSREP: trx protocol version: 4 does not match certification protocol version: -1

Global status reports that odd value for protocol version as well.

> show global status like 'wsrep_protocol_version';
-----------------------------+

Variable_name Value

-----------------------------+

wsrep_protocol_version -1

-----------------------------+

On the dev server logs, on the first server when it was upgraded to 10.4, you can see it is aware it has two protocols available and it chooses the lower one until the other node is upgraded. Then when the other node was upgraded it automatically reloaded itself to use the newer one. In the prod logs when the first node is upgraded, the startup sequence output is slightly different and you never see that output where it chooses a backwards compatible protocol.

We are considering rolling back the node to 10.3 rather than pushing ahead with the rest of the cluster since the upgraded one is not applying replication. We are OK with 2 of the 3 nodes but can't lose another one for any extended period of time.

Comment by Chris McGrail [ 2020-08-03 ]

We were eventually able to get past this by doing a full cluster shutdown and a non-rolling upgrade.

The first node that didn't stay in sync was shut down. The two nodes that had been healthy on 10.3 were both stopped and upgraded to 10.4. The cluster was bootstrapped on one, and it synced with the other no problem. Then we forced an SST on the first node and it joined them in a normal expected fashion.

Fortunately the nature of the business using the database allowed us a generous window for downtime, long enough that we were able to take cold filesystem backups in case we needed to revert to 10.3.

Comment by Stepan Patryshev (Inactive) [ 2020-08-04 ]

I presume, it may be the same as MDEV-22723, which probably may be fixed by using new Galera4 lib, like 26.4.4(rae24803), see explanations there by Yurchenko.

Comment by Chris McGrail [ 2020-08-04 ]

We were/are using 26.4.4.

If I read that bug report correctly the fix is in 4.5.

"In any case this bug (and many other) is fixed in 4.5 release tag. All MariaDB 10.4 users should switch to it. It will solve a lot of trouble."

It doesn't look like 4.5 is GA yet. We seem to be good now and will of course apply dot release updates as they are available.

In any event, it is good to see something published about the issue we saw. It is unsettling to have an error that has no matches found in a web search.

Comment by Shi Yan [ 2020-12-02 ]

We are having the same issue when rolling upgrade from 10.3 to 10.4/10.5, The galera version is 26.4.6.
After the rolling upgrade, the strange value are shown in the first node(get upgraded) but looks the cluster sync is fine and info from other un-upgraded nodes are still good.

wsrep_cluster_size 0
wsrep_local_index 18446744073709551615

[update]
We found out that when other node mariadb stops running, these values will be refreshed and turn to good. For example, we upgrade the 1st node to 10.5, then the wrong value happens on the 1st node, but looks the cluster is still synced and value are good on our 2nd or 3rd nodes. Then when we stop mariadb on the 2nd nodes, the 1st will get correct value. But same thing happens on the 2nd node after it gets upgraded.
Also when all of the three nodes get upgraded, the value will also be good.

Comment by Stepan Patryshev (Inactive) [ 2020-12-03 ]

julien.fritsch Shi Yan reported that they do not suffer from these wrong values and everything is good, so, I suppose, there is no strong reason to worry here. But in the customer ticket I have not seen a fresh reply from the customer if it is still a problem for them or not. Waiting...

Comment by Stepan Patryshev (Inactive) [ 2020-12-11 ]

Closing it as not reproduced since the customer is not experiencing this anymore.

Generated at Thu Feb 08 08:59:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.