[MDEV-20439] WSREP_CLUSTER_SIZE at 0 after rolling update a node Created: 2019-08-28 Updated: 2022-08-04 Resolved: 2020-12-11 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.4.7 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Levieux stéphane | Assignee: | Stepan Patryshev (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 3 |
| Labels: | need_feedback, need_verification | ||
| Environment: |
Debian 9 |
||
| Issue Links: |
|
||||||||||||
| Description |
|
Hello, I'm currently doing a rolling upgrade of my cluster mariadb galera from 10.3.17 to 10.4.7 i have this warning in log but don"t know if it's linked to my problem . I'm little afraid to continu rolling upgrade another node ... Thanks by advance |
| Comments |
| Comment by Levieux stéphane [ 2019-09-02 ] | ||||
|
Friday, the node has reboot and it has changes the node state.... the node (10.4.7 ) was a cluster site at 3 (good) but ws_ready was off ... So instead i followed the procedure, it seems a reboot and and after a restart of service was needed. I wonder what is rational in that ...? a reboot is different from a simple restart of service for mariadb ? | ||||
| Comment by Karl Levik [ 2019-11-11 ] | ||||
|
I've observed the same issue while doing a rolling upgrade from 10.3.19 to 10.4.10. I also see the same warning in the log:
Although it says wsrep_cluster_size is 0, it seems that replication to and from the other nodes is still working fine. After a while (maybe after all the nodes have been upgraded?), the wsrep_cluster_size is back to what I would expect, i.e. 3. | ||||
| Comment by Jesse [ 2020-04-24 ] | ||||
|
Same issue on CentOS 7 when performing rolling upgrade from MariaDB-server-10.3.22-1.el7.centos.x86_64 to MariaDB-compat-10.4.12-1.el7.centos.x86_64 All nodes are have wsrep_local_state_comment : Synced First upgraded node has correct wsrep_cluster_size : 3 After last node was upgraded all nodes reported correct wsrep_cluster_size of 3 | ||||
| Comment by Chris McGrail [ 2020-07-17 ] | ||||
|
Also experiencing this with a rolling upgrade of a 3 node cluster from 10.3.23 to 10.4.13 on CentOS 7.8. Upgrading our 2 node dev cluster with the same settings worked fine. We ran with one node 10.4 and one 10.3 for days and the cluster stayed synced. On production we upgraded one node to 10.4. It will successfully perform IST or SST as needed, declare itself synced, but then it fails to maintain sync during normal replication. The mysql.log is full of lines like this: Jul 17 14:57:31 prod1 mysqld: 2020-07-17 14:57:31 2 [Warning] WSREP: trx protocol version: 4 does not match certification protocol version: -1 Global status reports that odd value for protocol version as well. > show global status like 'wsrep_protocol_version';
-----------------------
----------------------- On the dev server logs, on the first server when it was upgraded to 10.4, you can see it is aware it has two protocols available and it chooses the lower one until the other node is upgraded. Then when the other node was upgraded it automatically reloaded itself to use the newer one. In the prod logs when the first node is upgraded, the startup sequence output is slightly different and you never see that output where it chooses a backwards compatible protocol. We are considering rolling back the node to 10.3 rather than pushing ahead with the rest of the cluster since the upgraded one is not applying replication. We are OK with 2 of the 3 nodes but can't lose another one for any extended period of time. | ||||
| Comment by Chris McGrail [ 2020-08-03 ] | ||||
|
We were eventually able to get past this by doing a full cluster shutdown and a non-rolling upgrade. The first node that didn't stay in sync was shut down. The two nodes that had been healthy on 10.3 were both stopped and upgraded to 10.4. The cluster was bootstrapped on one, and it synced with the other no problem. Then we forced an SST on the first node and it joined them in a normal expected fashion. Fortunately the nature of the business using the database allowed us a generous window for downtime, long enough that we were able to take cold filesystem backups in case we needed to revert to 10.3. | ||||
| Comment by Stepan Patryshev (Inactive) [ 2020-08-04 ] | ||||
|
I presume, it may be the same as | ||||
| Comment by Chris McGrail [ 2020-08-04 ] | ||||
|
We were/are using 26.4.4. If I read that bug report correctly the fix is in 4.5. "In any case this bug (and many other) is fixed in 4.5 release tag. All MariaDB 10.4 users should switch to it. It will solve a lot of trouble." It doesn't look like 4.5 is GA yet. We seem to be good now and will of course apply dot release updates as they are available. In any event, it is good to see something published about the issue we saw. It is unsettling to have an error that has no matches found in a web search. | ||||
| Comment by Shi Yan [ 2020-12-02 ] | ||||
|
We are having the same issue when rolling upgrade from 10.3 to 10.4/10.5, The galera version is 26.4.6. wsrep_cluster_size 0 [update] | ||||
| Comment by Stepan Patryshev (Inactive) [ 2020-12-03 ] | ||||
|
julien.fritsch Shi Yan reported that they do not suffer from these wrong values and everything is good, so, I suppose, there is no strong reason to worry here. But in the customer ticket I have not seen a fresh reply from the customer if it is still a problem for them or not. Waiting... | ||||
| Comment by Stepan Patryshev (Inactive) [ 2020-12-11 ] | ||||
|
Closing it as not reproduced since the customer is not experiencing this anymore. |