[MDEV-8975] [PATCH] 10.1 Fails To Join Existing Galera Cluster Created: 2015-10-20 Updated: 2015-11-17 Resolved: 2015-11-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Galera SST |
| Affects Version/s: | 10.1.8 |
| Fix Version/s: | 10.1.9 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Vincent Milum Jr | Assignee: | Nirbhay Choubey (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | galera, patch, sst | ||
| Environment: |
Debian 8.2 netinst install |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
MariaDB 10.1.8 fails to join an existing MariaDB 10.0.21 cluster for migration. With the existing cluster environment up and running, I built two parallel VMs to test joining. Both have fresh installs from the Debian 8.2 netinst.iso. After ISO installation completed, SSH was installed for ease of access. From here, each of the VMs were configured per instructions on this page: https://downloads.mariadb.org/mariadb/repositories/ The only difference in the two VMs is the reference to "10.0" and "10.1" in the source repository and the packages installed - "mariadb-galera-server" on 10.0 and "mariadb-server" on 10.1 The following file was created with contents below on each VM: /etc/mysql/conf.d/galera.cnf
After creating this file, the mariadb server was restarted on each VM. The MariaDB 10.0.21 VM joined the cluster without issue. The MariaDB 10.1.8 VM failed to join the cluster. Log file attached. |
| Comments |
| Comment by Nirbhay Choubey (Inactive) [ 2015-10-27 ] | ||||||||||||||||||||||||||||
|
darkain https://mariadb.com/kb/en/mariadb/galera-cluster-system-variables/#wsrep_sst_receive_address
| ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-10-27 ] | ||||||||||||||||||||||||||||
|
Added the option to the end of the galera.cnf file, specifying my "admin" node (192.168.100.10). 1) replication still fails Specifically this line: And as far as actual replication goes, the log doesn't seem to be too descriptive at all: | ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-10-27 ] | ||||||||||||||||||||||||||||
|
As another note: after failing, the MariaDB service retries automatically over and over again every few seconds creating huge log files that are harder to parse through. Attempting a "service mysql stop" doesn't counteract this either. | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2015-10-27 ] | ||||||||||||||||||||||||||||
|
darkain Did you set wsrep_sst_donor? Can you also share donor logs? | ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-10-27 ] | ||||||||||||||||||||||||||||
|
Oh crap, you're right, I was getting donor and receive_address backwards. Okay, after getting receive_address actually set properly, the node "joins" the cluster, just not perfectly. From the 10.1 node:
From the existing 10.0 nodes: (notice the double-comma after the first address)
So from both this and the previous issue, it looks like the 10.1 node is unable to determine the local machine's network address. As mentioned previously, this setup was tested on both 10.0 and 10.1 at the same time, both are virtual machines under VMWare running off the same Debian 8.2 ISO and config. So something has changed in 10.1 to break network address detection. And for reference, there is only 1 virtual NIC, configured as VMXNET 3
| ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-10-27 ] | ||||||||||||||||||||||||||||
|
After getting 10.1 to join (albeit in the state listed above where other nodes are not aware of it), I let it idle for a while in the cluster. The following log messages are popping up quite frequently on the 10.1 node.
| ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2015-10-28 ] | ||||||||||||||||||||||||||||
What do you mean by this? Is the node not part of cluster? | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2015-10-28 ] | ||||||||||||||||||||||||||||
|
jplindst The error from the previous comment seem to be related to the use of 10.0 system tables in 10.1. Do you know a fix for this (in a cluster environment)? I believe running a local upgrade should do. | ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-10-28 ] | ||||||||||||||||||||||||||||
|
The issue of "not being aware" is just as described above. Notice in the tables from both the 10.0 nodes and 10.1 node. There is an empty entry in the IP address field for the 10.1 node. The 10.1 node has issues determining and broadcasting out its own IP address to the other nodes, despite being connected directly to those other nodes. This is most likely also directly related to the initial issue where the 10.1 node wasn't able to automatically determine its own IP address for receiving SST. | ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-10-30 ] | ||||||||||||||||||||||||||||
|
The IPv6 patch broke bind address of "0.0.0.0" | ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-10-30 ] | ||||||||||||||||||||||||||||
|
pull request on github to fix this issue: https://github.com/MariaDB/server/pull/115 | ||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2015-10-30 ] | ||||||||||||||||||||||||||||
|
Yes, you need to run mysql_upgrade after migration from e.g. 10.0 and before you start using your db's. | ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-11-02 ] | ||||||||||||||||||||||||||||
So question then: Is there no direct support for a rolling upgrade from 10.0 to 10.1? Does it require literally taking the entire cluster offline to perform the needed mysql_upgrade, then re-syncing all of the nodes afterwards? | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2015-11-06 ] | ||||||||||||||||||||||||||||
|
https://github.com/MariaDB/server/commit/5079d69d48e2c1b763d23bdb294297e6d6da43a2 | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2015-11-16 ] | ||||||||||||||||||||||||||||
|
darkain http://galeracluster.com/documentation-webpages/upgrading.html#id1 | ||||||||||||||||||||||||||||
| Comment by Nirbhay Choubey (Inactive) [ 2015-11-16 ] | ||||||||||||||||||||||||||||
|
jplindst Do you think its a real problem with the following errors showing up the error log, or can they be safely ignored?
| ||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2015-11-17 ] | ||||||||||||||||||||||||||||
|
These are not the real problem, looks more like a mysql_upgrade was not run. | ||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2015-11-17 ] | ||||||||||||||||||||||||||||
|
Should mysql_upgrade be ran BEFORE or AFTER a rolling upgrade, or sometime during the middle? The main question here is how to properly have a mixed environment of 10.0 and 10.1 nodes during a rolling upgrade from one to the other, and whether this is officially supported, or if the entire cluster has to be taken offline to do the upgrade safely. |