[MDEV-8975] [PATCH] 10.1 Fails To Join Existing Galera Cluster Created: 2015-10-20  Updated: 2015-11-17  Resolved: 2015-11-06

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.1.8
Fix Version/s: 10.1.9

Type: Bug Priority: Blocker
Reporter: Vincent Milum Jr Assignee: Nirbhay Choubey (Inactive)
Resolution: Fixed Votes: 1
Labels: galera, patch, sst
Environment:

Debian 8.2 netinst install
MariaDB 10.1.8-MariaDB-1~jessie-log


Attachments: Text File mariadb-10.1-sst.log     Text File mariadb-10.1.log    
Issue Links:
Problem/Incident
is caused by MDEV-8034 wsrep_node_address can't be ipv6 Closed

 Description   

MariaDB 10.1.8 fails to join an existing MariaDB 10.0.21 cluster for migration.

With the existing cluster environment up and running, I built two parallel VMs to test joining. Both have fresh installs from the Debian 8.2 netinst.iso. After ISO installation completed, SSH was installed for ease of access.

From here, each of the VMs were configured per instructions on this page: https://downloads.mariadb.org/mariadb/repositories/

The only difference in the two VMs is the reference to "10.0" and "10.1" in the source repository and the packages installed - "mariadb-galera-server" on 10.0 and "mariadb-server" on 10.1

The following file was created with contents below on each VM: /etc/mysql/conf.d/galera.cnf

[mysqld]
log_slave_updates=1
innodb_buffer_pool_size=768M

[galera]
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_provider_options="gmcast.segment=3"
wsrep_cluster_name="CLUSTERNAME"
wsrep_cluster_address="gcomm://192.168.100.10"
wsrep_sst_method=rsync
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0

After creating this file, the mariadb server was restarted on each VM.

The MariaDB 10.0.21 VM joined the cluster without issue.

The MariaDB 10.1.8 VM failed to join the cluster. Log file attached.



 Comments   
Comment by Nirbhay Choubey (Inactive) [ 2015-10-27 ]

darkain
Can you try setting wsrep_sst_receive_address and see how it goes?

https://mariadb.com/kb/en/mariadb/galera-cluster-system-variables/#wsrep_sst_receive_address

mysqld[9460]: 2015-10-20  9:45:34 139731981920000 [ERROR] WSREP: Failed to guess address to accept state transfer. wsrep_sst_receive_address must be set manually.

Comment by Vincent Milum Jr [ 2015-10-27 ]

Added the option to the end of the galera.cnf file, specifying my "admin" node (192.168.100.10).

1) replication still fails
2) it is failing at a different part now, however
3) despite specifying the specific node to replicate from, this setting ignored, and galera is still choosing on its own randomly.

Specifically this line:
requested state transfer from 'any'. Selected 0.3 (maria-3)(SYNCED) as donor.

And as far as actual replication goes, the log doesn't seem to be too descriptive at all:
[ Warning ] WSREP: 0.3 (maria-3): State transfer to 1.3 () failed: -255 (Unknown error 255)

Comment by Vincent Milum Jr [ 2015-10-27 ]

As another note: after failing, the MariaDB service retries automatically over and over again every few seconds creating huge log files that are harder to parse through. Attempting a "service mysql stop" doesn't counteract this either.

Comment by Nirbhay Choubey (Inactive) [ 2015-10-27 ]

darkain Did you set wsrep_sst_donor? Can you also share donor logs?

Comment by Vincent Milum Jr [ 2015-10-27 ]

Oh crap, you're right, I was getting donor and receive_address backwards.

Okay, after getting receive_address actually set properly, the node "joins" the cluster, just not perfectly.

From the 10.1 node:
SELECT * FROM INFORMATION_SCHEMA.WSREP_MEMBERSHIP;

INDEX UUID NAME ADDRESS
0 1db9130f-71e8-11e5-9a9c-1f4b702514cb maria-3 192.168.100.25:3306
1 6a2053d6-7cd1-11e5-bc60-c274c464ddd1    
2 8eb89748-4edc-11e5-bc62-9e5ca3912ec1 maria-2 192.168.100.24:3306
3 904765e1-7695-11e5-a189-5ae848881002 core 192.168.100.23:3306
4 f91c6227-768c-11e5-b489-fe1758fcc489 phpmyadmin 192.168.100.10:3306

From the existing 10.0 nodes: (notice the double-comma after the first address)
SHOW STATUS LIKE 'wsrep_incoming_addresses';

Variable_name Value
wsrep_incoming_addresses 192.168.100.25:3306,,192.168.100.24:3306,192.168.100.23:3306,192.168.100.10:3306

So from both this and the previous issue, it looks like the 10.1 node is unable to determine the local machine's network address.

As mentioned previously, this setup was tested on both 10.0 and 10.1 at the same time, both are virtual machines under VMWare running off the same Debian 8.2 ISO and config. So something has changed in 10.1 to break network address detection.

And for reference, there is only 1 virtual NIC, configured as VMXNET 3

root@debian-maria-10-1:~# ifconfig
eth0 Link encap:Ethernet HWaddr 00:0c:29:e6:9c:29
inet addr:192.168.100.161 Bcast:192.168.100.255 Mask:255.255.255.0
inet6 addr: fe80::20c:29ff:fee6:9c29/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:122047 errors:0 dropped:0 overruns:0 frame:0
TX packets:71413 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:1283820977 (1.1 GiB) TX bytes:14468322 (13.7 MiB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

Comment by Vincent Milum Jr [ 2015-10-27 ]

After getting 10.1 to join (albeit in the state listed above where other nodes are not aware of it), I let it idle for a while in the cluster. The following log messages are popping up quite frequently on the 10.1 node.

InnoDB: Error: Column last_update in table "mysql"."innodb_table_stats" is INT UNSIGNED NOT NULL but should be BINARY(4) NOT NULL (type mismatch).
InnoDB: Error: Fetch of persistent statistics requested for table "my_database"."my_table" but the required system tables mysql.innodb_table_stats and mysql.innodb_index_stats are not present or have unexpected structure. Using transient stats instead.

Comment by Nirbhay Choubey (Inactive) [ 2015-10-28 ]

darkain

albeit in the state listed above where other nodes are not aware of it

What do you mean by this? Is the node not part of cluster?

Comment by Nirbhay Choubey (Inactive) [ 2015-10-28 ]

jplindst The error from the previous comment seem to be related to the use of 10.0 system tables in 10.1. Do you know a fix for this (in a cluster environment)? I believe running a local upgrade should do.

Comment by Vincent Milum Jr [ 2015-10-28 ]

The issue of "not being aware" is just as described above. Notice in the tables from both the 10.0 nodes and 10.1 node. There is an empty entry in the IP address field for the 10.1 node. The 10.1 node has issues determining and broadcasting out its own IP address to the other nodes, despite being connected directly to those other nodes. This is most likely also directly related to the initial issue where the 10.1 node wasn't able to automatically determine its own IP address for receiving SST.

Comment by Vincent Milum Jr [ 2015-10-30 ]

The IPv6 patch broke bind address of "0.0.0.0"

Comment by Vincent Milum Jr [ 2015-10-30 ]

pull request on github to fix this issue: https://github.com/MariaDB/server/pull/115

Comment by Jan Lindström (Inactive) [ 2015-10-30 ]

Yes, you need to run mysql_upgrade after migration from e.g. 10.0 and before you start using your db's.

Comment by Vincent Milum Jr [ 2015-11-02 ]

Yes, you need to run mysql_upgrade after migration from e.g. 10.0 and before you start using your db's.

So question then: Is there no direct support for a rolling upgrade from 10.0 to 10.1? Does it require literally taking the entire cluster offline to perform the needed mysql_upgrade, then re-syncing all of the nodes afterwards?

Comment by Nirbhay Choubey (Inactive) [ 2015-11-06 ]

https://github.com/MariaDB/server/commit/5079d69d48e2c1b763d23bdb294297e6d6da43a2

Comment by Nirbhay Choubey (Inactive) [ 2015-11-16 ]

darkain http://galeracluster.com/documentation-webpages/upgrading.html#id1

Comment by Nirbhay Choubey (Inactive) [ 2015-11-16 ]

jplindst Do you think its a real problem with the following errors showing up the error log, or can they be safely ignored?

InnoDB: Error: Column last_update in table "mysql"."innodb_table_stats" is INT UNSIGNED NOT NULL but should be BINARY(4) NOT NULL (type mismatch).
InnoDB: Error: Fetch of persistent statistics requested for table "my_database"."my_table" but the required system tables mysql.innodb_table_stats and mysql.innodb_index_stats are not present or have unexpected structure. Using transient stats instead.

Comment by Jan Lindström (Inactive) [ 2015-11-17 ]

These are not the real problem, looks more like a mysql_upgrade was not run.

Comment by Vincent Milum Jr [ 2015-11-17 ]

Should mysql_upgrade be ran BEFORE or AFTER a rolling upgrade, or sometime during the middle? The main question here is how to properly have a mixed environment of 10.0 and 10.1 nodes during a rolling upgrade from one to the other, and whether this is officially supported, or if the entire cluster has to be taken offline to do the upgrade safely.

Generated at Thu Feb 08 07:31:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.