Details
-
New Feature
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
Description
This is in relation to support ticket CS0666793. Summary:
Starting point:
MySQL 5.7 Galera cluster, 3 nodes, ~35TB of data per node
Upgrade methodology:
- Upgrade individual cluster node
- Join upgraded node to cluster
- Rinse and repeat
Tested migrations:
- MySQL 5.7 -> MariaDB 10.3: Not working; 5.7 uses newer replication protocol than 10.3.
- MySQL 5.7 -> MariaDB 10.4, 10.5, 10.11: Not working/not safe; events from MariaDB crashes remaining MySQL nodes in cluster
Suggested methodology:
- Disable all writes to single node
- Upgrade node
- Disable all writes to another node
- Upgrade second node
== Move writes to MariaDB nodes
== Shut down remaining MySQL node
The final two steps will either cause application outage OR force a SST rebuild of the final node, since it either has to be shut down prior to moving the write traffic or it will crash as soon as the writes are moved.
Additionally, not all writes can be moved in the same instant, therefore no matter which approach is taken, application outage will happen and will be significant: Stopping and starting any node in the cluster will, due to their size, take anything from 5 to 45 minutes.
Suggested solutions, in order of preference:
- Enable MySQL-compatible GTID recplication events to be used in a mixed cluster, enabling writes on the MariaDB nodes to replicate to remaining MySQL nodes. Obvious caveats like only MySQL-compatible operations are supported, etc.
- Make the failure mode safer: Don't attempt to replicate incompatible events to the MySQL nodes, don't cause them to crash. Allow for a global write-only flag that guarantees no DDL is accepted on the MariaDB nodes until it is safe to do so.
The former option would enable MariaDB to truly be a drop-in replacement for the EOL MySQL 5.7, as advertised - even in a Galera cluster context.
The latter option would not fully resolve the issue and would still require some outage during the migration, but the risk would be significantly reduced and the failure modes easier to understand and plan for.
In an ideal world, a combination of the above would be preferable.
— original support ticket text, including logs –
Product: Galera Cluster
Product version: 10.5, 10.11, MySQL 5.7
Database up: No
Description: Hi,
Starting point:
Cluster of 3x MySQL 5.7.42, Galera 25.3.37.
Galera configuration:
- GALERA
wsrep_provider = /usr/local/lib/libgalera_smm.so
wsrep_cluster_name="demo_cluster"
wsrep_cluster_address="gcomm://db01.demo.modirum.com,db02.demo.modirum.com,db03.demo.modirum.com"
wsrep_sst_auth=root:
wsrep_sst_method=xtrabackup-v2
wsrep_provider_options="gcache.size=300M; gcache.page_size=300M; gcache.recover=yes;"
Actions taken:
1. Upgrade db03 to MariaDB 10.11.5, Galera 26.4.16
– NOTE: Also tested with MariaDB 10.5.21, the oldest supported/available version.
2. Edit Galera-related configuration, adding (to the above):
wsrep_provider_options=".....; evs.version=0;"
wsrep_on = 0
3. Start MariaDB for the first time
4. mysql_upgrade; stop MariaDB
5. Edit Galera config setting wsrep_on = 1
6. Fix Linux-ism in wsrep_sst_mariadb:
— wsrep_sst_mariabackup.org 2023-10-17 10:03:59.457219000 +0000
+++ wsrep_sst_mariabackup 2023-10-17 09:27:51.578206000 +0000
@@ -798,7 +798,7 @@
local ltcmd="$tcmd"
if [ $tmt -gt 0 ]; then
if [ -n "$(commandex timeout)" ]; then
- if timeout --help | grep -qw -F – '-k'; then
+ if timeout --help 2>&1 | grep -qw -F – '-k'; then
ltcmd="timeout -k $(( tmt+10 )) $tmt $tcmd"
else
ltcmd="timeout -s9 $tmt $tcmd"
7. Start MariaDB again
------
At this point, the cluster consists of three nodes:
db01 - MySQL 5.7
db02 - MySQL 5.7
db03 - MariaDB 10.11
Replication from db01/db02 -> db03 works fine. Changes are applied cleanly and there are no issues.
CREATE DATABASE on db03 replicate to db01/02 without problems.
CREATE TABLE on db03 replicates to db01/02
However: Any INSERT (and probably any other write) statement on db03 causes both db01 and db02 to crash. Example:
db03> CREATE DATABASE foobar; USE foobar;
db03> CREATE TABLE foo (id int, bar varchar(255));
db03> INSERT INTO foo VALUES (1, 'baz');
Resulting log on db01 and db02:
2023-10-17T09:29:14.780803Z 0 [Note] WSREP: Member 1.0 (db03.demo.modirum.com) synced with group.
2023-10-17T09:29:17.110847Z 0 [Note] WSREP: (a918972e, 'tcp://0.0.0.0:4567') turning message relay requesting off
2023-10-17T09:45:04.385887Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96
2023-10-17T09:45:04.386483Z 1 [ERROR] WSREP: applier could not read binlog event, seqno: 2099323, len: 85
2023-10-17T09:45:04.386681Z 1 [Warning] WSREP: Failed to apply app buffer: seqno: 2099323, status: 1
at /wrkdirs/usr/ports/databases/galera/work/galera-release_25.3.37/galera/src/trx_handle.cpp:apply():353
Retrying 2th time
2023-10-17T09:45:04.386702Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96
2023-10-17T09:45:04.386714Z 1 [ERROR] WSREP: applier could not read binlog event, seqno: 2099323, len: 85
2023-10-17T09:45:04.386824Z 1 [Warning] WSREP: Failed to apply app buffer: seqno: 2099323, status: 1
at /wrkdirs/usr/ports/databases/galera/work/galera-release_25.3.37/galera/src/trx_handle.cpp:apply():353
Retrying 3th time
2023-10-17T09:45:04.386843Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96
2023-10-17T09:45:04.386853Z 1 [ERROR] WSREP: applier could not read binlog event, seqno: 2099323, len: 85
2023-10-17T09:45:04.386951Z 1 [Warning] WSREP: Failed to apply app buffer: seqno: 2099323, status: 1
at /wrkdirs/usr/ports/databases/galera/work/galera-release_25.3.37/galera/src/trx_handle.cpp:apply():353
Retrying 4th time
2023-10-17T09:45:04.386968Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96