Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32497

Enable safe migration from MySQL to MariaDB in a Galera cluster

    XMLWordPrintable

Details

    Description

      This is in relation to support ticket CS0666793. Summary:

      Starting point:
      MySQL 5.7 Galera cluster, 3 nodes, ~35TB of data per node

      Upgrade methodology:

      • Upgrade individual cluster node
      • Join upgraded node to cluster
      • Rinse and repeat

      Tested migrations:

      • MySQL 5.7 -> MariaDB 10.3: Not working; 5.7 uses newer replication protocol than 10.3.
      • MySQL 5.7 -> MariaDB 10.4, 10.5, 10.11: Not working/not safe; events from MariaDB crashes remaining MySQL nodes in cluster

      Suggested methodology:

      • Disable all writes to single node
      • Upgrade node
      • Disable all writes to another node
      • Upgrade second node
        == Move writes to MariaDB nodes
        == Shut down remaining MySQL node

      The final two steps will either cause application outage OR force a SST rebuild of the final node, since it either has to be shut down prior to moving the write traffic or it will crash as soon as the writes are moved.
      Additionally, not all writes can be moved in the same instant, therefore no matter which approach is taken, application outage will happen and will be significant: Stopping and starting any node in the cluster will, due to their size, take anything from 5 to 45 minutes.

      Suggested solutions, in order of preference:

      • Enable MySQL-compatible GTID recplication events to be used in a mixed cluster, enabling writes on the MariaDB nodes to replicate to remaining MySQL nodes. Obvious caveats like only MySQL-compatible operations are supported, etc.
      • Make the failure mode safer: Don't attempt to replicate incompatible events to the MySQL nodes, don't cause them to crash. Allow for a global write-only flag that guarantees no DDL is accepted on the MariaDB nodes until it is safe to do so.

      The former option would enable MariaDB to truly be a drop-in replacement for the EOL MySQL 5.7, as advertised - even in a Galera cluster context.

      The latter option would not fully resolve the issue and would still require some outage during the migration, but the risk would be significantly reduced and the failure modes easier to understand and plan for.

      In an ideal world, a combination of the above would be preferable.

      — original support ticket text, including logs –

      Product: Galera Cluster
      Product version: 10.5, 10.11, MySQL 5.7
      Database up: No

      Description: Hi,

      Starting point:
      Cluster of 3x MySQL 5.7.42, Galera 25.3.37.
      Galera configuration:

      1. GALERA
        wsrep_provider = /usr/local/lib/libgalera_smm.so
        wsrep_cluster_name="demo_cluster"
        wsrep_cluster_address="gcomm://db01.demo.modirum.com,db02.demo.modirum.com,db03.demo.modirum.com"
        wsrep_sst_auth=root:
        wsrep_sst_method=xtrabackup-v2
        wsrep_provider_options="gcache.size=300M; gcache.page_size=300M; gcache.recover=yes;"

      Actions taken:
      1. Upgrade db03 to MariaDB 10.11.5, Galera 26.4.16
      – NOTE: Also tested with MariaDB 10.5.21, the oldest supported/available version.

      2. Edit Galera-related configuration, adding (to the above):
      wsrep_provider_options=".....; evs.version=0;"
      wsrep_on = 0

      3. Start MariaDB for the first time

      4. mysql_upgrade; stop MariaDB

      5. Edit Galera config setting wsrep_on = 1

      6. Fix Linux-ism in wsrep_sst_mariadb:
      — wsrep_sst_mariabackup.org 2023-10-17 10:03:59.457219000 +0000
      +++ wsrep_sst_mariabackup 2023-10-17 09:27:51.578206000 +0000
      @@ -798,7 +798,7 @@
      local ltcmd="$tcmd"
      if [ $tmt -gt 0 ]; then
      if [ -n "$(commandex timeout)" ]; then

      • if timeout --help | grep -qw -F – '-k'; then
        + if timeout --help 2>&1 | grep -qw -F – '-k'; then
        ltcmd="timeout -k $(( tmt+10 )) $tmt $tcmd"
        else
        ltcmd="timeout -s9 $tmt $tcmd"

      7. Start MariaDB again

      ------
      At this point, the cluster consists of three nodes:
      db01 - MySQL 5.7
      db02 - MySQL 5.7
      db03 - MariaDB 10.11

      Replication from db01/db02 -> db03 works fine. Changes are applied cleanly and there are no issues.

      CREATE DATABASE on db03 replicate to db01/02 without problems.
      CREATE TABLE on db03 replicates to db01/02

      However: Any INSERT (and probably any other write) statement on db03 causes both db01 and db02 to crash. Example:

      db03> CREATE DATABASE foobar; USE foobar;

      db03> CREATE TABLE foo (id int, bar varchar(255));

      db03> INSERT INTO foo VALUES (1, 'baz');

      Resulting log on db01 and db02:

      2023-10-17T09:29:14.780803Z 0 [Note] WSREP: Member 1.0 (db03.demo.modirum.com) synced with group.
      2023-10-17T09:29:17.110847Z 0 [Note] WSREP: (a918972e, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2023-10-17T09:45:04.385887Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96
      2023-10-17T09:45:04.386483Z 1 [ERROR] WSREP: applier could not read binlog event, seqno: 2099323, len: 85
      2023-10-17T09:45:04.386681Z 1 [Warning] WSREP: Failed to apply app buffer: seqno: 2099323, status: 1
      at /wrkdirs/usr/ports/databases/galera/work/galera-release_25.3.37/galera/src/trx_handle.cpp:apply():353
      Retrying 2th time
      2023-10-17T09:45:04.386702Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96
      2023-10-17T09:45:04.386714Z 1 [ERROR] WSREP: applier could not read binlog event, seqno: 2099323, len: 85
      2023-10-17T09:45:04.386824Z 1 [Warning] WSREP: Failed to apply app buffer: seqno: 2099323, status: 1
      at /wrkdirs/usr/ports/databases/galera/work/galera-release_25.3.37/galera/src/trx_handle.cpp:apply():353
      Retrying 3th time
      2023-10-17T09:45:04.386843Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96
      2023-10-17T09:45:04.386853Z 1 [ERROR] WSREP: applier could not read binlog event, seqno: 2099323, len: 85
      2023-10-17T09:45:04.386951Z 1 [Warning] WSREP: Failed to apply app buffer: seqno: 2099323, status: 1
      at /wrkdirs/usr/ports/databases/galera/work/galera-release_25.3.37/galera/src/trx_handle.cpp:apply():353
      Retrying 4th time
      2023-10-17T09:45:04.386968Z 1 [ERROR] Error in Log_event::read_log_event(): 'Found invalid event in binary log', data_len: 52, event_type: -96

      Attachments

        Activity

          People

            Unassigned Unassigned
            ltning Eirik Øverby
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.