Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22723

Data loss when performing rolling upgrade from 10.3.23-MariaDB to 10.4.13-MariaDB

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.3.23, 10.4.13
    • 10.3.26, 10.4.16
    • Galera
    • None
    • OS: CentOS Linux release 7.6.1810 (Core)

    Description

      Creating a full galera cluster of 10.3.23 with 3 nodes
      mdb1,mdb2,mdb3 10.3.23 version.
      We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4.13, to enforce IST . We also re-tested with all 3 servers up , same result.

      Create a schema and a table on mdb1. all propagate

      • stop mdb2 . yum remove the rpm of Mariadb and galera.
      • install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider
      • set wsrep_on=OFF on my.cnf
      • start mdb2
      • perform mysql_upgrade -s
      • stop mdb2
      • set wsrep_on=ON on my.cnf
      • start mbd2

      At this point the status galera variables on mdb2:

      MariaDB mdb2 [pippo]> show global status like 'wsrep%';
      +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
      | Variable_name                 | Value                                                                                                                                          |
      +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
      | wsrep_local_state_uuid        | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0                                                                                                           |
      | wsrep_protocol_version        | -1                                                                                                                                             |
      | wsrep_last_committed          | 65                                                                                                                                             |
      | wsrep_replicated              | 0                                                                                                                                              |
      | wsrep_replicated_bytes        | 0                                                                                                                                              |
      | wsrep_repl_keys               | 0                                                                                                                                              |
      | wsrep_repl_keys_bytes         | 0                                                                                                                                              |
      | wsrep_repl_data_bytes         | 0                                                                                                                                              |
      | wsrep_repl_other_bytes        | 0                                                                                                                                              |
      | wsrep_received                | 3                                                                                                                                              |
      | wsrep_received_bytes          | 208                                                                                                                                            |
      | wsrep_local_commits           | 0                                                                                                                                              |
      | wsrep_local_cert_failures     | 0                                                                                                                                              |
      | wsrep_local_replays           | 0                                                                                                                                              |
      | wsrep_local_send_queue        | 0                                                                                                                                              |
      | wsrep_local_send_queue_max    | 1                                                                                                                                              |
      | wsrep_local_send_queue_min    | 0                                                                                                                                              |
      | wsrep_local_send_queue_avg    | 0                                                                                                                                              |
      | wsrep_local_recv_queue        | 0                                                                                                                                              |
      | wsrep_local_recv_queue_max    | 1                                                                                                                                              |
      | wsrep_local_recv_queue_min    | 0                                                                                                                                              |
      | wsrep_local_recv_queue_avg    | 0                                                                                                                                              |
      | wsrep_local_cached_downto     | 64                                                                                                                                             |
      | wsrep_flow_control_paused_ns  | 0                                                                                                                                              |
      | wsrep_flow_control_paused     | 0                                                                                                                                              |
      | wsrep_flow_control_sent       | 0                                                                                                                                              |
      | wsrep_flow_control_recv       | 0                                                                                                                                              |
      | wsrep_cert_deps_distance      | 0                                                                                                                                              |
      | wsrep_apply_oooe              | 0.5                                                                                                                                            |
      | wsrep_apply_oool              | 0                                                                                                                                              |
      | wsrep_apply_window            | 1.5                                                                                                                                            |
      | wsrep_commit_oooe             | 0                                                                                                                                              |
      | wsrep_commit_oool             | 0                                                                                                                                              |
      | wsrep_commit_window           | 1                                                                                                                                              |
      | wsrep_local_state             | 4                                                                                                                                              |
      | wsrep_local_state_comment     | Synced                                                                                                                                         |
      | wsrep_cert_index_size         | 0                                                                                                                                              |
      | wsrep_causal_reads            | 0                                                                                                                                              |
      | wsrep_cert_interval           | 0                                                                                                                                              |
      | wsrep_open_transactions       | 0                                                                                                                                              |
      | wsrep_open_connections        | 0                                                                                                                                              |
      | wsrep_incoming_addresses      | AUTO,10.0.1.13:3306                                                                                                                            |
      | wsrep_cluster_weight          | 2                                                                                                                                              |
      | wsrep_desync_count            | 0                                                                                                                                              |
      | wsrep_evs_delayed             |                                                                                                                                                |
      | wsrep_evs_evict_list          |                                                                                                                                                |
      | wsrep_evs_repl_latency        | 0.000325151/0.00176008/0.00607075/0.00193032/7                                                                                                 |
      | wsrep_evs_state               | OPERATIONAL                                                                                                                                    |
      | wsrep_gcomm_uuid              | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4                                                                                                           |
      | wsrep_applier_thread_count    | 32                                                                                                                                             |
      | wsrep_cluster_capabilities    |                                                                                                                                                |
      | wsrep_cluster_conf_id         | 18446744073709551615                                                                                                                           |
      | wsrep_cluster_size            | 0                                                                                                                                              |
      | wsrep_cluster_state_uuid      |                                                                                                                                                |
      | wsrep_cluster_status          | Primary                                                                                                                                        |
      | wsrep_connected               | ON                                                                                                                                             |
      | wsrep_local_bf_aborts         | 0                                                                                                                                              |
      | wsrep_local_index             | 18446744073709551615                                                                                                                           |
      | wsrep_provider_capabilities   | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
      | wsrep_provider_name           | Galera                                                                                                                                         |
      | wsrep_provider_vendor         | Codership Oy <info@codership.com>                                                                                                              |
      | wsrep_provider_version        | 26.4.4(r4599)                                                                                                                                  |
      | wsrep_ready                   | ON                                                                                                                                             |
      | wsrep_rollbacker_thread_count | 1                                                                                                                                              |
      | wsrep_thread_count            | 33                                                                                                                                             |
      +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
      65 rows in set (0.001 sec)
      

      NOTE THAT :

      wsrep_cluster_status          | Primary
      wsrep_local_state_comment     | Synced
      wsrep_local_index             | 18446744073709551615
      wsrep_cluster_size            | 0
      

      Looking at the error log, the server is ready for connections after a IST

      At this point the 'master' mdb1 have a write that are not getting replicate:

      MariaDB mdb2 [pippo]> select * from evento4;
      +----+---------------+--------+
      | Id | IdDispositivo | kkkk   |
      +----+---------------+--------+
      |  1 |           123 | aaaa   |
      |  3 |           222 | eeeeaa |
      |  4 |      34523452 | e4r4r4 |
      +----+---------------+--------+
      

      WHILE ON THE MASTER:

      MariaDB mdb1 [pippo]> select * from evento4;
      +----+---------------+--------+
      | Id | IdDispositivo | kkkk   |
      +----+---------------+--------+
      |  1 |           123 | aaaa   |
      |  3 |           222 | eeeeaa |
      |  4 |      34523452 | e4r4r4 |
      +----+---------------+--------+
      3 rows in set (0.001 sec)
       
      MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic');
      Query OK, 1 row affected (0.015 sec)
       
      MariaDB mdb1 [pippo]> select * from evento4;
      +----+---------------+--------------+
      | Id | IdDispositivo | kkkk         |
      +----+---------------+--------------+
      |  1 |           123 | aaaa         |
      |  3 |           222 | eeeeaa       |
      |  4 |      34523452 | e4r4r4       |
      |  6 |             3 | non tireplic |
      +----+---------------+--------------+
      4 rows in set (0.001 sec)
      

      The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so

      AT THIS point we restart mdb2 to fix the status:

      [root@mdb2 my.cnf.d]# systemctl restart  mariadb
      [root@mdb2 my.cnf.d]# mysql
       
      MariaDB md2 [(none)]> show global status like 'wsrep%';
      +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
      | Variable_name                 | Value                                                                                                                                          |
      +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
      | wsrep_local_state_uuid        | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0                                                                                                           |
      | wsrep_protocol_version        | 9                                                                                                                                              |
      | wsrep_last_committed          | 66                                                                                                                                             |
      | wsrep_replicated              | 0                                                                                                                                              |
      | wsrep_replicated_bytes        | 0                                                                                                                                              |
      | wsrep_repl_keys               | 0                                                                                                                                              |
      | wsrep_repl_keys_bytes         | 0                                                                                                                                              |
      | wsrep_repl_data_bytes         | 0                                                                                                                                              |
      | wsrep_repl_other_bytes        | 0                                                                                                                                              |
      | wsrep_received                | 2                                                                                                                                              |
      | wsrep_received_bytes          | 200                                                                                                                                            |
      | wsrep_local_commits           | 0                                                                                                                                              |
      | wsrep_local_cert_failures     | 0                                                                                                                                              |
      | wsrep_local_replays           | 0                                                                                                                                              |
      | wsrep_local_send_queue        | 0                                                                                                                                              |
      | wsrep_local_send_queue_max    | 1                                                                                                                                              |
      | wsrep_local_send_queue_min    | 0                                                                                                                                              |
      | wsrep_local_send_queue_avg    | 0                                                                                                                                              |
      | wsrep_local_recv_queue        | 0                                                                                                                                              |
      | wsrep_local_recv_queue_max    | 1                                                                                                                                              |
      | wsrep_local_recv_queue_min    | 0                                                                                                                                              |
      | wsrep_local_recv_queue_avg    | 0                                                                                                                                              |
      | wsrep_local_cached_downto     | 64                                                                                                                                             |
      | wsrep_flow_control_paused_ns  | 0                                                                                                                                              |
      | wsrep_flow_control_paused     | 0                                                                                                                                              |
      | wsrep_flow_control_sent       | 0                                                                                                                                              |
      | wsrep_flow_control_recv       | 0                                                                                                                                              |
      | wsrep_cert_deps_distance      | 0                                                                                                                                              |
      | wsrep_apply_oooe              | 0                                                                                                                                              |
      | wsrep_apply_oool              | 0                                                                                                                                              |
      | wsrep_apply_window            | 0                                                                                                                                              |
      | wsrep_commit_oooe             | 0                                                                                                                                              |
      | wsrep_commit_oool             | 0                                                                                                                                              |
      | wsrep_commit_window           | 0                                                                                                                                              |
      | wsrep_local_state             | 4                                                                                                                                              |
      | wsrep_local_state_comment     | Synced                                                                                                                                         |
      | wsrep_cert_index_size         | 0                                                                                                                                              |
      | wsrep_causal_reads            | 0                                                                                                                                              |
      | wsrep_cert_interval           | 0                                                                                                                                              |
      | wsrep_open_transactions       | 0                                                                                                                                              |
      | wsrep_open_connections        | 0                                                                                                                                              |
      | wsrep_incoming_addresses      | 10.0.1.13:3306,AUTO                                                                                                                            |
      | wsrep_cluster_weight          | 2                                                                                                                                              |
      | wsrep_desync_count            | 0                                                                                                                                              |
      | wsrep_evs_delayed             |                                                                                                                                                |
      | wsrep_evs_evict_list          |                                                                                                                                                |
      | wsrep_evs_repl_latency        | 0.000853237/0.001923/0.00333681/0.0010427/3                                                                                                    |
      | wsrep_evs_state               | OPERATIONAL                                                                                                                                    |
      | wsrep_gcomm_uuid              | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6                                                                                                           |
      | wsrep_applier_thread_count    | 32                                                                                                                                             |
      | wsrep_cluster_capabilities    |                                                                                                                                                |
      | wsrep_cluster_conf_id         | 6                                                                                                                                              |
      | wsrep_cluster_size            | 2                                                                                                                                              |
      | wsrep_cluster_state_uuid      | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0                                                                                                           |
      | wsrep_cluster_status          | Primary                                                                                                                                        |
      | wsrep_connected               | ON                                                                                                                                             |
      | wsrep_local_bf_aborts         | 0                                                                                                                                              |
      | wsrep_local_index             | 1                                                                                                                                              |
      | wsrep_provider_capabilities   | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
      | wsrep_provider_name           | Galera                                                                                                                                         |
      | wsrep_provider_vendor         | Codership Oy <info@codership.com>                                                                                                              |
      | wsrep_provider_version        | 26.4.4(r4599)                                                                                                                                  |
      | wsrep_ready                   | ON                                                                                                                                             |
      | wsrep_rollbacker_thread_count | 1                                                                                                                                              |
      | wsrep_thread_count            | 33                                                                                                                                             |
      +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
      65 rows in set (0.002 sec)
      

      NOTE now the status is ok:

      wsrep_local_index             | 1
      wsrep_cluster_status          | Primary
      wsrep_local_state_comment     | Synced
      wsrep_local_index             | 1
      

      but when we check the data we expect the new row should be present:

      MariaDB mdb2 [pippo]> select * from evento4;
      +----+---------------+--------+
      | Id | IdDispositivo | kkkk   |
      +----+---------------+--------+
      |  1 |           123 | aaaa   |
      |  3 |           222 | eeeeaa |
      |  4 |      34523452 | e4r4r4 |
      +----+---------------+--------+
      3 rows in set (0.001 sec)
      

      The row is not there.

      If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back.

      Attachments

        1. 200612_mysqld.1.err
          62 kB
        2. 200612_mysqld.2.err
          121 kB
        3. 200612_mysqld.3.err
          70 kB
        4. 200709_patgal_output.zip
          15 kB
        5. 20200713_MDEV-22723_patgal_no_errors.zip
          35 kB
        6. 20200714_MDEV-22723_mdb_no_errors.zip
          32 kB
        7. 20200714_MDEV-22723_patgal_no_errors.zip
          28 kB
        8. 20200720_MDEV-22723_CentOS_7.5_no_errors.zip
          24 kB
        9. 20200723_MDEV-22723_data_loss.zip
          43 kB
        10. error_log_mdb1
          23 kB
        11. error_log_mdb2.after_upgrade
          87 kB
        12. mysqld_new.2.cnf
          2 kB
        13. mysqld_old.1.cnf
          2 kB
        14. mysqld_old.2.cnf
          2 kB
        15. mysqld_old.3.cnf
          2 kB
        16. node1_bootsrapped_10.3.23.log
          91 kB
        17. node1_bootsrapped_10.3.23.log.rtf
          93 kB
        18. node2_upgraded_10.4.13.log
          14 kB
        19. node2_upgraded.log.rtf
          14 kB
        20. server.cnf_mdb1
          2 kB
        21. server.cnf_mdb2
          2 kB

        Issue Links

          Activity

            Yurchenko I hope you are right, but I used Galera 26.4.4(r4599) on 20.07.2020 and there was no data loss.

            stepan.patryshev Stepan Patryshev (Inactive) added a comment - - edited Yurchenko I hope you are right, but I used Galera 26.4.4(r4599) on 20.07.2020 and there was no data loss.
            Yurchenko Alexey added a comment -

            julien.fritsch
            Yes, it is fixed in later Galera releases.

            stepan.patryshev
            On 20.07.2020 there was a mistake in case reproduction: in Massimo's case node 2 was missing 2 events and had to perform state transfer. In your case it seems there were no updates to the cluster during node 2 upgrade: it was shut down at seqno 7 and was brought back - cluster still had seqno 7. So there was no state transfer and it is a different code path.

            And yes, I found out why in Massimo's case some transactions were lost:

            [Warning] WSREP: trx protocol version: 4 does not match certification protocol version: -1
            

            is a warning because we can expect during upgrade of the last node and protocol bump to get a writeset with an old protocol and in that case it simply is supposed to fail certification - on all nodes. The problem (that was fixed in the commit I mentioned above) was that protocol version was not updated in total order (it was not updated at all). As a result all transactions that failed certification on node 2 (and thus were skipped), perfectly passed certification on node 1 and thus were committed. In the end both nodes believed that they have successfully processed all events and are on the same page regarding last seqno. That's why those missing events went unnoticed.

            However when node 2 was restarted, it rejoined the cluster without state transfer, the bug was not triggered, and it could continue to apply transactions.

            Yurchenko Alexey added a comment - julien.fritsch Yes, it is fixed in later Galera releases. stepan.patryshev On 20.07.2020 there was a mistake in case reproduction: in Massimo's case node 2 was missing 2 events and had to perform state transfer. In your case it seems there were no updates to the cluster during node 2 upgrade: it was shut down at seqno 7 and was brought back - cluster still had seqno 7. So there was no state transfer and it is a different code path. And yes, I found out why in Massimo's case some transactions were lost: [Warning] WSREP: trx protocol version: 4 does not match certification protocol version: -1 is a warning because we can expect during upgrade of the last node and protocol bump to get a writeset with an old protocol and in that case it simply is supposed to fail certification - on all nodes. The problem (that was fixed in the commit I mentioned above) was that protocol version was not updated in total order (it was not updated at all). As a result all transactions that failed certification on node 2 (and thus were skipped), perfectly passed certification on node 1 and thus were committed. In the end both nodes believed that they have successfully processed all events and are on the same page regarding last seqno. That's why those missing events went unnoticed. However when node 2 was restarted, it rejoined the cluster without state transfer, the bug was not triggered, and it could continue to apply transactions.

            Yurchenko Thank you for the clarifications. But I want to note that rpizzi reproduced it without updating data during Node2 upgrade: steps are here.

            stepan.patryshev Stepan Patryshev (Inactive) added a comment - Yurchenko Thank you for the clarifications. But I want to note that rpizzi reproduced it without updating data during Node2 upgrade: steps are here .

            I have verified that using Galera 26.4.5(rb3764ab) and 25.3.30(r827e681) there were no any data loss or crash. The steps were the same which reproduced the bug on 23.07.2020 with 25.3.28(r3875) and 26.4.4(r4599).

            But the strange wsrep values still presented just after the first time upgraded node joined the cluster:

            wsrep_local_index 18446744073709551615
            wsrep_cluster_size 0
            stepan.patryshev Stepan Patryshev (Inactive) added a comment - I have verified that using Galera 26.4.5(rb3764ab) and 25.3.30(r827e681) there were no any data loss or crash. The steps were the same which reproduced the bug on 23.07.2020 with 25.3.28(r3875) and 26.4.4(r4599). But the strange wsrep values still presented just after the first time upgraded node joined the cluster: wsrep_local_index 18446744073709551615 wsrep_cluster_size 0

            Closing as fixed.

            stepan.patryshev Stepan Patryshev (Inactive) added a comment - Closing as fixed.

            People

              Yurchenko Alexey
              massimo.disaro Massimo
              Votes:
              3 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.