Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.3.23, 10.4.13
-
None
-
OS: CentOS Linux release 7.6.1810 (Core)
Description
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version.
We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4.13, to enforce IST . We also re-tested with all 3 servers up , same result.
Create a schema and a table on mdb1. all propagate
- stop mdb2 . yum remove the rpm of Mariadb and galera.
- install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider
- set wsrep_on=OFF on my.cnf
- start mdb2
- perform mysql_upgrade -s
- stop mdb2
- set wsrep_on=ON on my.cnf
- start mbd2
At this point the status galera variables on mdb2:
MariaDB mdb2 [pippo]> show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_protocol_version | -1 |
|
| wsrep_last_committed | 65 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 3 |
|
| wsrep_received_bytes | 208 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 64 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0.5 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 1.5 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | AUTO,10.0.1.13:3306 |
|
| wsrep_cluster_weight | 2 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 18446744073709551615 |
|
| wsrep_cluster_size | 0 |
|
| wsrep_cluster_state_uuid | |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 18446744073709551615 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.001 sec)
|
NOTE THAT :
wsrep_cluster_status | Primary
|
wsrep_local_state_comment | Synced
|
wsrep_local_index | 18446744073709551615
|
wsrep_cluster_size | 0
|
Looking at the error log, the server is ready for connections after a IST
At this point the 'master' mdb1 have a write that are not getting replicate:
MariaDB mdb2 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
WHILE ON THE MASTER:
MariaDB mdb1 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
3 rows in set (0.001 sec)
|
|
MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic');
|
Query OK, 1 row affected (0.015 sec)
|
|
MariaDB mdb1 [pippo]> select * from evento4;
|
+----+---------------+--------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
| 6 | 3 | non tireplic |
|
+----+---------------+--------------+
|
4 rows in set (0.001 sec)
|
The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so
AT THIS point we restart mdb2 to fix the status:
[root@mdb2 my.cnf.d]# systemctl restart mariadb
|
[root@mdb2 my.cnf.d]# mysql
|
|
MariaDB md2 [(none)]> show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 66 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 200 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 64 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 10.0.1.13:3306,AUTO |
|
| wsrep_cluster_weight | 2 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 6 |
|
| wsrep_cluster_size | 2 |
|
| wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 1 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.002 sec)
|
NOTE now the status is ok:
wsrep_local_index | 1
|
wsrep_cluster_status | Primary
|
wsrep_local_state_comment | Synced
|
wsrep_local_index | 1
|
but when we check the data we expect the new row should be present:
MariaDB mdb2 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
3 rows in set (0.001 sec)
|
The row is not there.
If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back.
Attachments
- 200612_mysqld.1.err
- 62 kB
- 200612_mysqld.2.err
- 121 kB
- 200612_mysqld.3.err
- 70 kB
- error_log_mdb1
- 23 kB
- mysqld_new.2.cnf
- 2 kB
- mysqld_old.1.cnf
- 2 kB
- mysqld_old.2.cnf
- 2 kB
- mysqld_old.3.cnf
- 2 kB
- server.cnf_mdb1
- 2 kB
- server.cnf_mdb2
- 2 kB
Issue Links
- relates to
-
MDEV-29246 WSREP_CLUSTER_SIZE at 0 after rolling update a node from 10.3 to 10.4
-
- Closed
-
-
MDEV-20439 WSREP_CLUSTER_SIZE at 0 after rolling update a node
-
- Closed
-
-
MDEV-22745 node crash on upgrade from 10.3 to 10.4 writing on the 10.4 node
-
- Closed
-
Activity
I have managed to reproduce it only partially. I have not observed any data loss during a node upgrade. But I got these strange values: wsrep_local_index = 18446744073709551615 and wsrep_cluster_size = 0.
Release builds 10.3.23 + Galera 25.3.29(rb0f34b0) and 10.4.13 + Galera 26.4.4(rae24803).
Steps:
1. ./mtr --suite=galera_3nodes --start-and-exit
2. Restart all nodes one by one with separate config files: Node1, Node2
, Node3
.
3. create table evento4 (Id int primary key auto_increment, IdDispositivo int, kkkk varchar(255));
4. insert into evento4(IdDispositivo, kkkk) values(123, 'aaaa');
insert into evento4(IdDispositivo, kkkk) values(222, 'eeeeaa');
insert into evento4(IdDispositivo, kkkk) values(34523452, 'e4r4r4 ');
5. Stop Node 2.
6. Set wsrep-on=OFF and run Node 2 on 10.4.13 binaries with Node2 new config.
7. Perform mysql_upgrade -s.
8. Stop Node 2.
9. Node 3: insert into evento4(IdDispositivo, kkkk) values(777777, 'While Node 2 was upgrading');
select * from evento4;
Id | IdDispositivo | kkkk |
2 | 123 | aaaa |
5 | 222 | eeeeaa |
8 | 34523452 | e4r4r4 |
10 | 777777 | While Node 2 was upgrading |
10. Start Node 2 with wsrep-on=ON.
11. New data appeared on Node 2:
select * from evento4;
Id | IdDispositivo | kkkk |
2 | 123 | aaaa |
5 | 222 | eeeeaa |
8 | 34523452 | e4r4r4 |
10 | 777777 | While Node 2 was upgrading |
But:
show global status like 'wsrep%';
|
|
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | be36cf8b-acb6-11ea-aa2c-e3149c2ff908 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 6 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 3 |
|
| wsrep_received_bytes | 288 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 6 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 1 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 127.0.0.1:16002,127.0.0.1:16000,127.0.0.1:16001 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000293552/0.000366098/0.000521759/7.98882e-05/5 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | e05a4078-acc3-11ea-9394-8ba782d6f291 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 18446744073709551615 |
|
| wsrep_cluster_size | 0 |
|
| wsrep_cluster_state_uuid | |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 18446744073709551615 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(rae24803) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.001 sec)
|
wsrep_cluster_status | Primary |
wsrep_local_state_comment | Synced |
wsrep_local_index | 18446744073709551615 |
wsrep_cluster_size | 0 |
12. On node 3: insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic');
13. New data are replicated to Node 2:
select * from evento4;
Id | IdDispositivo | kkkk |
2 | 123 | aaaa |
5 | 222 | eeeeaa |
8 | 34523452 | e4r4r4 |
10 | 777777 | While Node 2 was upgrading |
13 | 3 | non tireplic |
14. Restart Node 2.
15. On Node 2:
show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | be36cf8b-acb6-11ea-aa2c-e3149c2ff908 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 7 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 280 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 6 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 127.0.0.1:16002,127.0.0.1:16000,127.0.0.1:16001 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | a2c23b72-acc8-11ea-afe5-cbd8cb9a86ed |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 17 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | be36cf8b-acb6-11ea-aa2c-e3149c2ff908 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 2 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(rae24803) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.001 sec)
|
wsrep_cluster_status | Primary |
wsrep_local_state_comment | Synced |
wsrep_local_index | 2 |
wsrep_cluster_size | 3 |
Server logs: Node 1, Node 2
, Node 3
.
I also have tried with one node stopped and without data population on Node 1 joined to the cluster during upgrading Node 2, but there were no any data loss anyway.
Data loss is there, as documented in original description. We reproduced it many times.
I have re-tested this in my own lab (the original bug report was from Massimo, I'm in same team).
I confirm the bug exist and we don't understand why it is not happening to you.
Exact steps to reproduce:
1. install 3 nodes with latest 10.3, i used 10.3.23, wsrep version 25.3.28(r3875)
2. create a table and insert data in it.
Situation after 2 steps above:
node1>create table dataloss (id int not null auto_increment primary key, value int);
|
Query OK, 0 rows affected (0.025 sec)
|
|
node1>insert into dataloss (value) values (1), (2), (3);
|
Query OK, 3 rows affected (0.003 sec)
|
Records: 3 Duplicates: 0 Warnings: 0
|
|
node1>select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
+----+-------+
|
3 rows in set (0.000 sec)
|
|
node1>show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------+
|
| wsrep_applier_thread_count | 8 |
|
| wsrep_apply_oooe | 0.000000 |
|
| wsrep_apply_oool | 0.000000 |
|
| wsrep_apply_window | 1.000000 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_deps_distance | 1.000000 |
|
| wsrep_cert_index_size | 5 |
|
| wsrep_cert_interval | 0.000000 |
|
| wsrep_cluster_conf_id | 19 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | cf61cf68-aef7-11ea-88db-1bc466429584 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_commit_oooe | 0.000000 |
|
| wsrep_commit_oool | 0.000000 |
|
| wsrep_commit_window | 1.000000 |
|
| wsrep_connected | ON |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_flow_control_paused | 0.000000 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_gcomm_uuid | 66883d21-af01-11ea-a6eb-260a9c0d8490 |
|
| wsrep_incoming_addresses | AUTO,192.168.2.90:3306,192.168.2.92:3306 |
|
| wsrep_last_committed | 8 |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_cached_downto | 6 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_commits | 1 |
|
| wsrep_local_index | 1 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_avg | 0.000000 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_avg | 0.000000 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_local_state_uuid | cf61cf68-aef7-11ea-88db-1bc466429584 |
|
| wsrep_open_connections | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 25.3.28(r3875) |
|
| wsrep_ready | ON |
|
| wsrep_received | 4 |
|
| wsrep_received_bytes | 755 |
|
| wsrep_repl_data_bytes | 978 |
|
| wsrep_repl_keys | 9 |
|
| wsrep_repl_keys_bytes | 144 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_replicated | 3 |
|
| wsrep_replicated_bytes | 1328 |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 9 |
|
+-------------------------------+------------------------------------------+
|
63 rows in set (0.001 sec)
|
3. on node 2, shut down and upgrade to latest 10.4, I used 10.4.13, wsrep 26.4.4(r4599)
When you restart that node, you see weird values for cluster_size and cluster_local_index:
MariaDB [(none)]> show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | cf61cf68-aef7-11ea-88db-1bc466429584 |
|
| wsrep_protocol_version | -1 |
|
| wsrep_last_committed | 8 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 3 |
|
| wsrep_received_bytes | 288 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | -1 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | AUTO,192.168.2.90:3306,192.168.2.92:3306 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000567644/0.00112438/0.00173288/0.000348106/7 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 043aaa1a-af04-11ea-9292-9a42c9f9c38d |
|
| wsrep_applier_thread_count | 8 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 18446744073709551615 |
|
| wsrep_cluster_size | 0 |
|
| wsrep_cluster_state_uuid | |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 18446744073709551615 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 9 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.001 sec)
|
Recheck the content of table dataloss on 3 nodes:
node1>select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
+----+-------+
|
3 rows in set (0.001 sec)
|
|
node2> select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
+----+-------+
|
3 rows in set (0.001 sec)
|
|
node3>select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
+----+-------+
|
3 rows in set (0.000 sec)
|
|
|
Now insert a row on node1, verify it has been added:
node1>insert into dataloss (value) values (4);
|
Query OK, 1 row affected (0.002 sec)
|
|
node1>select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
| 11 | 4 |
|
+----+-------+
|
4 rows in set (0.000 sec)
|
If you check on node2, that row is not there and it's lost:
noed2> select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
+----+-------+
|
3 rows in set (0.000 sec)
|
On node 3, the row is there:
node3>select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
| 11 | 4 |
|
+----+-------+
|
4 rows in set (0.000 sec)
|
|
Any other row inserted in this situation never reaches node 2 - it's data loss.
Then if you reboot the node2 once more, the wsrep config clears and looks good:
Redirecting to /bin/systemctl stop mariadb.service
|
[root@docker2 ~]# service mariadb start
|
Redirecting to /bin/systemctl start mariadb.service
|
[root@docker2 ~]# mysql -A
|
Welcome to the MariaDB monitor. Commands end with ; or \g.
|
Your MariaDB connection id is 20
|
Server version: 10.4.13-MariaDB-log MariaDB Server
|
|
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
|
|
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
|
|
node2> show global status like 'wsrep_local_index';
|
+-------------------+-------+
|
| Variable_name | Value |
|
+-------------------+-------+
|
| wsrep_local_index | 2 |
|
+-------------------+-------+
|
1 row in set (0.001 sec)
|
|
Now, if I insert a new row on node1, it is correctly propagated to all nodes, but the row previously inserted is lost:
node1>insert into dataloss (value) values (5);
|
Query OK, 1 row affected (0.003 sec)
|
node1>select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
| 11 | 4 |
|
| 16 | 5 |
|
+----+-------+
|
5 rows in set (0.000 sec)
|
|
node2> select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
| 16 | 5 |
|
+----+-------+
|
4 rows in set (0.000 sec)
|
|
node3>select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 2 | 1 |
|
| 5 | 2 |
|
| 8 | 3 |
|
| 11 | 4 |
|
| 16 | 5 |
|
+----+-------+
|
5 rows in set (0.000 sec)
|
So, please re-test the above scenario to verify that there is actual data loss and it's not only a problem of bad variable display
Thanks
RIck
tested with rolling-update method. Three node cluster where nodes were 10.3.23 (on Centos 7.6). Node2 upgraded:
node1> MariaDB [test]> create table dataloss (id int not null auto_increment primary key, value int);
|
MariaDB [test]> insert into dataloss (value) values (1), (2), (3);
|
Query OK, 3 rows affected (0.006 sec)
|
Records: 3 Duplicates: 0 Warnings: 0
|
|
MariaDB [test]> select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
3 rows in set (0.001 sec)
|
Status on node1:
MariaDB [test]> show global status like 'wsrep%cluster_size%';
|
+--------------------+-------+
|
| Variable_name | Value |
|
+--------------------+-------+
|
| wsrep_cluster_size | 3 |
|
+--------------------+-------+
|
1 row in set (0.002 sec)
|
MariaDB [test]> show global status like 'wsrep%size%';
|
+-----------------------+-------+
|
| Variable_name | Value |
|
+-----------------------+-------+
|
| wsrep_cert_index_size | 3 |
|
| wsrep_cluster_size | 3 |
|
+-----------------------+-------+
|
2 rows in set (0.002 sec)
|
Status on node2 before upgrade:
MariaDB [(none)]> select * from test.dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
4 rows in set (0.001 sec)
|
Perform node2 upgrade:
# Copy configs to safe place:
|
mkdir /root/configs/
|
/bin/cp -p /etc/my.cnf.d/*cnf /root/configs/.
|
# Stop and remove old rpm's:
|
systemctl stop mariadb && rpm -qai|grep -e Maria -e galera |grep Name | awk '{print "yum remove " $3 " -y"}'|bash
|
# Then install new rpm's and Selinux-policyfiles:
|
yum localinstall rpmsfor10.4.13/*rpm -y && semodule -v -i selinux/*.pp
|
# Copy configs back:
|
/bin/cp -p /root/configs/*cnf /etc/my.cnf.d/.
|
# Add needed link, start MariaDB and run mysql_upgrade:
|
ln -s /usr/lib64/galera-4 /usr/lib64/galera && systemctl start mariadb && mysql_upgrade -uroot -p --skip-write-binlog
|
Status after node2 upgrade:
[root@galera2 ~]# mysql -uroot
|
Welcome to the MariaDB monitor. Commands end with ; or \g.
|
Your MariaDB connection id is 852
|
Server version: 10.4.13-MariaDB-log MariaDB Server
|
|
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
|
|
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
|
|
MariaDB [(none)]> show global status like 'wsrep%cluster_size%';
|
+--------------------+-------+
|
| Variable_name | Value |
|
+--------------------+-------+
|
| wsrep_cluster_size | 3 |
|
+--------------------+-------+
|
1 row in set (0.002 sec)
|
|
MariaDB [(none)]> show global status like 'wsrep%size%';
|
+-----------------------+-------+
|
| Variable_name | Value |
|
+-----------------------+-------+
|
| wsrep_cert_index_size | 3 |
|
| wsrep_cluster_size | 3 |
|
+-----------------------+-------+
|
2 rows in set (0.002 sec)
|
|
MariaDB [(none)]>
|
Inserting on node1 data:
MariaDB [test]> insert into dataloss (value) values (4);
|
Query OK, 1 row affected (0.004 sec)
|
|
MariaDB [test]> select * from dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
| 12 | 4 |
|
+----+-------+
|
4 rows in set (0.000 sec)
|
Status on node2 after data inserted on node1:
MariaDB [(none)]> select * from test.dataloss;
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
| 12 | 4 |
|
+----+-------+
|
4 rows in set (0.000 sec)
|
MariaDB [(none)]>
|
No data loss with this method
If node2 came up with correct cluster index it could be it has performed an SST.
Please post logs...
Your log is mangled. I would suggest you follow exactly my steps and you should get the same results. We did this in multiple labs with same result.
rpizzi Thank you for the detailed steps. I have retested it with wsrep version 25.3.28(r3875) you mentioned and these steps, but unfortunately still have not got any data loss or a server crash.
rpizziI have passed your steps with standard installed packages on separate VMs but still have not managed to reproduce it. Do not know what is the key difference. Can you please share the steps how exactly do you update the server just in case?
The steps are outlined above https://jira.mariadb.org/browse/MDEV-22723?focusedCommentId=156703&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-156703 and are more than detailed.
Can you please post the output of your session when running the above commands here?
rpizzi Here you are my sessions output. There are different sessions for MariaDB client and for the console itself.
stepan.patryshev from these files we can't infer whether the correct sequence of steps has been executed.
Can you provide evidence that reproducing the steps I have outlined above you get different results?
As I already mentioned, two in my team on two separate environments can reproduce it just fine and 100% of the time.
Please, try once again, and provide a single output with all the steps done in sequence, like I did above.
Thanks
Rick
please do not use the test schema as well and add the steps, conf and error log of all the nodes. Looking at the log isnt clear what you have done
rpizzi I have passed the steps again without any failures. PFA all logs and cnf files.
Steps:
1. Install 3 nodes with MariaDB 10.3.23 on CentOS Linux release 7.8.2003 (Core), wsrep version 25.3.29(r3902).
2. On Node1 create a table and insert data in it.
[root@patgal1 ~]# mysql -pr -e'CREATE DATABASE d;create table d.dataloss (id int not null auto_increment primary key, value int);insert into d.dataloss (value) values (1), (2), (3);'
|
[root@patgal1 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
+----+-------+
|
|
|
[root@patgal2 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
+----+-------+
|
|
|
[root@patgal3 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
+----+-------+
|
Situation after above:
Node1:
[root@patgal1 ~]# mysql -pr -e'show global status like "wsrep%";'
|
+-------------------------------+-------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+-------------------------------------------------------+
|
| wsrep_applier_thread_count | 1 |
|
| wsrep_apply_oooe | 0.000000 |
|
| wsrep_apply_oool | 0.000000 |
|
| wsrep_apply_window | 1.000000 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_deps_distance | 1.000000 |
|
| wsrep_cert_index_size | 5 |
|
| wsrep_cert_interval | 0.000000 |
|
| wsrep_cluster_conf_id | 3 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_commit_oooe | 0.000000 |
|
| wsrep_commit_oool | 0.000000 |
|
| wsrep_commit_window | 1.000000 |
|
| wsrep_connected | ON |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_flow_control_paused | 0.000000 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_gcomm_uuid | f1120258-c51e-11ea-8b48-cb8ed6394b53 |
|
| wsrep_incoming_addresses | 172.20.3.101:3306,172.20.3.102:3306,172.20.3.103:3306 |
|
| wsrep_last_committed | 24 |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_cached_downto | 22 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_commits | 1 |
|
| wsrep_local_index | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_avg | 0.000000 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_avg | 0.000000 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_local_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_open_connections | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 25.3.29(r3902) |
|
| wsrep_ready | ON |
|
| wsrep_received | 4 |
|
| wsrep_received_bytes | 626 |
|
| wsrep_repl_data_bytes | 969 |
|
| wsrep_repl_keys | 8 |
|
| wsrep_repl_keys_bytes | 136 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_replicated | 3 |
|
| wsrep_replicated_bytes | 1312 |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 2 |
|
+-------------------------------+-------------------------------------------------------+
|
Node2:
[root@patgal2 ~]# mysql -pr -e'show global status like "wsrep%";'
|
+-------------------------------+-------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+-------------------------------------------------------+
|
| wsrep_applier_thread_count | 1 |
|
| wsrep_apply_oooe | 0.000000 |
|
| wsrep_apply_oool | 0.000000 |
|
| wsrep_apply_window | 1.000000 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_deps_distance | 1.000000 |
|
| wsrep_cert_index_size | 5 |
|
| wsrep_cert_interval | 0.000000 |
|
| wsrep_cluster_conf_id | 3 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_commit_oooe | 0.000000 |
|
| wsrep_commit_oool | 0.000000 |
|
| wsrep_commit_window | 1.000000 |
|
| wsrep_connected | ON |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_flow_control_paused | 0.000000 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_gcomm_uuid | f8c46db5-c51e-11ea-8095-6ffbd7cfa539 |
|
| wsrep_incoming_addresses | 172.20.3.101:3306,172.20.3.102:3306,172.20.3.103:3306 |
|
| wsrep_last_committed | 24 |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_cached_downto | 22 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_index | 1 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_avg | 0.000000 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_avg | 0.000000 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_local_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_open_connections | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 25.3.29(r3902) |
|
| wsrep_ready | ON |
|
| wsrep_received | 6 |
|
| wsrep_received_bytes | 1803 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 2 |
|
+-------------------------------+-------------------------------------------------------+
|
3. On Node2 set wsrep_on=OFF, shut down and upgrade to 10.4.13, wsrep 26.4.4(r4599).
4. Join upgraded Node2 to the cluster:
[root@patgal2 ~]# mysql -pr -e'show global status like "wsrep%";'
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 24 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 280 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | -1 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | AUTO,172.20.3.101:3306,172.20.3.103:3306 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 332a2e12-c525-11ea-be26-4ed9b6694f67 |
|
| wsrep_applier_thread_count | 1 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 10 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 0 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 2 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
wsrep_cluster_size and wsrep_local_index on Node2:
wsrep_cluster_size | 3 |
wsrep_local_index | 0 |
5. Recheck the content of table dataloss on 3 nodes:
[root@patgal1 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
+----+-------+
|
|
[root@patgal2 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
+----+-------+
|
|
[root@patgal3 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
+----+-------+
|
6. Insert a row on Node1, verify it has been added and replicated to Node2 and Node3:
[root@patgal1 ~]# mysql -pr -e'insert into d.dataloss (value) values (4);'
|
[root@patgal1 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
| 11 | 4 |
|
+----+-------+
|
|
[root@patgal2 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
| 11 | 4 |
|
+----+-------+
|
|
[root@patgal3 ~]# mysql -pr -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 1 | 1 |
|
| 4 | 2 |
|
| 7 | 3 |
|
| 11 | 4 |
|
+----+-------+
|
As you may see there are no any related errors or data loss here.
You aren't reproducing the issue.
Can you please explicit step 3 in details?
When you say:
3. On Node2 set wsrep_on=OFF, shut down and upgrade to 10.4.13, wsrep 26.4.4(r4599).
|
We would like to see the exact steps used to do this as this is where you probably are doing things
differently. Please paste relevant part of history file.
Thanks
Rick
OK, by looking at the output of the patgal2 session (both yesterday and the other day) we see this:
root@patgal2 ~]# systemctl stop mariadb
|
[root@patgal2 ~]# systemctl start mariadb
|
[root@patgal2 ~]#
|
[root@patgal2 ~]# systemctl stop mariadb
|
Basically after upgrading node 2 to 10.4 you start the server with wsrep ON, run mysql_upgrade then shut down and set wsrep to OFF and start again. This is not what we have specified in the ticket.
Please repeat EXACT steps we have posted. In other words: after upgrading packages you need to start with WSREP OFF not ON.
Thanks
Rick
Re-reading the entire ticket I see that there was some confusion about this WSREP_ON = OFF thing, as Massimo (original bug submitter) said to start with off, run upgrade, stop and start with on, while in my test I don't play with that at all.
The bottom line of all this is: the FIRST time you start MariaDB on node2 with WSREP enabled, you get that weird cluster index and cluster_size=0 and it is in that moment that any data inserted in other nodes does not reach node2.
If you start node2 twice with WSREP enabled the problem does not appear because the 2nd restart (which you always seem to do, see above) "clears" the weird situation.
So, once again, to properly test this DO NOT touch the WSREP_ON variable, leave it on, but after upgrading packages start node2 only once, not twice. You will see the weird cluster index and size values - in that situation you will see that any row inserted on other nodes is lost (does not reach node2)
@rpizzi You are wrong here. As you may see in "20200713_patgal2_output.log" on the line 165 there is "wsrep_on=OFF" before running upgraded server. The only diference is that I did it even before upgrade.
And in "20200713_patgal2.err" the first run of 10.4.13 is on the line 494: "2020-07-13 19:13:38 0 [Note] InnoDB: 10.4.13 started", and the 1-st attemt to load WSREP provider on 10.4.13 logged later on the line 515 "2020-07-13 19:19:39 0 [Note] WSREP: Loading provider".
And here you are the history fragment:
262 systemctl start mariadb
|
263 mysql -pr -e'select * from d.dataloss;'
|
264 mysql -pr -e'show global status like "wsrep%";'
|
265 systemctl stop mariadb
|
266 vi /etc/my.cnf.d/server2.cnf
|
267 cat /etc/yum.repos.d/mariadb.repo
|
268 curl -sS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | sudo bash -s -- --mariadb-server-version=mariadb-10.4
|
269 cat /etc/yum.repos.d/mariadb.repo
|
270 yum list installed | grep galera
|
271 yum list installed | grep MariaDB
|
272 sudo yum remove MariaDB-server galera MariaDB-backup MariaDB-client MariaDB-common
|
273 yum list installed | grep galera
|
274 yum list installed | grep MariaDB
|
275 yum install MariaDB-server galera MariaDB-backup MariaDB-client MariaDB-common
|
276 yum list installed | grep MariaDB
|
277 yum list installed | grep galera
|
278 systemctl start mariadb
|
279 mysql_upgrade -s
|
280 mysql_upgrade -s -pr
|
281 systemctl stop mariadb
|
282 vi /etc/my.cnf.d/server.cnf
|
283 vi /etc/my.cnf.d/server2.cnf
|
284 systemctl start mariadb
|
285 vi /etc/my.cnf.d/server2.cnf
|
286 systemctl start mariadb
|
287 mysql -pr -e'show global status like "wsrep%";'
|
288 mysql -pr -e'select * from d.dataloss;'
|
Anyway I will try to do it more closer to your steps.
To verify the bug DO NOT start node2 more than once after upgrading. That's it.
@rpizzi It has not helped. I have not changed WSREP_ON at all and run the upgraded server only once. And it has passed again without any failures or data loss. Please, share exact steps how do you install and update packages. PFA all logs and cnf files.
Steps:
1. Install 3 nodes with MariaDB 10.3.23 on CentOS Linux release 7.8.2003 (Core), wsrep version 25.3.29(r3902).
2. On Node1 create a table and insert data in it.
[root@patgal1 ~]# mysql -e'create database d;'
|
[root@patgal1 ~]# mysql -e'create table d.dataloss (id int not null auto_increment primary key, value int)
|
;'
|
[root@patgal1 ~]# mysql -e'insert into d.dataloss (value) values (1), (2), (3);'
|
|
[root@patgal1 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
2.1. Check that data are propagated successfully to other nodes:
[root@patgal2 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
|
[root@patgal3 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
2.2. Situation after above:
Node1:
[root@patgal1 ~]# mysql -e'show global status like "wsrep%";'
|
+-------------------------------+-------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+-------------------------------------------------------+
|
| wsrep_applier_thread_count | 1 |
|
| wsrep_apply_oooe | 0.000000 |
|
| wsrep_apply_oool | 0.000000 |
|
| wsrep_apply_window | 1.000000 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_deps_distance | 1.000000 |
|
| wsrep_cert_index_size | 5 |
|
| wsrep_cert_interval | 0.000000 |
|
| wsrep_cluster_conf_id | 3 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_commit_oooe | 0.000000 |
|
| wsrep_commit_oool | 0.000000 |
|
| wsrep_commit_window | 1.000000 |
|
| wsrep_connected | ON |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_flow_control_paused | 0.000000 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_gcomm_uuid | fed13746-c5b4-11ea-a5fe-a6a8e8ca175a |
|
| wsrep_incoming_addresses | 172.20.3.102:3306,172.20.3.103:3306,172.20.3.101:3306 |
|
| wsrep_last_committed | 6 |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_cached_downto | 4 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_commits | 1 |
|
| wsrep_local_index | 2 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_avg | 0.000000 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_avg | 0.000000 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_local_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_open_connections | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 25.3.29(r3902) |
|
| wsrep_ready | ON |
|
| wsrep_received | 10 |
|
| wsrep_received_bytes | 782 |
|
| wsrep_repl_data_bytes | 969 |
|
| wsrep_repl_keys | 8 |
|
| wsrep_repl_keys_bytes | 136 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_replicated | 3 |
|
| wsrep_replicated_bytes | 1312 |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 2 |
|
+-------------------------------+-------------------------------------------------------+
|
Node2:
[root@patgal2 ~]# mysql -e'show global status like "wsrep%";'
|
+-------------------------------+-------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+-------------------------------------------------------+
|
| wsrep_applier_thread_count | 1 |
|
| wsrep_apply_oooe | 0.000000 |
|
| wsrep_apply_oool | 0.000000 |
|
| wsrep_apply_window | 1.000000 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_deps_distance | 1.000000 |
|
| wsrep_cert_index_size | 5 |
|
| wsrep_cert_interval | 0.000000 |
|
| wsrep_cluster_conf_id | 3 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_commit_oooe | 0.000000 |
|
| wsrep_commit_oool | 0.000000 |
|
| wsrep_commit_window | 1.000000 |
|
| wsrep_connected | ON |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_flow_control_paused | 0.000000 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_gcomm_uuid | 11a7b1fd-c5b5-11ea-9a59-5e4e35dabad1 |
|
| wsrep_incoming_addresses | 172.20.3.102:3306,172.20.3.103:3306,172.20.3.101:3306 |
|
| wsrep_last_committed | 6 |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_cached_downto | 4 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_index | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_avg | 0.142857 |
|
| wsrep_local_recv_queue_max | 2 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_avg | 0.000000 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_local_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_open_connections | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 25.3.29(r3902) |
|
| wsrep_ready | ON |
|
| wsrep_received | 7 |
|
| wsrep_received_bytes | 1811 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 2 |
|
+-------------------------------+-------------------------------------------------------+
|
3. On Node2 shut down and upgrade to 10.4.13, wsrep 26.4.4(r4599).
3.1. systemctl stop mariadb
3.2. https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | sudo bash -s – --mariadb-server-version=mariadb-10.4
3.3. yum remove MariaDB galera
3.4. yum install MariaDB galera
3.5. rm /etc/my.cnf.d/server.cnf
3.6. Update "wsrep_provider" value to "/usr/lib64/galera-4/libgalera_smm.so" in "/etc/my.cnf.d/server2.cnf".
3.7. systemctl start mariadb
3.8. mysql_upgrade -s
The --upgrade-system-tables option was used, user tables won't be touched.
|
Phase 1/7: Checking and upgrading mysql database
|
Processing databases
|
mysql
|
mysql.column_stats OK
|
mysql.columns_priv OK
|
mysql.db OK
|
mysql.event OK
|
mysql.func OK
|
mysql.gtid_slave_pos OK
|
mysql.help_category OK
|
mysql.help_keyword OK
|
mysql.help_relation OK
|
mysql.help_topic OK
|
mysql.host OK
|
mysql.index_stats OK
|
mysql.innodb_index_stats OK
|
mysql.innodb_table_stats OK
|
mysql.plugin OK
|
mysql.proc OK
|
mysql.procs_priv OK
|
mysql.proxies_priv OK
|
mysql.roles_mapping OK
|
mysql.servers OK
|
mysql.table_stats OK
|
mysql.tables_priv OK
|
mysql.time_zone OK
|
mysql.time_zone_leap_second OK
|
mysql.time_zone_name OK
|
mysql.time_zone_transition OK
|
mysql.time_zone_transition_type OK
|
mysql.transaction_registry OK
|
mysql.user OK
|
mysql.wsrep_cluster OK
|
mysql.wsrep_cluster_members OK
|
mysql.wsrep_streaming_log OK
|
Phase 2/7: Installing used storage engines... Skipped
|
Phase 3/7: Fixing views... Skipped
|
Phase 4/7: Running 'mysql_fix_privilege_tables'
|
Phase 5/7: Fixing table and database names ... Skipped
|
Phase 6/7: Checking and upgrading tables... Skipped
|
Phase 7/7: Running 'FLUSH PRIVILEGES'
|
OK
|
4.
[root@patgal2 ~]# mysql -e'show global status like "wsrep%";'
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 6 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 280 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | -1 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 172.20.3.103:3306,AUTO,172.20.3.101:3306 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 4a75dc41-c5ba-11ea-a6f4-4b9ef7fb8a13 |
|
| wsrep_applier_thread_count | 1 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 6 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 499f4d1e-b249-11ea-abeb-764a6a38b248 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 1 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 2 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
wsrep_cluster_size and wsrep_local_index on Node2:
wsrep_cluster_size | 3 |
wsrep_local_index | 1 |
5. Recheck the content of table dataloss on 3 nodes:
root@patgal1 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
|
[root@patgal2 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
|
[root@patgal3 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
+----+-------+
|
6. Insert a row on Node1, verify it has been added and replicated to Node2 and Node3:
[root@patgal1 ~]# mysql -e'insert into d.dataloss (value) values (4);'
|
|
[root@patgal1 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
| 12 | 4 |
|
+----+-------+
|
|
[root@patgal2 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
| 12 | 4 |
|
+----+-------+
|
|
[root@patgal3 ~]# mysql -e'select * from d.dataloss;'
|
+----+-------+
|
| id | value |
|
+----+-------+
|
| 3 | 1 |
|
| 6 | 2 |
|
| 9 | 3 |
|
| 12 | 4 |
|
+----+-------+
|
And here you are the history fragment for the Node2:
211 date
|
212 ps -ef | grep mysqld
|
213 systemctl start mariadb
|
214 mysql -e'select * from d.dataloss;'
|
215 mysql -e'show global status like "wsrep%";'
|
216 systemctl stop mariadb
|
217 cat /etc/yum.repos.d/mariadb.repo
|
218 curl -sS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | sudo bash -s -- --mariadb-server-version=mariadb-10.4
|
219 cat /etc/yum.repos.d/mariadb.repo
|
220 yum list installed | grep galera
|
221 yum list installed | grep MariaDB
|
222 yum remove MariaDB galera
|
223 yum list installed | grep galera
|
224 yum list installed | grep MariaDB
|
225 yum install MariaDB galera
|
226 yum list installed | grep MariaDB
|
227 yum list installed | grep galera
|
228 rm /etc/my.cnf.d/server.cnf
|
229 vi /etc/my.cnf.d/server2.cnf
|
230 cat /etc/my.cnf.d/server2.cnf
|
231 ls -al /usr/lib64/galera-4/libgalera_smm.so
|
232 systemctl start mariadb
|
233 mysql_upgrade -s
|
234 mysql -e'show global status like "wsrep%";'
|
235 mysql -e'select * from d.dataloss;'
|
For what i could understood from your steps, you are performing the INSERT, when all the nodes are up, nomatter which version. There is not IST perform from the node that you have upgrade, cause you are not writing there while the node2 is down. You have to see that the node2 request and perform an IST cause it has not all the data yet.
It doesn't happen because in this test you have done, you do not get the node with cluster_size=0 and weird index id.
But you originally got that: https://jira.mariadb.org/browse/MDEV-22723?focusedCommentId=156489&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-156489
stepan.patryshev is there a reason why you don't use the conf file we supplied when trying this test, and use a different one that you built yourself? This is not a good way of testing bugs if you ask me. Please, try with the files we have supplied.
Thank you!
massimo.disaro Why IST should take place if according to the steps from the description and especially from the more detailed ones by @rpizzi (see https://jira.mariadb.org/browse/MDEV-22723?focusedCommentId=156703&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-156703 ) INSERT is performed when upgraded node 2 is run with WSREP_ON=ON?
Please, point me what exactly should I try differently if you have any certain idea.
rpizzi Ok, that is what I was going to try next - to use your config files. When I used "./mtr --suite=galera_3nodes --start-and-exit" simulation during my first tests I tried to get maximum related stuff from the attached configs. But when I moved to the cluster with three VMs and installed packages I decided to try first just only configs which I managed to adjust and run the cluster.
@rpizzi I have passed the steps again without any data loss or failures with the original configs: Node1 and Node2. Just changed only ip addresses. But I see there are some newer config files attached here.
Steps were exactly the same as described in my previous test.
PFA all logs and cnf files.
I'm stumped, especially because you were able to get the cluster size 0 in your first attempt, and now you don't get that anymore.
How is that possible is beyond me.
Maybe that's the difference. Both customer and my lab is on CentOS Linux release 7.5.1804 (Core) .
Can you please retry on that OS version?
Thanks
Rick
I think Massimo used 7.6 but customer has 7.5 so please test on that. Thanks
@rpizzi I have passed the steps again without any data loss or failures on CentOS 7.5.1804.
Steps were exactly the same as described here. Just small steps modifications were here:
3.3. yum remove MariaDB-server MariaDB-client MariaDB-backup galera
3.4. yum install MariaDB-common MariaDB-compat MariaDB-server MariaDB-backup MariaDB-client galera
PFA all logs and cnf files.
This is really odd.
Do you think you can retry with mtr?
And see if you still got the cluster_size=0 you got at the beginning?
Because that's the situation where data loss happens.
See your comment below:
@rpizzi It's really strange, but I have managed to reproduce the data loss, but not a crash, just with my scenario using MTR described here. I used Galera 25.3.28(r3875).
PFA all logs and cnf files. Please, ignore errors in mysqld.2.err around 22:17, I just forgot to shutdown a node and tried to run it again.
There are the detailed steps how I reproduced the data loss.
Release builds 10.3.23 + Galera 25.3.28(r3875) and 10.4.13 + Galera 26.4.4(r4599). PFA all logs and cnf files.
Steps:
1. ./mtr --suite=galera_3nodes --start-and-exit
2. Restart all nodes one by one with separate config files from here.
The cluster status on Node1 is:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"show global status like 'wsrep%';"
|
|
+-------------------------------+-------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+-------------------------------------------------+
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_apply_oooe | 0.000000 |
|
| wsrep_apply_oool | 0.000000 |
|
| wsrep_apply_window | 0.000000 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_deps_distance | 0.000000 |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_cert_interval | 0.000000 |
|
| wsrep_cluster_conf_id | 8 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 335ea557-cd0b-11ea-bce5-1b40dbec53a7 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_commit_oooe | 0.000000 |
|
| wsrep_commit_oool | 0.000000 |
|
| wsrep_commit_window | 0.000000 |
|
| wsrep_connected | ON |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_flow_control_paused | 0.000000 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_gcomm_uuid | 0f038d23-cd0d-11ea-acd2-b7ff4121c102 |
|
| wsrep_incoming_addresses | 127.0.0.1:16000,127.0.0.1:16001,127.0.0.1:16002 |
|
| wsrep_last_committed | 0 |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_cached_downto | 18446744073709551615 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_index | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_avg | 0.000000 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_avg | 0.000000 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_local_state_uuid | 335ea557-cd0b-11ea-bce5-1b40dbec53a7 |
|
| wsrep_open_connections | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 25.3.28(r3875) |
|
| wsrep_ready | ON |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 270 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+-------------------------------------------------+
|
3. On the Node1 create a database and a table:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"create database d; create table d.evento4 (Id int primary key auto_increment, IdDispositivo int, kkkk varchar(255));"
|
4. On the Node1 insert 3 rows:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"insert into d.evento4(IdDispositivo, kkkk) values(123, 'aaaa'); insert into d.evento4(IdDispositivo, kkkk) values(222, 'eeeeaa'); insert into d.evento4(IdDispositivo, kkkk) values(34523452, 'e4r4r4 ');"
|
Data have been propageted to all the cluster:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"select * from d.evento4;"
|
|
+----+---------------+---------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+---------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
+----+---------------+---------+
|
|
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"select * from d.evento4;"
|
|
+----+---------------+---------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+---------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
+----+---------------+---------+
|
|
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.3.sock -e"select * from d.evento4;"
|
|
+----+---------------+---------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+---------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
+----+---------------+---------+
|
5. Stop Node 2.
6. To check that IST works while Node2 is off insert 1 row on the Node1:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"insert into d.evento4(IdDispositivo, kkkk) values(888, 'While Node 2 is OFF');"
|
The new row is added on the Node1 and Node3:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"select * from d.evento4;"
|
+----+---------------+---------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+---------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
+----+---------------+---------------------+
|
|
[stepan@cnt7glr11 mysql-test]$ /home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.3.sock -e"select * from d.evento4;"
|
+----+---------------+---------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+---------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
+----+---------------+---------------------+
|
7. Start the Node2.
The new row is added on the Node2 successfully :
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"select * from d.evento4;"
|
|
+----+---------------+---------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+---------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
+----+---------------+---------------------+
|
8. Check the cluster status on the Node2:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"show global status like 'wsrep%'"
|
|
+-------------------------------+-------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+-------------------------------------------------+
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_apply_oooe | 0.000000 |
|
| wsrep_apply_oool | 0.000000 |
|
| wsrep_apply_window | 1.000000 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_deps_distance | 0.000000 |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_cert_interval | 0.000000 |
|
| wsrep_cluster_conf_id | 10 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 335ea557-cd0b-11ea-bce5-1b40dbec53a7 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_commit_oooe | 0.000000 |
|
| wsrep_commit_oool | 0.000000 |
|
| wsrep_commit_window | 1.000000 |
|
| wsrep_connected | ON |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_flow_control_paused | 0.000000 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_gcomm_uuid | 96685da8-cd17-11ea-be6f-4399d680ab4c |
|
| wsrep_incoming_addresses | 127.0.0.1:16000,127.0.0.1:16001,127.0.0.1:16002 |
|
| wsrep_last_committed | 6 |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_cached_downto | 18446744073709551615 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_index | 1 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_avg | 0.000000 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_avg | 0.000000 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_local_state_uuid | 335ea557-cd0b-11ea-bce5-1b40dbec53a7 |
|
| wsrep_open_connections | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 25.3.28(r3875) |
|
| wsrep_ready | ON |
|
| wsrep_received | 3 |
|
| wsrep_received_bytes | 278 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+-------------------------------------------------+
|
Pay attention that wsrep_local_index = 1.
9. Stop Node 2.
10. Set wsrep-on=OFF and run Node2 on 10.4.13 binaries with new config containing paths to 10.4.13 resources (cnf files here).
/home/stepan/mariadb/10.4.13/sql/mysqld --defaults-file=/home/stepan/mariadb/10.3.23/mysql-test/var/mysqld_new.2.cnf &
|
11. Perform mysql_upgrade -s.
12. Stop Node 2.
13. Insert 1 new row on the Node1:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"insert into d.evento4(IdDispositivo, kkkk) values(777777, 'While Node 2 was upgrading');"
|
|
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"select * from d.evento4;" +----+---------------+----------------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+----------------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
| 13 | 777777 | While Node 2 was upgrading |
|
+----+---------------+----------------------------+
|
14. Set wsrep-on=ON and run Node2.
15. Check that the new row is added to the Node2 also:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"select * from d.evento4;"
|
+----+---------------+----------------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+----------------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
| 13 | 777777 | While Node 2 was upgrading |
|
+----+---------------+----------------------------+
|
16. Check the wsrep variables on the Node2:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"show global status like 'wsrep%'"
|
|
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 335ea557-cd0b-11ea-bce5-1b40dbec53a7 |
|
| wsrep_protocol_version | -1 |
|
| wsrep_last_committed | 7 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 3 |
|
| wsrep_received_bytes | 288 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 2 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0.333333 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 7 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 1 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 127.0.0.1:16000,127.0.0.1:16001,127.0.0.1:16002 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 11fd46cc-cd1b-11ea-8f5d-7efdb4c94287 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 18446744073709551615 |
|
| wsrep_cluster_size | 0 |
|
| wsrep_cluster_state_uuid | |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 18446744073709551615 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
Pay attention:
wsrep_cluster_status | Primary |
wsrep_local_state_comment | Synced |
wsrep_local_index | 18446744073709551615 |
wsrep_cluster_size | 0 |
17. Insert 1 row on the Node1 again:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"insert into d.evento4 (IdDispositivo,kkkk) values (3,'non tireplic');"
|
The new row has been replicated to the Node3:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.3.sock -e"select * from d.evento4;"
|
+----+---------------+----------------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+----------------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
| 13 | 777777 | While Node 2 was upgrading |
|
| 16 | 3 | non tireplic |
|
+----+---------------+----------------------------+
|
But it has NOT been replicated to the Node2:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"select * from d.evento4;"
|
+----+---------------+----------------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+----------------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
| 13 | 777777 | While Node 2 was upgrading |
|
+----+---------------+----------------------------+
|
18. Just one more insert on the Node1 to repeat:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"insert into d.evento4 (IdDispositivo,kkkk) values (666,'Lost data');"
|
And again the new row has been replicated to the Node3:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.3.sock -e"select * from d.evento4;" +----+---------------+----------------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+----------------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
| 13 | 777777 | While Node 2 was upgrading |
|
| 16 | 3 | non tireplic |
|
| 19 | 666 | Lost data |
|
+----+---------------+----------------------------+
|
But it has NOT been replicated to the Node2:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"select * from d.evento4;" +----+---------------+----------------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+----------------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
| 13 | 777777 | While Node 2 was upgrading |
|
+----+---------------+----------------------------+
|
19. Restart the Node2.
Check the wsrep variables on the Node2:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"show global status like 'wsrep%';"
|
|
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 335ea557-cd0b-11ea-bce5-1b40dbec53a7 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 9 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 280 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 7 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 127.0.0.1:16000,127.0.0.1:16001,127.0.0.1:16002 |
|
| wsrep_cluster_weight | 3 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0/0/0/0/0 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 39969b6c-cd1f-11ea-abde-7b7ed790f75c |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 14 |
|
| wsrep_cluster_size | 3 |
|
| wsrep_cluster_state_uuid | 335ea557-cd0b-11ea-bce5-1b40dbec53a7 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 1 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
All seems ok:
wsrep_cluster_status | Primary |
wsrep_local_state_comment | Synced |
wsrep_local_index | 1 |
wsrep_cluster_size | 3 |
20. Insert the new row on the Node1:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.1.sock -e"insert into d.evento4 (IdDispositivo,kkkk) values (555,'After Node restart');"
|
And the new row has been successfully replicated to the Node3:
/home/stepan/mariadb/10.3.23/client/mysql -u root -S/home/stepan/mariadb/10.3.23/mysql-test/var/tmp/mysqld.2.sock -e"select * from d.evento4;"
|
+----+---------------+----------------------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+----------------------------+
|
| 1 | 123 | aaaa |
|
| 4 | 222 | eeeeaa |
|
| 7 | 34523452 | e4r4r4 |
|
| 11 | 888 | While Node 2 is OFF |
|
| 13 | 777777 | While Node 2 was upgrading |
|
| 22 | 555 | After Node restart |
|
+----+---------------+----------------------------+
|
Ok, I think I know what is the problem, at least where it is solved.
Massimo's node 2 log has the following
wsrep loader: [INFO] wsrep_load(): Galera 26.4.4(r4599) by Codership Oy <info@codership.com> loaded successfully.
|
...
|
2020-05-25 22:25:17 19 [Warning] WSREP: trx protocol version: 4 does not match certification protocol version: -1
|
As you may guess the last line spells bad news - the node cannot apply writesets. It is caused by a bug that was fixed in commit 02ad0e11 on April 1, way after release 4.4 was tagged and was merged into MariaDB Galera fork in commit ae24803 on April 9.
Stepan's log has
wsrep loader: [INFO] wsrep_load(): Galera 26.4.4(rae24803) by Codership Oy <info@codership.com> loaded successfully.
|
That's why Stepan can't reproduce the bug, he's using a different Galera binary.
In any case this bug (and many other) is fixed in 4.5 release tag. All MariaDB 10.4 users should switch to it. It will solve a lot of trouble.
Yurchenko I hope you are right, but I used Galera 26.4.4(r4599) on 20.07.2020 and there was no data loss.
julien.fritsch
Yes, it is fixed in later Galera releases.
stepan.patryshev
On 20.07.2020 there was a mistake in case reproduction: in Massimo's case node 2 was missing 2 events and had to perform state transfer. In your case it seems there were no updates to the cluster during node 2 upgrade: it was shut down at seqno 7 and was brought back - cluster still had seqno 7. So there was no state transfer and it is a different code path.
And yes, I found out why in Massimo's case some transactions were lost:
[Warning] WSREP: trx protocol version: 4 does not match certification protocol version: -1
|
is a warning because we can expect during upgrade of the last node and protocol bump to get a writeset with an old protocol and in that case it simply is supposed to fail certification - on all nodes. The problem (that was fixed in the commit I mentioned above) was that protocol version was not updated in total order (it was not updated at all). As a result all transactions that failed certification on node 2 (and thus were skipped), perfectly passed certification on node 1 and thus were committed. In the end both nodes believed that they have successfully processed all events and are on the same page regarding last seqno. That's why those missing events went unnoticed.
However when node 2 was restarted, it rejoined the cluster without state transfer, the bug was not triggered, and it could continue to apply transactions.
Yurchenko Thank you for the clarifications. But I want to note that rpizzi reproduced it without updating data during Node2 upgrade: steps are here.
I have verified that using Galera 26.4.5(rb3764ab) and 25.3.30(r827e681) there were no any data loss or crash. The steps were the same which reproduced the bug on 23.07.2020 with 25.3.28(r3875) and 26.4.4(r4599).
But the strange wsrep values still presented just after the first time upgraded node joined the cluster:
wsrep_local_index | 18446744073709551615 |
wsrep_cluster_size | 0 |
Looks related to https://jira.mariadb.org/browse/MDEV-19983