Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.3.23, 10.4.13
-
None
-
OS: CentOS Linux release 7.6.1810 (Core)
Description
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version.
We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4.13, to enforce IST . We also re-tested with all 3 servers up , same result.
Create a schema and a table on mdb1. all propagate
- stop mdb2 . yum remove the rpm of Mariadb and galera.
- install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider
- set wsrep_on=OFF on my.cnf
- start mdb2
- perform mysql_upgrade -s
- stop mdb2
- set wsrep_on=ON on my.cnf
- start mbd2
At this point the status galera variables on mdb2:
MariaDB mdb2 [pippo]> show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_protocol_version | -1 |
|
| wsrep_last_committed | 65 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 3 |
|
| wsrep_received_bytes | 208 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 64 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0.5 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 1.5 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | AUTO,10.0.1.13:3306 |
|
| wsrep_cluster_weight | 2 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 18446744073709551615 |
|
| wsrep_cluster_size | 0 |
|
| wsrep_cluster_state_uuid | |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 18446744073709551615 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.001 sec)
|
NOTE THAT :
wsrep_cluster_status | Primary
|
wsrep_local_state_comment | Synced
|
wsrep_local_index | 18446744073709551615
|
wsrep_cluster_size | 0
|
Looking at the error log, the server is ready for connections after a IST
At this point the 'master' mdb1 have a write that are not getting replicate:
MariaDB mdb2 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
WHILE ON THE MASTER:
MariaDB mdb1 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
3 rows in set (0.001 sec)
|
 |
MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic');
|
Query OK, 1 row affected (0.015 sec)
|
 |
MariaDB mdb1 [pippo]> select * from evento4;
|
+----+---------------+--------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
| 6 | 3 | non tireplic |
|
+----+---------------+--------------+
|
4 rows in set (0.001 sec)
|
The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so
AT THIS point we restart mdb2 to fix the status:
[root@mdb2 my.cnf.d]# systemctl restart mariadb
|
[root@mdb2 my.cnf.d]# mysql
|
 |
MariaDB md2 [(none)]> show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 66 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 200 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 64 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 10.0.1.13:3306,AUTO |
|
| wsrep_cluster_weight | 2 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 6 |
|
| wsrep_cluster_size | 2 |
|
| wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 1 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.002 sec)
|
NOTE now the status is ok:
wsrep_local_index | 1
|
wsrep_cluster_status | Primary
|
wsrep_local_state_comment | Synced
|
wsrep_local_index | 1
|
but when we check the data we expect the new row should be present:
MariaDB mdb2 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
3 rows in set (0.001 sec)
|
The row is not there.
If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back.
Attachments
Issue Links
- relates to
-
MDEV-29246 WSREP_CLUSTER_SIZE at 0 after rolling update a node from 10.3 to 10.4
- Closed
-
MDEV-20439 WSREP_CLUSTER_SIZE at 0 after rolling update a node
- Closed
-
MDEV-22745 node crash on upgrade from 10.3 to 10.4 writing on the 10.4 node
- Closed