Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.3.23, 10.4.13
-
None
-
OS: CentOS Linux release 7.6.1810 (Core)
Description
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version.
We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4.13, to enforce IST . We also re-tested with all 3 servers up , same result.
Create a schema and a table on mdb1. all propagate
- stop mdb2 . yum remove the rpm of Mariadb and galera.
- install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider
- set wsrep_on=OFF on my.cnf
- start mdb2
- perform mysql_upgrade -s
- stop mdb2
- set wsrep_on=ON on my.cnf
- start mbd2
At this point the status galera variables on mdb2:
MariaDB mdb2 [pippo]> show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_protocol_version | -1 |
|
| wsrep_last_committed | 65 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 3 |
|
| wsrep_received_bytes | 208 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 64 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0.5 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 1.5 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 1 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | AUTO,10.0.1.13:3306 |
|
| wsrep_cluster_weight | 2 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 18446744073709551615 |
|
| wsrep_cluster_size | 0 |
|
| wsrep_cluster_state_uuid | |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 18446744073709551615 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.001 sec)
|
NOTE THAT :
wsrep_cluster_status | Primary
|
wsrep_local_state_comment | Synced
|
wsrep_local_index | 18446744073709551615
|
wsrep_cluster_size | 0
|
Looking at the error log, the server is ready for connections after a IST
At this point the 'master' mdb1 have a write that are not getting replicate:
MariaDB mdb2 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
WHILE ON THE MASTER:
MariaDB mdb1 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
3 rows in set (0.001 sec)
|
|
MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic');
|
Query OK, 1 row affected (0.015 sec)
|
|
MariaDB mdb1 [pippo]> select * from evento4;
|
+----+---------------+--------------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
| 6 | 3 | non tireplic |
|
+----+---------------+--------------+
|
4 rows in set (0.001 sec)
|
The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so
AT THIS point we restart mdb2 to fix the status:
[root@mdb2 my.cnf.d]# systemctl restart mariadb
|
[root@mdb2 my.cnf.d]# mysql
|
|
MariaDB md2 [(none)]> show global status like 'wsrep%';
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| Variable_name | Value |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
| wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_protocol_version | 9 |
|
| wsrep_last_committed | 66 |
|
| wsrep_replicated | 0 |
|
| wsrep_replicated_bytes | 0 |
|
| wsrep_repl_keys | 0 |
|
| wsrep_repl_keys_bytes | 0 |
|
| wsrep_repl_data_bytes | 0 |
|
| wsrep_repl_other_bytes | 0 |
|
| wsrep_received | 2 |
|
| wsrep_received_bytes | 200 |
|
| wsrep_local_commits | 0 |
|
| wsrep_local_cert_failures | 0 |
|
| wsrep_local_replays | 0 |
|
| wsrep_local_send_queue | 0 |
|
| wsrep_local_send_queue_max | 1 |
|
| wsrep_local_send_queue_min | 0 |
|
| wsrep_local_send_queue_avg | 0 |
|
| wsrep_local_recv_queue | 0 |
|
| wsrep_local_recv_queue_max | 1 |
|
| wsrep_local_recv_queue_min | 0 |
|
| wsrep_local_recv_queue_avg | 0 |
|
| wsrep_local_cached_downto | 64 |
|
| wsrep_flow_control_paused_ns | 0 |
|
| wsrep_flow_control_paused | 0 |
|
| wsrep_flow_control_sent | 0 |
|
| wsrep_flow_control_recv | 0 |
|
| wsrep_cert_deps_distance | 0 |
|
| wsrep_apply_oooe | 0 |
|
| wsrep_apply_oool | 0 |
|
| wsrep_apply_window | 0 |
|
| wsrep_commit_oooe | 0 |
|
| wsrep_commit_oool | 0 |
|
| wsrep_commit_window | 0 |
|
| wsrep_local_state | 4 |
|
| wsrep_local_state_comment | Synced |
|
| wsrep_cert_index_size | 0 |
|
| wsrep_causal_reads | 0 |
|
| wsrep_cert_interval | 0 |
|
| wsrep_open_transactions | 0 |
|
| wsrep_open_connections | 0 |
|
| wsrep_incoming_addresses | 10.0.1.13:3306,AUTO |
|
| wsrep_cluster_weight | 2 |
|
| wsrep_desync_count | 0 |
|
| wsrep_evs_delayed | |
|
| wsrep_evs_evict_list | |
|
| wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 |
|
| wsrep_evs_state | OPERATIONAL |
|
| wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 |
|
| wsrep_applier_thread_count | 32 |
|
| wsrep_cluster_capabilities | |
|
| wsrep_cluster_conf_id | 6 |
|
| wsrep_cluster_size | 2 |
|
| wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 |
|
| wsrep_cluster_status | Primary |
|
| wsrep_connected | ON |
|
| wsrep_local_bf_aborts | 0 |
|
| wsrep_local_index | 1 |
|
| wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: |
|
| wsrep_provider_name | Galera |
|
| wsrep_provider_vendor | Codership Oy <info@codership.com> |
|
| wsrep_provider_version | 26.4.4(r4599) |
|
| wsrep_ready | ON |
|
| wsrep_rollbacker_thread_count | 1 |
|
| wsrep_thread_count | 33 |
|
+-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+
|
65 rows in set (0.002 sec)
|
NOTE now the status is ok:
wsrep_local_index | 1
|
wsrep_cluster_status | Primary
|
wsrep_local_state_comment | Synced
|
wsrep_local_index | 1
|
but when we check the data we expect the new row should be present:
MariaDB mdb2 [pippo]> select * from evento4;
|
+----+---------------+--------+
|
| Id | IdDispositivo | kkkk |
|
+----+---------------+--------+
|
| 1 | 123 | aaaa |
|
| 3 | 222 | eeeeaa |
|
| 4 | 34523452 | e4r4r4 |
|
+----+---------------+--------+
|
3 rows in set (0.001 sec)
|
The row is not there.
If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back.
Attachments
Issue Links
- relates to
-
MDEV-29246 WSREP_CLUSTER_SIZE at 0 after rolling update a node from 10.3 to 10.4
-
- Closed
-
-
MDEV-20439 WSREP_CLUSTER_SIZE at 0 after rolling update a node
-
- Closed
-
-
MDEV-22745 node crash on upgrade from 10.3 to 10.4 writing on the 10.4 node
-
- Closed
-
Activity
Field | Original Value | New Value |
---|---|---|
Description |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently showdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4. , to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) NOTE THAT : wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ WHILE ON THE MASTER: MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) NOTE now the status is ok: wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 but when we check the data we expect the new row should be present: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently shut mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4. , to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) NOTE THAT : wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ WHILE ON THE MASTER: MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) NOTE now the status is ok: wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 but when we check the data we expect the new row should be present: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Description |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently shut mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4. , to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) NOTE THAT : wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ WHILE ON THE MASTER: MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) NOTE now the status is ok: wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 but when we check the data we expect the new row should be present: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4. , to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) NOTE THAT : wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ WHILE ON THE MASTER: MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) NOTE now the status is ok: wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 but when we check the data we expect the new row should be present: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Priority | Major [ 3 ] | Critical [ 2 ] |
Description |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4. , to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) NOTE THAT : wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ WHILE ON THE MASTER: MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) NOTE now the status is ok: wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 but when we check the data we expect the new row should be present: MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4. , to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: {noformat} MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) {noformat} NOTE THAT : {noformat} wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 {noformat} Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: {noformat} MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ {noformat} WHILE ON THE MASTER: {noformat} MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) {noformat} The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: {noformat} [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) {noformat} NOTE now the status is ok: {noformat} wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 {noformat} but when we check the data we expect the new row should be present: {noformat} MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) {noformat} The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Fix Version/s | 10.3 [ 22126 ] | |
Assignee | Jan Lindström [ jplindst ] |
Assignee | Jan Lindström [ jplindst ] | Stepan Patryshev [ stepan.patryshev ] |
Status | Open [ 1 ] | In Progress [ 3 ] |
Description |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4. , to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: {noformat} MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) {noformat} NOTE THAT : {noformat} wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 {noformat} Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: {noformat} MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ {noformat} WHILE ON THE MASTER: {noformat} MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) {noformat} The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: {noformat} [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) {noformat} NOTE now the status is ok: {noformat} wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 {noformat} but when we check the data we expect the new row should be present: {noformat} MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) {noformat} The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Creating a full galera cluster of 10.3.23 with 3 nodes
mdb1,mdb2,mdb3 10.3.23 version. We gently shutdown mdb3 to check the interaction between writing on 10.3.23 and effect on 10.4.13, to enforce IST . We also re-tested with all 3 servers up , same result. Create a schema and a table on mdb1. all propagate - stop mdb2 . yum remove the rpm of Mariadb and galera. - install from new repo of Mariadb 10.4 and update my.cnf to the right wsrep_provider - set wsrep_on=OFF on my.cnf - start mdb2 - perform mysql_upgrade -s - stop mdb2 - set wsrep_on=ON on my.cnf - start mbd2 At this point the status galera variables on mdb2: {noformat} MariaDB mdb2 [pippo]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | -1 | | wsrep_last_committed | 65 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 3 | | wsrep_received_bytes | 208 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0.5 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 1.5 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 1 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | AUTO,10.0.1.13:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000325151/0.00176008/0.00607075/0.00193032/7 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 7ff14eaf-9ed6-11ea-b98f-8fc2b85537f4 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 18446744073709551615 | | wsrep_cluster_size | 0 | | wsrep_cluster_state_uuid | | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 18446744073709551615 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.001 sec) {noformat} NOTE THAT : {noformat} wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 18446744073709551615 wsrep_cluster_size | 0 {noformat} Looking at the error log, the server is ready for connections after a IST At this point the 'master' mdb1 have a write that are not getting replicate: {noformat} MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ {noformat} WHILE ON THE MASTER: {noformat} MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) MariaDB mdb1 [pippo]> insert into evento4 (IdDispositivo,kkkk) values (3,'non tireplic'); Query OK, 1 row affected (0.015 sec) MariaDB mdb1 [pippo]> select * from evento4; +----+---------------+--------------+ | Id | IdDispositivo | kkkk | +----+---------------+--------------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | | 6 | 3 | non tireplic | +----+---------------+--------------+ 4 rows in set (0.001 sec) {noformat} The fact that INSERT not getting replicate could be indeed cause the cluster_size=0 and wsrep_local_index= 18446744073709551615, obviously so AT THIS point we restart mdb2 to fix the status: {noformat} [root@mdb2 my.cnf.d]# systemctl restart mariadb [root@mdb2 my.cnf.d]# mysql MariaDB md2 [(none)]> show global status like 'wsrep%'; +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | Variable_name | Value | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ | wsrep_local_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_protocol_version | 9 | | wsrep_last_committed | 66 | | wsrep_replicated | 0 | | wsrep_replicated_bytes | 0 | | wsrep_repl_keys | 0 | | wsrep_repl_keys_bytes | 0 | | wsrep_repl_data_bytes | 0 | | wsrep_repl_other_bytes | 0 | | wsrep_received | 2 | | wsrep_received_bytes | 200 | | wsrep_local_commits | 0 | | wsrep_local_cert_failures | 0 | | wsrep_local_replays | 0 | | wsrep_local_send_queue | 0 | | wsrep_local_send_queue_max | 1 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 1 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 0 | | wsrep_local_cached_downto | 64 | | wsrep_flow_control_paused_ns | 0 | | wsrep_flow_control_paused | 0 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 0 | | wsrep_cert_deps_distance | 0 | | wsrep_apply_oooe | 0 | | wsrep_apply_oool | 0 | | wsrep_apply_window | 0 | | wsrep_commit_oooe | 0 | | wsrep_commit_oool | 0 | | wsrep_commit_window | 0 | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_cert_index_size | 0 | | wsrep_causal_reads | 0 | | wsrep_cert_interval | 0 | | wsrep_open_transactions | 0 | | wsrep_open_connections | 0 | | wsrep_incoming_addresses | 10.0.1.13:3306,AUTO | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0.000853237/0.001923/0.00333681/0.0010427/3 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | ab80ace4-9ed6-11ea-8cdf-eab063bfbbb6 | | wsrep_applier_thread_count | 32 | | wsrep_cluster_capabilities | | | wsrep_cluster_conf_id | 6 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | 86a3014e-9e9d-11ea-8f7d-829b023fcaf0 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 1 | | wsrep_provider_capabilities | :MULTI_MASTER:CERTIFICATION:PARALLEL_APPLYING:TRX_REPLAY:ISOLATION:PAUSE:CAUSAL_READS:INCREMENTAL_WRITESET:UNORDERED:PREORDERED:STREAMING:NBO: | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <info@codership.com> | | wsrep_provider_version | 26.4.4(r4599) | | wsrep_ready | ON | | wsrep_rollbacker_thread_count | 1 | | wsrep_thread_count | 33 | +-------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------+ 65 rows in set (0.002 sec) {noformat} NOTE now the status is ok: {noformat} wsrep_local_index | 1 wsrep_cluster_status | Primary wsrep_local_state_comment | Synced wsrep_local_index | 1 {noformat} but when we check the data we expect the new row should be present: {noformat} MariaDB mdb2 [pippo]> select * from evento4; +----+---------------+--------+ | Id | IdDispositivo | kkkk | +----+---------------+--------+ | 1 | 123 | aaaa | | 3 | 222 | eeeeaa | | 4 | 34523452 | e4r4r4 | +----+---------------+--------+ 3 rows in set (0.001 sec) {noformat} The row is not there. If we write after this moment all is getting replicate. So the data loss is after the first IST complete until a new restart is done and got the status of the cluster back. |
Attachment | 200612_mysqld.1.err [ 52173 ] | |
Attachment | 200612_mysqld.2.err [ 52174 ] | |
Attachment | 200612_mysqld.3.err [ 52175 ] | |
Attachment | mysqld_new.2.cnf [ 52176 ] | |
Attachment | mysqld_old.3.cnf [ 52177 ] | |
Attachment | mysqld_old.2.cnf [ 52178 ] | |
Attachment | mysqld_old.1.cnf [ 52179 ] |
Assignee | Stepan Patryshev [ stepan.patryshev ] | Seppo Jaakola [ seppo ] |
Assignee | Seppo Jaakola [ seppo ] | Stepan Patryshev [ stepan.patryshev ] |
Status | In Progress [ 3 ] | Stalled [ 10000 ] |
Assignee | Stepan Patryshev [ stepan.patryshev ] | Seppo Jaakola [ seppo ] |
Affects Version/s | 10.4.13 [ 24223 ] | |
Affects Version/s | 10.3.14 [ 23216 ] |
Fix Version/s | 10.4 [ 22408 ] |
Comment | [ [~mihaQ] you are doing the wrong test. the insert and the data loss are happening when the node2 is down. on your test you write when the node join already the cluster ] |
Attachment | node1_bootsrapped_10.3.23.log.rtf [ 52193 ] | |
Attachment | node2_upgraded.log.rtf [ 52194 ] |
Attachment | node2_upgraded_10.4.13.log [ 52195 ] | |
Attachment | node1_bootsrapped_10.3.23.log [ 52196 ] |
Link |
This issue relates to |
Attachment | 200709_patgal_output.zip [ 52743 ] |
Attachment | 20200713_MDEV-22723_patgal_no_errors.zip [ 52773 ] |
Attachment | 20200714_MDEV-22723_patgal_no_errors.zip [ 52793 ] |
Attachment | 20200714_MDEV-22723_patgal_no_errors.zip [ 52794 ] |
Attachment |
20200714_ |
Attachment |
20200714_ |
Attachment | 20200714_MDEV-22723_patgal_no_errors.zip [ 52796 ] |
Attachment | 20200714_MDEV-22723_mdb_no_errors.zip [ 52806 ] |
Assignee | Seppo Jaakola [ seppo ] | Alexey [ yurchenko ] |
Attachment | 20200720_MDEV-22723_CentOS_7.5_no_errors.zip [ 52884 ] |
Status | Stalled [ 10000 ] | In Progress [ 3 ] |
Attachment | 20200723_MDEV-22723_data_loss.zip [ 52936 ] |
Link |
This issue relates to |
Fix Version/s | 10.3.25 [ 24506 ] | |
Fix Version/s | 10.4.15 [ 24507 ] | |
Fix Version/s | 10.3 [ 22126 ] | |
Fix Version/s | 10.4 [ 22408 ] | |
Resolution | Fixed [ 1 ] | |
Status | In Progress [ 3 ] | Closed [ 6 ] |
Fix Version/s | 10.4.16 [ 25020 ] |
Fix Version/s | 10.4.15 [ 24507 ] |
Fix Version/s | 10.3.26 [ 25021 ] |
Fix Version/s | 10.3.25 [ 24506 ] |
Workflow | MariaDB v3 [ 109176 ] | MariaDB v4 [ 157862 ] |
Link |
This issue relates to |
Zendesk Related Tickets | 183937 |
Looks related to https://jira.mariadb.org/browse/MDEV-19983