Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.11.8
-
None
-
None
-
Ubuntu 22.04
Description
Have a 3 node cluster of about 230Gb data set with various databases. The cluster is usually kept up to date with apt reasonably often. So we find ourselves on 10.11.8 / Galera 26.4.18(ra96793fc) and we have started having difficulty maintaining primary ever since the first node to run 10.11.8 was restarted. To rule out configuration inconsistencies I cloned the master node and rejoined the clones to it. After catching up things are ok for a few hours until we encounter inconsistent state and the "slave" nodes take themselves out of the cluster.
It looks like the slave has tried to apply the replication twice. I checked the table and there are no duplicates. That table has a multi-column PK if that is relevant but thats valid and supported.
At this stage, you'll understand if I am not certain there isn't a problem with the table, the config or the versions, but I can say it only started with 10.11.8. We have another database with a very large data set same versions and doesn't appear to have the issue. The difference is that it doesn't have SST compression defined, but otherwise identical config.
[sst]
#compressor="/usr/bin/pigz"
#decompressor="/usr/bin/pigz -d"
I have disabled on the problematic cluster to see if the compression is the cause of the problem. I have to wait for it to get in sync and then see what happens. I ran check table and can dump or use mariabackup ok, I think the table is fine. But yesterday it was a different table, same error. Note that the error doesn't occur often but on a busy cluster we trip over it quite often. On a test cluster I can't get enough test data to cause the issue.
Node config looks like this:
{{[mysqld]
wsrep_provider_options="pc.weight=2;gcache.size=1024M;gcache.recover=yes;evs.inactive_check_period=PT1S;evs.keepalive_period=PT3S;evs.suspect_timeout=P30S;evs.inactive_timeout=PT1M;evs.install_timeout=PT1M;evs.send_window=1024;evs.user_send_window=512;gcs.fc_limit=40;gcs.fc_factor=0.8;"
wsrep_on=ON
wsrep_cluster_name="ffwebc_cluster001"
wsrep_cluster_address='gcomm://node1,node2,node3?pc.wait_prim=no'
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_notify_cmd=/usr/local/sbin/wsrep-notify
wsrep_sst_method=mariabackup
wsrep_sst_auth='galera:galera'
- this node
wsrep_node_address="192.168.80.33"
wsrep_node_name="node1"
wsrep_slave_threads=8
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
query_cache_size=0
innodb_flush_log_at_trx_commit=0
sync_binlog=0
[sst]
#compressor="/usr/bin/pigz"
#decompressor="/usr/bin/pigz -d"}}
and the error looks like:
{{2024-05-30 0:55:07 2 [Warning] WSREP: Ignoring error 'Duplicate entry '226-5423890-202422-3-4' for key 'PRIMARY'' on query. Default database: 'sales_report'. Query: 'INSERT INTO wastereporting.product_waste (branch_id, product_id, yearweek, day, value, units, reason_id)
SELECT branch.id, products.id, 202422, 3, 1, 2, reasons.id
FROM sales_report.branch, sales_report.products, wastereporting.reasons
WHERE branch.branch_code = '541'
AND products.product_code = 'U589'
AND reasons.name = 'Damaged On Delivery'', Error_code: 1062
2024-05-30 0:55:39 7 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table wastereporting.product_waste; Duplicate entry '3-655848-202422-4-3' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 12250620, Internal MariaDB error code: 1062
2024-05-30 0:55:39 7 [Warning] WSREP: Event 53104 Write_rows_v1 apply failed: 121, seqno 1493016
2024-05-30 0:55:39 0 [Note] WSREP: Member 1(node3) initiates vote on 74650a58-1cfb-11ef-91c9-970177d851aa:1493016,e79c311fecd71f20: Duplicate entry '3-655848-202422-4-
3' for key 'PRIMARY', Error_code: 1062;
2024-05-30 0:55:39 0 [Note] WSREP: Votes over 74650a58-1cfb-11ef-91c9-970177d851aa:1493016:
0000000000000000: 1/2
e79c311fecd71f20: 1/2
Winner: 0000000000000000
2024-05-30 0:55:39 7 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on 74650a58-1cfb-11ef-91c9-970177d851aa:1493016 at ./galera/src/replicator_smm.cpp:process_apply_error():1370
2024-05-30 0:55:39 7 [Note] WSREP: Closing send monitor...
2024-05-30 0:55:39 7 [Note] WSREP: Closed send monitor.
2024-05-30 0:55:39 7 [Note] WSREP: gcomm: terminating thread
2024-05-30 0:55:39 7 [Note] WSREP: gcomm: joining thread
2024-05-30 0:55:39 7 [Note] WSREP: gcomm: closing backend
2024-05-30 0:55:40 7 [Note] WSREP: view(view_id(NON_PRIM,5d3bcc69-9846,12) memb
joined {
} left {
} partitioned
)
2024-05-30 0:55:40 7 [Note] WSREP: PC protocol downgrade 1 -> 0
2024-05-30 0:55:40 7 [Note] WSREP: view((empty))
2024-05-30 0:55:40 7 [Note] WSREP: gcomm: closed}}
Attachments
Issue Links
- duplicates
-
MDEV-34269 10.11.8 cluster becomes inconsistent when using composite primary key and partitioning
- Closed