[MDEV-34267] Since version 10.11.8 keep getting Inconsistency HA_ERR_FOUND_DUPP_KEY - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11.8
Fix Version/s: None
Component/s: Replication
Labels:
None
Environment:
Ubuntu 22.04

Description

Have a 3 node cluster of about 230Gb data set with various databases. The cluster is usually kept up to date with apt reasonably often. So we find ourselves on 10.11.8 / Galera 26.4.18(ra96793fc) and we have started having difficulty maintaining primary ever since the first node to run 10.11.8 was restarted. To rule out configuration inconsistencies I cloned the master node and rejoined the clones to it. After catching up things are ok for a few hours until we encounter inconsistent state and the "slave" nodes take themselves out of the cluster.

It looks like the slave has tried to apply the replication twice. I checked the table and there are no duplicates. That table has a multi-column PK if that is relevant but thats valid and supported.

At this stage, you'll understand if I am not certain there isn't a problem with the table, the config or the versions, but I can say it only started with 10.11.8. We have another database with a very large data set same versions and doesn't appear to have the issue. The difference is that it doesn't have SST compression defined, but otherwise identical config.

[sst]
#compressor="/usr/bin/pigz"
#decompressor="/usr/bin/pigz -d"

I have disabled on the problematic cluster to see if the compression is the cause of the problem. I have to wait for it to get in sync and then see what happens. I ran check table and can dump or use mariabackup ok, I think the table is fine. But yesterday it was a different table, same error. Note that the error doesn't occur often but on a busy cluster we trip over it quite often. On a test cluster I can't get enough test data to cause the issue.

Node config looks like this:

{{[mysqld]
wsrep_provider_options="pc.weight=2;gcache.size=1024M;gcache.recover=yes;evs.inactive_check_period=PT1S;evs.keepalive_period=PT3S;evs.suspect_timeout=P30S;evs.inactive_timeout=PT1M;evs.install_timeout=PT1M;evs.send_window=1024;evs.user_send_window=512;gcs.fc_limit=40;gcs.fc_factor=0.8;"
wsrep_on=ON
wsrep_cluster_name="ffwebc_cluster001"
wsrep_cluster_address='gcomm://node1,node2,node3?pc.wait_prim=no'
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_notify_cmd=/usr/local/sbin/wsrep-notify
wsrep_sst_method=mariabackup
wsrep_sst_auth='galera:galera'

this node
wsrep_node_address="192.168.80.33"
wsrep_node_name="node1"
wsrep_slave_threads=8

binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
query_cache_size=0
innodb_flush_log_at_trx_commit=0
sync_binlog=0

[sst]
#compressor="/usr/bin/pigz"
#decompressor="/usr/bin/pigz -d"}}

and the error looks like:

{{2024-05-30 0:55:07 2 [Warning] WSREP: Ignoring error 'Duplicate entry '226-5423890-202422-3-4' for key 'PRIMARY'' on query. Default database: 'sales_report'. Query: 'INSERT INTO wastereporting.product_waste (branch_id, product_id, yearweek, day, value, units, reason_id)
SELECT branch.id, products.id, 202422, 3, 1, 2, reasons.id
FROM sales_report.branch, sales_report.products, wastereporting.reasons
WHERE branch.branch_code = '541'
AND products.product_code = 'U589'
AND reasons.name = 'Damaged On Delivery'', Error_code: 1062

2024-05-30 0:55:39 7 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table wastereporting.product_waste; Duplicate entry '3-655848-202422-4-3' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 12250620, Internal MariaDB error code: 1062

2024-05-30 0:55:39 7 [Warning] WSREP: Event 53104 Write_rows_v1 apply failed: 121, seqno 1493016

2024-05-30 0:55:39 0 [Note] WSREP: Member 1(node3) initiates vote on 74650a58-1cfb-11ef-91c9-970177d851aa:1493016,e79c311fecd71f20: Duplicate entry '3-655848-202422-4-
3' for key 'PRIMARY', Error_code: 1062;
2024-05-30 0:55:39 0 [Note] WSREP: Votes over 74650a58-1cfb-11ef-91c9-970177d851aa:1493016:
0000000000000000: 1/2
e79c311fecd71f20: 1/2
Winner: 0000000000000000

2024-05-30 0:55:39 7 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on 74650a58-1cfb-11ef-91c9-970177d851aa:1493016 at ./galera/src/replicator_smm.cpp:process_apply_error():1370
2024-05-30 0:55:39 7 [Note] WSREP: Closing send monitor...
2024-05-30 0:55:39 7 [Note] WSREP: Closed send monitor.
2024-05-30 0:55:39 7 [Note] WSREP: gcomm: terminating thread
2024-05-30 0:55:39 7 [Note] WSREP: gcomm: joining thread
2024-05-30 0:55:39 7 [Note] WSREP: gcomm: closing backend
2024-05-30 0:55:40 7 [Note] WSREP: view(view_id(NON_PRIM,5d3bcc69-9846,12) memb

{ 7464ab24-929a,0 }

joined {
} left {
} partitioned

{ 5d3bcc69-9846,0 }

)
2024-05-30 0:55:40 7 [Note] WSREP: PC protocol downgrade 1 -> 0
2024-05-30 0:55:40 7 [Note] WSREP: view((empty))
2024-05-30 0:55:40 7 [Note] WSREP: gcomm: closed}}

Attachments

Issue Links

duplicates

MDEV-34269 10.11.8 cluster becomes inconsistent when using composite primary key and partitioning

Closed

Activity

People

Assignee:: Unassigned

Reporter:: James Cross

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-05-30 10:00

Updated:: 2025-01-28 19:08

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.