[MDEV-21675] Galera: RQG (mss): Master node disconnects from the cluster after voting on: Duplicate entry '0' for key 'PRIMARY', Error_code: 1062 Created: 2020-02-06  Updated: 2020-03-21  Resolved: 2020-03-21

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.1, 10.4
Fix Version/s: 10.4.13, 10.5.2

Type: Bug Priority: Major
Reporter: Stepan Patryshev (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: galera, rqg
Environment:

CentOS Linux release 7.6.1810 (Core)


Attachments: File galera_stress.yy     File galera_stress.zz     File mysql0.err     File mysql1.err     File mysql2.err     File runall-new.pl    

 Description   

When RQG is used with "--galera=mss", master node disconnects from the cluster after voting on: "Duplicate entry '0' for key 'PRIMARY', Error_code: 1062".

Galera lib 26.4.4(r4774): Repository: mariadb-corporation/es-galera; branch es-mariadb-4.x; Revision af1fa5b5d20c50ec827f45b9724ec964ab13efdd; debug build.

RQG: Repository: MariaDB/randgen; master branch; Revision 3bf9c73c1b010facc78d210dfc875dfbaaf1ccdd

Run:

perl ./runall-new.pl --grammar=conf/galera/galera_stress.yy --gendata=conf/galera/galera_stress.zz --duration=300 --queries=200M --threads=8 --galera=mss --basedir=/home/stepan/ment/10.5 --vardir=/home/stepan/rqg/var "--mysqld=--wsrep-provider=/usr/lib/libgalera_enterprise_smm.so" "--mysqld=--wsrep_sst_method=rsync" "--mysqld=--core" "--mysqld=--general-log" "--mysqld=--general-log-file=queries.log" "--mysqld=--log-output=file" "--mysqld=--wsrep-debug=none" "--mysqld=--wsrep-sync-wait=15" "--mysqld=--wsrep_retry_autocommit=0" "--mysqld=--wsrep_log_conflicts=1" "--mysqld=--wsrep_on=ON" "--mysqld=--default-storage-engine=innodb" "--mysqld=--sort_buffer_size=200M" "--mysqld=--innodb-autoinc-lock-mode=2" "--mysqld=--innodb-lock-wait-timeout=3"

Master 0 error log:
2020-02-06 12:01:40 2 [ERROR] WSREP: Vote 0 (success) on 674921f6-48c7-11ea-b9a9-4a0630e482fa:240 is inconsistent with group. Leaving cluster.

10.5.1-ES 57950ded281731263f6aa358d43c7b9d51f3dbfb debug

2020-02-06 12:00:12 2 [Note] WSREP: Server cnt7glr11.localdomain synced with group
2020-02-06 12:00:12 2 [Note] WSREP: Server status change joined -> synced
2020-02-06 12:00:12 2 [Note] WSREP: Synchronized with group, ready for connections
2020-02-06 12:00:12 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-02-06 12:00:14 0 [Note] WSREP: async IST sender served
2020-02-06 12:00:14 0 [Note] WSREP: 2.0 (cnt7glr11.localdomain): State transfer from 0.0 (cnt7glr11.localdomain) complete.
2020-02-06 12:00:14 0 [Note] WSREP: Member 2.0 (cnt7glr11.localdomain) synced with group.
2020-02-06 12:01:14 8 [Note] WSREP: MDL conflict db=test table=table10000_innodb_key_pk_parts_2_int_autoinc ticket=1 solved by abort
2020-02-06 12:01:14 8 [Note] WSREP: MDL conflict db=test table=table10000_innodb_key_pk_parts_2_int_autoinc ticket=1 solved by abort
2020-02-06 12:01:14 0 [Note] WSREP: MDL conflict db=test table=table10000_innodb_key_pk_parts_2_int_autoinc ticket=1 solved by abort
2020-02-06 12:01:14 0 [Note] WSREP: MDL conflict db=test table=table10000_innodb_key_pk_parts_2_int_autoinc ticket=9 solved by abort
2020-02-06 12:01:14 0 [Note] WSREP: MDL conflict db=test table=table10000_innodb_key_pk_parts_2_int_autoinc ticket=9 solved by abort
2020-02-06 12:01:40 0 [Note] WSREP: Member 2(cnt7glr11.localdomain) initiates vote on 674921f6-48c7-11ea-b9a9-4a0630e482fa:240,fdcab7194c6d731e:  Duplicate entry '0' for key 'PRIMARY', Error_code: 1062;
2020-02-06 12:01:40 0 [Note] WSREP: Votes over 674921f6-48c7-11ea-b9a9-4a0630e482fa:240:
   fdcab7194c6d731e:   1/3
Waiting for more votes.
2020-02-06 12:01:40 2 [Note] WSREP: Got vote request for seqno 674921f6-48c7-11ea-b9a9-4a0630e482fa:240
2020-02-06 12:01:40 0 [Note] WSREP: Member 1(cnt7glr11.localdomain) initiates vote on 674921f6-48c7-11ea-b9a9-4a0630e482fa:240,fdcab7194c6d731e:  Duplicate entry '0' for key 'PRIMARY', Error_code: 1062;
2020-02-06 12:01:40 0 [Note] WSREP: Votes over 674921f6-48c7-11ea-b9a9-4a0630e482fa:240:
   fdcab7194c6d731e:   2/3
Winner: fdcab7194c6d731e
2020-02-06 12:01:40 0 [Note] WSREP: Recovering vote result from history: 674921f6-48c7-11ea-b9a9-4a0630e482fa:240,fdcab7194c6d731e
2020-02-06 12:01:40 2 [ERROR] WSREP: Vote 0 (success) on 674921f6-48c7-11ea-b9a9-4a0630e482fa:240 is inconsistent with group. Leaving cluster.
2020-02-06 12:01:40 2 [Note] WSREP: Closing send monitor...
2020-02-06 12:01:40 2 [Note] WSREP: Closed send monitor.
2020-02-06 12:01:40 2 [Note] WSREP: gcomm: terminating thread
2020-02-06 12:01:40 2 [Note] WSREP: gcomm: joining thread
2020-02-06 12:01:40 2 [Note] WSREP: gcomm: closing backend
2020-02-06 12:01:40 2 [Note] WSREP: view(view_id(NON_PRIM,6747ff9c-a2d2,3) memb {
	6747ff9c-a2d2,0
} joined {
} left {
} partitioned {
	67bcc51c-bea9,0
	7026dfc8-9b76,0
})
2020-02-06 12:01:40 2 [Note] WSREP: PC protocol downgrade 1 -> 0
2020-02-06 12:01:40 2 [Note] WSREP: view((empty))
2020-02-06 12:01:40 2 [Note] WSREP: gcomm: closed
2020-02-06 12:01:40 14 [Warning] WSREP: 0x5634837cfd98 down context(s) not set
2020-02-06 12:01:40 14 [Warning] WSREP: gcs_caused() returned -107 (Transport endpoint is not connected)
2020-02-06 12:01:40 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-02-06 12:01:40 0 [Note] WSREP: Flow-control interval: [16, 16]
2020-02-06 12:01:40 0 [Note] WSREP: Received NON-PRIMARY.
2020-02-06 12:01:40 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 240)
2020-02-06 12:01:40 0 [Note] WSREP: New SELF-LEAVE.
2020-02-06 12:01:40 0 [Note] WSREP: Flow-control interval: [0, 0]
2020-02-06 12:01:40 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2020-02-06 12:01:40 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: -1)
2020-02-06 12:01:40 0 [Note] WSREP: RECV thread exiting 0: Success
2020-02-06 12:01:40 14 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
2020-02-06 12:01:40 14 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
2020-02-06 12:01:40 2 [Note] WSREP: recv_thread() joined.
2020-02-06 12:01:40 2 [Note] WSREP: Closing replication queue.
2020-02-06 12:01:40 2 [Note] WSREP: Closing slave action queue.
2020-02-06 12:01:40 2 [Note] WSREP: ================================================
View:
  id: 674921f6-48c7-11ea-b9a9-4a0630e482fa:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(1):
	0: 6747ff9c-48c7-11ea-a2d2-6ee13cc9cdf6, cnt7glr11.localdomain
=================================================
2020-02-06 12:01:40 2 [Note] WSREP: Non-primary view
2020-02-06 12:01:40 2 [Note] WSREP: Server status change synced -> connected
2020-02-06 12:01:40 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-02-06 12:01:40 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-02-06 12:01:40 14 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
2020-02-06 12:01:40 2 [Note] WSREP: ================================================
View:
  id: 674921f6-48c7-11ea-b9a9-4a0630e482fa:-1
  status: non-primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: yes
  own_index: -1
  members(0):
=================================================
2020-02-06 12:01:40 2 [Note] WSREP: Non-primary view
2020-02-06 12:01:40 2 [Note] WSREP: Server status change connected -> disconnected
2020-02-06 12:01:40 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-02-06 12:01:40 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-02-06 12:01:40 0 [Note] WSREP: Service thread queue flushed.
2020-02-06 12:01:40 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
2020-02-06 12:01:40 2 [Note] WSREP: Applier thread exiting ret: 0 thd: 2
2020-02-06 12:01:40 2 [Warning] Aborted connection 2 to db: 'unconnected' user: 'unauthenticated' host: '' (This connection closed normally without authentication)
2020-02-06 12:01:40 12 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)
2020-02-06 12:01:40 14 [Warning] WSREP: gcs_caused() returned -103 (Software caused connection abort)

Used runall-new.pl and appropriate yy and zz files are also attached.



 Comments   
Comment by Jan Lindström (Inactive) [ 2020-02-21 ]

Yurchenko Any progress on this ?

Comment by Alexey [ 2020-02-23 ]

Hm, somehow I was not notified of this. Well, I can reproduce it, but it needs some investigation. At the first glance there is an inconsistency between master and the slaves, master originally does not have a a row that is present on slaves. So far it looks like a bug in binlog emulation, as when binlog is enabled, this particular crash is not reproducible.

Generated at Thu Feb 08 09:08:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.