[MDEV-14877] Galera cluster crashed when deleting records Created: 2018-01-05  Updated: 2020-12-21  Resolved: 2020-12-21

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.10, 10.1.34, 10.1.35
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Oren Bissick (Inactive) Assignee: Seppo Jaakola
Resolution: Incomplete Votes: 0
Labels: need_feedback


 Description   

Ran pt-archiver to archive records and all the nodes went down with the error below.

2018-01-04 11:36:27 140527176017664 [ERROR] mysqld: Can't find record in 'log_system'
2018-01-04 11:36:28 140527176017664 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table dataworks.log_system; Can't find record in 'log_system', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 541, Internal MariaDB error code: 1032
2018-01-04 11:36:28 140527176017664 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 120, 397232798
2018-01-04 11:36:28 140527176017664 [Warning] WSREP: Failed to apply app buffer: seqno: 397232798, status: 1
	 at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
2018-01-04 11:36:28 140527176017664 [ERROR] mysqld: Can't find record in 'log_system'
2018-01-04 11:36:28 140527176017664 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table dataworks.log_system; Can't find record in 'log_system', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 541, Internal MariaDB error code: 1032
2018-01-04 11:36:28 140527176017664 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 120, 397232798
2018-01-04 11:36:28 140527176017664 [Warning] WSREP: Failed to apply app buffer: seqno: 397232798, status: 1
	 at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2018-01-04 11:36:28 140527176017664 [ERROR] mysqld: Can't find record in 'log_system'
2018-01-04 11:36:28 140527176017664 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table dataworks.log_system; Can't find record in 'log_system', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 541, Internal MariaDB error code: 1032
2018-01-04 11:36:28 140527176017664 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 120, 397232798
2018-01-04 11:36:28 140527176017664 [Warning] WSREP: Failed to apply app buffer: seqno: 397232798, status: 1
	 at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2018-01-04 11:36:28 140527176017664 [ERROR] mysqld: Can't find record in 'log_system'
2018-01-04 11:36:28 140527176017664 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table dataworks.log_system; Can't find record in 'log_system', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 541, Internal MariaDB error code: 1032
2018-01-04 11:36:28 140527176017664 [Warning] WSREP: RBR event 3 Delete_rows_v1 apply warning: 120, 397232798
2018-01-04 11:36:28 140527176017664 [ERROR] WSREP: Failed to apply trx: source: a16066e0-e4f9-11e7-ae4d-665dac15733d version: 3 local: 0 state: APPLYING flags: 1 conn_id: 3358326 trx_id: 822622245 seqnos (l: 5873410, g: 397232798, s: 397232797, d: 397232772, ts: 3278302040065719)
2018-01-04 11:36:28 140527176017664 [ERROR] WSREP: Failed to apply trx 397232798 4 times
2018-01-04 11:36:28 140527176017664 [ERROR] WSREP: Node consistency compromized, aborting...
2018-01-04 11:36:28 140527176017664 [Note] WSREP: Closing send monitor...
2018-01-04 11:36:28 140527176017664 [Note] WSREP: Closed send monitor.
2018-01-04 11:36:28 140527176017664 [Note] WSREP: gcomm: terminating thread
2018-01-04 11:36:28 140527176017664 [Note] WSREP: gcomm: joining thread
2018-01-04 11:36:28 140527176017664 [Note] WSREP: gcomm: closing backend
2018-01-04 11:36:28 140527176017664 [Note] WSREP: view(view_id(NON_PRIM,17fca7cf,203) memb {
	877cbcec,0
} joined {
} left {
} partitioned {
	17fca7cf,0
	6ec5c1cb,0
	a16066e0,0
	cdc1bdfd,0
})
2018-01-04 11:36:28 140527176017664 [Note] WSREP: view((empty))
2018-01-04 11:36:28 140527176017664 [Note] WSREP: gcomm: closed
2018-01-04 11:36:28 140553635206912 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-01-04 11:36:28 140553635206912 [Note] WSREP: Flow-control interval: [64, 80]
2018-01-04 11:36:28 140553635206912 [Note] WSREP: Received NON-PRIMARY.
2018-01-04 11:36:28 140553635206912 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 397232798)
2018-01-04 11:36:28 140553635206912 [Note] WSREP: Received self-leave message.
2018-01-04 11:36:28 140553635206912 [Note] WSREP: Flow-control interval: [0, 0]
2018-01-04 11:36:28 140553635206912 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2018-01-04 11:36:28 140553635206912 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 397232798)
2018-01-04 11:36:28 140553635206912 [Note] WSREP: RECV thread exiting 0: Success
2018-01-04 11:36:28 140527176017664 [Note] WSREP: recv_thread() joined.
2018-01-04 11:36:28 140527176017664 [Note] WSREP: Closing replication queue.
2018-01-04 11:36:28 140527176017664 [Note] WSREP: Closing slave action queue.
2018-01-04 11:36:28 140527176017664 [Note] WSREP: /usr/sbin/mysqld: Terminated.
2018-01-04 11:36:42 140645890877568 [Note] WSREP: Read nil XID from storage engines, skipping position init
2018-01-04 11:36:42 140645890877568 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2018-01-04 11:36:42 140645890877568 [Note] WSREP: wsrep_load(): Galera 25.3.20(r3703) by Codership Oy <info@codership.com> loaded successfully.



 Comments   
Comment by Elena Stepanova [ 2018-01-06 ]

Which version is it?

Comment by Oren Bissick (Inactive) [ 2018-01-08 ]

version 10.2.10

Comment by Oren Bissick (Inactive) [ 2018-01-09 ]

Also record did exist for the delete that was suppose to run.

Comment by Oren Bissick (Inactive) [ 2018-01-09 ]

also the version says log at the end, why is that? 10.2.10-MariaDB-log

Comment by Bernd Lünenschloß [ 2018-01-31 ]

We have quite similar problems at our galera cluster using version 10.1.30 of mariadb.
One of three nodes aborted several times the last days due to compromised node consistency while deleting 20k to 70k rows of data.

+-------------------------+---------------------------------+
| Variable_name           | Value                           |
+-------------------------+---------------------------------+
| innodb_version          | 5.6.36-82.2                     |
| protocol_version        | 10                              |
| slave_type_conversions  |                                 |
| version                 | 10.1.30-MariaDB                 |
| version_comment         | MariaDB Server                  |
| version_compile_machine | x86_64                          |
| version_compile_os      | Linux                           |
| version_malloc_library  | system jemalloc                 |
| version_ssl_library     | OpenSSL 1.0.1e-fips 11 Feb 2013 |
| wsrep_patch_version     | wsrep_25.21                     |
+-------------------------+---------------------------------+

This is what we can find in mysql_error.logs:

BF-BF X lock conflict,mode: 1027 supremum: 0
 
conflicts states: my 0 locked 0
 
RECORD LOCKS space id 12992 page no 124919 n bits 128 index `PRIMARY` of table `w3_mdm`.`devices_actions_commands` trx table locks 10 total table locks 5 trx id 1071748551 lock_mode X locks rec but not gap lock hold time 0 wait time before grant 0
 
2018-01-31 9:56:59 139910114175744 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table w3_mdm.devices_actions_commands; Can't find record in 'devices_actions_commands', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 98348, Internal MariaDB error code: 1032
 
2018-01-31 9:57:00 139910114175744 [Warning] WSREP: RBR event 116 Delete_rows_v1 apply warning: 120, 12430471
 
2018-01-31 9:57:00 139910114175744 [Warning] WSREP: Failed to apply app buffer: seqno: 12430471, status: 1
 
at galera/src/trx_handle.cpp:apply():351
 
Retrying 2th time
 
BF-BF X lock conflict,mode: 1027 supremum: 0
 
conflicts states: my 0 locked 0
 
RECORD LOCKS space id 12992 page no 124833 n bits 136 index `PRIMARY` of table `w3_mdm`.`devices_actions_commands` trx table locks 10 total table locks 4 trx id 1071748551 lock_mode X locks rec but not gap lock hold time 1 wait time before grant 0
 
BF-BF X lock conflict,mode: 1027 supremum: 0
 
conflicts states: my 0 locked 0
 
RECORD LOCKS space id 12992 page no 124919 n bits 128 index `PRIMARY` of table `w3_mdm`.`devices_actions_commands` trx table locks 10 total table locks 4 trx id 1071748551 lock_mode X locks rec but not gap lock hold time 1 wait time before grant 0
 
2018-01-31 9:57:00 139910114175744 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table w3_mdm.devices_actions_commands; Can't find record in 'devices_actions_commands', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 98348, Internal MariaDB error code: 1032
 
2018-01-31 9:57:00 139910114175744 [Warning] WSREP: RBR event 116 Delete_rows_v1 apply warning: 120, 12430471
 
2018-01-31 9:57:00 139910114175744 [Warning] WSREP: Failed to apply app buffer: seqno: 12430471, status: 1
 
at galera/src/trx_handle.cpp:apply():351
 
Retrying 3th time
 
BF-BF X lock conflict,mode: 1027 supremum: 0
 
conflicts states: my 0 locked 0
 
RECORD LOCKS space id 12992 page no 124833 n bits 136 index `PRIMARY` of table `w3_mdm`.`devices_actions_commands` trx table locks 10 total table locks 4 trx id 1071748551 lock_mode X locks rec but not gap lock hold time 1 wait time before grant 0
 
BF-BF X lock conflict,mode: 1027 supremum: 0
 
conflicts states: my 0 locked 0
 
RECORD LOCKS space id 12992 page no 124919 n bits 128 index `PRIMARY` of table `w3_mdm`.`devices_actions_commands` trx table locks 10 total table locks 4 trx id 1071748551 lock_mode X locks rec but not gap lock hold time 1 wait time before grant 0
 
2018-01-31 9:57:00 139910114175744 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table w3_mdm.devices_actions_commands; Can't find record in 'devices_actions_commands', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 98348, Internal MariaDB error code: 1032
 
2018-01-31 9:57:00 139910114175744 [Warning] WSREP: RBR event 116 Delete_rows_v1 apply warning: 120, 12430471
 
2018-01-31 9:57:00 139910114175744 [Warning] WSREP: Failed to apply app buffer: seqno: 12430471, status: 1
 
at galera/src/trx_handle.cpp:apply():351
 
Retrying 4th time
 
BF-BF X lock conflict,mode: 1027 supremum: 0
 
conflicts states: my 0 locked 0
 
RECORD LOCKS space id 12992 page no 124833 n bits 136 index `PRIMARY` of table `w3_mdm`.`devices_actions_commands` trx table locks 10 total table locks 4 trx id 1071748551 lock_mode X locks rec but not gap lock hold time 1 wait time before grant 0
 
BF-BF X lock conflict,mode: 1027 supremum: 0
 
conflicts states: my 0 locked 0
 
RECORD LOCKS space id 12992 page no 124919 n bits 128 index `PRIMARY` of table `w3_mdm`.`devices_actions_commands` trx table locks 10 total table locks 4 trx id 1071748551 lock_mode X locks rec but not gap lock hold time 1 wait time before grant 0
 
2018-01-31 9:57:00 139910114175744 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table w3_mdm.devices_actions_commands; Can't find record in 'devices_actions_commands', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 98348, Internal MariaDB error code: 1032
 
2018-01-31 9:57:00 139910114175744 [Warning] WSREP: RBR event 116 Delete_rows_v1 apply warning: 120, 12430471
 
2018-01-31 9:57:00 139910114175744 [ERROR] WSREP: Failed to apply trx: source: 897fbe8e-0608-11e8-b39d-aabb766c0c43 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 1089 trx_id: 1071493022 seqnos (l: 12691905, g: 12430471, s: 12430470, d: 12430455, ts: 1812714932596988)
 
2018-01-31 9:57:00 139910114175744 [ERROR] WSREP: Failed to apply trx 12430471 4 times
 
2018-01-31 9:57:00 139910114175744 [ERROR] WSREP: Node consistency compromised, aborting...

Comment by Jan Lindström (Inactive) [ 2020-11-19 ]

I would recommend upgrading server to more recent version of 10.2 (10.1 is not maintained anymore) and if problem reproduces, we would need full error logs from all nodes and some instructions how to reproduce.

Generated at Thu Feb 08 08:16:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.