[MDEV-22766] MariaDB 10.4.12 doesn't syncronize: Slave SQL: Could not execute Write_rows_v1 event on table Created: 2020-06-01  Updated: 2022-01-19  Resolved: 2022-01-03

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.2.31, 10.3.22, 10.4.12, 10.4.19
Fix Version/s: 10.4.22

Type: Bug Priority: Critical
Reporter: Nikita Karpin Assignee: Seppo Jaakola
Resolution: Fixed Votes: 6
Labels: None
Environment:

ubuntu 18.04, mariadb galera 3-nodes cluster


Attachments: File logs_with_debug.tar.gz    

 Description   

Steps to reproduce:

1. Run reproduce script (if the issue was not reproduced after first run of the script, drop database tests and try to run script again):

#!/bin/bash
 
set -x
 
mysql_cmd="mysql -h $1 -u $2 -p$3" 
 
$mysql_cmd -e "create database tests;"
 
$mysql_cmd tests << EOF
 
CREATE TABLE test1 (
  id int(11) NOT NULL AUTO_INCREMENT,
  environments varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  test_start datetime NOT NULL,
  test_end datetime DEFAULT NULL,
  test_status varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
  cmd varchar(2000) COLLATE utf8_unicode_ci DEFAULT NULL,
  scheduled_date date DEFAULT NULL,
  scheduled_time time DEFAULT NULL,
  status_message varchar(500) COLLATE utf8_unicode_ci DEFAULT '',
  PRIMARY KEY (id,environments,test_start)
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
 
CREATE TABLE test2 (
  test_id int(11) NOT NULL,
  environment varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  PRIMARY KEY (test_id,environment),
  CONSTRAINT fk_test_env FOREIGN KEY (test_id) REFERENCES test1 (id)
) DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
 
EOF
 
for j in {1..9}; do
    for i in {10..50}; do
        c_start="2020-05-27 0${j}:${i}:05"
        c_end="2020-05-27 0${j}:${i}:21"
        $mysql_cmd tests -e "INSERT INTO test1(environments,test_start,test_end,test_status,cmd)  VALUES ('[\'test1a\']', '${c_start}', NULL, 'STARTED', '[\'testservices\']');"
        id=$($mysql_cmd -s tests -e "SELECT id from test1 WHERE test_start = '${c_start}';" | tail -1)
        $mysql_cmd tests -e "INSERT INTO test2 (test_id, environment) VALUES(((${id})), 'test1a');"
    done
done

Expected result:
All data is written successfully, all mariadb nodes are synced and in primary state in galera cluster
Actual result:
All data is written successfully, 1 or 2 mariadb nodes have state non-primary
Node(s) which dropped from cluster has following in logs:

2020-06-01 14:50:35 12 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tests.test2; Cannot add or update a child row: a foreign key constraint fails (`tests`.`test2`, CONSTRAINT `fk_test_env` FOREIGN KEY (`test_id`) REFERENCES `test1` (`id`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 174, Internal MariaDB error code: 1452
2020-06-01 14:50:35 12 [Note] WSREP: ha_rollback_trans(12, FALSE) rolled back: INSERT INTO test2 (test_id, environment) VALUES(((92)), 'test1a')\BB\D5^: Cannot add or update a child row: a foreign key constraint fails (`tests`.`test2`, CONSTRAINT `fk_test_env` FOREIGN KEY (`test_id`) REFERENCES `test1` (`id`)); is_real 0
 
2020-06-01 14:50:35 15 [Note] InnoDB:  SQL1: INSERT INTO test2 (test_id, environment) VALUES(((95)), 'test1a')\BB\D5^ SQL2: NULL
2020-06-01 14:50:35 15 [Note] InnoDB: BF conflict, modes: 1026:1059 idx: PRIMARY table: tests/test1 n_uniq: 3 n_user: 3 SQL1: INSERT INTO test2 (test_id, environment) VALUES(((95)), 'test1a')\BB\D5^ SQL2: NULL

It was found that when setting wsrep_slave_threads variable to 1, the issue cannot be reproduced.

Detailed logs are attached. They contain script for reproduce as well as config, variables and status from all 3 nodes.

UPDATE:

On maria 10.2.31, often such conflicts are resolved by retrying (which is absent in maria 10.4.12), and galera node remains in cluster. This is fragment of error and retry on maria 10.2.31:

2020-06-02  8:55:02 139866759632640 [Note] WSREP: Set WSREPXid for InnoDB:  648b7d16-a49f-11ea-9698-9e3da8a1d73f:251
2020-06-02  8:55:02 139866760247040 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tests.test2; Cannot add or update a child row: a foreign key constraint fails (`tests`.`test2`, CONSTRAINT `fk_test_env` FOREIGN KEY (`test_id`) REFERENCES `test1` (`id`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 207, Internal MariaDB error code: 1452
2020-06-02  8:55:02 139866760247040 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 254
2020-06-02  8:55:02 139866760247040 [Warning] WSREP: Failed to apply app buffer: seqno: 254, status: 1 at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
2020-06-02  8:55:02 139866760247040 [Note] InnoDB: BF-BF lock conflict, locking: 1 
RECORD LOCKS space id 101 page no 3 n bits 72 index PRIMARY of table "tests"."test1" trx id 6308 lock_mode X locks rec but not gap 
Record lock, heap no 2 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
0: len 4; hex 80000002; asc     ;;
1: len 10; hex 5b27737475623161275d; asc ['test1a'];;
2: len 5; hex 99a6848daf; asc      ;;
3: len 6; hex 0000000018a4; asc       ;;
4: len 7; hex fe000002020110; asc        ;;
5: SQL NULL;
6: len 7; hex 53544152544544; asc STARTED;;
7: len 12; hex 5b277365727669636573275d; asc ['services'];;
8: SQL NULL;
b' 9: SQL NULL;
10: len 0; hex ; asc ;;
 
2020-06-02  8:55:02 139866760247040 [Note] InnoDB:  SQL1: INSERT INTO test2 (test_id, environment) VALUES(((2)), 'test1a')\xe4\x13\xd6^\x13\x01
2020-06-02  8:55:02 139866760247040 [Note] InnoDB:  SQL2: NULL
2020-06-02  8:55:02 139866760247040 [Note] WSREP: waiting for BF, trx order: 254 253
 
2020-06-02  8:55:02 139866760247040 [Note] InnoDB: BF conflict, modes: 1026:1059 idx: PRIMARY table: tests/test1 n_uniq: 3 n_user: 3'
2020-06-02  8:55:02 139866760247040 [Note] InnoDB:  SQL1: INSERT INTO test2 (test_id, environment) VALUES(((2)), 'test1a')\xe4\x13\xd6^\x13\x01
2020-06-02  8:55:02 139866760247040 [Note] InnoDB:  SQL2: NULL
2020-06-02  8:55:02 139866760554240 [Note] InnoDB: BF-BF lock conflict, locking: 1
RECORD LOCKS space id 101 page no 3 n bits 72 index PRIMARY of table "tests"."test1" trx id 6310 lock mode S locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 11; compact format; info bits 0
0: len 4; hex 80000002; asc     ;;
1: len 10; hex 5b27737475623161275d; asc ['test1a'];;

However in some cases, retries are unsuccessful, and node crashes:

2020-06-02  9:21:25 139643550910208 [Note] InnoDB:  SQL2: NULL
2020-06-02  9:21:25 139643950552832 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tests.test2; Cannot add or update a child row: a foreign key constraint fails (`tests`.`test2`, CONSTRAINT `fk_test_env` FOREIGN KEY (`test_id`) REFERENCES `test1` (`id`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 176, Internal MariaDB error code: 1452
2020-06-02  9:21:25 139643950552832 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 6050
2020-06-02  9:21:25 139643950552832 [Warning] WSREP: Failed to apply app buffer: seqno: 6050, status: 1
\t at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
2020-06-02  9:21:25 139643950552832 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tests.test2; Cannot add or update a child row: a foreign key constraint fails (`tests`.`test2`, CONSTRAINT `fk_test_env` FOREIGN KEY (`test_id`) REFERENCES `test1` (`id`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 176, Internal MariaDB error code: 1452
2020-06-02  9:21:25 139643950552832 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 6050
2020-06-02  9:21:25 139643950552832 [Warning] WSREP: Failed to apply app buffer: seqno: 6050, status: 1
\t at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2020-06-02  9:21:25 139643950552832 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tests.test2; Cannot add or update a child row: a foreign key constraint fails (`tests`.`test2`, CONSTRAINT `fk_test_env` FOREIGN KEY (`test_id`) REFERENCES `test1` (`id`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 176, Internal MariaDB error code: 1452
2020-06-02  9:21:25 139643950552832 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 6050
2020-06-02  9:21:25 139643950552832 [Warning] WSREP: Failed to apply app buffer: seqno: 6050, status: 1
\t at galera/src/trx_handle.cpp:apply():351'
Retrying 4th time'
2020-06-02  9:21:25 139643950552832 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table tests.test2; Cannot add or update a child row: a foreign key constraint fails (`tests`.`test2`, CONSTRAINT `fk_test_env` FOREIGN KEY (`test_id`) REFERENCES `test1` (`id`)), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 176, Internal MariaDB error code: 1452
2020-06-02  9:21:25 139643950552832 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 6050
2020-06-02  9:21:25 139643950552832 [ERROR] WSREP: Failed to apply trx: source: 98174c9b-a4a8-11ea-904d-9f9eb4b1de70 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 10274 trx_id: 17928 seqnos (l: 3651, g: 6050, s: 6049, d: 5935, ts: 1562760915024458)
2020-06-02  9:21:25 139643950552832 [ERROR] WSREP: Failed to apply trx 6050 4 times
2020-06-02  9:21:25 139643950552832 [ERROR] WSREP: Node consistency compromized, aborting...



 Comments   
Comment by Mario Karuza (Inactive) [ 2020-06-05 ]

Hi Nikita,

We were able to reproduce problem with provided scripts.
It looks this is FK applying problem and we have already seen them before.Current solution is to set cert.optimistic_pa=false in galera parameters.

You can check what it does here: https://galeracluster.com/library/documentation/galera-parameters.html#cert-optimistic-pa

Comment by Nikita Karpin [ 2020-06-11 ]

Hi Mario,

thanks with setting this parameter to false, the issue cannot be reproduced anymore,
Could you please also clarify if with this setting some performance degradation is expected?

Comment by Mario Karuza (Inactive) [ 2020-06-14 ]

Hi Nikita,

Yes there will be performance penalty when using this parameter.

Comment by Nikita Karpin [ 2020-08-10 ]

Hi, could somebody please explain why this issue was closed as not a bug ?

Comment by Jean-Louis Dupond [ 2020-11-06 ]

Not clear why this was closed indeed! Hitting the same issue here.

Comment by Seppo Jaakola [ 2020-11-06 ]

Re-opened to verify that the bug does not appear anymore with 4.17 and later
Backporting the fix to pre- 10.4 will be hard due to wsrep API change (API 26 needed by the fix, 10.3 has wsrep API 25)

Comment by Seppo Jaakola [ 2021-04-21 ]

Tested the reproducring test script on 10.4.19 (development) build, and cannot reproduce the issue anymore

Comment by Mario Karuza (Inactive) [ 2021-07-08 ]

As mention in previous comment using cert.optimistic_pa=false should be used to prevent this kind of problem. This is not set by default as it comes with performance penalty.

Comment by Nikita Karpin [ 2021-07-08 ]

Ok, and what about 10.4.19 mariadb was it fixed here? Or i need cert.optimistic_pa on this release to?

Comment by Pim Rupert [ 2021-07-08 ]

> Ok, and what about 10.4.19 mariadb was it fixed here? Or i need cert.optimistic_pa on this release to?

We are still using cert.optimistic_pa=false on 10.4.19 to be sure. Will only enable back again once we have absolute confirmation there is no chance of repeat issues.

Comment by Mario Karuza (Inactive) [ 2021-07-09 ]

This is galera library configuration variable - https://galeracluster.com/library/documentation/galera-parameters.html#cert-optimistic-pa

It should be there in 10.4 version.

Comment by Loïc Stevens [ 2021-12-15 ]

We hit this bug too and were able to reproduce with the provided script in 10.4.21. MariaDB support asked us to set cert.optimistic_pa to false. We don't have issue ever since in that version.

We've been asked to upgrade to 10.4.22, we will maybe try to remove that parameter and see what happens.

Comment by Seppo Jaakola [ 2022-01-03 ]

Cannot reproduce the issue anymore with the given script

Generated at Thu Feb 08 09:17:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.