[MDEV-33303] slave_parallel_mode=optimistic should not report the mode's specific temporary errors Created: 2024-01-24  Updated: 2024-01-25

Status: Confirmed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.5.16
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: liujia Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: replication, slave_parallel_mode
Environment:

centos7 x86_64


Issue Links:
Blocks
blocks MDEV-20628 If temporary error occurs with optimi... Closed

 Description   

I have server mariadb-10.5.16 with slave_parallel_mode=optimistic and see errors in log:
```
2024-01-24 8:00:46 57 [ERROR] mysqld: Can't find record in 'test'
2024-01-24 10:40:55 56 [ERROR] mysqld: Can't find record in 'test'
2024-01-24 11:24:48 56 [ERROR] mysqld: Can't find record in 'test'
2024-01-24 14:44:33 55 [ERROR] mysqld: Can't find record in 'test'
2024-01-24 15:11:35 58 [ERROR] mysqld: Can't find record in 'test'
2024-01-24 15:11:51 57 [ERROR] mysqld: Can't find record in 'test'
2024-01-24 15:11:51 57 [ERROR] mysqld: Can't find record in 'test'
2024-01-24 15:33:25 57 [ERROR] mysqld: Can't find record in 'test'
```

how to repeat:
```
drop database if exists my_table_test;
create database if not exists my_table_test;
use my_table_test;
drop table if exists test;
create table test(a int, b varchar(123));
insert into test values(1, 'asdf');
update test set b='zxf1' where a=1;
update test set b='\n' where a=1;
```

Replication is not crash and still works.
```
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: liujia2
Master_Port: 3633
Connect_Retry: 60
Master_Log_File: mysql-bin.000052
Read_Master_Log_Pos: 219035686
Relay_Log_File: mysql-develop-0001-relay-bin.000104
Relay_Log_Pos: 168065139
Relay_Master_Log_File: mysql-bin.000052
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 219035209
Relay_Log_Space: 570350217
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_SSL_Crl:
Master_SSL_Crlpath:
Using_Gtid: Slave_Pos
Gtid_IO_Pos: 0-1-113030212
Replicate_Do_Domain_Ids:
Replicate_Ignore_Domain_Ids:
Parallel_Mode: optimistic
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Slave_DDL_Groups: 8
Slave_Non_Transactional_Groups: 2
Slave_Transactional_Groups: 113030236
```



 Comments   
Comment by Kristian Nielsen [ 2024-01-24 ]

Ouch, this is not supposed to occur. Thanks for the report, confirmed that this can occur with the very simple test provided.

Here is a mysql-test-run testcase that reproduces, the below patch makes it easier to trigger:

--source include/have_innodb.inc
--source include/have_binlog_format_row.inc
--source include/master-slave.inc
 
--connection slave
--source include/stop_slave.inc
SET @old_retry=@@GLOBAL.slave_transaction_retries;
SET GLOBAL slave_parallel_threads=4;
SET @old_mode= @@GLOBAL.slave_parallel_mode;
SET GLOBAL slave_parallel_mode= optimistic;
 
--connection master
drop database if exists my_table_test;
create database if not exists my_table_test;
use my_table_test;
drop table if exists test;
create table test(a int, b varchar(123)) ENGINE=InnoDB;
SET gtid_seq_no= 100;
insert into test values(1, 'asdf');
update test set b='zxf1' where a=1;
update test set b='\n' where a=1;
 
--connection slave
--source include/start_slave.inc
 
--connection master
--sync_slave_with_master
 
--connection master
DROP DATABASE my_table_test;
 
--connection slave
--source include/stop_slave.inc
SET GLOBAL slave_transaction_retries= @old_retry;
SET GLOBAL slave_parallel_threads=0;
SET GLOBAL slave_parallel_mode= @old_mode;
--source include/start_slave.inc
 
--source include/rpl_end.inc

diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
index 13a981d4381..426d4ee481e 100644
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -7325,6 +7325,7 @@ Rows_log_event::write_row(rpl_group_info *rgi,
     TODO: Add safety measures against infinite looping. 
    */
 
+my_sleep(20000);
   if (table->s->sequence)
     error= update_sequence();
   else while (unlikely(error= table->file->ha_write_row(table->record[0])))

Comment by Kristian Nielsen [ 2024-01-24 ]

This was supposed to be fixed in MDEV-13577.
However, the fix for that overlooked the case where the table has no unique (or primary) key. Then the missing row is caught from an index/table scan returning the error HA_ERR_END_OF_FILE (the patch for MDEV-13577 handled only an index lookup that returns HA_ERR_KEY_NOT_FOUND).

This patch seems to fix the bug and avoid the errorneous messages in the error log:

diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
index 13a981d4381..28b1dd13658 100644
--- a/sql/log_event_server.cc
+++ b/sql/log_event_server.cc
@@ -7814,6 +7815,12 @@ static int row_not_found_error(rpl_group_info *rgi)
          ? HA_ERR_KEY_NOT_FOUND : HA_ERR_RECORD_CHANGED;
 }
 
+static int end_of_file_error(rpl_group_info *rgi)
+{
+  return rgi->speculation != rpl_group_info::SPECULATE_OPTIMISTIC
+         ? HA_ERR_END_OF_FILE : HA_ERR_RECORD_CHANGED;
+}
+
 /**
   Locate the current row in event's table.
 
@@ -8061,6 +8068,8 @@ int Rows_log_event::find_row(rpl_group_info *rgi)
       while ((error= table->file->ha_index_next(table->record[0])))
       {
         DBUG_PRINT("info",("no record matching the given row found"));
+        if (error == HA_ERR_END_OF_FILE)
+          error= end_of_file_error(rgi);
         table->file->print_error(error, MYF(0));
         table->file->ha_index_end();
         goto end;
@@ -8097,6 +8106,7 @@ int Rows_log_event::find_row(rpl_group_info *rgi)
         break;
 
       case HA_ERR_END_OF_FILE:
+        error= end_of_file_error(rgi);
         DBUG_PRINT("info", ("Record not found"));
         table->file->ha_rnd_end();
         goto end;

Comment by liujia [ 2024-01-25 ]

I used sysbench to reproduce the error.
This is the content of the sysbench script, named read_only.lua.
```
require("oltp_common")

function prepare_statements()
end

function event()
con:query("update test set b='zxf1' where a=1;")
con:query("update test set b='\n' where a=1;")
end
```

This is the sysbench command.
```
sysbench ./read_only.lua --db-driver=mysql --mysql-host=127.0.0.1 --mysql-port=3633 --mysql-user=root --mysql-password=123456 --mysql-db=my_table_test --threads=10 --time=0 --report-interval=1 run > my_table_test.log 2>&1 &
```

Run the sysbench command to quickly reproduce the problem.

Generated at Thu Feb 08 10:37:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.