[MDEV-20645] Replication consistency is broken as workers miss the error notification from an earlier failed group. Created: 2019-09-23  Updated: 2020-06-15  Resolved: 2019-09-30

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1, 10.2, 10.3, 10.4
Fix Version/s: 10.2.28, 10.1.42, 10.3.19, 10.4.9

Type: Bug Priority: Major
Reporter: Sujatha Sivakumar (Inactive) Assignee: Sujatha Sivakumar (Inactive)
Resolution: Fixed Votes: 0
Labels: optimistic, parallelslave


 Description   

Enable parallel replication on slave.
slave_parallel_mode='optimistic'.

Table Structure:
CREATE TABLE t2 (a int PRIMARY KEY) ENGINE=InnoDB;

Execute following DML operations on master.

INSERT INTO t2 VALUES (32);
INSERT INTO t2 VALUES (33);
INSERT INTO t2 VALUES (34);

The above three transactions are scheduled for parallel execution on slave.
The first insert fails on slave due to duplicate key error. Upon error the rest
of the workers should abort but transaction 34 gets committed.



 Comments   
Comment by Sujatha Sivakumar (Inactive) [ 2019-09-23 ]

Hello Andrei,

Please review the fix for MDEV-20645.

https://github.com/MariaDB/server/commit/e07caf401c26cf8144899336d103e4c7aafd3d7a

http://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.1-sujatha

Thank you.

Comment by Andrei Elkin [ 2019-09-24 ]

Sujatha, the patch looks nice, and thanks for a good piece of work!

I only was not happy about cluttering up the code block now with the 2nd
simulation

@@ -1096,6 +1102,13 @@ handle_rpl_parallel_thread(void *arg)
bool did_enter_cond= false;
PSI_stage_info old_stage;

+ DBUG_EXECUTE_IF("hold_worker_on_schedule", {
+ if (rgi->current_gtid.domain_id == 0 &&
+ rgi->current_gtid.seq_no == 100)

{ + debug_sync_set_action(thd, + STRING_WITH_LEN("now SIGNAL reached_pause WAIT_FOR continue_worker")); + }

+ });
DBUG_EXECUTE_IF("rpl_parallel_scheduled_gtid_0_x_100", {
if (rgi->current_gtid.domain_id == 0 &&
rgi->current_gtid.seq_no == 100) {

which largely copies the 1st one. We could actually keep just one generic simulation block
and employ the statement format's user variables to carry to the worker various
things like
gtid and the reaction string. Eventually it would be something like this
pseudo-code:

    if ((event_type= qev->ev->get_type_code()) == GTID_EVENT)
      {
        ...
        DBUG_EXECUTE_IF("hold_worker_before_gco",
                                         {
                                            if (rgi->current_gtid.seq_no == "@gtid_for_hold_worker_before_gco")
                                                       debug_sync_set_action(thd, "@action_for_hold_worker_before_gco")
                                          });
        }

I am just throwing in the idea without urging yet to discuss it and implement. We would certainly benefit in having this sort of simulation policy which just bound with STATEMENT format a bit to much to my taste (a tentative feeling; should we have user variables logged along with Rows_log_events ...).
some policy and

Comment by Sujatha Sivakumar (Inactive) [ 2019-09-30 ]

Fix for the issue is implemented in 10.1.42.

The patch has been tested on higher versions.

10.2 patch: There is a minor change in test for 10.2. The 'enable_connect_log' and 'disable_connect_log' are not required. Hence they are removed.

https://github.com/MariaDB/server/commit/62c05dd14a37b7c4dff3bf9069eca6dd1deb9235

10.3 patch:
https://github.com/MariaDB/server/commit/7d7d741cc33fb51a0f3f226728769255d4e73c1e

10.4 patch:
https://github.com/MariaDB/server/commit/fa0fc3e38af44b685872b7846beb631999ea01b5

Generated at Thu Feb 08 09:01:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.