Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31448

Killing a replica thread awaiting its GCO can hang/crash a parallel replica

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL)
    • 10.4.31
    • Replication

    Description

      Killing a replica thread awaiting its GCO can hang/crash a parallel replica

      If any transactions have started their commit phase while a replica thread that is waiting on its GCO to start is killed, the parallel slave will hang (non-debug) or crash in an assertion error (debug) if the actively committing transactions error. This is because the killed replica thread will perform GCO cleanup on the previous GCO while it has not finished, leading to the same outcome as MDEV-30780.

      For example, on a replica with three worker threads, if we have three transactions, T1, T2, and T3, grouped into two GCOs as GCO1

      {T1, T2}

      and GCO2

      {T3}

      such that T1 is executing, T2 is ready and queued for group commit, and T3 is waiting for its GCO to start, if T3 is killed it will perform GCO cleanup on GCO1 even though T1 and T2 are still active. The following MTR test shows this:

      --source include/master-slave.inc
      --source include/have_innodb.inc
      --source include/have_debug.inc
      --source include/have_binlog_format_row.inc
       
      --echo #
      --echo # Initialize test data
      --connection master
      create table t1 (a int) engine=innodb;
      create table t2 (a int) engine=innodb;
      insert into t1 values (1);
      --source include/save_master_gtid.inc
       
      --connection slave
      --source include/sync_with_master_gtid.inc
      --source include/stop_slave.inc
      --let $save_innodb_lock_wait_timeout= `SELECT @@global.innodb_lock_wait_timeout`
      --let $save_transaction_retries= `SELECT @@global.slave_transaction_retries`
      set @@global.slave_parallel_threads= 3;
      set @@global.slave_parallel_mode= CONSERVATIVE;
      set @@global.innodb_lock_wait_timeout= 2;
      set @@global.slave_transaction_retries= 0;
      BEGIN;
      SELECT * FROM t1 WHERE a=1 FOR UPDATE;
       
      --connection master
      SET @old_dbug= @@SESSION.debug_dbug;
      SET @@SESSION.debug_dbug="+d,binlog_force_commit_id";
       
      # GCO 1
      SET @commit_id= 10000;
      # T1
      update t1 set a=2 where a=1;
      # T2
      insert into t2 values (1);
       
      # GCO 2
      SET @commit_id= 10001;
      # T3
      insert into t1 values (3);
       
      --connection slave
      --source include/start_slave.inc
       
      --let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Update_rows_log_event::find_row(-1)' and  command LIKE 'Slave_worker';
      --source include/wait_condition.inc
      --let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Waiting for prior transaction to commit%' and  command LIKE 'Slave_worker';
      --source include/wait_condition.inc
      --let $wait_condition= SELECT count(*)=1 FROM information_schema.processlist WHERE state LIKE 'Waiting for prior transaction to start commit%' and  command LIKE 'Slave_worker';
      --source include/wait_condition.inc
       
      --let $t3_tid= `SELECT ID FROM INFORMATION_SCHEMA.PROCESSLIST WHERE STATE LIKE 'Waiting for prior transaction to start commit'`
      --eval kill $t3_tid
       
      --echo #
      --echo # Cleanup
      --connection master
      DROP TABLE t1, t2;
      --source include/save_master_gtid.inc
       
      --connection slave
      --source include/sync_with_master_gtid.inc
       
      --source include/rpl_end.inc
      

      Attachments

        Issue Links

          Activity

            The fix for this (in the MDEV-13915 patch) is not correct.
            The real problem seems to be an incorrect unmark_start_commit() in signal_error_to_sql_driver_thread(). This unmark can run after the following GCO has started, and then a later mark_start_commit() will access freed GCO.

            I have implemented what I think is a correct fix here, in my knielsen_faster_stop_slave branch:

            https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave
            https://github.com/MariaDB/server/commit/6ce9c839997c1fc78c2989540dd21155a96fb419

            knielsen Kristian Nielsen added a comment - The fix for this (in the MDEV-13915 patch) is not correct. The real problem seems to be an incorrect unmark_start_commit() in signal_error_to_sql_driver_thread(). This unmark can run after the following GCO has started, and then a later mark_start_commit() will access freed GCO. I have implemented what I think is a correct fix here, in my knielsen_faster_stop_slave branch: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave https://github.com/MariaDB/server/commit/6ce9c839997c1fc78c2989540dd21155a96fb419
            Elkin Andrei Elkin added a comment - - edited

            knielsen, a test that catches a gco assert in finish_event_group is attached.
            The test is to run on the bare 10.5. (The description test covers a similar issue in the conservative mode; I am finishing its reviewing shortly).
            As a summary (let me place it here; I am repeating that on github pages as well) from the attached one,
            an immediate cause of the killed worker behavior is its unlinking from the commit parent, very much similarly to MDEV-30780.
            MDEV-13915 fixes that to return from wait_for_prior_commit 's killed-but-no-commit-started branch with the parent link intact (as of now, June 15, this is improved to cover all kill target phases).
            And the 2nd protection from OOO is a force_wait flag that the killed T2 (of the test) worker employs inside finish_event_group().

            Elkin Andrei Elkin added a comment - - edited knielsen , a test that catches a gco assert in finish_event_group is attached. The test is to run on the bare 10.5. (The description test covers a similar issue in the conservative mode; I am finishing its reviewing shortly). As a summary (let me place it here; I am repeating that on github pages as well) from the attached one, an immediate cause of the killed worker behavior is its unlinking from the commit parent, very much similarly to MDEV-30780 . MDEV-13915 fixes that to return from wait_for_prior_commit 's killed-but-no-commit-started branch with the parent link intact ( as of now, June 15, this is improved to cover all kill target phases ). And the 2nd protection from OOO is a force_wait flag that the killed T2 (of the test) worker employs inside finish_event_group() .
            Elkin Andrei Elkin added a comment -

            A fixes commit is pushed to bb-10.5-andrei.

            Elkin Andrei Elkin added a comment - A fixes commit is pushed to bb-10.5-andrei.

            As we discussed on Zulip, there are two different approaches here to select between.

            1. Force each event group to wait for the prior event group to commit or fail, even in case of kill. This is the approach that Andrei's patch is taking.

            2. Allow a killed wait_for_prior_commit() to abort the thread out-of-order. Fix finish_event_group() so that it works correctly in the error case where it may complete out-of-order with earlier transactions. I have pushed a patch for this to my branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave which passes the various tests.

            I think both approaches have merit, still thinking to decide which I prefer.

            knielsen Kristian Nielsen added a comment - As we discussed on Zulip, there are two different approaches here to select between. 1. Force each event group to wait for the prior event group to commit or fail, even in case of kill. This is the approach that Andrei's patch is taking. 2. Allow a killed wait_for_prior_commit() to abort the thread out-of-order. Fix finish_event_group() so that it works correctly in the error case where it may complete out-of-order with earlier transactions. I have pushed a patch for this to my branch knielsen_faster_stop_slave: https://github.com/MariaDB/server/commits/knielsen_faster_stop_slave which passes the various tests. I think both approaches have merit, still thinking to decide which I prefer.

            Pushed to 10.4

            knielsen Kristian Nielsen added a comment - Pushed to 10.4

            People

              knielsen Kristian Nielsen
              bnestere Brandon Nesterenko
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.