Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7818

Deadlock occurring with parallel replication and FTWRL

Details

    • 10.0.22, 10.1.9-1, 10.1.9-2

    Description

      Thread 5 is locked by global read lock but FTWRL never completes and stays in waiting for commit lock; replication thread stays locked forever until FTWRL is killed

      MariaDB(db-02)[(none)]> show processlist\G
      *************************** 1. row ***************************
            Id: 3
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 12612
         State: Waiting for prior transaction to start commit before starting next transaction
          Info: NULL
      Progress: 0.000
      *************************** 2. row ***************************
            Id: 4
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 12612
         State: Waiting for prior transaction to commit
          Info: NULL
      Progress: 0.000
      *************************** 3. row ***************************
            Id: 5
          User: system user
          Host: 
            db: crypto_data
       Command: Connect
          Time: 12612
         State: Waiting for global read lock
          Info: INSERT INTO `market_orders_ok` (`label`, `marketid`, `ordertype`, `price`, `quantity`, `total`) VALU
      Progress: 0.000
      *************************** 4. row ***************************
            Id: 6
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 12612
         State: Waiting for prior transaction to start commit before starting next transaction
          Info: NULL
      Progress: 0.000
      *************************** 5. row ***************************
            Id: 7
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 1651381
         State: Waiting for master to send event
          Info: NULL
      Progress: 0.000
      *************************** 6. row ***************************
            Id: 8
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 13263
         State: Waiting for room in worker thread event queue
          Info: NULL
      Progress: 0.000
      *************************** 7. row ***************************
            Id: 39
          User: monitoring
          Host: localhost
            db: NULL
       Command: Sleep
          Time: 59
         State: 
          Info: NULL
      Progress: 0.000
      *************************** 8. row ***************************
            Id: 1568378
          User: tanje6acfc0c1213cfb63
          Host: localhost
            db: NULL
       Command: Sleep
          Time: 3
         State: 
          Info: NULL
      Progress: 0.000
      *************************** 9. row ***************************
            Id: 27871929
          User: backup
          Host: localhost
            db: NULL
       Command: Query
          Time: 12612
         State: Waiting for commit lock
          Info: FLUSH TABLES WITH READ LOCK
      Progress: 0.000
      *************************** 10. row ***************************
            Id: 28082858
          User: tanje6acfc0c1213cfb63
          Host: localhost
            db: NULL
       Command: Query
          Time: 0
         State: init
          Info: show processlist
      Progress: 0.000
      10 rows in set (0.00 sec)

      Attachments

        Issue Links

          Activity

            Current idea:

            When FTWRL starts, it first checks all parallel replication worker
            threads. It finds the most recent GTID started by any of them. It then sets
            a flag to tell the threads not to start on any newer GTIDs, and then waits
            for all earlier GTIDs to fully commit. It also sets a flag to tell START
            SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity.

            Once all worker threads have reached their designated point, FTWLR continues
            to take the global read lock. Once that is obtained, it clears the flags and
            signals worker threads and other slave code that it can proceed. At this
            point, the lock is held, so no real activity will be possible until the lock
            is cleared with UNLOCK TABLES.

            This should hopefully fix the deadlock, at least I got the test case of
            Elena to pass with a preliminary patch along these lines.

            Some care will probably be needed to guard against other deadlocks against
            concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.

            knielsen Kristian Nielsen added a comment - Current idea: When FTWRL starts, it first checks all parallel replication worker threads. It finds the most recent GTID started by any of them. It then sets a flag to tell the threads not to start on any newer GTIDs, and then waits for all earlier GTIDs to fully commit. It also sets a flag to tell START SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity. Once all worker threads have reached their designated point, FTWLR continues to take the global read lock. Once that is obtained, it clears the flags and signals worker threads and other slave code that it can proceed. At this point, the lock is held, so no real activity will be possible until the lock is cleared with UNLOCK TABLES. This should hopefully fix the deadlock, at least I got the test case of Elena to pass with a preliminary patch along these lines. Some care will probably be needed to guard against other deadlocks against concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.
            knielsen Kristian Nielsen added a comment - - edited

            I have a fix for this now (that just removes the deadlock, without breaking
            existing backup tools or otherwise changing user-visible behaviour).

            It is in three parts. The actual fix is the second patch. The first patch is
            just unrelated refactoring of the parallel replication code to make the
            second patch cleaner. The third patch is a test case.

            http://lists.askmonty.org/pipermail/commits/2015-June/008059.html
            http://lists.askmonty.org/pipermail/commits/2015-June/008060.html
            http://lists.askmonty.org/pipermail/commits/2015-June/008061.html

            Elena: You asked some days ago - this patch should be ready, if you want to
            apply it in further testing.

            EDIT: Updated with newest version of patches after fixing MDEV-8318.

            knielsen Kristian Nielsen added a comment - - edited I have a fix for this now (that just removes the deadlock, without breaking existing backup tools or otherwise changing user-visible behaviour). It is in three parts. The actual fix is the second patch. The first patch is just unrelated refactoring of the parallel replication code to make the second patch cleaner. The third patch is a test case. http://lists.askmonty.org/pipermail/commits/2015-June/008059.html http://lists.askmonty.org/pipermail/commits/2015-June/008060.html http://lists.askmonty.org/pipermail/commits/2015-June/008061.html Elena: You asked some days ago - this patch should be ready, if you want to apply it in further testing. EDIT: Updated with newest version of patches after fixing MDEV-8318 .

            See MDEV-8318. I made it a separate report not to pollute this one too much.

            elenst Elena Stepanova added a comment - See MDEV-8318 . I made it a separate report not to pollute this one too much.

            Review of MDEV-7818: Deadlock occurring with parallel replication and FTWRL

            http://lists.askmonty.org/pipermail/commits/2015-June/008059.html
            In this patch you added:

            +/*
            + Do not start parallel execution of this event group until all prior groups
            + have reached the commit phase that are not safe to run in parallel with.
            +*/
            +static bool
            +do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco,
            + bool *did_enter_cond, PSI_stage_info *old_stage)

            You did however remove this comment:

            • /*
            • Register ourself to wait for the previous commit, if we need to do
            • such registration and that previous commit has not already
            • occured.
              -
            • Also do not start parallel execution of this event group until all
            • prior groups have reached the commit phase that are not safe to run
            • in parallel with.
            • */

            Why did you remove the first part of the comment?
            Is that not true anymore?

            In sql/rpl_parallel.h you added:

            + /*
            + The largest sub_id that has started its transaction. Protected by
            + LOCK_parallel_entry.
            +
            + (Transactions can start out-of-order, so this value signifies that no
            + transactions with larger sub_id have started, but not necessarily that all
            + transactions with smaller sub_id have started).
            + */
            + uint64 largest_started_sub_id;

            But this is not used anywhere.

            http://lists.askmonty.org/pipermail/commits/2015-June/008060.html

            +static void
            +do_ftwrl_wait(rpl_group_info *rgi,
            + bool *did_enter_cond, PSI_stage_info *old_stage)
            +{
            + THD *thd= rgi->thd;
            + rpl_parallel_entry *entry= rgi->parallel_entry;
            + uint64 sub_id= rgi->gtid_sub_id;
            +
            + mysql_mutex_assert_owner(&entry->LOCK_parallel_entry);

            Would be good to have a comment for the following test, like:

            /*
            If ftwrl is activate, wait until ftwrl is finnished if we are a
            new transaction that started after the ftwrl command was given
            */

            Note that if you would store MAX_INT in entry->pause_sub_id when it's not
            active, then the following test would be much easier:

            + if (unlikely(entry->pause_sub_id > 0) && sub_id > entry->pause_sub_id)
            + {
            + thd->ENTER_COND(&entry->COND_parallel_entry, &entry->LOCK_parallel_entry,
            + &stage_waiting_for_ftwrl, old_stage);

            As you don't have an EXIT_COND() matching the above ENTER_COND, please
            add a comment that EXIT_COND() will be called by the caller
            (handle_rpl_parallel_thread())

            <cut>

            Wouldn't a better name for the following function be:
            rpl_unpause_after_ftwrl(THD *thd) ?

            void
            +rpl_unpause_for_ftwrl(THD *thd)
            +{
            + uint32 i;
            + rpl_parallel_thread_pool *pool= &global_rpl_thread_pool;
            +
            + DBUG_ASSERT(pool->busy);
            +
            + for (i= 0; i < pool->count; ++i)
            + {
            + rpl_parallel_entry *e;
            + rpl_parallel_thread *rpt= pool->threads[i];
            +
            + mysql_mutex_lock(&rpt->LOCK_rpl_thread);
            + if (!rpt->current_owner)
            +

            { + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + continue; Can the above thread ever have e->pause_sub_id != 0 ? For example if the thread did have a current_owner during pause and not an owner now ? + }

            + e= rpt->current_entry;
            + mysql_mutex_lock(&e->LOCK_parallel_entry);
            + mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
            + e->pause_sub_id= 0;
            + mysql_cond_broadcast(&e->COND_parallel_entry);

            Don't you need to unlock mysql_mutex_lock(&e->LOCK_parallel_entry) here ?

            + }

            + mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
            + pool_mark_not_busy(pool);
            + mysql_cond_broadcast(&pool->COND_rpl_thread_pool);
            + mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool);

            As pool_mark_not_busy() is already doing the broadcast, you can remove
            it from above.

            If we always have to take the above mutexes when calling pool_mark_not_busy()
            it would be better if we move the lock and unlock inside pool_mark_not_busy()

            <cut>
            +rpl_pause_for_ftwrl(THD *thd)
            +{
            <cut>

            + for (i= 0; i < pool->count; ++i)
            + {
            + PSI_stage_info old_stage;
            + rpl_parallel_entry *e;
            + rpl_parallel_thread *rpt= pool->threads[i];
            +
            + mysql_mutex_lock(&rpt->LOCK_rpl_thread);
            + if (!rpt->current_owner)
            +

            { + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + continue; + }

            + e= rpt->current_entry;
            + mysql_mutex_lock(&e->LOCK_parallel_entry);
            + mysql_mutex_unlock(&rpt->LOCK_rpl_thread);
            + ++e->need_sub_id_signal;
            + if (!e->pause_sub_id)
            + e->pause_sub_id= e->largest_started_sub_id;

            Why the above test?
            In which case can pause_sub_id be != 0 here ?
            If it's != 0, what does it mean?

            + thd->ENTER_COND(&e->COND_parallel_entry, &e->LOCK_parallel_entry,
            + &stage_waiting_for_ftwrl_threads_to_pause, &old_stage);
            + while (e->last_committed_sub_id < e->pause_sub_id && !err)
            + {
            + if (thd->check_killed())
            +

            { + thd->send_kill_message(); + err= 1; + break; + }

            + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry);
            + };
            + -e>need_sub_id_signal;
            + thd->EXIT_COND(&old_stage);
            + if (err)
            + break;
            + }
            +
            + if (err)
            + rpl_unpause_for_ftwrl(thd);
            + return err;
            +}

            @@ -1106,7 +1302,14 @@ rpl_parallel_change_thread_count(rpl_parallel_thread_pool *pool,
            */
            for (i= 0; i < pool->count; ++i)
            {

            • rpl_parallel_thread *rpt= pool->get_thread(NULL, NULL);
              + rpl_parallel_thread *rpt;
              +
              + mysql_mutex_lock(&pool->LOCK_rpl_thread_pool);
              + while ((rpt= pool->free_list) == NULL)
              + mysql_cond_wait(&pool->COND_rpl_thread_pool, &pool->LOCK_rpl_thread_pool);
              + pool->free_list= rpt->next;
              + mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool);
              + mysql_mutex_lock(&rpt->LOCK_rpl_thread);

            I see you used the original code from get_thread(), but you don't reset
            rpt->current_owner or rpt->current_entry anymore. Is that not needed?

            @@ -1496,8 +1703,14 @@ rpl_parallel_thread_pool::get_thread(rpl_parallel_thread **owner,
            rpl_parallel_thread *rpt;

            mysql_mutex_lock(&LOCK_rpl_thread_pool);

            • while ((rpt= free_list) == NULL)
              + for (;
              + { + while (unlikely(busy)) + mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool); + if ((rpt= free_list) != NULL) + break; mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool); + }

            Why not use:

            while (unlikely(busy) || !(rpt= free_list))
            mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool);

            — a/sql/sql_parse.cc
            +++ b/sql/sql_parse.cc
            @@ -4259,6 +4259,17 @@ case SQLCOM_PREPARE:
            break;
            }

            + if (lex->type & REFRESH_READ_LOCK)
            +

            { + /* + We need to pause any parallel replication slave workers during FLUSH + TABLES WITH READ LOCK. Otherwise we might cause a deadlock, as + worker threads eun run in arbitrary order but need to commit in a + specific given order. + */ + if (rpl_pause_for_ftwrl(thd)) + goto error; + }

            /*
            reload_acl_and_cache() will tell us if we are allowed to write to the
            binlog or not.
            @@ -4289,6 +4300,8 @@ case SQLCOM_PREPARE:
            if (!res)
            my_ok(thd);
            }
            + if (lex->type & REFRESH_READ_LOCK)
            + rpl_unpause_for_ftwrl(thd);

            break;
            }

            Why have the above code in sql_parse.cc, instead of in
            reload_acl_and_cache() around the call for

            'thd->global_read_lock.lock_global_read_lock(thd))'

            This would avoid calling rpl_pause_for_ftwrl(thd) if tables are locked
            and in other possible error cases detected early.

            On the other hand, you may also need to take care of the code just above
            that calls

            flush_tables_with_read_lock(thd, all_tables)

            This happens when you give tables as a argument to FLUSH TABLES WITH READ LOCK.

            I would assume that this can also cause a dead lock.

            The easy fix is probably to add rpl_pause / rpl_unpause also to this
            function just before it's calling lock_table_names().

            ------------

            http://lists.askmonty.org/pipermail/commits/2015-June/008061.html

            I didn't see test for doing FLUSH TABLES WITH READ LOCK, 2 in a row.

            This should detect if there ever was any issues with
            rpl_unpause_for_ftwrl() not beeing properly called after
            rpl_pause...().

            monty Michael Widenius added a comment - Review of MDEV-7818 : Deadlock occurring with parallel replication and FTWRL http://lists.askmonty.org/pipermail/commits/2015-June/008059.html In this patch you added: +/* + Do not start parallel execution of this event group until all prior groups + have reached the commit phase that are not safe to run in parallel with. +*/ +static bool +do_gco_wait(rpl_group_info *rgi, group_commit_orderer *gco, + bool *did_enter_cond, PSI_stage_info *old_stage) You did however remove this comment: /* Register ourself to wait for the previous commit, if we need to do such registration and that previous commit has not already occured. - Also do not start parallel execution of this event group until all prior groups have reached the commit phase that are not safe to run in parallel with. */ Why did you remove the first part of the comment? Is that not true anymore? In sql/rpl_parallel.h you added: + /* + The largest sub_id that has started its transaction. Protected by + LOCK_parallel_entry. + + (Transactions can start out-of-order, so this value signifies that no + transactions with larger sub_id have started, but not necessarily that all + transactions with smaller sub_id have started). + */ + uint64 largest_started_sub_id; But this is not used anywhere. http://lists.askmonty.org/pipermail/commits/2015-June/008060.html +static void +do_ftwrl_wait(rpl_group_info *rgi, + bool *did_enter_cond, PSI_stage_info *old_stage) +{ + THD *thd= rgi->thd; + rpl_parallel_entry *entry= rgi->parallel_entry; + uint64 sub_id= rgi->gtid_sub_id; + + mysql_mutex_assert_owner(&entry->LOCK_parallel_entry); Would be good to have a comment for the following test, like: /* If ftwrl is activate, wait until ftwrl is finnished if we are a new transaction that started after the ftwrl command was given */ Note that if you would store MAX_INT in entry->pause_sub_id when it's not active, then the following test would be much easier: + if (unlikely(entry->pause_sub_id > 0) && sub_id > entry->pause_sub_id) + { + thd->ENTER_COND(&entry->COND_parallel_entry, &entry->LOCK_parallel_entry, + &stage_waiting_for_ftwrl, old_stage); As you don't have an EXIT_COND() matching the above ENTER_COND, please add a comment that EXIT_COND() will be called by the caller (handle_rpl_parallel_thread()) <cut> Wouldn't a better name for the following function be: rpl_unpause_after_ftwrl(THD *thd) ? void +rpl_unpause_for_ftwrl(THD *thd) +{ + uint32 i; + rpl_parallel_thread_pool *pool= &global_rpl_thread_pool; + + DBUG_ASSERT(pool->busy); + + for (i= 0; i < pool->count; ++i) + { + rpl_parallel_entry *e; + rpl_parallel_thread *rpt= pool->threads [i] ; + + mysql_mutex_lock(&rpt->LOCK_rpl_thread); + if (!rpt->current_owner) + { + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + continue; Can the above thread ever have e->pause_sub_id != 0 ? For example if the thread did have a current_owner during pause and not an owner now ? + } + e= rpt->current_entry; + mysql_mutex_lock(&e->LOCK_parallel_entry); + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + e->pause_sub_id= 0; + mysql_cond_broadcast(&e->COND_parallel_entry); Don't you need to unlock mysql_mutex_lock(&e->LOCK_parallel_entry) here ? + } + mysql_mutex_lock(&pool->LOCK_rpl_thread_pool); + pool_mark_not_busy(pool); + mysql_cond_broadcast(&pool->COND_rpl_thread_pool); + mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool); As pool_mark_not_busy() is already doing the broadcast, you can remove it from above. If we always have to take the above mutexes when calling pool_mark_not_busy() it would be better if we move the lock and unlock inside pool_mark_not_busy() <cut> +rpl_pause_for_ftwrl(THD *thd) +{ <cut> + for (i= 0; i < pool->count; ++i) + { + PSI_stage_info old_stage; + rpl_parallel_entry *e; + rpl_parallel_thread *rpt= pool->threads [i] ; + + mysql_mutex_lock(&rpt->LOCK_rpl_thread); + if (!rpt->current_owner) + { + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + continue; + } + e= rpt->current_entry; + mysql_mutex_lock(&e->LOCK_parallel_entry); + mysql_mutex_unlock(&rpt->LOCK_rpl_thread); + ++e->need_sub_id_signal; + if (!e->pause_sub_id) + e->pause_sub_id= e->largest_started_sub_id; Why the above test? In which case can pause_sub_id be != 0 here ? If it's != 0, what does it mean? + thd->ENTER_COND(&e->COND_parallel_entry, &e->LOCK_parallel_entry, + &stage_waiting_for_ftwrl_threads_to_pause, &old_stage); + while (e->last_committed_sub_id < e->pause_sub_id && !err) + { + if (thd->check_killed()) + { + thd->send_kill_message(); + err= 1; + break; + } + mysql_cond_wait(&e->COND_parallel_entry, &e->LOCK_parallel_entry); + }; + - e >need_sub_id_signal; + thd->EXIT_COND(&old_stage); + if (err) + break; + } + + if (err) + rpl_unpause_for_ftwrl(thd); + return err; +} @@ -1106,7 +1302,14 @@ rpl_parallel_change_thread_count(rpl_parallel_thread_pool *pool, */ for (i= 0; i < pool->count; ++i) { rpl_parallel_thread *rpt= pool->get_thread(NULL, NULL); + rpl_parallel_thread *rpt; + + mysql_mutex_lock(&pool->LOCK_rpl_thread_pool); + while ((rpt= pool->free_list) == NULL) + mysql_cond_wait(&pool->COND_rpl_thread_pool, &pool->LOCK_rpl_thread_pool); + pool->free_list= rpt->next; + mysql_mutex_unlock(&pool->LOCK_rpl_thread_pool); + mysql_mutex_lock(&rpt->LOCK_rpl_thread); I see you used the original code from get_thread(), but you don't reset rpt->current_owner or rpt->current_entry anymore. Is that not needed? @@ -1496,8 +1703,14 @@ rpl_parallel_thread_pool::get_thread(rpl_parallel_thread **owner, rpl_parallel_thread *rpt; mysql_mutex_lock(&LOCK_rpl_thread_pool); while ((rpt= free_list) == NULL) + for (; + { + while (unlikely(busy)) + mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool); + if ((rpt= free_list) != NULL) + break; mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool); + } Why not use: while (unlikely(busy) || !(rpt= free_list)) mysql_cond_wait(&COND_rpl_thread_pool, &LOCK_rpl_thread_pool); — a/sql/sql_parse.cc +++ b/sql/sql_parse.cc @@ -4259,6 +4259,17 @@ case SQLCOM_PREPARE: break; } + if (lex->type & REFRESH_READ_LOCK) + { + /* + We need to pause any parallel replication slave workers during FLUSH + TABLES WITH READ LOCK. Otherwise we might cause a deadlock, as + worker threads eun run in arbitrary order but need to commit in a + specific given order. + */ + if (rpl_pause_for_ftwrl(thd)) + goto error; + } /* reload_acl_and_cache() will tell us if we are allowed to write to the binlog or not. @@ -4289,6 +4300,8 @@ case SQLCOM_PREPARE: if (!res) my_ok(thd); } + if (lex->type & REFRESH_READ_LOCK) + rpl_unpause_for_ftwrl(thd); break; } Why have the above code in sql_parse.cc, instead of in reload_acl_and_cache() around the call for 'thd->global_read_lock.lock_global_read_lock(thd))' This would avoid calling rpl_pause_for_ftwrl(thd) if tables are locked and in other possible error cases detected early. On the other hand, you may also need to take care of the code just above that calls flush_tables_with_read_lock(thd, all_tables) This happens when you give tables as a argument to FLUSH TABLES WITH READ LOCK. I would assume that this can also cause a dead lock. The easy fix is probably to add rpl_pause / rpl_unpause also to this function just before it's calling lock_table_names(). ------------ http://lists.askmonty.org/pipermail/commits/2015-June/008061.html I didn't see test for doing FLUSH TABLES WITH READ LOCK, 2 in a row. This should detect if there ever was any issues with rpl_unpause_for_ftwrl() not beeing properly called after rpl_pause...().
            knielsen Kristian Nielsen added a comment - Pushed to 10.0 and 10.1: http://lists.askmonty.org/pipermail/commits/2015-November/008627.html http://lists.askmonty.org/pipermail/commits/2015-November/008628.html

            People

              knielsen Kristian Nielsen
              tanj Guillaume Lefranc
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.