[MDEV-8476] Race condition in slave SQL thread shutdown Created: 2015-07-16  Updated: 2015-11-12  Resolved: 2015-11-12

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0, 10.1
Fix Version/s: 10.0.23, 10.1.8

Type: Bug Priority: Critical
Reporter: Sergey Vojtovich Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-7884 rpl_binlog_errors fails sporadically ... Closed

 Description   

During server shutdown slave SQL thread broadcasts "stopped" too early:

  mysql_cond_broadcast(&rli->stop_cond);
  DBUG_EXECUTE_IF("simulate_slave_delay_at_terminate_bug38694", sleep(5););
  mysql_mutex_unlock(&rli->run_lock);  // tell the world we are done
 
  /*
    Deactivate the parallel replication thread pool, if there are now no more
    SQL threads running. Do this here, when we have released all locks, but
    while our THD (and current_thd) is still valid.
  */
  mysql_mutex_lock(&LOCK_active_mi);
  if (opt_slave_parallel_threads > 0 &&
      !master_info_index->any_slave_sql_running())
    rpl_parallel_inactivate_pool(&global_rpl_thread_pool);
  mysql_mutex_unlock(&LOCK_active_mi);
 
  mysql_mutex_lock(&LOCK_thread_count);
  THD_CHECK_SENTRY(thd);
  delete thd;
  mysql_mutex_unlock(&LOCK_thread_count);

This causes enormous number of sporadic replication test failures with crashes, hangs and warnings like "Plugin '...' has ref_count=1 after shutdown".

To reproduce this failure reliably, just add sleep(1) immediately after condition is signaled and run tests as usual:

diff --git a/sql/slave.cc b/sql/slave.cc
index 4129c4c..efc75f3 100644
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -4837,6 +4837,7 @@ log '%s' at position %s, relay log '%s' position: %s%s", RPL_LOG_NAME,
   mysql_cond_broadcast(&rli->stop_cond);
   DBUG_EXECUTE_IF("simulate_slave_delay_at_terminate_bug38694", sleep(5););
   mysql_mutex_unlock(&rli->run_lock);  // tell the world we are done
+  sleep(1);
 
   /*
     Deactivate the parallel replication thread pool, if there are now no more

This seem to be appeared after:

commit 682ed005c57dabb4615052a79789e44e5c93525c
Author: Kristian Nielsen <knielsen@knielsen-hq.org>
Date:   Wed Jun 10 11:57:42 2015 +0200
 
    MDEV-8294: Inconsistent behavior of slave parallel threads at runtime
 
    There were some cases where the slave SQL thread could stop without
    the pool of parallel replication worker threads being correctly
    de-activated.



 Comments   
Comment by Elena Stepanova [ 2015-09-25 ]

svoj,

Is it a 10.x-specific problem, as specified in Affected versions, or 5.5 can be affected as well?
Apart from MDEV-7884, we've recently had another flavor of a similar failure in rpl.rpl_binlog_errors on 5.5:
http://buildbot.askmonty.org/buildbot/builders/p8-trusty-bintar-debug/builds/672/steps/test/logs/stdio

rpl.rpl_binlog_errors 'mix,xtradb'       w1 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2015-09-25 00:11:51
line
150925  0:10:46 [ERROR] mysqld: Table './mysql/plugin' is marked as crashed and should be repaired
150925  0:10:46 [Warning] Checking table:   './mysql/plugin'
150925  0:10:46 [ERROR] mysql.plugin: 1 client is using or hasn't closed the table properly
150925  0:10:46 [ERROR] mysqld: Table './mysql/event' is marked as crashed and should be repaired
150925  0:10:46 [Warning] Checking table:   './mysql/event'
150925  0:10:46 [ERROR] mysql.event: 1 client is using or hasn't closed the table properly

Which in fact looks like this in the slave's error log:

150925  0:10:13 [Note] /var/lib/buildbot/maria-slave/power8-vlp04-bintar-debug/build/sql/mysqld: Normal shutdown
 
150925  0:10:43 [Note] /var/lib/buildbot/maria-slave/power8-vlp04-bintar-debug/build/sql/mysqld (mysqld 5.5.45-MariaDB-debug-log) starti
ng as process 7173 ...

So, obviously the slave had hung on shutdown.
Can it be the same problem? (and MDEV-7884 as well).

Comment by Sergey Vojtovich [ 2015-09-28 ]

elenst, this should be different bug. I see no way for this MDEV-8476 to affect 5.5.

Comment by Michael Widenius [ 2015-11-12 ]

Fixed some time ago in 10.1 by adding a variable to detect if all slaves threads has properly ended
Backported fix to 10.0

Generated at Thu Feb 08 07:27:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.