[MDEV-9573] 'Stop slave' hangs on replication slave Created: 2016-02-17 Updated: 2023-06-28 Resolved: 2017-05-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.1.10, 10.1.11, 10.1.13 |
| Fix Version/s: | 10.0.30, 10.1.22 |
| Type: | Bug | Priority: | Major |
| Reporter: | Markus Nägele | Assignee: | Michael Widenius |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | replication | ||
| Environment: |
CentOS release 6.7 (Final) x64 on Dell PowerEdge R510 |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
Since we switched from MariaDB 10.0.x to MariaDB 10.1 we are having trouble with the statement 'stop slave;' it just hangs and doesn't return even after hours. The 'show slave status\G' statement made with another connection is also blocking at this moment. It doesn't produce any output. Please let me know, if I can provide any more details that might be helpful. |
| Comments |
| Comment by Elena Stepanova [ 2016-02-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
optonaegele, are you able to reproduce it when slave_run_triggers_for_rbr is not enabled? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Markus Nägele [ 2016-03-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry for the delayed answer. I needed to setup a testing machine first. If slave_run_triggers_for_rbr is not enabled I can reproduce it as well, it seems to make no difference. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-03-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is the config same on master and slave? If you can reproduce it with a debug build, and take all threads' stack trace on the hanging slave, it could indeed be helpful. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Markus Nägele [ 2016-04-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The config on master an slave is basically the same, I attached the configuration of the master as well.
SymmtericDS creates triggers like this at the table (there are two more for this table, basically the same, but for UPDATE and DELETE):
Those triggers updates this table:
In the queries shown in the process list and in the triggers you can find a procedure sym_transaction_id_post_5_1_23, it is defined as followed:
pt-table-checksum is running on a separate server, neither on the master or slave. As you can see, now I tried with the latest version 10.1.13. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2016-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for providing the stack traces, that is really useful to find the I think I see what is happening, seems like an obvious deadlock problem in In mysql_execute_command() for SQLCOM_SLAVE_STOP, the code takes But something like INSERT ... SELECT FROM INFORMATION_SCHEMA might also need The result is that STOP SLAVE is waiting forever for the SQL thread to stop
It doesn't seem that this is specific to parallel replication, though Frankly, I have no idea how this code was ever supposed to work, I don't see My compliments to Markus Nägele for a very good report on a complicated bug, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2016-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yep, easy to reproduce also without parallel replication or triggers or
This triggers a hang with a similar stack trace as in this report. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2016-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Mailing list thread: https://lists.launchpad.net/maria-developers/msg09486.html So the thing that seems to trigger the hang here is accessing information_schema.session_status from a slave-replicated query (in this case from a trigger on a table modified during replication). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2017-01-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Agree with Kristian that the issue is mainly due to keeping LOCK_active_mi active over STOP SLAVE I have now a fix that introduces object counting for Master_info and not taking LOCK_active_mi over stop slave or even stop_all_slaves(). This seams to fix this problem and gives us some other benefits:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2017-02-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Have been working on this and related bugs found by Elena. An added benefit of the new approach is that one will now be able to run start/stop/change master on different connections in parallel. Before these was serialized with the LOCK_active_mi mutex. Will update this Jira entry with a full set of commit logs when I am done. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2017-03-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A fix is now pushed into 10.0 tree that fixes a major part of the Here follows a list of the commit's related to my fixing this issue. Fixed dead locks when doing stop slave while slave was starting.
Fixed hang doing FLUSH TABLES WITH READ LOCK and parallel replication
Add protection to not access is_open() without LOCK_log mutex
Don't allow one to kill START SLAVE while the slaves IO_THREAD or SQL_THREAD
Add protection for reinitialization of mutex in parallel replaction
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2017-05-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pushed to 10.0 at end of January. Should be in all newer releases |