[MDEV-9246] Slave hangs with parallel replication in MariaDB 10.1 Created: 2015-12-08  Updated: 2015-12-09  Resolved: 2015-12-09

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1.9
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: replication


 Description   

A MariaDB 10.1.9 slave was running for over 24 hours with no change in Relay_Master_Log_File:

MariaDB [db]> show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: master
Master_User: replication
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000157
Read_Master_Log_Pos: 161873700
Relay_Log_File: relay_log.000143
Relay_Log_Pos: 654
Relay_Master_Log_File: mysql-bin.000027
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: 
Replicate_Ignore_DB: 
Replicate_Do_Table: 
Replicate_Ignore_Table: 
Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: 
Last_Errno: 0
Last_Error: 
Skip_Counter: 0
Exec_Master_Log_Pos: 366
Relay_Log_Space: 703475834482
Until_Condition: None
Until_Log_File: 
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File: 
Master_SSL_CA_Path: 
Master_SSL_Cert: 
Master_SSL_Cipher: 
Master_SSL_Key: 
Seconds_Behind_Master: 115303
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error: 
Last_SQL_Errno: 0
Last_SQL_Error: 
Replicate_Ignore_Server_Ids: 
Master_Server_Id: 1
Master_SSL_Crl: 
Master_SSL_Crlpath: 
Using_Gtid: No
Gtid_IO_Pos: 
Replicate_Do_Domain_Ids: 
Replicate_Ignore_Domain_Ids: 
Parallel_Mode: optimistic

It was assumed that the slave was stuck, so the user executed STOP SLAVE. This also hung. SHOW FULL PROCESSLIST output was obtained at that time:

MariaDB [db]> show full processlist;
+-----+-----------------+--------------------------------+-------------+---
----------+-------+--------------------------------------------------------
---------------+--------------------------------+----------+
| Id | User | Host | db |
Command | Time | State
| Info | Progress |
+-----+-----------------+--------------------------------+-------------+---
----------+-------+--------------------------------------------------------
---------------+--------------------------------+----------+
| 2 | event_scheduler | localhost | NULL |
Daemon | 99968 | Waiting on empty queue
| NULL | 0.000 |
| 4 | system user | | NULL |
Connect | 81773 | Waiting for work from SQL thread
| NULL | 0.000 |
| 5 | system user | | NULL |
Connect | 99968 | Waiting for master to send event
| NULL | 0.000 |
| 6 | system user | | NULL |
Connect | 0 | Write_rows_log_event::write_row(-1)
| NULL | 0.000 |
| 7 | system user | | NULL |
Connect | 80392 | Waiting for work from SQL thread
| NULL | 0.000 |
| 8 | system user | | NULL |
Connect | 82490 | Waiting for work from SQL thread
| NULL | 0.000 |
| 9 | system user | | NULL |
Connect | 82490 | Waiting for work from SQL thread
| NULL | 0.000 |
| 10 | system user | | NULL |
Connect | 82490 | Waiting for work from SQL thread
| NULL | 0.000 |
| 11 | system user | | NULL |
Connect | 81774 | Waiting for work from SQL thread
| NULL | 0.000 |
| 12 | system user | | NULL |
Connect | 82490 | Waiting for work from SQL thread
| NULL | 0.000 |
| 13 | system user | | NULL |
Connect | 81779 | Waiting for work from SQL thread
| NULL | 0.000 |
| 14 | system user | | NULL |
Connect | 82489 | Waiting for work from SQL thread
| NULL | 0.000 |
| 15 | system user | | NULL |
Connect | 81774 | Waiting for work from SQL thread
| NULL | 0.000 |
| 16 | system user | | NULL |
Connect | 81773 | Waiting for work from SQL thread
| NULL | 0.000 |
| 17 | system user | | NULL |
Connect | 44823 | Reading event from the relay log
| NULL | 0.000 |
| 19 | replication | master:47237 | NULL |
Binlog Dump | 99910 | Master has sent all binlog to slave; waiting for
binlog to be updated | NULL | 0.000 |
| 94 | root | master:50113 | NULL |
Query | 2652 | Filling schema table
| SHOW /*innotop*/ GLOBAL STATUS | 0.000 |
| 95 | root | localhost | db |
Killed | 2378 | init
| show slave status | 0.000 |
| 96 | root | localhost | db |
Query | 2730 | Killing slave
| stop slave | 0.000 |
| 97 | root | localhost | NULL |
Killed | 2421 | Filling schema table
| show status | 0.000 |
| 100 | root | localhost | NULL |
Killed | 2351 | Filling schema table
| show global status | 0.000 |
| 103 | root | localhost | NULL |
Killed | 1670 | Filling schema table
| show status | 0.000 |
| 106 | root | localhost | db |
Killed | 86 | Filling schema table
| show global variables | 0.000 |
| 111 | root | localhost | db |
Query | 0 | init
| show full processlist | 0.000 |
+-----+-----------------+--------------------------------+-------------+---
----------+-------+--------------------------------------------------------
---------------+--------------------------------+----------+

We thought that the one active worker thread may be holding things up, so we stopped that by executing `KILL 6`, but the SQL thread automatically restarted the thread.

We killed each thread one by one to see if that would free up any mutexes, but eventually all replication-related threads were killed, except for the IO thread (which had automatically restarted after being killed), and the user executed STOP SLAVE again, which hung again:

MariaDB [(none)]> show processlist;
+-----+-----------------+--------------------------------+-------------+---
----------+--------+-------------------------------------------------------
----------------+--------------------------------+----------+
| Id  | User            | Host                           | db          |
Command     | Time   | State
                  | Info                           | Progress |
+-----+-----------------+--------------------------------+-------------+---
----------+--------+-------------------------------------------------------
----------------+--------------------------------+----------+
|   2 | event_scheduler | localhost                      | NULL        |
Daemon      | 106027 | Waiting on empty queue
                  | NULL                           |    0.000 |
|   4 | system user     |                                | NULL        |
Killed      |  87833 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|   5 | system user     |                                | NULL        |
Killed      | 106027 | Waiting for slave mutex on exit
                  | NULL                           |    0.000 |
|   6 | system user     |                                | NULL        |
Killed      |      0 | Write_rows_log_event::write_row(-1)
                  | NULL                           |    0.000 |
|   7 | system user     |                                | NULL        |
Killed      |  86452 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|   8 | system user     |                                | NULL        |
Killed      |  88549 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|   9 | system user     |                                | NULL        |
Killed      |  88549 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  10 | system user     |                                | NULL        |
Killed      |  88549 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  11 | system user     |                                | NULL        |
Killed      |  87833 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  12 | system user     |                                | NULL        |
Killed      |  88549 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  13 | system user     |                                | NULL        |
Killed      |  87838 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  14 | system user     |                                | NULL        |
Killed      |  88549 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  15 | system user     |                                | NULL        |
Killed      |  87833 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  16 | system user     |                                | NULL        |
Killed      |  87833 | Waiting for work from SQL thread
                  | NULL                           |    0.000 |
|  17 | system user     |                                | NULL        |
Killed      |  50883 | Reading event from the relay log
                  | NULL                           |    0.000 |
|  94 | root            | master:50113 | NULL        |
Query       |   8711 | Filling schema table
                  | SHOW /*innotop*/ GLOBAL STATUS |    0.000 |
|  95 | root            | localhost                      | db |
Killed      |   8437 | init
                  | show slave status              |    0.000 |
|  96 | root            | localhost                      | db |
Killed      |   8789 | Killing slave
                  | stop slave                     |    0.000 |
|  97 | root            | localhost                      | NULL        |
Killed      |   8480 | Filling schema table
                  | show status                    |    0.000 |
| 100 | root            | localhost                      | NULL        |
Killed      |   8411 | Filling schema table
                  | show global status             |    0.000 |
| 103 | root            | localhost                      | NULL        |
Killed      |   7729 | Filling schema table
                  | show status                    |    0.000 |
| 106 | root            | localhost                      | db |
Killed      |   6145 | Filling schema table
                  | show global variables          |    0.000 |
| 111 | root            | localhost                      | db |
Killed      |   6027 | Filling schema table
                  | show global status             |    0.000 |
| 115 | root            | localhost                      | NULL        |
Query       |      0 | init
                  | show processlist               |    0.000 |
| 116 | replication     |master:55614 | NULL        |
Binlog Dump |    772 | Master has sent all binlog to slave; waiting for
binlog to be updated | NULL                           |    0.000 |
| 117 | root            | localhost                      | NULL        |
Query       |     64 | init
                  | stop slave                     |    0.000 |
+-----+-----------------+--------------------------------+-------------+---
----------+--------+-------------------------------------------------------
----------------+--------------------------------+----------+



 Comments   
Comment by Geoff Montee (Inactive) [ 2015-12-09 ]

The slave thread actually didn't seem to be hung. It was just taking an unusually long time to apply that particular binlog, and the "Killed" thread eventually did die hours after being killed. I'm not sure if this slowness was caused by a bug, but there's a good chance that it wasn't, so I'm going to close this.

Generated at Thu Feb 08 07:33:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.