[MDEV-8311] Replication slave is stuck without any error Created: 2015-06-12  Updated: 2015-08-11  Resolved: 2015-08-11

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Zhoukai Lin Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: need_feedback, parallelslave, replication
Environment:

redhat 6.5



 Description   

When execute show slave status, It seems every thing is OK, but actually slave's sql thread was stuck! The Exec_Master_Log_Pos(mysql-bin.000160 2140880879) didn't change for several days.

MariaDB [(none)]> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.113
                  Master_User: repl
                  Master_Port: 5518
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000282
          Read_Master_Log_Pos: 531076432
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 691
        Relay_Master_Log_File: mysql-bin.000160
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 2140880879
              Relay_Log_Space: 124435203348
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 283562
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1895939662
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Current_Pos
                  Gtid_IO_Pos: 0-1895939662-13106380
1 row in set (0.00 sec)

I use parallel replication, and set SLAVE_PARALLEL_THREADS=16, I found one work thread is always 'Waiting for prior transaction to start commit before starting next transaction', other 15 work threads is always 'Waiting for work from SQL thread',and show engine innodb status, there's no locks.
Then check error log, find a restart in log 'mysql-bin.000160' at position 2140880879 :

150606 10:32:34 [Note] /home/mariadb/bin/mysqld: Normal shutdown
recording                                                                                                                                  60385,1       86%
150605 18:42:13 [Warning] IP address '192.168.1.117' could not be resolved: Temporary failure in name resolution
150606 10:32:34 [Note] /home/mariadb/bin/mysqld: Normal shutdown
 
150606 10:32:34 [Note] Event Scheduler: Purging the queue. 0 events
150606 10:36:51 [Note] Slave: received end packet from server, apparent master shutdown:
150606 10:36:51 [Note] Slave I/O thread killed while reading event
150606 10:36:51 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000260', position 353423150; GTID position 0-1895939662-1585424
150606 10:44:24 [Note] InnoDB: Using mutexes to ref count buffer pool pages
150606 10:44:24 [Note] InnoDB: The InnoDB memory heap is disabled
150606 10:44:24 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
150606 10:44:24 [Note] InnoDB: Compressed tables use zlib 1.2.3
150606 10:44:24 [Note] InnoDB: Using Linux native AIO
150606 10:44:24 [Note] InnoDB: Using CPU crc32 instructions
150606 10:44:24 [Note] InnoDB: Initializing buffer pool, size = 78.1G
150606 10:44:27 [Note] InnoDB: Completed initialization of buffer pool
150606 10:44:28 [Note] InnoDB: Highest supported file format is Barracuda.
150606 10:44:28 [Note] InnoDB: The log sequence numbers 70561972035 and 70561972035 in ibdata files do not match the log sequence number 121831407098 in the ib_logfiles!
150606 10:44:28 [Note] InnoDB: Database was not shutdown normally!
150606 10:44:28 [Note] InnoDB: Starting crash recovery.
150606 10:44:28 [Note] InnoDB: Reading tablespace information from the .ibd files...
150606 10:44:28 [Note] InnoDB: Restoring possible half-written data pages
150606 10:44:28 [Note] InnoDB: from the doublewrite buffer...
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 4554 row operations to undo
InnoDB: Trx id counter is 28084736
InnoDB: Last MySQL binlog file position 0 708607516, file name ../binlog/mysql-bin.000179
150606 10:44:29 [Note] InnoDB: 128 rollback segment(s) are active.
InnoDB: Starting in background the rollback of uncommitted transactions
2015-06-06 10:44:29 7f1c2d7ff700  InnoDB: Rolling back trx with id 28084351, 4554 rows to undo
 
InnoDB: Progress in percents: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100150606 10:44:30 [Note] InnoDB: Rollback of trx with id 28084351 completed
2015-06-06 10:44:30 7f1c2d7ff700  InnoDB: Rollback of non-prepared transactions completed
150606 10:44:34 [Note] InnoDB: Waiting for purge to start
150606 10:44:34 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.17-65.0 started; log sequence number 121831407098
150606 10:44:34 [Note] Semi-sync replication initialized for transactions.
150606 10:44:34 [Note] Semi-sync replication enabled on the master.
150606 10:44:34 server_audit: MariaDB Audit Plugin version 1.1.7 STARTED.
150606 10:44:34 [Note] Recovering after a crash using ../binlog/mysql-bin
150606 10:44:34 [Note] Starting crash recovery...
150606 10:44:34 [Note] Crash recovery finished.
150606 10:44:34 [Note] Server socket created on IP: '192.168.1.114'.
150606 10:44:34 [Note] Event Scheduler: Loaded 0 events
150606 10:44:46 [Note] /home/mariadb/bin/mysqld: Normal shutdown
 
150606 10:44:50 [Note] /home/mariadb/bin/mysqld: ready for connections.
Version: '10.0.12-MariaDB-log'  socket: '/home/mariadb/bin/mysql1.sock'  port: 5518  Source distribution
150606 10:44:50 [Note] Event Scheduler: Purging the queue. 0 events
150606 10:44:50 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.1.113:5518' in log 'mysql-bin.000160' at position 2140880879
150606 10:44:50 [Note] Slave I/O thread killed while connecting to master
150606 10:44:50 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000160', position 2140880879; GTID position 0-1895939662-1128069
150606 10:44:50 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000160' at position 2140880879, relay log '../relaylog/relay-bin.000001' position: 4; GTID position '0-1895939662-1128069'
150606 10:44:50 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.000160' at position 2140880879; GTID position '0-1895939662-1128069'
150606 10:44:50 server_audit: STOPPED
150606 10:44:50 [Note] unregister_replicator OK
150606 10:44:50 [Note] InnoDB: FTS optimize thread exiting.
150606 10:44:50 [Note] InnoDB: Starting shutdown...
150606 10:44:55 [Note] InnoDB: Shutdown completed; log sequence number 121833466167
150606 10:44:55 [Note] /home/mariadb/bin/mysqld: Shutdown complete
 
150606 10:45:12 [Note] InnoDB: Using mutexes to ref count buffer pool pages
150606 10:45:12 [Note] InnoDB: The InnoDB memory heap is disabled
150606 10:45:12 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
150606 10:45:12 [Note] InnoDB: Compressed tables use zlib 1.2.3
150606 10:45:12 [Note] InnoDB: Using Linux native AIO
150606 10:45:12 [Note] InnoDB: Using CPU crc32 instructions
150606 10:45:12 [Note] InnoDB: Initializing buffer pool, size = 78.1G
150606 10:45:15 [Note] InnoDB: Completed initialization of buffer pool
150606 10:45:15 [Note] InnoDB: Highest supported file format is Barracuda.
150606 10:45:16 [Note] InnoDB: 128 rollback segment(s) are active.
150606 10:45:16 [Note] InnoDB: Waiting for purge to start
150606 10:45:16 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.17-65.0 started; log sequence number 121833466167
150606 10:45:16 [Note] Semi-sync replication initialized for transactions.
150606 10:45:16 [Note] Semi-sync replication enabled on the master.
150606 10:45:16 server_audit: MariaDB Audit Plugin version 1.1.7 STARTED.
150606 10:45:16 [Note] Server socket created on IP: '192.168.1.114'.
150606 10:45:16 [Note] Event Scheduler: Loaded 0 events
150606 10:45:16 [Note] /home/mariadb/bin/mysqld: ready for connections.
Version: '10.0.12-MariaDB-log'  socket: '/home/mariadb/bin/mysql1.sock'  port: 5518  Source distribution
150606 10:45:16 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.1.113:5518' in log 'mysql-bin.000160' at position 2140880879
150606 10:45:16 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000160' at position 2140880879, relay log '../relaylog/relay-bin.000001' position: 4; GTID position '0-1895939662-1128069'
150606 10:45:16 [Note] Slave I/O thread: connected to master 'repl@192.168.1.113:5518',replication starts at GTID position '0-1895939662-1128069'



 Comments   
Comment by Elena Stepanova [ 2015-06-13 ]

Hi,
Thanks for the report.

10.0.12 is pretty old (1 year old), especially in terms of parallel replication which had a lot of fixes since then. At least these two seem to fit, but there might be others:

MDEV-7326

The user-visible effect in this hang is that at least one replication worker
threads are stuck in state "Waiting for prior transaction to start commit
before starting next transaction", and all other threads are stuck in this
state or the state "Waiting for prior transaction to commit". (As seen in SHOW
PROCESSLIST). Killing the worker threads will stop replication, and it can
then be re-started successfully.

MDEV-7929

The parallel replication worker threads can hang in some cases with
non-transactional event groups. The symptom is that worker threads are stuck
in "waiting for prior transaction to start commit".

I recommend you to consider upgrading to the recent 10.0 version (currently 10.0.19, and there will be 10.0.20 released soon). Until then, because different failures in this area often have identical representation, it is going to be very difficult, if not impossible, to understand whether you are hitting an old bug or a new one.

Comment by Zhoukai Lin [ 2015-06-13 ]

I think it may be different with MDEV-7326, in that case Killing the worker threads will stop replication, and it can
then be re-started successfully.But in my case, I cannot restart replication by execute 'stop slave' or shutdown the MariaDB server, both will stuck. Only kill -9 pid can stop it.

Comment by Elena Stepanova [ 2015-06-13 ]

I think what Kristian's comment means is that one needed to kill the 16 parallel threads directly (with KILL <thread id>) rather than via STOP SLAVE. Did you try that?

There have also been some fixes related to deadlocks on stopping parallel replication and/or parallel replication threads; moreover, we still have some open ones, e.g. MDEV-7980, MDEV-8039 just to name a couple. So it's possible that you hit two different problems at once.

How often does it happen for you? Would you consider upgrading one of your servers to see if the problem goes away?

Comment by Elena Stepanova [ 2015-07-13 ]

linzhoukai,
Are you still experiencing the problem?

Comment by Elena Stepanova [ 2015-08-11 ]

Closing as incomplete for now. Please comment to re-open if you still experience the problem and have new information.

Generated at Thu Feb 08 07:26:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.