[MXS-1589] maxscale stops working when binlogrouter is erroring Created: 2017-12-22  Updated: 2018-01-16  Resolved: 2018-01-16

Status: Closed
Project: MariaDB MaxScale
Component/s: binlogrouter
Affects Version/s: 2.2.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Maikel Punie Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

centos7 3.10.0-693.5.2.el7.x86_64



 Description   

when we start the binlog router we always end up in the same situation. It starts connecting to the master, it reads some info and then times-out. On a reconnect it says that there is an error in the binlog.

2017-12-22 10:56:08   notice : (928002) [binlogrouter] BinlogSVC: 'CHANGE MASTER TO executed'. Previous state MASTER_HOST='_none_', MASTER_PORT=3306, MASTER_LOG_FILE='', MASTER_LOG_POS=0, MASTER_USER='repl'. New state is MASTER_HOST='10.255.10.32', MASTER_PORT=3306, MASTER_LOG_FILE='', MASTER_LOG_POS=4, MASTER_USER='repl', MASTER_USE_GTID=Slave_pos
2017-12-22 10:56:11   notice : (928002) [binlogrouter] BinlogSVC: START SLAVE executed by maxscale@138.203.16.242. Trying connection to master [10.255.10.32]:3306, binlog , pos 4, transaction safe pos 4
2017-12-22 10:56:11   notice : [binlogrouter] BinlogSVC: attempting to connect to master server [10.255.10.32]:3306, binlog='', pos=4, GTID=0-1-22539224966
2017-12-22 10:56:11   notice : (928121) [binlogrouter] BinlogSVC: identity seen by the master: Server_id: 9, Slave_UUID: f12fcb7f-b97b-11e3-bc5e-0401152c4c22, Host: maxscale-blr-1
2017-12-22 10:56:11   notice : (928121) [binlogrouter] BinlogSVC: identity seen by the slaves: server_id: 999, uuid: x-f-cc-common, hostname: harvey, MySQL version: 5.6.19-common
2017-12-22 10:56:12   notice : (928002) [MySQLAuth] [BinlogSVC] Loaded 121 MySQL users for listener BinlogLIST.
2017-12-22 10:56:36   notice : [binlogrouter] BinlogSVC: Slave 138.203.16.242, server id 0, disconnected after 55 seconds. 6 SQL commands
2017-12-22 10:57:21   error  : [binlogrouter] No event received from master [10.255.10.32]:3306 in heartbeat period (30 seconds), last event (Transaction ID Event (2 Phase Commit) 16) received 60 seconds ago. Assuming connection is dead and reconnecting.
2017-12-22 10:57:29   error  : (928121) [binlogrouter] BinlogSVC: failed to connect to master server 'binlog_router_master_host', retrying in 10 seconds
2017-12-22 10:57:39   notice : [binlogrouter] BinlogSVC: attempting to connect to master server [10.255.10.32]:3306, binlog='mysql-bin.000036', pos=749450152, GTID=0-1-1064389133
2017-12-22 10:57:39   notice : (929418) [binlogrouter] BinlogSVC: Request binlog records from mysql-bin.000036 at position 749450152 from master server [10.255.10.32]:3306
2017-12-22 10:57:39   notice : (929418) [binlogrouter] BinlogSVC: identity seen by the master: Server_id: 9, Slave_UUID: f12fcb7f-b97b-11e3-bc5e-0401152c4c22, Host: maxscale-blr-1
2017-12-22 10:57:39   notice : (929418) [binlogrouter] BinlogSVC: identity seen by the slaves: server_id: 999, uuid: x-f-cc-common, hostname: harvey, MySQL version: 5.6.19-common
2017-12-22 10:57:42   error  : (929418) [binlogrouter] Error packet in binlog stream.mysql-bin.000030 @ 4423.
2017-12-22 10:57:42   notice : (929418) [binlogrouter] BinlogSVC: Master 10.255.10.32 disconnected after 3 seconds. 6 events read.
2017-12-22 10:57:59   notice : [binlogrouter] BinlogSVC: Slave 138.203.16.242, server id 0, disconnected after 20 seconds. 3 SQL commands

Config for the binlog router:

[BinlogSVC]
type=service
router=binlogrouter
version_string=5.6.17-log
user=XXXXXXX
passwd=XXXXXXX
router_options=uuid=f12fcb7f-b97b-11e3-bc5e-0401152c4c22,
           server_id=9,
           user=XXXXXXX,
           password=XXXXXXX,
           heartbeat=30,
           binlogdir=/binlogs,
           transaction_safety=1,
           master_version=5.6.19-common,
           master_hostname=harvey,
           master_uuid=x-f-cc-common,
           master_id=999,
           mariadb10-compatibility=1,
           semisync=0,
           mariadb10_slave_gtid=1,
           mariadb10_master_gtid=1,
           binlog_structure=tree,
           slave_hostname=maxscale-blr-1,
           master_retry_count=1000,
           connect_retry=60

master is running: Server version: 10.1.29-MariaDB MariaDB Server

The moment this happens maxscale is not accepting any new connections on any listeners. Existing connections are still working



 Comments   
Comment by Massimiliano Pinto (Inactive) [ 2017-12-22 ]

Hi Maikel Punie,

when maxscale reports no heartbeat received would it be possible to issue

SHOW SLAVE STATUS\G in maxscale mysql connection?

If not provide at least SHOW SLAVE STATUS\G
when seeing:
2017-12-22 10:57:42 error : (929418) [binlogrouter] Error packet in binlog stream.mysql-bin.000030 @ 4423.

The CHANGE MASTER I see is without the previous set of
SET @@global.gtid_slave_pos='0-x-yz';

In this case maxscale should get first GTID available in master binlogs.

I see:

2017-12-22 10:57:39 notice : [binlogrouter] BinlogSVC: attempting to connect to master server [10.255.10.32]:3306, binlog='mysql-bin.000036', pos=749450152, GTID=0-1-1064389133

Are GTID and pos the right ones?

Th error 2017-12-22 10:57:42 error : (929418) [binlogrouter] Error packet in binlog stream.mysql-bin.000030 @ 4423

Is related to a binlog file/pos not the same as the previous log (binlog='mysql-bin.000036', pos=749450152)

Would you also mind looking into the GTID database in $binlogdir)
It's a SQLite3 database, gtid_maps.db.

You can easily check whether it keeps the right data.

Additionally, does BLR work well without GTID master registration?

You can check it by setting:

mariadb10_master_gtid=0,
binlog_structure=flat,

Comment by Maikel Punie [ 2017-12-22 ]

in the sql command line:

SET @@global.gtid_slave_pos='0-1-22545514499';
CHANGE MASTER TO MASTER_HOST='10.255.10.32', MASTER_USER='repl', MASTER_PASSWORD='XXXXX', master_use_gtid=slave_pos;

the show slave status at the moment of the error:

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Slave stopped
                  Master_Host: 10.255.10.32
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000030
          Read_Master_Log_Pos: 4423
               Relay_Log_File: mysql-bin.000030
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mysql-bin.000030
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1236
                   Last_Error: The binlog on the master is missing the GTID 0-1-1070679239 requested by the slave (even though both a prior and a subsequent sequence number does exist), and GTID strict mode is enabled
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4
              Relay_Log_Space: 4
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 1
                  Master_UUID: x-f-cc-common
             Master_Info_File: /binlogs/master.ini
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave stopped
           Master_Retry_Count: 1000
                  Master_Bind: 
      Last_IO_Error_TimeStamp: 
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
                   Using_Gtid: Slave_pos
                  Gtid_IO_Pos: 0-1-1070679239

in the gtid db i see the following at the moment of the error:

279|0|1|1070679127|mysql-bin.000038|456907318|456907577
280|0|1|1070679133|mysql-bin.000038|456908862|456909119
281|0|1|1070679135|mysql-bin.000038|456909374|456909630
282|0|1|1070679141|mysql-bin.000038|456910910|456911168
283|0|1|1070679143|mysql-bin.000038|456911424|456911682
284|0|1|1070679149|mysql-bin.000038|456912967|456913224
285|0|1|1070679151|mysql-bin.000038|456913478|456913732
286|0|1|1070679157|mysql-bin.000038|456915014|456915269
287|0|1|1070679159|mysql-bin.000038|456915527|456915785
288|0|1|1070679165|mysql-bin.000038|456917066|456917322
289|0|1|1070679167|mysql-bin.000038|456917575|456917830
290|0|1|1070679173|mysql-bin.000038|456919106|456919361
291|0|1|1070679175|mysql-bin.000038|456919615|456919873
292|0|1|1070679181|mysql-bin.000038|456921151|456921404
293|0|1|1070679183|mysql-bin.000038|456921659|456921914
294|0|1|1070679189|mysql-bin.000038|456923199|456923458
295|0|1|1070679191|mysql-bin.000038|456923723|456923978
296|0|1|1070679197|mysql-bin.000038|456925256|456925511
297|0|1|1070679199|mysql-bin.000038|456925767|456926022
298|0|1|1070679205|mysql-bin.000038|456927307|456927587
299|0|1|1070679207|mysql-bin.000038|456927869|456928155
300|0|1|1070679213|mysql-bin.000038|456929571|456929851
301|0|1|1070679215|mysql-bin.000038|456930133|456930419
302|0|1|1070679221|mysql-bin.000038|456931805|456932084
303|0|1|1070679223|mysql-bin.000038|456932364|456932648
304|0|1|1070679229|mysql-bin.000038|456934061|456934335
305|0|1|1070679231|mysql-bin.000038|456934611|456934891
306|0|1|1070679237|mysql-bin.000038|456936307|456936584
307|0|1|1070679239|mysql-bin.000038|456936863|456937145
308|0|1|0|mysql-bin.000030|4|4

This last line looks strange

Comment by Massimiliano Pinto (Inactive) [ 2017-12-22 ]

Hi Maikel Punie

Indeed last line it's wrong

I can't see the output of "show slave status \G"
There are empty lines

Please add here the last files in your master:

SHOW BINARY LOGS;

and the last events in mysql-bin.000030

MariaDB> show binlog events in 'mysql-bin.000030';

Comment by Maikel Punie [ 2017-12-22 ]

the output of the show slave is aligned more to the right (scroll and then you can see it).

i'm now running in a mode without gtid enabled to see if this reproduces the issue.
If not then i will switch back to gtid enabled mode and get you the output of the show binary logs.

Comment by Massimiliano Pinto (Inactive) [ 2017-12-22 ]

I see now:

Last_Error: The binlog on the master is missing the GTID 0-1-1070679239 requested by the slave (even though both a prior and a subsequent sequence number does exist), and GTID strict mode is enabled

and

Master_Log_File: mysql-bin.000030
Read_Master_Log_Pos: 4423

It looks like the master is sending the information "mysql-bin.000030"

It would be very useful to take the output of ngrep before and just after the error:

  1. root > ngrep -x -q -d lo '' 'port 23240'

replace lo with the right interface name and the port as well

You should be able to copy here the last good transmission from master say, events about GTID=0-1-1064389139 (the last one I see in SHOW SLAVE STATUS\G)

and the transmission with the error.

Comment by Maikel Punie [ 2017-12-22 ]

ok, without gtid is see that maxscale also stops working but i don't see any errors in the logs.

restarting now with gtid enabled again

Comment by Maikel Punie [ 2017-12-22 ]

ok, when gtid is enabled and the slave is stopped and restarted i go into this error.

looking at the binlog file 30 i have the following info in there.

mysqlbinlog mysql-bin.000030mysqlbinlog mysql-bin.000030
/!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1/;
/!40019 SET @@session.max_insert_delayed_threads=0/;
/!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0/;
DELIMITER /!/;

  1. at 4
    #171221 8:59:59 server id 1 end_log_pos 249 Start: binlog v 4, server v 10.1.29-MariaDB created 171221 8:59:59
    BINLOG '
    /2k7Wg8BAAAA9QAAAPkAAAAAAAQAMTAuMS4yOS1NYXJpYURCAGxvZwAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAA3QAEGggAAAAICAgCAAAACgoKAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAEEwQAABPOflQ=
    '/!/;
  2. at 249
    #171221 8:59:59 server id 1 end_log_pos 288 Gtid list [0-2-22514646005]
  3. at 288
    #171221 8:59:59 server id 1 end_log_pos 327 Binlog checkpoint mysql-bin.000029
  4. at 327
    #171221 8:01:24 server id 1 end_log_pos 4384 Ignorable
  5. Ignorable event type 28 (Ignorable log event)
  6. # at 4384
    #171221 8:01:24 server id 1 end_log_pos 4423 Binlog checkpoint mysql-bin.000030
    DELIMITER ;
  7. End of log file
    ROLLBACK /* added by mysqlbinlog */;
    /!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE/;
    /!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0/;
Comment by Massimiliano Pinto (Inactive) [ 2017-12-22 ]

This is the maxscale binlog file:

as long as GTID can return any position in a file maxscale has written an Ignorable log event in order to keep its binlog file without holes.

It would be interesting to see the master binlog file mysql-bin.000030 content
and last events of the master binlog file before the error occurs.

Is that mysql-bin.000036 ?

As next step I suggest to stop maxscale, wipe off gtid_maps.db and master.ini.

Te you start maxscale, configure master replication and set the GTID you want, ay 0-1-1070679127 with is in mysql-bin.000038

Try to get all the network traffic using "ngrep" until the error is seen

Massimiliano

Comment by Maikel Punie [ 2018-01-02 ]

hmm, i can't seem to reproduce this anymore.
Will watch and try to reproduce this in the coming weeks

Comment by markus makela [ 2018-01-02 ]

Thanks for updating, we'll keep the issue open until we've had time to try and reproduce this on our side.

Comment by Johan Wikman [ 2018-01-16 ]

Please reopen if problem reappears.

Generated at Thu Feb 08 04:07:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.