[MDEV-12038] Replication is stuck on a wrongly formated Binlog Created: 2017-02-09  Updated: 2019-08-21  Resolved: 2017-03-24

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1.18
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Patrick Moiroux Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: need_feedback, replication
Environment:

Centos7 - MariaDB 10.1.18


Issue Links:
Blocks
is blocked by MDEV-20398 Slave sql thread blocked by Update_ro... Closed

 Description   

Sql thread is running on the slave but seems to be blocked on a specific position.

when looking at that position on the master , the format seems to be broken (overlap between #at and end_log_pos lines:

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.193.9.31
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.001976
          Read_Master_Log_Pos: 2042964
               Relay_Log_File: slave-relay.000004
                Relay_Log_Pos: 66666208
        Relay_Master_Log_File: mysql-bin.001965
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 66665920

On the master:

 mysqlbinlog -v mysql-bin.001965
 
# at 66665893
#170208 17:51:24 server id 1  end_log_pos 66665920      Xid = 4949069
COMMIT/*!*/;
# at 66665920
#170208 17:51:21 server id 1  end_log_pos 66665958      GTID 0-1-21861609 trans
/*!100001 SET @@session.gtid_seq_no=21861609*//*!*/;
BEGIN
/*!*/;
# at 66665958
#170208 17:51:21 server id 1  end_log_pos 66666035      Table_map: `customers`.`pmig_contracts_contractperiod` mapped to number 832
# at 66666035
# at 66667051
# at 66668067
# at 66669083
# at 66670099
# at 66671115
# at 66672131
# at 66673147
# at 66674163
# at 66675179
# at 66676195
# at 66677211
# at 66678227
# at 66679243
# at 66680259
# at 66681275
# at 66682291
# at 66683307
...
# at 93191763
# at 93192779
# at 93193795
# at 93194811
# at 93195827
# at 93196843
# at 93197859
#170208 17:51:21 server id 1  end_log_pos 66667051      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66668067      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66669083      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66670099      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66671115      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66672131      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66673147      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66674163      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 66675179      Update_rows: table id 832
...
#170208 17:51:21 server id 1  end_log_pos 93196843      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 93197859      Update_rows: table id 832
#170208 17:51:21 server id 1  end_log_pos 93198671      Update_rows: table id 832 flags: STMT_END_F
...
### UPDATE `customers`.`pmig_contracts_contractperiod`
### WHERE
###   @1=5726323
###   @2=99.00
###   @3=0.00
###   @4=12
### SET
###   @1=5726323
###   @2=99.00
###   @3=8.25
###   @4=12
### UPDATE `customers`.`pmig_contracts_contractperiod`
### WHERE
###   @1=5726325
###   @2=49.00
###   @3=0.00
###   @4=24
### SET
###   @1=5726325
###   @2=49.00
###   @3=2.04
###   @4=24
# at 93198671
#170208 17:51:21 server id 1  end_log_pos 93198698      Xid = 4947268
COMMIT/*!*/;

=> Everything seems to be in the file, but not in the correct order. Slave is not complaining but does nothing

MariaDB [(none)]> show processlist;
+-----+-------------+-------------------+------+-------------+-------+--------------------------------------------------------------------------------+------------------+----------+
| Id  | User        | Host              | db   | Command     | Time  | State                                                                          | Info             | Progress |
+-----+-------------+-------------------+------+-------------+-------+--------------------------------------------------------------------------------+------------------+----------+
|   8 | repl        | 10.193.9.23:51741 | NULL | Binlog Dump | 96271 | Master has sent all binlog to slave; waiting for binlog to be updated          | NULL             |    0.000 |
| 142 | system user |                   | NULL | Connect     | 87110 | Waiting for master to send event                                               | NULL             |    0.000 |
| 340 | system user |                   | NULL | Connect     | 54207 | Waiting for prior transaction to start commit before starting next transaction | NULL             |    0.000 |
| 341 | system user |                   | NULL | Connect     | 54207 | Waiting for prior transaction to start commit before starting next transaction | NULL             |    0.000 |
| 342 | system user |                   | NULL | Connect     | 54207 | Waiting for prior transaction to start commit before starting next transaction | NULL             |    0.000 |
| 343 | system user |                   | NULL | Connect     | 54207 | Waiting for prior transaction to start commit before starting next transaction | NULL             |    0.000 |
| 344 | system user |                   | NULL | Connect     | 54207 | Waiting for prior transaction to start commit before starting next transaction | NULL             |    0.000 |
| 345 | system user |                   | NULL | Connect     | 54207 | Waiting for prior transaction to start commit before starting next transaction | NULL             |    0.000 |
| 346 | system user |                   | NULL | Connect     | 54207 | Waiting for prior transaction to start commit before starting next transaction | NULL             |    0.000 |
| 347 | system user |                   | NULL | Connect     |     5 | Update_rows_log_event::find_row(-1)                                            | NULL             |    0.000 |



 Comments   
Comment by Elena Stepanova [ 2017-02-13 ]

pmoiroux,

Sorry, could you please elaborate on "overlap between #at and end_log_pos lines"? I don't see what you mean at the first glance, maybe I'm missing something obvious.

In fact, I don't see any problem at all from the paste in the description.
Unfortunately, the excerpt from SHOW SLAVE STATUS is missing Seconds_Behind_Master, which is really important here. How long the server has spent on this position?
The event is updating over 1,5 mln rows, and since it's a row event, it basically means 1,5+ mln individual updates, so it's bound to take quite long. How long – it depends. Do you have a primary key on pmig_contracts_contractperiod? If not, it can be insanely long.

Comment by Patrick Moiroux [ 2017-02-16 ]

Hi Elena,

Thank you for your reply. I checked the code of our developer and reproduced the scenario, and this is the current status:

Seconds_Behind_Master: 48170

This is what is done on the master:

drop table if exists pmig_contracts_contractperiod;

create table pmig_contracts_contractperiod as select id, reference_recurring_charge, reference_recurring_charge_monthly, period from contracts_contractperiod; – backup the original table

update pmig_contracts_contractperiod set reference_recurring_charge_monthly = FLOOR(reference_recurring_charge/period*100)/100;

CREATE INDEX pmig_contracts_contractperiod_idx ON pmig_contracts_contractperiod (id);

And on the slave indeed it will be 921954 individual updates like this before creating the index:

      1. UPDATE `customers`.`pmig_contracts_contractperiod`
      2. WHERE
      3. @1=1002
      4. @2=60.00
      5. @3=0.00
      6. @4=24
      7. SET
      8. @1=1002
      9. @2=60.00
      10. @3=2.50
      11. @4=24

Maybe it will be enough to create the INDEX before the update ? Any other suggestion? Thanks again for your time.

Comment by Elena Stepanova [ 2017-02-24 ]

pmoiroux,

I didn't see the structure of the table, but in any case – whenever you use binlog_format=row|mixed, it's crucially important to have a PK/unique index on every table of a considerable size which receives updates, otherwise you will have horrible performance problems. In unfortunate circumstances it can indeed be many hours per update.
That's an infamous problem of row-based replication. You can find more about it, for example, in this upstream bug report. Even though it appears to be fixed, in fact the "fix" was just addition of a warning in the error log, while the problem itself still exists (to my knowledge, in all popular flavors of the server).

So yes, please try to add the PK or unique index and check if it helps.

Comment by Elena Stepanova [ 2017-03-24 ]

If you have further information on this issue, please comment and the report will be re-opened if needed.

Generated at Thu Feb 08 07:54:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.