[MDEV-16685] Replication data drift Created: 2018-07-04  Updated: 2022-10-08  Resolved: 2022-10-08

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.26, 10.0.30, 10.0.33
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Gabriel Garrido (Inactive) Assignee: Andrei Elkin
Resolution: Incomplete Votes: 1
Labels: None
Environment:

CentOS 7.2-7.4
hardware (32 cores, 128Gbs ram, 4.4Tb of data on ssd raid)
10Gbps lan



 Description   

We've configured 4 mariadb 10.0.xx slaves, the master also has the same major-minor version and about the same hardware (32 cores, 128Gbs ram, 4.4Tb of data on ssd raid), everything was running fine until we noticed some data was missing in the slaves (that's when we read all over data drift in mysql/mariadb replication), after reading thousands of articles and keep trying to find what happened we noticed that the data is missing (in the slaves) but the schemas are being updated, for example if I created a new db with some tables it will show up in the slaves but not the data that I could potentially insert, the same it's been happening to the rest of the databases/tables (and there are thousands), however we see that no errors are being displayed with `SHOW SLAVE STATUS\G`, but the replication is partially working, we were analyzing the files using mysqlbinlog in the slaves to get to the conclusion that data wasn't being sent to the slaves, as there is a lot of data it's a big effort to reconfigure/copy and start one slave again (also knowing that it could get into the same issue in a few weeks)

Here are some config parameters of interest:

Master Config:

    [mariadb]
    
    #InnoDB
    innodb_buffer_pool_populate     = 1
    innodb_buffer_pool_size         = 95G
    innodb_file_per_table           = 1
    innodb_corrupt_table_action     = warn
    innodb_support_xa               = 1
    innodb_open_files               = 100000
    innodb_flush_neighbors          = 0
    innodb_write_io_threads         = 64
    innodb_log_file_size            = 1G
    innodb_log_files_in_group       = 4
    innodb_log_buffer_size          = 128M
    innodb_flush_log_at_trx_commit  = 0
    innodb_flush_method             = O_DIRECT
    innodb_io_capacity              = 2000
    innodb_buffer_pool_instances    = 8
    innodb_stats_on_metadata        = 0
    
    wait_timeout                    = 30
 
    datadir                         = /var/lib/mysql/data
    tmpdir                          = /var/lib/mysql/tmp
    socket                          = /var/lib/mysql/mysql.sock
    skip-name-resolve
    sql-mode                        = NO_ENGINE_SUBSTITUTION
    open_files_limit                = 100000
    max_connections                 = 1300
    
    #Disable query cache
    query-cache-size                = 0M
    query-cache-limit               = 1M
    
    # Recommended by Percona
    table_open_cache                = 8192
    thread_cache_size               = 256
    
    ## Replication
    server-id                       = 12345
    log-bin                         = /var/lib/mysqllogs/master-bin-log
    relay-log                       = /var/lib/mysqllogs/master-relay-log
    relay-log-space-limit           = 16G
    slave_parallel_threads          = 8
    read-only                       = 0
    log-slave-updates               = 1
    binlog-format                   = ROW
    skip_slave_start
    
    ## Logging
    log-output                      = FILE
    slow-query-log                  = 1
    slow-query-log-file             = /var/lib/mysql/slow-log
    log-slow-slave-statements
    long-query-time                 = 2

Slave config (only different values, skipping server-id, log-bin, relay-log)

    read-only                       = 1
    sync-binlog                     = 1
    slave_skip_errors               = 1062

Master status (Note that the master status was added several hours after the slave status):

    MariaDB [(none)]> show master status;
    +--------------------+----------+--------------+------------------+
    | File               | Position | Binlog_Do_DB | Binlog_Ignore_DB |
    +--------------------+----------+--------------+------------------+
    | master-bin-log.021135 | 13151964 |              |                  |
    +--------------------+----------+--------------+------------------+
    1 row in set (0.00 sec)

Slave status:

    MariaDB [(none)]> show slave status\G
    *************************** 1. row ***************************
                   Slave_IO_State: Waiting for master to send event
                      Master_Host: 10.10.10.1
                      Master_User: replusr
                      Master_Port: 3306
                    Connect_Retry: 60
                  Master_Log_File: master-bin-log.021114
              Read_Master_Log_Pos: 711412247
                   Relay_Log_File: slave1-relay-log.000044
                    Relay_Log_Pos: 711412536
            Relay_Master_Log_File: master-bin-log.021114
                 Slave_IO_Running: Yes
                Slave_SQL_Running: Yes
                  Replicate_Do_DB: 
              Replicate_Ignore_DB: 
               Replicate_Do_Table: 
           Replicate_Ignore_Table: 
          Replicate_Wild_Do_Table: 
      Replicate_Wild_Ignore_Table: 
                       Last_Errno: 0
                       Last_Error: 
                     Skip_Counter: 0
              Exec_Master_Log_Pos: 711412247
                  Relay_Log_Space: 711412876
                  Until_Condition: None
                   Until_Log_File: 
                    Until_Log_Pos: 0
               Master_SSL_Allowed: No
               Master_SSL_CA_File: 
               Master_SSL_CA_Path: 
                  Master_SSL_Cert: 
                Master_SSL_Cipher: 
                   Master_SSL_Key: 
            Seconds_Behind_Master: 0
    Master_SSL_Verify_Server_Cert: No
                    Last_IO_Errno: 0
                    Last_IO_Error: 
                   Last_SQL_Errno: 0
                   Last_SQL_Error: 
      Replicate_Ignore_Server_Ids: 
                 Master_Server_Id: 95724
                   Master_SSL_Crl: 
               Master_SSL_Crlpath: 
                       Using_Gtid: Slave_Pos
                      Gtid_IO_Pos: 0-95724-49194074113
    1 row in set (0.00 sec)

We also investigated percona tools like pt-table-checksum and pt-table-sync but they are useless if the data is not getting inserted in the binlog, it could be a bug?

The config for the slaves is similar to what's displayed here, so my questions are:

Why mysql keeps replicating schemas but not data?
Is there any way to fix the replication without having to start over again?
Any other idea?

Thanks.



 Comments   
Comment by Elena Stepanova [ 2018-07-04 ]

You said you looked at the slaves using mysqlbinlog and discovered that the data wasn't sent to the slaves.
Did you look into the master binlog, to see if the data was written to the binary log at all?

Comment by Gabriel Garrido (Inactive) [ 2018-07-04 ]

This is from the master's binlog and the insert it's present, what makes this issue so odd is that the schemas are being updated in the slaves, but not the data. Below you can see the extracted insert from the master (that was a test db created while checking this issue):

180703 16:16:22 server id 95724  end_log_pos 597993092         GTID 0-95724-49190481089
/*!100001 SET @@session.gtid_seq_no=49190481089*//*!*/;
BEGIN
/*!*/;
# at 597993092
# at 597993139
#180703 16:16:22 server id 95724  end_log_pos 597993139         Table_map: `test`.`dummy` mapped to number 4871549419
#180703 16:16:22 server id 95724  end_log_pos 597993238         Write_rows: table id 4871549419 flags: STMT_END_F
### INSERT INTO `test`.`dummy`
### SET
###   @1=1 /* INT meta=0 nullable=0 is_null=0 */
###   @2='a' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=2 /* INT meta=0 nullable=0 is_null=0 */
###   @2='b' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=3 /* INT meta=0 nullable=0 is_null=0 */
###   @2='c' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=4 /* INT meta=0 nullable=0 is_null=0 */
###   @2='d' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=5 /* INT meta=0 nullable=0 is_null=0 */
###   @2='e' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=6 /* INT meta=0 nullable=0 is_null=0 */
###   @2='f' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=7 /* INT meta=0 nullable=0 is_null=0 */
###   @2='g' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=8 /* INT meta=0 nullable=0 is_null=0 */
###   @2='h' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=9 /* INT meta=0 nullable=0 is_null=0 */
###   @2='i' /* STRING(5) meta=65029 nullable=1 is_null=0 */
### INSERT INTO `test`.`dummy`
### SET
###   @1=10 /* INT meta=0 nullable=0 is_null=0 */
###   @2='j' /* STRING(5) meta=65029 nullable=1 is_null=0 */
# at 597993238
#180703 16:16:22 server id 95724  end_log_pos 597993265         Xid = 60516562879
COMMIT/*!*/;

Comment by Gabriel Garrido (Inactive) [ 2018-07-05 ]

Let me know if you need anything else from me, I should be able to provide any additional information if needed.

Comment by Elena Stepanova [ 2018-07-05 ]

Thanks.
Next step, when you said you looked at the slaves using mysqlbinlog, did you mean slave's binary log, or relay log? If you meant slave's binary log, could you also check the relay log?

Comment by Gabriel Garrido (Inactive) [ 2018-07-05 ]

I just tested it and dumped the relay log and it is present in the relay log, in the previous comments I was only talking about the binary log.

This is from the relay log in the slave:

# at 711212911
#180705 10:48:55 server id 95724  end_log_pos 711212660         GTID 0-95724-49248341051
/*!100001 SET @@session.gtid_seq_no=49248341051*//*!*/;
BEGIN
/*!*/;
# at 711212949
# at 711212996
#180705 10:48:55 server id 95724  end_log_pos 711212707         Table_map: `test`.`dummy` mapped to number 4982198060
#180705 10:48:55 server id 95724  end_log_pos 711212743         Write_rows: table id 4982198060 flags: STMT_END_F
### INSERT INTO `test`.`dummy`
### SET
###   @1=13
###   @2='m'
# at 711213032
#180705 10:48:55 server id 95724  end_log_pos 711212770         Xid = 62156308933
COMMIT/*!*/;

Thanks.

Comment by Gabriel Garrido (Inactive) [ 2018-07-10 ]

Last night we stopped the master server, and all 4 slaves, after starting the master again we started 2 slaves which throwed this error:

Last_Error: Could not execute Update_rows_v1 event on table example_db.orders; Can't find record in 'orders', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin-log.021343, end_log_pos 10929

After that, as a test I added slave_skip_errors = 1032 in the other 2 slaves and started the slaves, these 2 slaves started to sync correctly and also the data started to get inserted in the tables (only new data), of course this is a totally corrupted slave at this point, but it took a master restart and a slave_skip_errors = 1032 param to recover from the no data issue thing, the problem here is the silent replication break, if we could at least tell when it breaks without having to implement any custom script or checking on a daily basis. And also as we use binlog-format = ROW we cannot make use of pt-table-checksum and pt-table-sync to resync in such case.

Thanks.

Comment by Gabriel Garrido (Inactive) [ 2018-12-28 ]

Hello, it's been a while since this was reported and we were wondering if the root cause of the issue was found, or if you were able to replicate it?, if there is any more information that I can provide please just ask me, We still live with this bug .

Thank you and Happy holidays.

Comment by Andrei Elkin [ 2021-03-23 ]

gabriel.garrido: Perhaps it is already long time irrelevant, but it looks your issue related to the fact the slaves were configured to ignored duplicate errors - 1062. It's a normal thing to do, consequences of such configuration may generally lead to data loss on slave.

Generated at Thu Feb 08 08:30:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.