[MDEV-587] LP:713561 - "Duplicate entry" error and time datatype cause slave provisioning to fail Created: 2011-02-05  Updated: 2013-03-08  Resolved: 2013-03-08

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug
Reporter: Philip Stoev (Inactive) Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: Launchpad

Attachments: XML File LPexportBug713561.xml    

 Description   

When executing a RQG test that does non-concurrent INSERT into various tables and then uses mysqldump to clone a new slave, the new slave diverges from the master. The slave starts properly and does apply all binlog events from the master, however when the master and the slave are dumped and diffed, they are no longer identical.

Some observations:

  • It appears that the situation only happens when duplicate key errors are seen on the master due to the randomness of the workload.
  • The duplicate key error is reported against the table that has a non-auto-increment PK, however the diff reports that the table that contains no PK at all is the one that has diverged
  • In some instances, the test reports that the slave thread has failed with an error "Slave SQL: Error 'Table 'test.table1_innodb_int_autoinc' doesn't exist' on opening tables, Error_code: 1146", however the table does exist on the slave
  • Maybe the issue is related to the different rules regarding InnoDB rollback depending on the type of error – statement v.s. transaction rollback


 Comments   
Comment by Philip Stoev (Inactive) [ 2011-02-05 ]

Re: "Duplicate entry" error causes slave provisioning using binlog_snapshot_position to fail
This issue is not specific to using binlog_snapshot_position, it is observed when using old-style mysqldump as well. It affects both MariaDB and MySQL, and requires the use of the time datatype. Maybe it is caused by different time zones on the master and on the slave, however MTR operates in GMT so it is not easy to provision a slave having the exact same timezone configuration.

Comment by Rasmus Johansson (Inactive) [ 2011-02-10 ]

Launchpad bug id: 713561

Comment by Elena Stepanova [ 2013-03-08 ]

While there is not enough information to fully confirm it, the bug MDEV-4255 provides a good story which might explain discrepancy of the data reported here.

The origin of the problem in MDEV-4255 is that when the dump is restored on slave, the rows are written into tables in a different order comparing to master.
The same is done in CloneSlave reporter, which was probably used for tests here.

There are some cases when the order of rows becomes important for further replication, even on 5.5 and even with MBR which is supposed to be reasonably safe. One of such cases is described in MDEV-4255, there were other, more subtle ones, observed during replication tests.
If the issue reported here was observed on 5.1, or if SBR was used, it could be even simpler since there are many unsafe statements which might make the data diverge.

In regard to the note that the problem was specific to time types, I can only guess that it was caused by the nature of RQG flow. Date/time fields most often use wrong literals (like datetime_field = 'a'), so they end up with zero values ('0000-00-00 00:00:00'), and hence duplicate key errors. I also initially observed the problem described in MDEV-4255 on datetime values, although it turned out to be unrelated to the type as such.

Possibly there were other reasons as well, and the theory above doesn't explain "table doesn't exist" errors mentioned here, but it's impossible to analyze further based on the description alone.

Generated at Thu Feb 08 06:29:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.