[MDEV-381] Data inconsistency (missing rows, extra rows, different rows) between a recovered server and a restored binary log after an imitated HW crash Created: 2012-07-05  Updated: 2023-08-15  Resolved: 2012-09-03

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.25, 5.3.7, 5.2.12, 5.1.62
Fix Version/s: 5.5.27

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None
Environment:

Windows 7 runs Oracle VirtualBox VM with Ubuntu 11.10 oneiric inside. The oneiric runs kvm VM based on vm-lucid-amd64-tarbake.qcow2.


Attachments: File data-and-logs.tar.gz     File ext4bug-minimal.c    

 Description   

The test was run on 5.5-mdev232, and the data and logs are from there, but I observe the same problem on the main 5.5 tree also.

Test flow:

  • started kvm as

    runvm \
    --port=2201 \
    --user=elenst \
    --smp=1 \
    --mem=2048 \
    --startup-timeout=600 \
    --shutdown-timeout=300 \
    --cpu=qemu64 \
    --base-image=<base image> \
    <new image> \
    bash

  • inside the VM, started MariaDB server as

    5.5-mdev232/sql/mysqld \
    --no-defaults \
    --basedir=<basedir> \
    --datadir=data \
    --lc-messages-dir=<basedir>/sql/share \ 
    --character-sets-dir=<basedir>/sql/share/charsets \
    --core-file \
    --max-allowed-packet=128Mb \
    --port=19300 \
    --socket=<socket file> \
    --pid-file=<pidfile> \
    --general-log \
    --general-log-file=<general log> \
    --sql-mode=no_engine_substitution \
    --log-bin=master-bin \
    --binlog-format=mixed \
    --sync-binlog=1

  • created test tables (InnoDB), and for a few minutes ran one-thread DML flow (insert, update, delete, start transaction, commit, rollback)
    Note: The thread sets SQL_SAFE_UPDATES=1 at the beginning;
  • while the flow was still running, killed the VM with SIGKILL;
  • stored a copy of the datadir and the binary log(s);
  • started server on the same datadir as

    mysqld \
    --no-defaults \
    --basedir=<basedir> \
    --datadir=<datadir> \
    --lc-messages-dir=<basedir>/sql/share \
    --character-sets-dir=<basedir>/sql/share/charsets \
    --core-file \
    --max-allowed-packet=128Mb \
    --port=19300 \
    --socket=<socket> \
    --pid-file=<pidfile> \
    --general-log \
    --general-log-file=<general log> \
    --sql-mode=no_engine_substitution \
    --log-bin=master-bin \
    --binlog-format=mixed \
    --sync-binlog=1 \
    --myisam-recover-options=FORCE \
    --aria-recover=FORCE

  • ran mysqldump on the server;
  • shutdown the server gracefully;
  • started a new server on an empty datadir, with the same parameters;
  • fed the binary logs from the previous server to the new one;
  • ran mysqldump on the server;
  • sorted and compared two dumps

Here is the result of comparison (the first server is the recovered one, the second server is the one populated from the binlog):

@@ -30,7 +30,6 @@
 INSERT INTO `AA` VALUES (129,NULL,NULL,'0000-00-00','0000-00-00',NULL,NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `AA` VALUES (130,66,NULL,NULL,NULL,NULL,NULL,NULL,'0000-00-00 00:00:00',NULL,NULL);
 INSERT INTO `AA` VALUES (132,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'0000-00-00 00:00:00',NULL,'6');
-INSERT INTO `AA` VALUES (133,NULL,NULL,NULL,NULL,NULL,'00:00:56','0000-00-00 00:00:00',NULL,NULL,NULL);
 INSERT INTO `AA` VALUES (15,NULL,NULL,'0000-00-00',NULL,NULL,NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `AA` VALUES (18,6,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'7');
 INSERT INTO `AA` VALUES (20,NULL,NULL,NULL,'0000-00-00',NULL,'00:00:07',NULL,NULL,NULL,NULL);
@@ -103,7 +102,6 @@
 INSERT INTO `A` VALUES (93,9,NULL,'0000-00-00',NULL,NULL,NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `A` VALUES (94,6,NULL,NULL,NULL,NULL,NULL,'0000-00-00 00:00:00',NULL,NULL,NULL);
 INSERT INTO `A` VALUES (96,NULL,NULL,NULL,NULL,'00:00:11',NULL,NULL,NULL,'1',NULL);
-INSERT INTO `A` VALUES (97,NULL,NULL,NULL,NULL,NULL,'00:00:09','0000-00-00 00:00:00',NULL,NULL,NULL);
 INSERT INTO `BB` VALUES (10,NULL,NULL,NULL,'0000-00-00',NULL,NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `BB` VALUES (20,26,NULL,NULL,NULL,'00:00:36',NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `BB` VALUES (21,7,NULL,NULL,'0000-00-00',NULL,NULL,NULL,NULL,NULL,NULL);
@@ -137,6 +135,7 @@
 INSERT INTO `BB` VALUES (72,NULL,14,NULL,NULL,NULL,NULL,NULL,NULL,'6',NULL);
 INSERT INTO `BB` VALUES (73,NULL,NULL,'0000-00-00',NULL,'00:00:06',NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `B` VALUES (104,6,NULL,NULL,'0000-00-00',NULL,NULL,NULL,NULL,NULL,NULL);
+INSERT INTO `B` VALUES (110,NULL,NULL,'0000-00-00','0000-00-00',NULL,NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `B` VALUES (112,6,NULL,'0000-00-00',NULL,'00:00:06',NULL,NULL,NULL,NULL,NULL);
 INSERT INTO `B` VALUES (113,NULL,NULL,'0000-00-00',NULL,NULL,NULL,NULL,NULL,'5',NULL);
 INSERT INTO `B` VALUES (1,1,7,'1900-01-01','1900-01-01',NULL,NULL,'2001-11-04 19:07:55','2001-11-04 19:07:55','x','x');
@@ -204,7 +203,7 @@
 INSERT INTO `CC` VALUES (65,NULL,NULL,NULL,NULL,'00:00:00',NULL,'0000-00-00 00:00:00',NULL,NULL,NULL);
 INSERT INTO `CC` VALUES (68,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'0000-00-00 00:00:00',NULL,'6');
 INSERT INTO `CC` VALUES (69,16,56,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL);
-INSERT INTO `CC` VALUES (6,NULL,NULL,NULL,NULL,NULL,NULL,'0000-00-00 00:00:00','0000-00-00 00:00:00',NULL,NULL);
+INSERT INTO `CC` VALUES (6,NULL,NULL,NULL,NULL,NULL,NULL,'0000-00-00 00:00:00',NULL,NULL,NULL);
 INSERT INTO `CC` VALUES (71,NULL,NULL,NULL,NULL,'00:00:09','00:00:00',NULL,NULL,NULL,NULL);
 INSERT INTO `CC` VALUES (73,NULL,NULL,NULL,'0000-00-00',NULL,NULL,NULL,NULL,NULL,'1');
 INSERT INTO `CC` VALUES (75,76,NULL,NULL,NULL,NULL,NULL,'0000-00-00 00:00:00',NULL,NULL,NULL);
@@ -345,13 +344,13 @@
 INSERT INTO `DD` VALUES (184,NULL,NULL,'0000-00-00',NULL,NULL,NULL,'0000-00-00 00:00:00',NULL,NULL,NULL);
 INSERT INTO `DD` VALUES (18,5,2,'2005-04-16','2005-04-16','09:21:51','09:21:51','2001-11-15 10:51:46','2001-11-15 10:51:46','v','v');
 INSERT INTO `DD` VALUES (185,NULL,NULL,'0000-00-00',NULL,NULL,'00:00:07',NULL,NULL,NULL,NULL);
-INSERT INTO `DD` VALUES (186,NULL,16,NULL,NULL,NULL,NULL,NULL,NULL,'6',NULL);
 INSERT INTO `DD` VALUES (19,163,103,'2007-05-12','2007-05-12','09:02:21','09:02:21','1900-01-01 00:00:00','1900-01-01 00:00:00','f','f');
 INSERT INTO `DD` VALUES (20,2,3,'2000-07-20','2000-07-20','22:40:03','22:40:03','2000-10-20 17:09:41','2000-10-20 17:09:41','q','q');
 INSERT INTO `DD` VALUES (22,0,2,'2004-08-04','2004-08-04',NULL,NULL,'2008-03-06 19:40:18','2008-03-06 19:40:18','a','a');
 INSERT INTO `DD` VALUES (23,9,7,'2002-06-25','2002-06-25','01:39:40','01:39:40','1900-01-01 00:00:00','0000-00-00 00:00:00','w','w');
 INSERT INTO `DD` VALUES (24,3,2,'2004-04-08','2004-04-08','18:14:11','18:14:11','2004-03-03 15:06:11','2004-03-03 15:06:11','e','e');
 INSERT INTO `DD` VALUES (25,1,2,'2004-12-20','2004-12-20','03:28:55','03:28:55','2009-06-03 03:14:33','2009-06-03 03:14:33',NULL,NULL);
+INSERT INTO `DD` VALUES (29,46,8,'2007-09-24','2007-09-24','00:00:00','00:00:00','2009-07-24 13:01:33','2009-07-24 13:01:33','k','6');
 INSERT INTO `DD` VALUES (30,9,4,'2000-05-07','2000-05-07','09:51:13','09:51:13','2005-04-14 03:38:00','2005-04-14 03:38:00',NULL,NULL);
 INSERT INTO `DD` VALUES (31,8,2,'2002-07-15','2002-07-15','11:25:28','11:25:28',NULL,NULL,'j','j');
 INSERT INTO `DD` VALUES (32,1,8,'2005-09-09','2005-09-09',NULL,NULL,'2002-07-16 03:25:43','2002-07-16 03:25:43','t','t');
@@ -518,6 +517,7 @@
 INSERT INTO `D` VALUES (67,7,8,NULL,NULL,'00:00:00','00:00:00','2003-12-17 21:47:57','2003-12-17 21:47:57',NULL,NULL);
 INSERT INTO `D` VALUES (68,2,5,'2007-02-07','2007-02-07','02:59:31','02:59:31',NULL,NULL,NULL,NULL);
 INSERT INTO `D` VALUES (69,1,8,'2005-01-22','2005-01-22','09:04:17','09:04:17','1900-01-01 00:00:00','1900-01-01 00:00:00','h','h');
+INSERT INTO `D` VALUES (6,NULL,16,NULL,NULL,NULL,NULL,NULL,NULL,NULL,'9');
 INSERT INTO `D` VALUES (70,9,2,'2006-04-25','2006-04-25','19:44:22','19:44:22',NULL,NULL,'k','k');
 INSERT INTO `D` VALUES (71,5,9,'2002-11-13','2002-11-13','00:00:00','00:00:00','2009-06-08 21:21:31','2009-06-08 21:21:31','k','k');
 INSERT INTO `D` VALUES (72,4,7,'2007-10-26','2007-10-26','00:00:00','00:00:00','2008-05-25 23:36:30','2008-05-25 23:36:30',NULL,NULL);

The last steps (recovery, feeding the binlog and comparison) can be reproduced with the provided dataset; the difference is still there.
Here is the recovery log:

120704 20:07:03 InnoDB: Initializing buffer pool, size = 128.0M
120704 20:07:04 InnoDB: Completed initialization of buffer pool
120704 20:07:04 InnoDB: highest supported file format is Barracuda.
InnoDB: Log scan progressed past the checkpoint lsn 2858763
120704 20:07:04  InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Doing recovery: scanned up to log sequence number 2877127
InnoDB: Transaction C1F was in the XA prepared state.
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 0 row operations to undo
InnoDB: Trx id counter is E00
120704 20:07:07  InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
 97 98 99 
InnoDB: Apply batch completed
InnoDB: Last MySQL binlog file position 0 568642, file name ./master-bin.000001
InnoDB: Starting in background the rollback of uncommitted transactions
120704 20:07:08  InnoDB: Rollback of non-prepared transactions completed
120704 20:07:08  InnoDB: Waiting for the background threads to start
120704 20:07:09 Percona XtraDB (http://www.percona.com) 1.1.8-26.0 started; log sequence number 2877127
120704 20:07:12 [Note] Recovering after a crash using master-bin
120704 20:07:14 [Note] Starting crash recovery...
120704 20:07:14  InnoDB: Starting recovery for XA transactions...
120704 20:07:14  InnoDB: Transaction C1F in prepared state after recovery
120704 20:07:14  InnoDB: Transaction contains changes to 1 rows
120704 20:07:14  InnoDB: 1 transactions in prepared state after recovery
120704 20:07:14 [Note] Found 1 prepared transaction(s) in InnoDB
120704 20:07:14 [Note] rollback xid 'MySQLXid\1\0\0\0\0\0\0\0\334\30\0\0\0\0\0\0'
120704 20:07:14 [Note] Crash recovery finished.
120704 20:07:15 [Note] Event Scheduler: Loaded 0 events
120704 20:07:15 [Note] sql/mysqld: ready for connections.

Binlog reading is uneventful (no errors on the way).

The attached archive contains the following:

mysql.err - server error log from the server start and till the VM crash;
mysql.log - server general log from the server start and till the VM crash;
data/master-bin.index and data/master-bin.000001 - binary log from the server start and up to the VM crash;
data/*, data/*/* - usual datadir contents as of the moment of the VM crash.

Test tables are in the `test` schema.



 Comments   
Comment by Elena Stepanova [ 2012-07-05 ]

Attached file contains the datadir, binary log, general and error logs of the server after the VM crash

Comment by Kristian Nielsen [ 2012-08-15 ]

Ok, I got some results now, this definitely looks like a real problem.

The problem seems to be that the fdatasync() call for the binlog does not
correctly flush the new file size, though it does flush the actual data.

On the host side, I see alternate writes to binlog, innodb redo log, binlog,
innodb redo log. The writes to update the ext4 metadata to note the file has
increased are missing.

So this is why we get inconsistency after crash recovery - ext4 has dropped
the last few transactions from the end of the binlog file.

I now need to figure out if this is a bug (misconfiguration?) in mysqld, in
ext4, or in kvm. Though according to `man fdatasync` I don't see any bug in
mysqld:

fdatasync() is similar to fsync(), but does not flush modified metadata
unless that metadata is needed in order to allow a subsequent data
retrieval to be correctly handled. For example, changes to st_atime or
st_mtime (respectively, time of last access and time of last modifica‐
tion; see stat(2)) do not require flushing because they are not neces‐
sary for a subsequent data read to be handled correctly. On the other
hand, a change to the file size (st_size, as made by say ftruncate(2)),
would require a metadata flush.

The mysqld does do an fdatasync, which according to the above should flush
the change in file size.

Even if this is some misconfiguration in kvm, we still need to document it
very clearly, as lots of people these days are running their servers
virtualised.

I will investigate more ...

Comment by Kristian Nielsen [ 2012-08-15 ]

If I change mysqld to use fsync() for binlog instead of fdatasync(), the problem (mostly) goes away - which seems to be contrary to the man page. Also, straces seem to show some problem still remains, though it's harder to reproduce. Will investigate further ...

Comment by Kristian Nielsen [ 2012-08-16 ]

I created a small test program that reproduces the issue.
I will attach, the core is this:

memset(buf2, 0, sizeof(buf2));
for (seq = 0; seq < COUNT; ++seq)

{ snprintf(buf, sizeof(buf), "%7d\n", seq); write(append_fd, buf, RECSIZE); if (use_fdatasync) fdatasync(append_fd) else fsync(append_fd); memcpy(buf2 + (seq * RECSIZE) % BLOCKSIZE, buf, RECSIZE); pwrite(inplace_fd, buf2, BLOCKSIZE, 0); fdatasync(inplace_fd); }

With ext4 and ext3, this loses entries in the append_fd file in case of crash,
when using fdatasync() but not when using fsync().

XFS does not loose entries.

This seems in violation with fdatasync(2) man page. But even if not, it still
looks like a bug, as strace of kvm shows that ext4 flushes the data page for
each fdatasync() but does not flush the metadata for the increase in length -
that just does not make sense.

To test, run the program, then crash the machine hard (power
failure). Remember to use `hdparm -W 0` so disks do real syncs.

Then after machine comes back up, test that files written by the test program
are in sync:

    tail -1 append.log
    perl -lne '$x = $_ if $_ > $x; END {print $x}' inplace.log

The files must have the same number, or append.log can be one ahead of
inplace.log. When the problem shows, append.log is behind inplace.log.

Comment by Kristian Nielsen [ 2012-08-27 ]

I implemented a fix/work-around for this.
The patch makes binlog always use fsync(), not fdatasync().
I think this is a good fix, as it is not likely to be any slower when a file
is extended (metadata needs flushing anyway). And it avoids complicated checks
for if we are running on buggy ext3/ext4.

The patch is pushed to lp:~maria-captains/maria/5.3-knielsen

Comment by Raghavendra Prabhu [ 2012-08-27 ]

I tested with latest linux kernel on kvm with following settings:

-append "root=/dev/vda1 "
-balloon virtio
-chardev socket,path=/qemu/vc/Fsync,server,nowait,id=Fsync
-cpu host
-daemonize
-device virtconsole,chardev=Fsync,name=org.fedoraproject.console.Fsync
-device virtio-serial
-drive file=/qemu/Images/Fsync.img,if=virtio,cache=writeback
-enable-kvm
-initrd /boot/initramfs-linux.img
-kernel /qemu/kvmImage
-m 768
-monitor unix:/qemu/socket/archie.sock,server,nowait
-name Fsync
-net nic,model=virtio,vlan=1,macaddr=DE:AD:BE:EF:68:75
-net vde,vlan=1
-nographic
-serial file:/qemu/logs/console.log
-smp 2,cores=1
-watchdog ib700
-watchdog-action poweroff

I can confirm this. XFS seems to be good but ext4 shows disparity even with nodelalloc. I didn't test with O_DIRECT though, wonder if that makes any difference.

uname -a
Linux archie 3.6.0-rc2-DEB #175 SMP PREEMPT Fri Aug 17 05:00:46 IST 2012 x86_64 GNU/Linux

Comment by Kristian Nielsen [ 2012-08-30 ]

I pushed a work-around to 5.3.
I'd still like to report the ext3/4 bug upstream before closing this issue

Comment by Kristian Nielsen [ 2012-09-03 ]

I reported the ext3/ext4 bug upstream:

http://lkml.indiana.edu/hypermail/linux/kernel/1209.0/00437.html

Comment by Kristian Nielsen [ 2012-09-03 ]

Patch is pushed to 5.3 and merged to 5.5

Comment by Elena Stepanova [ 2012-09-03 ]

I ran several single-thread and multi-thread tests on maria/5.3 revno 3573 and maria/5.5 revno 3520, both versions looked good so far (no failures). I'll run more later, but it will be out of scope of this bug report, I consider the workaround working for the provided test scenario.

Comment by Kristian Nielsen [ 2012-09-04 ]

ext3 maintainer Jan Kara confirmed the bug in ext3/ext4, and proposed a patch:

http://www.spinics.net/lists/linux-ext4/msg33716.html
http://www.spinics.net/lists/linux-ext4/msg33717.html

Comment by Kristian Nielsen [ 2012-09-18 ]

Looks like this will be fixed in older kernels back at least to 3.0:

https://patchwork.kernel.org/patch/1448361/

Comment by Kristian Nielsen [ 2023-08-10 ]

I now pushed to 10.4 the removal of this work-around. The bug was fixed in the Linux kernel since long ago.

Comment by Marko Mäkelä [ 2023-08-15 ]

The removal of the MY_SYNC_FILESIZE flag will be part of the next quarterly releases (MariaDB Server 10.4.32, 10.5.22, and so on) in about 3 months.

Generated at Thu Feb 08 06:28:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.