Details
-
Task
-
Status: Closed (View Workflow)
-
Trivial
-
Resolution: Fixed
-
None
-
None
-
None
Description
Overview
Crash recovery is a crucial part of database operation. We have it for
InnoDB/XtraDB, for binlog, for Aria, and other places.
Such crash recovery relies on fsync()/fdatasync()/O_DIRECT to know when data
is durably stored on disk. It is crucial for correct operation that fsync
happens in all the right places.
Unfortunately testing for this is almost entirely missing. The problem is that
fsync is really needed only when the kernel crashes (bug or power or hardware
failure). Such conditions are very inconvenient to include in an automatic
test suite. Normally, just `kill -9` is used. However, `kill -9` does not
loose any data written with write() calls but not fsync'ed. While kernel crash
does loose such data.
This task is to implement an LD_PRELOAD .so library that will help test
correct fsync placement in automated test suites, without having to crash the
kernel.
The idea is to artificially simulate the loss of data that has been written
with write(), but not synced with fsync(). an LD_PRELOAD .so library will
override the write() and fsync() syscalls. There will be some way to inject a
fault, like a signal or setting a global variable. When triggered, the
LD_PRELOAD will make write() calls do nothing, or perhaps X% of write calls do
nothing. Then when the first fsync() is encountered, it will abort() to crash
the process. (Until triggered, all calls work as normal).
This simulates that a kernel crash causes some data from write() to never
reach the disk. It is crucial to crash at the first fsync, as otherwise we may
loose write()s that would not have been lost in a real-life situation.
Details
There are a number of details to get right to get correct behaviour of such
LD_PRELOAD.
In practise, there are more system calls to wrap than just write() and
fsync(). Here is an attempt at a full list. We attempt to be reasonably
complete, but it is enough to cover what current mysqld uses.
- fsync()
- fdatasync(). In fact, we might want to test that we always use
fdatasync() in favour of fsync(), as the former should normally be
sufficient and can be much faster (I have measured 2.5 times faster on one
system). So we could make fsync() wrapper assert unconditionally.
- fcntl(). We need to know which fd's are opened with O_DIRECT. For such fds,
we must not ignore the write() call (however we might delay it for some
small random time in an attempt to expose a bug). We might also want to
crash after some (random?) N write() callss to O_DIRECT files, as when
using O_DIRECT, fsync may not be used at all by the process to be tested.
- open() - O_DIRECT can also be set in open(2).
- close() - to know when to de-allocate our state when target process closes
a file.
- dup() and dup2() and dup3(), to keep track of opened files.
- write()
- pwrite()
- writev()
- pwritev()
- read(), as if we skip a write() call, we still need to return the data that
would have been written to any subsequent read() call. Otherwise we may
cause spurious failures for the target application.
- pread()
- readv()
- preadv()
Operation
How will tests trigger a fault injection? I think the LD_PRELOAD should define
some int variable __fsync_test_crash_injection - and once this is set to 1,
subsequent wrapped calls trigger the fault injection. Then we can in the
server have a debug_dbug setting that will trigger the LD_PRELOAD. Or the test
driver can attach GDB in batch mode to set the trigger.
We need to skip only write() calls to disk files, not to sockets and pipes and
such. I think we can handle this by calling fstat() on the fd inside our
write() wrapper to decide if it is a regular file or not.
We need to maintain a hash of state for each open fd in the process. We need
to maintain this hash even before we are triggered, to know the O_DIRECT
status of every file opened. We also need to remember in the hash the data for
every skipped write() call, so we can return that data correctly from read().
We can use a single mutex to protect the hash of state, I think - we are
testing correctness, not scalability. We do not need any locking to protect
the __fsync_test_crash_injection variable though.
If we decide to ignore a write() or writev() call, we must do a seek() instead
on the fd, so we leave the fd in the correct state for subsequent write()
calls. At least if we want to skip only some write() calls after triggering,
as opposed to all. Skipping only some would be useful to catch more bugs,
where target application makes incorrect assumptions that some bytes written
implies that other bytes will also be written.
If we skip a write() call (or similar) that would extend the file, I think we
also need to skip any following write() that follows. I think good file
systems guarantee that a write that extends a file will not be seen without
also the data extended with being seen. Or check what POSIX or other relevant
standards guarantee, and make sure we obey the correct semantics.
A read(), readv(), pread() or preadv() call on an fd on which we already
skipped a write() call must look up any skipped data in the state hash for the
fd, and place that in the buffer before returning from read().
Doing the LD_PRELOAD
It needs to be sorted out how to do an LD_PRELOAD, ie. how to wrap selected
syscalls and augment them with own operations. I think Stewarts libeatmydata
can be used as a starting point, it already does similar wrapping, although
with a different aim. If needed, the Debian fakeroot mechanism may be another
useful piece of source code to study for ideas.
Limitations
It is also possible to write to files using mmap(). Then simply writing to
memory will eventually cause a disk write. Unfortunately, I did not come up
with a way to handle testing of fsync and mmap in an LD_PRELOAD.
Other ideas
A possible alternative to an LD_PRELOAD is to use virtualisation, like KVM or
similar. If a mysqld process inside the guest does write(), then the data
written will be in the file system buffers in the guest kernel, but not
necessarily in the buffers or on disk in the host. Thus, a `kill -9` of the
kvm (or whatever) process can loose the data not fsync()'ed, in effect doing a
kernel crash on the guest. Using kvm with "-drive cache=writeback ..." eg.
should give appropriate semantics for this.
The LD_PRELOAD is more convenient (ie. mysql-test-run could use it
automatically if available), however the virtualisation method is perhaps more
robust and complete, so both could be utilised for maximum coverage.
I wonder if this might be better using something at the 'dm' level in the Linux kernel. It's likely that with an LD_PRELOAD solution, it would end up being Linux-specific anyway. It's entirely possible that dm-flakey already either supports what would be necessary or could be easily modified to do so. This would also be beneficial in trapping other types of I/O which could go wrong and compromise crash safety, such as creation of new files, deletion of files, etc., and would natively support mmap and O_DIRECT without additional work.