Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-370

LD_PRELOAD to crash-test correctness of fsync().

Details

    • Task
    • Status: Closed (View Workflow)
    • Trivial
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      Overview

      Crash recovery is a crucial part of database operation. We have it for
      InnoDB/XtraDB, for binlog, for Aria, and other places.

      Such crash recovery relies on fsync()/fdatasync()/O_DIRECT to know when data
      is durably stored on disk. It is crucial for correct operation that fsync
      happens in all the right places.

      Unfortunately testing for this is almost entirely missing. The problem is that
      fsync is really needed only when the kernel crashes (bug or power or hardware
      failure). Such conditions are very inconvenient to include in an automatic
      test suite. Normally, just `kill -9` is used. However, `kill -9` does not
      loose any data written with write() calls but not fsync'ed. While kernel crash
      does loose such data.

      This task is to implement an LD_PRELOAD .so library that will help test
      correct fsync placement in automated test suites, without having to crash the
      kernel.

      The idea is to artificially simulate the loss of data that has been written
      with write(), but not synced with fsync(). an LD_PRELOAD .so library will
      override the write() and fsync() syscalls. There will be some way to inject a
      fault, like a signal or setting a global variable. When triggered, the
      LD_PRELOAD will make write() calls do nothing, or perhaps X% of write calls do
      nothing. Then when the first fsync() is encountered, it will abort() to crash
      the process. (Until triggered, all calls work as normal).

      This simulates that a kernel crash causes some data from write() to never
      reach the disk. It is crucial to crash at the first fsync, as otherwise we may
      loose write()s that would not have been lost in a real-life situation.

      Details

      There are a number of details to get right to get correct behaviour of such
      LD_PRELOAD.

      In practise, there are more system calls to wrap than just write() and
      fsync(). Here is an attempt at a full list. We attempt to be reasonably
      complete, but it is enough to cover what current mysqld uses.

      • fsync()
      • fdatasync(). In fact, we might want to test that we always use
        fdatasync() in favour of fsync(), as the former should normally be
        sufficient and can be much faster (I have measured 2.5 times faster on one
        system). So we could make fsync() wrapper assert unconditionally.
      • fcntl(). We need to know which fd's are opened with O_DIRECT. For such fds,
        we must not ignore the write() call (however we might delay it for some
        small random time in an attempt to expose a bug). We might also want to
        crash after some (random?) N write() callss to O_DIRECT files, as when
        using O_DIRECT, fsync may not be used at all by the process to be tested.
      • open() - O_DIRECT can also be set in open(2).
      • close() - to know when to de-allocate our state when target process closes
        a file.
      • dup() and dup2() and dup3(), to keep track of opened files.
      • write()
      • pwrite()
      • writev()
      • pwritev()
      • read(), as if we skip a write() call, we still need to return the data that
        would have been written to any subsequent read() call. Otherwise we may
        cause spurious failures for the target application.
      • pread()
      • readv()
      • preadv()

      Operation

      How will tests trigger a fault injection? I think the LD_PRELOAD should define
      some int variable __fsync_test_crash_injection - and once this is set to 1,
      subsequent wrapped calls trigger the fault injection. Then we can in the
      server have a debug_dbug setting that will trigger the LD_PRELOAD. Or the test
      driver can attach GDB in batch mode to set the trigger.

      We need to skip only write() calls to disk files, not to sockets and pipes and
      such. I think we can handle this by calling fstat() on the fd inside our
      write() wrapper to decide if it is a regular file or not.

      We need to maintain a hash of state for each open fd in the process. We need
      to maintain this hash even before we are triggered, to know the O_DIRECT
      status of every file opened. We also need to remember in the hash the data for
      every skipped write() call, so we can return that data correctly from read().

      We can use a single mutex to protect the hash of state, I think - we are
      testing correctness, not scalability. We do not need any locking to protect
      the __fsync_test_crash_injection variable though.

      If we decide to ignore a write() or writev() call, we must do a seek() instead
      on the fd, so we leave the fd in the correct state for subsequent write()
      calls. At least if we want to skip only some write() calls after triggering,
      as opposed to all. Skipping only some would be useful to catch more bugs,
      where target application makes incorrect assumptions that some bytes written
      implies that other bytes will also be written.

      If we skip a write() call (or similar) that would extend the file, I think we
      also need to skip any following write() that follows. I think good file
      systems guarantee that a write that extends a file will not be seen without
      also the data extended with being seen. Or check what POSIX or other relevant
      standards guarantee, and make sure we obey the correct semantics.

      A read(), readv(), pread() or preadv() call on an fd on which we already
      skipped a write() call must look up any skipped data in the state hash for the
      fd, and place that in the buffer before returning from read().

      Doing the LD_PRELOAD

      It needs to be sorted out how to do an LD_PRELOAD, ie. how to wrap selected
      syscalls and augment them with own operations. I think Stewarts libeatmydata
      can be used as a starting point, it already does similar wrapping, although
      with a different aim. If needed, the Debian fakeroot mechanism may be another
      useful piece of source code to study for ideas.

      Limitations

      It is also possible to write to files using mmap(). Then simply writing to
      memory will eventually cause a disk write. Unfortunately, I did not come up
      with a way to handle testing of fsync and mmap in an LD_PRELOAD.

      Other ideas

      A possible alternative to an LD_PRELOAD is to use virtualisation, like KVM or
      similar. If a mysqld process inside the guest does write(), then the data
      written will be in the file system buffers in the guest kernel, but not
      necessarily in the buffers or on disk in the host. Thus, a `kill -9` of the
      kvm (or whatever) process can loose the data not fsync()'ed, in effect doing a
      kernel crash on the guest. Using kvm with "-drive cache=writeback ..." eg.
      should give appropriate semantics for this.

      The LD_PRELOAD is more convenient (ie. mysql-test-run could use it
      automatically if available), however the virtualisation method is perhaps more
      robust and complete, so both could be utilised for maximum coverage.

      Attachments

        Activity

          jeremycole Jeremy Cole added a comment -

          I wonder if this might be better using something at the 'dm' level in the Linux kernel. It's likely that with an LD_PRELOAD solution, it would end up being Linux-specific anyway. It's entirely possible that dm-flakey already either supports what would be necessary or could be easily modified to do so. This would also be beneficial in trapping other types of I/O which could go wrong and compromise crash safety, such as creation of new files, deletion of files, etc., and would natively support mmap and O_DIRECT without additional work.

          jeremycole Jeremy Cole added a comment - I wonder if this might be better using something at the 'dm' level in the Linux kernel. It's likely that with an LD_PRELOAD solution, it would end up being Linux-specific anyway. It's entirely possible that dm-flakey already either supports what would be necessary or could be easily modified to do so. This would also be beneficial in trapping other types of I/O which could go wrong and compromise crash safety, such as creation of new files, deletion of files, etc., and would natively support mmap and O_DIRECT without additional work.

          Elena actually implemented the "KVM" idea.

          RQG has a test where it runs some replication load inside a KVM guest and then
          kills the KVM process, fairly accurately simulating a power failure or other
          kernel crash in the guest. After crash recovery, the master and slave is
          compared for consistency.

          This was used to test crash recovery, and was actually very successful; it
          found this bug in Linux ext3/ext4 fdatasync():

          http://lkml.indiana.edu/hypermail/linux/kernel/1209.0/00517.html

          (The dm-flakey idea might still be interesting, thanks for the pointer).

          knielsen Kristian Nielsen added a comment - Elena actually implemented the "KVM" idea. RQG has a test where it runs some replication load inside a KVM guest and then kills the KVM process, fairly accurately simulating a power failure or other kernel crash in the guest. After crash recovery, the master and slave is compared for consistency. This was used to test crash recovery, and was actually very successful; it found this bug in Linux ext3/ext4 fdatasync(): http://lkml.indiana.edu/hypermail/linux/kernel/1209.0/00517.html (The dm-flakey idea might still be interesting, thanks for the pointer).
          jeremycole Jeremy Cole added a comment -

          Cool! Thanks for the note. If you've already got a working system for testing this, definitely using dm-flakey might be fun to catch other problems...

          jeremycole Jeremy Cole added a comment - Cool! Thanks for the note. If you've already got a working system for testing this, definitely using dm-flakey might be fun to catch other problems...

          People

            Unassigned Unassigned
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.