[MDEV-17084] Optimize append only files for NVDIMM - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 10.5.2
Component/s: Replication, Storage Engine - Aria, Storage Engine - InnoDB
Labels:
None

Description

The task is to optimize the speed performance of append only files with
the help of persistent memory (NVDIMM).

Preamble:

To understand what I would like to do, we first have to consider
the the optimal "user" interface for using persistent memory on
append-only files:

Least possible changes to existing applications.
Application should work unchanged even if there is no or limited
persistent memory on the machine.
Should work on any file, any where.

One way to do this would be something like:

Create log file
Execute an ioctl(file, IOTCL_OPTIMIZE_FOR_APPEND_ONLY, 1024*1024)
This should use assign 1M to optimize append write performance for
the file until it's closed.
Close log file
This should flush everything from persistent memory to the file
Any read, write, sync to the file should work 'normally', like if
it would be a normal file.
On reboot the persistent memory should be flushed to the file, so
that the file contains what was written to it.
There should also some calls/tool to check how much persistent memory
exists, who is allocating how much etc.

In other words, to use persistent memory to improve append only,
in most cases there would be only a one line change for each log file.

Unfortunately the above approach is not practical as it's hard to get to
work on all platforms.

However, I would like to create something that is as close as the above
as possible. This would allow anyone to, with minimum amount of changes
to adopt the library for their C or C++ application.

Implementations:

The suggested library to use is http://pmem.io/ and especially
http://pmem.io/pmdk/libpmem/. It seams to be available on most
modern linux.
Note that this suggested library should be written so that it works
even if there is no pmem library or persistent library available. In this
case it everything should work exactly like before (with the overhead
of one virtual call per pwrite()).

There is a library based on the above that implements support for
append only files
(http://pmem.io/pmdk/manpages/linux/master/libpmemlog/libpmemlog.7.html)
but this assumes that the full log files should be in persistent
memory, which is not optimal for the end user. It's also quite
complex to use with MariaDB, would require a lot of changes in MariaDB
to use and would be hard to get to work with and without persistent
memory.

Because of the above, I suggest we would base our work on the
low level http://pmem.io/pmdk/libpmem/ library.

Here is what I envision as an interface:

struct pem_append_base_handler {

  void *map;               /* 0 if no persistent memory */

  size_t mapped_length;    /* Available persistent memory */

/* Initialize pmem_append */

pmem_append_base_handler *pmem_append_base_init(const char* path_to_mem_dev);

/* Write all cached memory to files and free up memory for reuse */

int pmem_append_write_all(pmem_append_base_handler *ptr);

/* end usage of pmem_append */

void pmem_append_close(pmem_append_base_handler *ptr);

/* Allocate a file_append_handler for a specific file */

pemem_append_handler *pmem_append_init(pmem_handler, path_to_log_file...);

To use the library one should do something like:

pmem_handler= pmem_append_base_init(path_to_memory_device);

log_file_handler= open(path_to_log_file,...);

/* request to use half of available persistent memory for this file */

handler= pmem_append_init(pmem_handler, path_to_log_file,

                          log_file_handler,

                          pmem_handler->mapped_length / 2);

handler->fsync();    /* Write out memory to file (in case of crash before) */

The handler would be a struct where the public members would be something
like:

struct pem_append_handler {

  int log_file_handler;

  const char *path_to_log_file;

  /* size of the persistent buffer for this file */

  ulonglong memory_available;

  pmem_append_base_handler *pmem;

  off_t     offset;   /* End of file */

  ssize_t (*append)(int fildes, const void *buf, size_t nbyte);

  ssize_t (*pwrite)(int fildes, const void *buf, size_t nbyte,

                  off_t offset);

  ssize_t (*pread)(int fildes, const void *buf, size_t nbyte,

                   off_t offset);

  /* Write persistent memory in file region to file */

  int sync(off_t offset, size_t length);

  /* Write all persistent memory to file and fsync file */

  int fsync(void);

};

If there is no persistent memory, the above calls to pread/pwrite would be
mapped to normal read/writes.

The use this interface, one would have to do the following changes in the
application.

Add a call to pmem_append_init() when one opens the log file.
This call will also flush any cached data to the file.
Change write calls from pwrite to handler->pwrite() or handler->append
or add a call to handler->sync() to ensure that the area is already written.
Change read calls from pread to handler->pread()
Change fseek(SEEK_END) use handler->offset
Change fsync to handler->fsync

Note that any normal reads will work on the file normally. The user can
always call handler->fsync() to be able to use any file operations normally

Some applications may have their own version of pwrite/pread (like MariaDB).
To allow these to work with the above, there should also be a mapping
trough which the library calls pread, pwrite and sync so that one can use
the applications calls. For example, Aria is using my_pwrite() instead of
pwrite().

The library would internally do also the following things:

Create a background thread (in pmem_append_base_init()) that will
monitor all append files and start flushing as soon as half of the
memory of the respective cache is used.
Create a separate segment for each pmem_append_init() call and
store information about the file there that can be used on restart.

There should also be a external tool that one can use to:

See which files are cached by a persistent memory file and how much
is still not written.
Force the cache to be written to some or all of the files
Reset the cache

With the above library, one should be able to take an application like
MariaDB and convert all append only files (MariaDB has usually 3
active log files: binary log, InnoDB redo log, Aria redo log) to use
persistent memory in a matter of a few hours and still work when there is
no persistent memory available.

Attachments

Issue Links

causes

MDEV-32791 MariaDB-client community can't be installed in red hat ubi9

Closed

is part of

MDEV-9905 Options for NVDIMM usage in MariaDB

Open

relates to

MDEV-21534 improve locking/waiting in log_write_up_to

Closed

MDEV-25124 benchmark 10.6 performance for PMEM enabled builds

Closed

MDEV-27848 Remove unused wait/io/file/innodb/innodb_log_file

Closed

Activity

Transition	Time In Source Status	Execution Times

Ralf Gebhardt made transition - 2019-09-12 15:05

Open

In Progress

380d 1h 36m

1

Sergey Vojtovich made transition - 2019-10-10 09:44

In Progress

In Review

27d 18h 38m

1

Sergey Vojtovich made transition - 2019-11-08 09:23

In Review

Stalled

28d 23h 38m

1

Sergei Golubchik made transition - 2020-11-17 18:02

Stalled

Closed

375d 8h 38m

1

MariaDB Server

Optimize append only files for NVDIMM