[MDEV-33379] innodb_log_file_buffering=OFF causes corruption on bcachefs Created: 2024-02-03  Updated: 2024-02-05

Status: In Progress
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 11.2.2
Fix Version/s: 10.11, 11.0

Type: Bug Priority: Critical
Reporter: Miroslav Rýzek Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: Compatibility
Environment:

Archlinux, bcachefs. 2 background HDD devices, 2 foreground SSD devices.


Issue Links:
Relates
relates to MDEV-24854 Change innodb_flush_method=O_DIRECT b... Closed
relates to MDEV-30136 Map innodb_flush_method to new settab... Closed

 Description   

Hello,

When I installed MariaDB and try to init databases using
```sh
mariadb-install-db --user=mysql --basedir=/usr --datadir=/var/lib/mysql
```
I receive this kind of crash:
```
mariadb-install-db --user=mysql --basedir=/usr --datadir=/var/lib/mysql --verbose
Installing MariaDB/MySQL system tables in '/var/lib/mysql' ...
2024-02-03 15:30:30 0 [Note] Starting MariaDB 11.2.2-MariaDB source revision 929532a9426d085111c24c63de9c23cc54382259 as process 28383
2024-02-03 15:30:30 0 [Note] InnoDB: The first data file './ibdata1' did not exist. A new tablespace will be created!
2024-02-03 15:30:30 0 [Note] InnoDB: Compressed tables use zlib 1.3
2024-02-03 15:30:30 0 [Note] InnoDB: Number of transaction pools: 1
2024-02-03 15:30:30 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-02-03 15:30:30 0 [Note] InnoDB: Using liburing
2024-02-03 15:30:30 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2024-02-03 15:30:30 0 [Note] InnoDB: Completed initialization of buffer pool
2024-02-03 15:30:30 0 [Note] InnoDB: Setting file './ibdata1' size to 12.000MiB. Physically writing the file full; Please wait ...
2024-02-03 15:30:30 0 [Note] InnoDB: File './ibdata1' size is now 12.000MiB.
2024-02-03 15:30:30 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: 512 bytes should have been written at 12288 from (unknown file), but got only 0. Retrying.
2024-02-03 15:30:30 0 [Warning] InnoDB: Retry attempts for writing partial data failed.
2024-02-03 15:30:30 0 [ERROR] InnoDB: Write to file ib_logfile0 failed at offset 12288, 512 bytes should have been written, only 0 were written. Operating system error number 2. Check that your OS and file system support files of this size. Check also that the disk is not full or a disk quota exceeded.
2024-02-03 15:30:30 0 [ERROR] InnoDB: Error number 2 means 'No such file or directory'
2024-02-03 15:30:30 0 [Note] InnoDB: Some operating system error numbers are described at https://mariadb.com/kb/en/library/operating-system-error-codes/
2024-02-03 15:30:30 0 [ERROR] [FATAL] InnoDB: write("ib_logfile0") returned I/O error
240203 15:30:30 [ERROR] mysqld got signal 6 ;
Sorry, we probably made a mistake, and this is a bug.

Your assistance in bug reporting will enable us to fix this for the next release.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 11.2.2-MariaDB source revision: 929532a9426d085111c24c63de9c23cc54382259
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468089 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/bin/mariadbd(my_print_stacktrace+0x31)[0x5c8cc8cc4591]
/usr/bin/mariadbd(handle_fatal_signal+0x47a)[0x5c8cc88a690a]
/usr/lib/libc.so.6(+0x3a710)[0x7e1151bdc710]
/usr/lib/libc.so.6(+0x8a83c)[0x7e1151c2c83c]
/usr/lib/libc.so.6(raise+0x18)[0x7e1151bdc668]
/usr/lib/libc.so.6(abort+0xd7)[0x7e1151bc44b8]
/usr/bin/mariadbd(+0x5f0f29)[0x5c8cc8444f29]
/usr/bin/mariadbd(+0x620b7f)[0x5c8cc8474b7f]
/usr/bin/mariadbd(+0xd08b4f)[0x5c8cc8b5cb4f]
/usr/bin/mariadbd(+0xdb04a1)[0x5c8cc8c044a1]
/usr/bin/mariadbd(+0x5f9511)[0x5c8cc844d511]
/usr/bin/mariadbd(+0x5e6d3e)[0x5c8cc843ad3e]
/usr/bin/mariadbd(+0x5eadf6)[0x5c8cc843edf6]
/usr/bin/mariadbd(+0xcc34d1)[0x5c8cc8b174d1]
/usr/bin/mariadbd(_Z24ha_initialize_handlertonP13st_plugin_int+0x8e)[0x5c8cc88a7a2e]
/usr/bin/mariadbd(+0x7d6a13)[0x5c8cc862aa13]
/usr/bin/mariadbd(+0x7da466)[0x5c8cc862e466]
/usr/bin/mariadbd(_Z11plugin_initPiPPci+0xe85)[0x5c8cc8635165]
/usr/bin/mariadbd(+0x6a1821)[0x5c8cc84f5821]
/usr/bin/mariadbd(_Z11mysqld_mainiPPc+0x2c1a)[0x5c8cc85024ea]
/usr/lib/libc.so.6(+0x23cd0)[0x7e1151bc5cd0]
/usr/lib/libc.so.6(__libc_start_main+0x8a)[0x7e1151bc5d8a]
/usr/bin/mariadbd(_start+0x25)[0x5c8cc84bb0a5]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 0 bytes
Max resident set unlimited unlimited bytes
Max processes 127351 127351 processes
Max open files 32190 32190 files
Max locked memory 8388608 8388608 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 127351 127351 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h

Kernel version: Linux version 6.7.3-arch1-1 (linux@archlinux) (gcc (GCC) 13.2.1 20230801, GNU ld (GNU Binutils) 2.42.0) #1 SMP PREEMPT_DYNAMIC Thu, 01 Feb 2024 10:30:35 +0000

/usr/bin/mariadb-install-db: line 581: 28383 Aborted (core dumped) "$mysqld_bootstrap" $defaults $defaults_group_suffix "$mysqld_opt" -bootstrap $silent_startup "basedir=$basedir" "datadir=$ldata" --log-warnings=0 --enforce-storage-engine="" "-plugin-dir=${plugindir}" $args --max_allowed_packet=8M --net_buffer_length=16K

Installation of system tables failed! Examine the logs in
/var/lib/mysql for more information.

The problem could be conflicting information in an external
my.cnf files. You can ignore these by doing:

shell> /usr/bin/mariadb-install-db --defaults-file=~/.my.cnf

You can also try to start the mariadbd daemon with:

shell> /usr/bin/mariadbd --skip-grant-tables --general-log &

and use the command line tool /usr/bin/mariadb
to connect to the mysql database and look at the grant tables:

shell> /usr/bin/mariadb -u root mysql
MariaDB> show tables;

Try '/usr/bin/mariadbd --help' if you have problems with paths. Using
--general-log gives you a log in /var/lib/mysql that may be helpful.

The latest information about mariadb-install-db is available at
https://mariadb.com/kb/en/installing-system-tables-mysql_install_db
You can find the latest source at https://downloads.mariadb.org and
the maria-discuss email list at https://launchpad.net/~maria-discuss

Please check all of the above before submitting a bug report
at https://mariadb.org/jira
```

I dont know whether is it mariaDB or bcachefs issue (as used filesystem). But I would like to help here so I found out some rules for working configurations that "solve" this issue. Working configuration is:
```sh
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT_NO_FSYNC
```
`innodb_flush_method` can also be `nosync`. Everything else does not work (`fsync / O_DSYNC / littlesync / O_DIRECT`).

`innodb_flush_log_at_trx_commit` must always be `= 2` regardless of `innodb_flush_method`. If I set 0 or 1 it just does not work also.

With this configuration `mariadb-install-db` just works and I can start `mariadb.service`. I did not try real functionality of the mysql server but I expect that as I started without any errors then it could work.

Could you please navigate me to find out what is wrong? Maybe it is a mariadb issue itself but maybe a filesystem issue. I would like to find out with your support what part of filesystem does not work as expected from mariadb point of view and then I can open an issue in bcachefs if you confirm it is not mariadb issue.

Just for reference, my bcachefs volume is created using:
```sh
bcachefs format --block_size=4k --label=ssd.ssd1 /dev/nvme0n1p2 --label=ssd.ssd2 /dev/nvme1n1p2 --label=hdd.hdd1 /dev/sda --label=hdd.hdd2 /dev/sdb --replicas=2 --foreground_target=ssd --promote_target=ssd --background_target=hdd
```



 Comments   
Comment by Marko Mäkelä [ 2024-02-03 ]

The parameter innodb_flush_method was deprecated in MDEV-30136 (MariaDB Server 11.0).

Can you please try differentn values of the 4 Boolean parameters:

  • innodb_log_file_buffering (disable O_DIRECT on ib_logfile0)
  • innodb_data_file_buffering (disable O_DIRECT on data files)
  • innodb_log_file_write_through (enable O_DSYNC on the log)
  • innodb_data_file_write_through (enable O_DSYNC on persistent data files)

I would expect O_DIRECT to not be supported on many copy-on-write file systems that support transparent compression. Ideally, the file system should return an error from a system call that tries to enable O_DIRECT, instead of corrupting any data. As you can read in issues linked from MDEV-24854, also btrfs has had similar trouble with O_DIRECT.

Comment by Miroslav Rýzek [ 2024-02-03 ]

Thank you for your quick support.

These are results from testing. I used default configuration (removed innodb_flush_log_at_trx_commit and innodb_flush_method).

log_file_buffering data_file_buffering log_file_write_through data_file_write_through mariadb-install-db
OFF OFF OFF OFF ERROR
ON OFF OFF OFF OK
OFF ON OFF OFF ERROR
OFF OFF ON OFF ERROR
OFF OFF OFF ON ERROR
OFF OFF ON ON ERROR
ON OFF ON OFF OK
ON ON ON ON OK

I opened an issue in bcachefs and will link this for reference. It looks like some issue with O_DIRECT.

Comment by Marko Mäkelä [ 2024-02-05 ]

kezyr, thank you for testing this. Your table does not quite look like a truth table with 4 inputs, because there are 5 instead of 4 lines innodb_log_file_buffering=OFF. In any case, the error seems to correlate exactly with that.

One more difference between the log and data files is that the ib_logfile0 is being written synchronously (we may revisit MDEV-26603 some day) while the data files are typically written asynchronously via io_uring or libaio (io_setup(), io_submit(), io_getevents()). But, it does not look like this would be a problem. After all, we are using the simpler (synchronous) API for log file I/O.

I see that on Microsoft Windows, we pass the flags FILE_FLAG_NO_BUFFERING (corresponding to O_DIRECT) and FILE_FLAG_WRITE_THROUGH (corresponding to O_DSYNC or O_SYNC) to the CreateFile() call, which corresponds to open(2) in POSIX environments. Before MDEV-30136, on POSIX like systems we never passed the O_DIRECT flag to open(2). Only the O_DSYNC or O_SYNC were be passed to it. The commit message includes the following:

We will open files straight in O_DSYNC or O_SYNC mode when applicable. Data files we will try to open straight in O_DIRECT mode when the page size is at least 4096 bytes. For atomically creating data files, we will invoke os_file_set_nocache() to enable O_DIRECT afterwards, because O_DIRECT is not supported on some file systems. We will also continue to invoke os_file_set_nocache() on ib_logfile0 when innodb_log_file_buffering=OFF can be fulfilled.

I think that we should be able to fix this in MariaDB Server, by removing the function os_file_set_nocache() and always trying to open the files in O_DIRECT mode when needed. We probably should fix this in 10.11 as well, because it is the newest long-term supported (LTS) release. Fixing this in older LTS (10.5 or 10.6) could require more extensive changes.

Generated at Thu Feb 08 10:38:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.