[MDEV-26537] InnoDB corrupts files due to incorrect st_blksize calculation Created: 2021-09-04  Updated: 2023-08-08  Resolved: 2021-09-10

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2.40, 10.3.31, 10.4.21, 10.5.12, 10.6.4
Fix Version/s: 10.2.41, 10.3.32, 10.4.22, 10.5.13, 10.6.5

Type: Bug Priority: Blocker
Reporter: László Károlyi Assignee: Marko Mäkelä
Resolution: Fixed Votes: 2
Labels: None
Environment:

FreeBSD 13.0-RELEASE-p4, running within a FreeBSD jail (amd64)


Attachments: Text File freebsdfix.patch    
Issue Links:
Duplicate
duplicates MDEV-26604 InnoDB tables corruption and crashes Closed
is duplicated by MDEV-26388 assertion (innodb) fseg_inode_try_get... Closed
Problem/Incident
causes MDEV-27016 mysqld: storage/innobase/buf/buf0flu.... Closed
is caused by MDEV-26040 os_file_set_size() may not work on O_... Closed
Relates
relates to MDEV-13542 Crashing on a corrupted page is unhel... Closed
relates to MDEV-26640 InnoDB: Assertion failure in file /wr... Closed
relates to MDEV-28149 10.6.7 on FreeBSD hangs with InnoDB Closed
relates to MDEV-31875 ROW_FORMAT=COMPRESSED table: InnoDB: ... Closed
relates to MDEV-27401 Randomly destroyed tables after migra... Closed

 Description   

Hey,

I have a job that migrates an old database to in a different format (my custom, php based forum migrating to a django-based one).

Recently, I've seen the mariadb server crashing randomly, leaving other databases and their tables corrupted. I can't just pinpoint out the exact issue, but I can provide the traceback and coredump, in hopes of you finding and fixing the bug ASAP.

This posted crashlog is from executing mariadb-check -c --all-databases -S /tmp/mysql.sock, that according to its output, crashes on a certain CHECK TABLE command.

The crash log from mysqld.err:
===
2021-09-04 11:15:41 0x80f1b5d00 InnoDB: Assertion failure in file /wrkdirs/usr/ports/databases/mariadb105-server/work/mariadb-10.5.
12/storage/innobase/btr/btr0btr.cc line 4763
InnoDB: Failing assertion: !page_is_empty(page) || (level == 0 && page_get_page_no(page) == dict_index_get_page(index))
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
210904 11:15:41 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.

Server version: 10.5.12-MariaDB
key_buffer_size=16777216
read_buffer_size=8388608
max_used_connections=35
max_threads=153
thread_count=34
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 11300560 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x81e206b98
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fffdafdef38 thread_stack 0x49000
0x130cd7c <my_print_stacktrace+0x3c> at /usr/local/libexec/mariadbd
0xc6e6af <handle_fatal_signal+0x28f> at /usr/local/libexec/mariadbd
0x801904e00 <_pthread_sigmask+0x530> at /lib/libthr.so.3

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x81e24def0): CHECK TABLE `forum_event_eventresponse`

Connection ID (thread ID): 138
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash.
Core pattern: %N.core
===

Since I couldn't upload the core file here (max filesize exceeded), please download the gzipped coredump from here: https://wetransfer.com/downloads/58bcaac03b364010532ebddd214c7dbd20210904093632/a91858186386ac9749129b6f2632db3520210904093659/8c9aa0



 Comments   
Comment by László Károlyi [ 2021-09-06 ]

The situation is getting out of hand. My mariadb started crashing left and right, sometimes even with unrecoverable errors where I had to load a dump back for the failing database from the day before.

I've recompiled the package on the production server to see if libraries are at fault (different version which should be unlikely), and can confirm that it is NOT the case.

Comment by Marko Mäkelä [ 2021-09-06 ]

It looks like you are executing CHECK TABLE by invoking the mariadb-check tool. For some table, CHECK TABLE is crashing because an InnoDB index page is corrupted.

Yes, our error reporting could be better. A core dump would not help us much. You can try to determine the table name and the page number from the core dump. A copy of the page and the output of the SHOW CREATE TABLE statement could allow us to diagnose the bug.

As far as I can tell, CHECK TABLE…QUICK would not perform the check that is failing; it is only counting the records in the leaf pages. You could still be able to recover all data by invoking mariadb-dump (mysqldump) or similar.

Comment by László Károlyi [ 2021-09-06 ]

Hey,

thanks for your response. Since this is a production server, I reverted back to 10.5.11 now (compiled it for myself again), which seems to be running safely for now, after dumping back the crashed tables. I am simply unable to test on this server, it is that essential.

As for which table it was, I can't tell you any good news. They were all kinds of tables from my nextcloud database. Sometimes it would start up again successfully with the broken table unavailable, other times (today when I mentioned it got more severe) it wouldn't.

Nothing ever helped but dropping and dumping back the broken tables from the dump which I saved. Luckily I have a daily backup script that dumps all databases.

I can safely say that this bug got introduced with 10.5.12. The crashes started happening after I upgraded to that version.

This is all I can say for now, in hopes of having it fixed with 10.5.13.

Comment by Marko Mäkelä [ 2021-09-06 ]

I would find it very surprising if such corruption was introduced in 10.5.12. We conduct extensive internal testing on Linux, covering also crash recovery. There were several changes to improve the write performance in the 10.5.12 release. While we do not implement any asynchronous I/O interface for FreeBSD, our internal testing does cover the "simulated AIO" code that I think should be the only option outside Linux and Microsoft Windows.

If there is a recent regression, it will not magically fix itself. To diagnose and fix this, I would need a somewhat repeatable test case.

Comment by László Károlyi [ 2021-09-06 ]

Unfortunately, I can't risk crashing mariadb on my production server. The first crash happened on Aug 28th, and I don't have the error log for that.

The recent crashes were on nextcloud's tables, but I've seen other InnoDB related crashes as well, sometimes when loading a dumpfile, sometimes when an ALTER TABLE happened. Hence, I can't provide now a reproducible example since I have many tables/databases.

I'm already overwhelmed with work, but one idea I have is to install a virtual machine and try to load those dumps to see if I can reproduce the InnoDB crash. Don't hold your breath until then though.

I really hope that someone else will be able to reproduce this bug and come up with a similar bug report.

Comment by László Károlyi [ 2021-09-06 ]

So while I was at it (and before I had no time for it anymore), I installed a VM real quick and installed a mariadbd into it. At first it loaded the passed dump nicely, but then it started the same crashes I had on my server. What causes it, unfortunately I can't tell you. But at least it became reproducible. All I can tell is that it happens from loading the aforementioned dump, and that's also not always, it seems.

I can send you the compressed virtualbox VM image privately with instructions on how to reproduce the issue.

Comment by László Károlyi [ 2021-09-06 ]

A bit more info on this: my research shows that it seems to happen with a higher probability when I try and load two dumps at the same time, so that multiple clients are trying to use mariadb on different databases.

Comment by Marko Mäkelä [ 2021-09-07 ]

If you can reproduce the corruption starting from a logical dump with 10.5.12 and not with 10.5.11, it could be related to the recent write performance improvements.

I am happy to diagnose the problem. I would need an environment where I can compile MariaDB from source code and repeat the problem. A virtual machine image could be a good starting point.

Given that reproducing the problem likely is sensitive to the relative performance of CPU cores and I/O, it might be easiest if you can provide ssh access to a suitable environment.

Comment by László Károlyi [ 2021-09-07 ]

I have already prepared an OVA from the box I installed, I'll extend it with the possibility of compiling mariadb for yourself on it. You'll be able to import it in any VM handler as an appliance.

I'll send you the wetransfer link and the instructions to your mariadb.com email address. Meanwhile for instant messages and faster communication, you can contact me on matrix under @karolyi:ksol.io if you like.

Cheers!

Comment by Marko Mäkelä [ 2021-09-08 ]

Motivated by this report, I went through some internal bug reports and diagnosed a fairly recently found race condition in MDEV-25776, which could explain this.
karolyi, can you please test if the bug goes away if you set innodb_change_buffering=none in your configuration?
That bug would not be anything new in 10.5.12. It could have been ‘exposed’ (made more probable to occur) by MDEV-26004 or other write performance improvements in 10.5.12.

Comment by László Károlyi [ 2021-09-08 ]

Hey,

(I'm not sure to use german or english here so I'll continue with english)

As per your last email, I tried setting innodb_flush_neighbors=ON but mariadb said /usr/local/libexec/mariadbd: unknown variable 'innodb_flush_neighbors=OFF.

After changing innodb_change_buffering=none, the issue remains the same, unfortunately.

Meanwhile I joined the zulipchat instance, where do I need to go there?

Comment by László Károlyi [ 2021-09-08 ]

Looking up the option in the documentation from my comment before, the accepted option was innodb_flush_neighbors=0 but that didn't make the crash go away, unfortunately.

Comment by Marko Mäkelä [ 2021-09-08 ]

The comments in this system are public and preferred to be in English. Thank you for narrowing this down. So, this ought to be something else than MDEV-25776. I sent you a private message in the chat.

Comment by Pedro de Oliveira [ 2021-09-08 ]

Hello, I just want to add some information.

The "change" that causes this crash is common in all the latest MariaDB versions for the 10.3, 10.4 and 10.5 trees.
The last version of every tree is crashing on FreeBSD.

https://mariadb.com/kb/en/mariadb-10512-release-notes/
https://mariadb.com/kb/en/mariadb-10421-release-notes/
https://mariadb.com/kb/en/mariadb-10331-release-notes/

In common in all the releases is:

  • InnoDB no longer acquires advisory file locks by default (MDEV-24393)
  • Some fixes from MySQL 5.7.35 (MDEV-26205)
  • And the two CVEs

So the problem was added in one of these three changes probably. My guess is the InnoDB one, but just a guess.
I went back to 10.5.11 to be able to run the service while this is happening.

Comment by Marko Mäkelä [ 2021-09-08 ]

I can repeat a crash on my system in the provided VirtualBox image. But, I would need some additional help regarding the build options and debugging information. I am not seeing complete stack traces in gdb. All InnoDB function names seem to be replaced with ??.

Comment by Pedro de Oliveira [ 2021-09-08 ]

If you are building MariaDB from the ports, probably this will help: https://wiki.freebsd.org/BenWoods/DebuggingPorts

Comment by Marko Mäkelä [ 2021-09-08 ]

falsovsky, thank you for the update. If the 10.3 and 10.4 series are affected, this could be completely unrelated to the write performance improvements in 10.5 and 10.6.
With karolyi, I debugged this a bit. I can repeat it in the FreeBSD image, but not on Linux. The test case was a parallel load of one mysqldump output file into two databases. The data directory size at the end (no crash on Linux, no matter what I tried) is about 7GiB. The buffer pool size for the crash was 128M. karolyi did not repeat the crash when using innodb_buffer_pool_size=7g. That is why I would suspect the write performance improvements.

I was not able to reproduce any crash with the same test case on Linux. I tried the latest 10.5 as well as 10.5.12, even with innodb_use_native_aio=0 to mimic FreeBSD (disabling the io_uring that was in the build).

I think that this needs help from the FreeBSD community because we do not have that platform on our CI systems nor anyone who is developing on the platform. I am happy to help once this has been narrowed down a little further.

I would suggest that you read the source code revision history (git log mariadb-10.3.30..mariadb-10.3.31 storage/innobase and similar). Maybe karolyi could test his dump on the 10.3 and 10.4 series as well? The earlier version reproduces the crash, the easier it should be to narrow down the cause.

Comment by Pedro de Oliveira [ 2021-09-08 ]

Hello Marko!

I just did a git bisect between these two releases.

https://github.com/MariaDB/server/compare/mariadb-10.4.20...mariadb-10.4.21

After building MariaDB a thousand times, I think I identified the evil commit:

https://github.com/MariaDB/server/commit/58252fff15acfe7c7b0452a87e202e3f8e454e19

I created a patch ( freebsdfix.patch ) that reverts that commit, applied it to mariadb-10.4.21 and it seems to no longer crash. I haven't been able to crash it yet

Do you want me to try to build mariadb-10.4.21 with debugging and give a somewhat useful stack trace?

Comment by László Károlyi [ 2021-09-08 ]

Hey Pedro,

thanks for your great work. Just a question, are you allowed to disclose how you were able to crash the server? In my case it was loading a dump but if a more easy way exists (that is not loading a huge dump twice simultaneously), it might help in tracking down the cause as well.

Comment by Pedro de Oliveira [ 2021-09-09 ]

Hey,

The best way that I managed to crash it almost always, without having to wait too long was:

I have a SQL dump (~100MB of my private projects, about 10 databases with a few tables).

In one terminal I run mysql and 'source dump.sql;' on it, and while it is importing, on other three terminals I start running "mysqldump -A > /dev/null" at the same time, and keep repeating them.

MariaDB almost always crashes before the dump is imported.

Sorry if I don't have a "scientific way" of doing it.

Comment by Pedro de Oliveira [ 2021-09-09 ]

I managed to build MariaDB with debugging symbols and ran it inside gdb.
Here's the backtrace when it crashes, don't know if its useful or not.

Version: '10.4.21-MariaDB' socket: '/tmp/mysql.sock' port: 3306 MariaDB Server
[New LWP 321998 of process 9710]
[New LWP 321999 of process 9710]
[New LWP 322000 of process 9710]
[New LWP 322001 of process 9710]
2021-09-09 3:46:39 27 [ERROR] InnoDB: In pages [page id: space=2, page number=23] and [page id: space=2, page number=24] of index `PRIMARY` of table `mysql`.`innodb_index_stats`
InnoDB: broken FIL_PAGE_NEXT or FIL_PAGE_PREV links
2021-09-09 3:46:39 27 [ERROR] InnoDB: In pages [page id: space=2, page number=23] and [page id: space=2, page number=24] of index `PRIMARY` of table `mysql`.`innodb_index_stats`
InnoDB: 'compact' flag mismatch
2021-09-09 3:46:39 27 [ERROR] InnoDB: Page index id 0 != data dictionary index id 19
2021-09-09 03:46:39 0x825303100 InnoDB: Assertion failure in file /home/falso/bleh/storage/innobase/btr/btr0btr.cc line 5008
InnoDB: Failing assertion: !page_is_empty(page) || (level == 0 && page_get_page_no(page) == dict_index_get_page(index))
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.

Thread 34 received signal SIGABRT, Aborted.
[Switching to LWP 322000 of process 9710]
thr_kill () at thr_kill.S:4
4 thr_kill.S: No such file or directory.
(gdb) bt
#0 thr_kill () at thr_kill.S:4
#1 0x000000080175c064 in __raise (s=s@entry=6) at /usr/src/lib/libc/gen/raise.c:52
#0 thr_kill () at thr_kill.S:4 [0/1863]
#1 0x000000080175c064 in __raise (s=s@entry=6) at /usr/src/lib/libc/gen/raise.c:52
#2 0x0000000801810f29 in abort () at /usr/src/lib/libc/stdlib/abort.c:67
#3 0x00000000011d1e0b in ut_dbg_assertion_failed (
expr=0x4af978 "!page_is_empty(page) || (level == 0 && page_get_page_no(page) == dict_index_get_page(index))", file=<optimized out>, line=<optimized out>, line@entry=5008)
at /home/falso/bleh/storage/innobase/ut/ut0dbg.cc:60
#4 0x00000000010567a7 in btr_validate_level (index=0x8248682e8, trx=0x816204688,
level=0, lockout=<optimized out>)
at /home/falso/bleh/storage/innobase/btr/btr0btr.cc:5006
#5 btr_validate_index (index=<optimized out>, index@entry=0x8248682e8,
trx=0x816204688) at /home/falso/bleh/storage/innobase/btr/btr0btr.cc:5350
#6 0x0000000000fe191c in ha_innobase::check (this=0x8341e4820, thd=0x836006448,
check_opt=0x83600b358)
at /home/falso/bleh/storage/innobase/handler/ha_innodb.cc:14822
#7 0x0000000000b5bb44 in handler::ha_check (this=0x8341e4820, thd=0x836006448,
check_opt=0x83600b358) at /home/falso/bleh/sql/handler.cc:4372
#8 0x0000000000e9fd4a in mysql_admin_table (thd=<optimized out>,
thd@entry=0x836006448, tables=tables@entry=0x83604c830,
check_opt=check_opt@entry=0x83600b358, operator_name=<optimized out>,
lock_type=lock_type@entry=TL_READ_NO_INSERT, org_open_for_modify=false,
repair_table_use_frm=<optimized out>, extra_open_options=32, prepare_func=0x0,
operator_func=
(int (handler::)(handler * const, THD *, HA_CHECK_OPT *)) 0xb5bad0 <handler::ha_check(THD, st_ha_check_opt*)>,
view_operator_func=0xe4d800 <view_check(THD*, TABLE_LIST*, st_ha_check_opt*)>,
is_cmd_replicated=<optimized out>) at /home/falso/bleh/sql/sql_admin.cc:873
#9 0x0000000000ea0840 in Sql_cmd_check_table::execute (this=<optimized out>,
thd=0x836006448) at /home/falso/bleh/sql/sql_admin.cc:1447
#10 0x0000000000d7c54b in mysql_execute_command (thd=thd@entry=0x836006448)
at /home/falso/bleh/sql/sql_parse.cc:6192
#11 0x0000000000d77f99 in mysql_parse (thd=thd@entry=0x836006448,
rawbuf=<optimized out>, length=<optimized out>,
parser_state=parser_state@entry=0x7fffdc8b17a0, is_com_multi=<optimized out>,
is_next_command=<optimized out>) at /home/falso/bleh/sql/sql_parse.cc:7995
#12 0x0000000000d7598d in dispatch_command (command=command@entry=COM_QUERY,
thd=thd@entry=0x836006448, packet=<optimized out>, packet@entry=0x83601f489 "",
packet_length=906282784, packet_length@entry=33, is_com_multi=false,
is_next_command=false) at /home/falso/bleh/sql/sql_parse.cc:1857
#13 0x0000000000d784be in do_command (thd=0x836006448)
at /home/falso/bleh/sql/sql_parse.cc:1373
#14 0x0000000000e93b39 in do_handle_one_connection (connect=connect@entry=0x82533cf88)
at /home/falso/bleh/sql/sql_connect.cc:1420
#15 0x0000000000e93916 in handle_one_connection (arg=0x82533cf88)
at /home/falso/bleh/sql/sql_connect.cc:1316
#16 0x000000080157e82b in thread_start (curthread=0x825303100)
at /usr/src/lib/libthr/thread/thr_create.c:292
#17 0x0000000000000000 in ?? ()

Comment by Marko Mäkelä [ 2021-09-09 ]

falsovsky, great work! And very nice that you can repeat this with a much smaller data dump than the 788 MiB (which made the data directory expand to about 6.8 GiB when loaded twice in parallel).

MDEV-26040 was something that I encountered when running tests on a SSD with 4096-byte block size. Obviously we cannot simply revert that changed, but instead the regression needs to be fixed and the fix tested in various environments. The environment with 4096-byte block size is not normally accessible to me.

I remember that posix_fallocate() has been the source of some surprises. On Linux, GNU libc would not only wrap the system call, but also implement a fallback that would write NUL bytes. That fallback is missing in musl libc, which is used in Alpine Linux (MDEV-18462) and in the GNU libc replacement in Docker on Microsoft Windows.

I must say that I only tested MDEV-26040 with GNU libc. It would be very interesting to see what would happen on musl libc.

falsovsky, would you or someone you know be able to contribute a proposed fix?

Comment by Marko Mäkelä [ 2021-09-09 ]

I tried to build MariaDB on Debian GNU/Linux after

sudo apt install musl-dev musl-tools

(and creating a wrapper script for g++ similar to the musl-gcc) but it turned out that I would need a special libncurses.
Next, following danblack’s suggestion, I downloaded a musl libc toolchain targeting "x86-64-core-i7" from https://toolchains.bootlin.com/ which does include libncurses. But I failed to convince CMake to use that library. (And that would be only the first of many potentially needed libraries.)
danblack is now trying a build in a Docker container.

Comment by Marko Mäkelä [ 2021-09-09 ]

On the FreeBSD virtual machine image provided by karolyi, I tried the following with innodb_buffer_pool_size=5m and the release binary of MariaDB 10.5.12:

seq -f 'create table t%g engine=innodb select * from seq_1_to_200000;' 1 100|mysql test&
seq -f 'create table u%g engine=innodb select * from seq_1_to_200000;' 1 100|mysql test&

It took a while, but the server did crash. In the data directory, files for tables t1‥t5 and u1‥u5 had been created. The u5.ibd file was slightly larger than the files for the completed t1‥t4 and u1‥u4, which were all identical size. This would seem to suggest that the file size is wrongly shrinking at some point. InnoDB cannot shrink data files for tables. TRUNCATE and similar operations would create a new smaller file.

Comment by Daniel Black [ 2021-09-09 ]

alpine-3.13 (tldr - no crash)

7c082bcf4bff:/build$ for a in $(seq 100); do echo "create table t$a engine=innodb  select * from seq_1_to_200000;"; done | client/mysql -S /tmp/build.sock test & for b in $(seq 100); do echo "create table u$b engine=innodb  select * from seq_1_to_200000;"; done | client/mysql -S /tmp/build.sock test
 
[1]+  Done                       for a in $(...); do echo "create table t${a} engine=innodb  select * from seq_1_to_200000;"; done | client/mysql -S /tmp/build.sock test
 
 
7c082bcf4bff:/build$ mkdir -p /tmp/${PWD##*/}-datadir && scripts/mysql_install_db --no-defaults --srcdir=$OLDPWD --builddir=$PWD --datadir=/tmp/${PWD##*/}-datadir --verbose && sql/mysqld --no-defaults --skip-networking --datadir=/tmp/${PWD##*/}-datadir --socket=/tmp/${PWD##*/}.sock --plugin-dir=${PWD}/mysql-test/var/plugins/ --innodb-buffer-pool-size=5M  --verbose
Installing MariaDB/MySQL system tables in '/tmp/build-datadir' ...
2021-09-09  8:10:49 0 [Note] /build/sql/mysqld (mysqld 10.5.13-MariaDB) starting as process 67389 ...
2021-09-09  8:10:49 0 [Note] InnoDB: The first innodb_system data file 'ibdata1' did not exist. A new tablespace will be created!
2021-09-09  8:10:49 0 [Note] InnoDB: Uses event mutexes
2021-09-09  8:10:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-09-09  8:10:49 0 [Note] InnoDB: Number of pools: 1
2021-09-09  8:10:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2021-09-09  8:10:49 0 [Note] mysqld: O_TMPFILE is not supported on /tmp (disabling future attempts)
2021-09-09  8:10:49 0 [Note] InnoDB: Using Linux native AIO
2021-09-09  8:10:49 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728
2021-09-09  8:10:49 0 [Note] InnoDB: Completed initialization of buffer pool
2021-09-09  8:10:49 0 [Note] InnoDB: Setting file './ibdata1' size to 12 MB. Physically writing the file full; Please wait ...
2021-09-09  8:10:49 0 [Note] InnoDB: File './ibdata1' size is now 12 MB.
2021-09-09  8:10:49 0 [Note] InnoDB: Setting log file ./ib_logfile101 size to 100663296 bytes
2021-09-09  8:10:49 0 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
2021-09-09  8:10:49 0 [Note] InnoDB: New log file created, LSN=10317
2021-09-09  8:10:49 0 [Note] InnoDB: Doublewrite buffer not found: creating new
2021-09-09  8:10:49 0 [Note] InnoDB: Doublewrite buffer created
2021-09-09  8:10:49 0 [Note] InnoDB: 128 rollback segments are active.
2021-09-09  8:10:49 0 [Note] InnoDB: Creating foreign key constraint system tables.
2021-09-09  8:10:49 0 [Note] InnoDB: Creating tablespace and datafile system tables.
2021-09-09  8:10:49 0 [Note] InnoDB: Creating sys_virtual system tables.
2021-09-09  8:10:49 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2021-09-09  8:10:49 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2021-09-09  8:10:49 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2021-09-09  8:10:49 0 [Note] InnoDB: 10.5.13 started; log sequence number 0; transaction id 7
OK
2021-09-09  8:10:51 0 [Note] sql/mysqld (mysqld 10.5.13-MariaDB) starting as process 67470 ...
2021-09-09  8:10:51 0 [Note] InnoDB: Uses event mutexes
2021-09-09  8:10:51 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-09-09  8:10:51 0 [Note] InnoDB: Number of pools: 1
2021-09-09  8:10:51 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2021-09-09  8:10:51 0 [Note] mysqld: O_TMPFILE is not supported on /tmp (disabling future attempts)
2021-09-09  8:10:51 0 [Note] InnoDB: Using Linux native AIO
2021-09-09  8:10:51 0 [Note] InnoDB: Initializing buffer pool, total size = 5242880, chunk size = 5242880
2021-09-09  8:10:51 0 [Note] InnoDB: Completed initialization of buffer pool
2021-09-09  8:10:51 0 [Note] InnoDB: 128 rollback segments are active.
2021-09-09  8:10:51 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2021-09-09  8:10:51 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2021-09-09  8:10:51 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2021-09-09  8:10:51 0 [Note] InnoDB: 10.5.13 started; log sequence number 45094; transaction id 20
2021-09-09  8:10:51 0 [Note] Plugin 'FEEDBACK' is disabled.
2021-09-09  8:10:51 0 [Note] InnoDB: Loading buffer pool(s) from /tmp/build-datadir/ib_buffer_pool
2021-09-09  8:10:51 0 [Note] InnoDB: Buffer pool(s) load completed at 210909  8:10:51
2021-09-09  8:10:51 0 [Note] Reading of all Master_info entries succeeded
2021-09-09  8:10:51 0 [Note] Added new Master_info '' to hash table
2021-09-09  8:10:51 0 [Note] sql/mysqld: ready for connections.
Version: '10.5.13-MariaDB'  socket: '/tmp/build.sock'  port: 0  Source distribution
 
 
Filesystem                Size      Used Available Use% Mounted on
fuse-overlayfs          390.4G    333.0G     37.5G  90% /

Comment by Marko Mäkelä [ 2021-09-09 ]

Part of the MDEV-26040 fix basically simplified a fstat() call that was previously wrapped in os_file_get_size() and changed some logic to ensure that we will enlarge files in multiples of statbuf.st_blksize. Finally, if posix_fallocate() is not implemented or it fails, we would issue one more fstat() call to determine the file size.

When using innodb_flush_method=O_DIRECT (on systems where it has been implemented), all file reads and writes must be multiples of the block size. That was being violated on the system where the SSD sector size was 4096 bytes and a data file was not a multiple of that, for some reason. The test innodb.log_data_file_size artificially constructs such a scenario by shrinking an .ibd file using some Perl code.

Which block size would be reported on FreeBSD? falsovsky, could you try to debug the root cause of the failure?

Comment by Daniel Black [ 2021-09-09 ]

karolyi, falsovsky, can you try/test the build with fdatasync explicitly disabled - {{ cmake -DHAVE_FDATASYNC=0 -DHAVE_DECL_FDATASYNC=0 .}} and see if the fallback to fsync corrects the behaviour. The theory is fdatasync doesn't updated the size metadata returned by fstat somewhere in the BSD Kernel/libc.

Comment by Pedro de Oliveira [ 2021-09-09 ]

Hi Daniel,

Just built it with those two options, but sadly no change, still crashes.

Comment by Marko Mäkelä [ 2021-09-09 ]

For the record, I tested the following on Linux with GNU libc:

diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
index dde1975ea08..d05f44e142a 100644
--- a/storage/innobase/os/os0file.cc
+++ b/storage/innobase/os/os0file.cc
@@ -3334,7 +3334,7 @@ os_file_set_size(
 		return(success);
 	}
 
-# ifdef HAVE_POSIX_FALLOCATE
+# if 0
 	int err;
 	do {
 		if (fstat(file, &statbuf)) {

This would cause us to always use the fallback code that invokes a pwrite() wrapper in a loop. If FreeBSD supports posix_fallocate(), the above patch would help in narrowing down the problem. If it fixes the corruption, the problem is in the posix_fallocate() code path, else it is in the fallback code. Also, does the following test pass on FreeBSD?

cd mysql-test
./mtr innodb.log_data_file_size

Comment by Pedro de Oliveira [ 2021-09-09 ]

Yo Marko,

Just built it with that patch, it still crashes in my test case.
And the test seems to have ran OK, but after it ran there's a mysqld.core there.

TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
innodb.log_data_file_size '32k,innodb' [ pass ] 1162
innodb.log_data_file_size '64k,innodb' [ pass ] 1174
innodb.log_data_file_size '16k,innodb' [ pass ] 1279
innodb.log_data_file_size '4k,innodb' [ pass ] 1173
innodb.log_data_file_size '8k,innodb' [ pass ] 1164
--------------------------------------------------------------------------
The servers were restarted 4 times
Spent 5.952 of 25 seconds executing testcases

Completed: All 5 tests were successful.

Comment by László Károlyi [ 2021-09-09 ]

Hey @Pedro,

can you please share how you could compile mariadb so that the innodb storage would be compiled with said debug as well? When using the debug flags in /etc/make.conf, although mariadb was compiled with the debug flags, I only got the ?? chars when trying to print a backtrace on the core file from gdb.

It seems to me (us, because we did it together with Marko), that the debug flags specifications don't reach down to the innodb storage level, when using the debug mode specified in the URL you shared.

Comment by Marko Mäkelä [ 2021-09-09 ]

falsovsky, thank you. What if you revert the #if 0 and add #error after the original #ifdef HAVE_POSIX_FALLOCATE line? That is, is the posix_fallocate() code being used at all on FreeBSD?

In the revert freebsdfix.patch, you can see in the last two hunks that the computation of current_size was changed, and that an early return for the condition if (current_size >= size) had been added. I think that that condition only avoids an unnecessary memory allocation, because the following while loop includes the negation of that condition.

The intention was that current_size would be rounded down to a multiple of statbuf.st_blksize, so that we will be writing multiples of that in order to enlarge the file to the desired size. (Some last bytes in the file that do not form a multiple of a block size will be overwritten with NUL bytes.)

It looks like we are using pwrite() anywhere else than Microsoft Windows. So, I would think that the only cause of the corruption should be a funny value of statbuf.st_blksize.

Would the following (untested) patch fix the corruption? And what would it write to the server error log?

diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
index dde1975ea08..0cae9e9db95 100644
--- a/storage/innobase/os/os0file.cc
+++ b/storage/innobase/os/os0file.cc
@@ -3385,7 +3385,12 @@ os_file_set_size(
 		return false;
 	}
 	os_offset_t current_size = statbuf.st_size
+# if 0
 		& ~os_offset_t(statbuf.st_blksize - 1);
+# else
+	;
+	ib::info() << name << ":" << statbuf.st_blksize;
+# endif
 #endif
 	if (current_size >= size) {
 		return true;

(Sorry, I do not have a proper development environment in the FreeBSD VM image, so I have to rely on your help.)

Comment by Pedro de Oliveira [ 2021-09-09 ]

Hi Lazlo,

Building a DEBUG version of MariaDB doesn't work, because it gets ULTRA SLOW, and this bug is probably a race condition, and it doesn't crash while it is build with debug way.

The only way I managed to get some info was to build it outside of the ports tree. Manually on my home.

export CFLAGS="-ggdb -g3"
export CXXFLAGS=-ggdb -g3"
mkdir build
cd build
cmake .. -DBUILD_CONFIG=mysql_release -DWITHOUT_TOKUDB=true -DWITHOUT_ROCKSDB=true -DPLUGIN_MROONGA=NO -DCMAKE_INSTALL_PREFIX=/home/falso/mariadb
make -j16
make install

Comment by Pedro de Oliveira [ 2021-09-09 ]

Hello Marko,

That patch seems to fix the issue, at least I haven't been able to crash it yet. =)
Don't know what the info you want from the log. These appear while importing data to the database:

2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/push_mirror.ibd:65536
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/push_mirror.ibd:98304
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/push_mirror.ibd:114688
2021-09-09 21:17:24 8 [Note] InnoDB: ./mysql/innodb_index_stats.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:65536
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:98304
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:114688
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/reaction.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/release.ibd:131072
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/release.ibd:65536
2021-09-09 21:17:24 8 [Note] InnoDB: ./gitea/release.ibd:98304
....

Comment by László Károlyi [ 2021-09-09 ]

Hey Pedro,

thanks, just like I thought, a debug build of mariadb on FreeBSD from ports isn't really a debug build.

Keep in mind that there are patches under `/usr/ports/databases/mariadb105-server/files/`, you might want to apply those to see if they don't break anything.

Comment by Marko Mäkelä [ 2021-09-10 ]

falsovsky, thank you. Could it be that the st_blksize is reporting the current file size? https://www.unix.com/man-page/freebsd/2/stat/ says:

st_blksize The optimal I/O block size for the file.

Okay, it does not promise it to be a power of two (your output includes 0x18000 and 0x1c000 which are not powers of 2), or that it would remain constant over the lifetime of the file handle. That was what I would have expected.

The rounding down to a block size is only needed due to the constraints of O_DIRECT I/O. I see that O_DIRECT is documented in the FreeBSD 13 man 2 fcntl, so I suppose that we do support innodb_flush_method=O_DIRECT on FreeBSD. It is not the default setting before MDEV-24854 in MariaDB Server 10.6, but it was typically configured by everyone who wanted the last bit of performance.

It looks like we could fix the corruption with a preprocessor directive that would avoid reading st_blksize on FreeBSD. But, that would basically revert the fix of MDEV-26040 on FreeBSD. What is the correct way on FreeBSD to determine the block size of the underlying file system or block device, as a power of 2?

When it comes to MDEV-26040, many HDDs and even SSDs support a 512-byte block size for compatibility reasons. I first remember encountering a block device with 2048-byte block size in the mid-1990s. I know that there are some HDDs with 4096-byte block size, as well as some SSDs. But, they may be hard to come by.

Comment by Marko Mäkelä [ 2021-09-10 ]

I am setting this to Blocker status, so that we should address this before the next releases.

I could trivially disable the MDEV-26040 fix on any system that we have not tested, something like this:

diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
index dde1975ea08..3278a921d41 100644
--- a/storage/innobase/os/os0file.cc
+++ b/storage/innobase/os/os0file.cc
@@ -3385,7 +3385,10 @@ os_file_set_size(
 		return false;
 	}
 	os_offset_t current_size = statbuf.st_size
-		& ~os_offset_t(statbuf.st_blksize - 1);
+# ifndef __linux__
+		& ~os_offset_t(statbuf.st_blksize - 1)
+# endif
+		;
 #endif
 	if (current_size >= size) {
 		return true;

But, I would rather like to have a suggestion how to read the actual block size on various BSDs. And I would need your help in that.

Comment by Marko Mäkelä [ 2021-09-10 ]

One more idea: Could the st_blksize be reported differently if the file is in O_DIRECT mode? That is, would the output from my diagnostic patch be different if innodb_flush_method=O_DIRECT had been specified? Maybe someone can check the FreeBSD source code.

Comment by Marko Mäkelä [ 2021-09-10 ]

This appears to affect IBM AIX as well. A safe fix could be to assume that the file system block size is 4096 bytes, while in reality it could be a smaller power of 2. I think that a larger storage device block size is highly unlikely, because 4096 bytes is a very popular virtual memory page size. Only the smallest page size of the UltraSPARC would be larger (8192 bytes).

We also use st_blksize in a few other places, most notably when dealing with page_compressed tables. It remains to be seen in which way those tables are broken on the affected operating systems.

As far as I can tell, the only occasion where os_file_set_size() may currently be invoked on something else than an integer multiple of 4096 bytes is with ROW_FORMAT=COMPRESSED tables that use 1024 or 2048 bytes page size. The test innodb.check_ibd_filesize,4k is even covering that. We will have to make sure that when such files are being extended, we will not overwrite the last 1024, 2048, or 3072 bytes of actual payload with NUL bytes.

Comment by Marko Mäkelä [ 2021-09-10 ]

I conducted some more checks and pushed a tentative fix for testing as well as a port of the fix to 10.5. We will run some internal stress tests to make sure that there is no regression for ROW_FORMAT=COMPRESSED tables.

It looks like INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES.FS_BLOCK_SIZE is reporting the st_blksize. For page_compressed tables, ‘insane’ values of st_blksize>4096 should be mapped to 512 bytes already.

Comment by Pedro de Oliveira [ 2021-09-10 ]

Hello Marko,

A friend of mine made me the following source to be able to test what you said on FreeBSD.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
 
#include <sys/types.h>
#include <sys/stat.h>
 
#define __USE_GNU
#include <fcntl.h>
 
 
int main(void) {
        int fd = 0;
        struct stat st;
 
        if ((fd = open("/etc/group", O_DIRECT)) < 0) {
                puts("open(..., O_DIRECT) failed.");
                exit(EXIT_FAILURE);
        }
 
        if (fstat(fd, &st) < 0) {
                puts("fstat() failed.");
                exit(EXIT_FAILURE);
        }
 
        printf("Block size (with O_DIRECT): %lu\n", st.st_blksize);
 
        close(fd);
 
        if ((fd = open("/etc/group", 0)) < 0) {
                puts("open(..., 0) failed.");
                exit(EXIT_FAILURE);
        }
 
        if (fstat(fd, &st) < 0) {
                puts("fstat() failed.");
                exit(EXIT_FAILURE);
        }
 
        printf("Block size (without O_DIRECT): %lu\n", st.st_blksize);
 
        close(fd);
 
        return 0;
}

falso@bisect:~/test $ uname -a
FreeBSD bisect 13.0-RELEASE-p4 FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021     root@amd64-builder.daemonology.net:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64
falso@bisect:~/test $ gcc -o zbr zbr.c
falso@bisect:~/test $ ./zbr
Block size (with O_DIRECT): 4096
Block size (without O_DIRECT): 4096

Comment by László Károlyi [ 2021-09-10 ]

Just a hunch, but searching around I found getbsize. Not sure if it helps.

Comment by László Károlyi [ 2021-09-10 ]

Also, one should consider that in the case of ZFS, the DB files can be in multiple zfs mounts with multiple blocksizes, possibly. A certain discovery has to be done in this case for the block sizes.

Comment by Marko Mäkelä [ 2021-09-10 ]

falsovsky, yes, the test program delivers a result that I would have expected before this bug was filed. Do you have an idea why the call in InnoDB would return such values? Note that we got a similar corruption on IBM AIX, so it definitely is not FreeBSD specific.

I think that it is easiest to extend the files with 4096-byte aligned writes when posix_fallocate() is not available, and to hope that such writes will be compatible with O_DIRECT. For the rare case that someone uses ROW_FORMAT=COMPRESSED tables with a smaller page size and the table is being extended to an odd number of pages, the writes of those pages may extend the otherwise preallocated file with 1, 2, or 3 kilobytes. In that way, we can avoid adding more operating system specific tweaks.

Comment by Matthias Leich [ 2021-09-10 ]

A source tree origin/bb-10.5-marko commit 716f04a40f225af11fa88e793441da96dd0ddf21
containing a fix for the current MDEV-26537 behaved well in RQG testing (innodb_page_size=4K only).

Comment by Pedro de Oliveira [ 2021-09-10 ]

Will this fix only be applied to 10.5 tree or will it be also applied on the other versions?

Comment by Marko Mäkelä [ 2021-09-11 ]

I have pushed the fix to the 10.2 branch and merged up to 10.5 so far.

Comment by Sergei Golubchik [ 2021-09-21 ]

It was fixed in FreeBSD Port collection on September 12. New, fixed, versions:

  • 10.3.31_1
  • 10.4.21_1
  • 10.5.12_1
Comment by Marko Mäkelä [ 2021-11-10 ]

It turns out that the fix causes a debug assertion failure on crash recovery (MDEV-27016) when using ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=2 or 1, at least on 10.5 and 10.6. I think that the scenario must involve DDL, because I was unable to repeat it with DML (the file size would always jump from 64 KiB to 5 MiB).

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index 583b793aff7..d3b6b1f9371 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -962,7 +962,10 @@ static page_id_t buf_flush_check_neighbors(const fil_space_t &space,
                                            page_id_t &id, bool contiguous,
                                            bool lru)
 {
-  ut_ad(id.page_no() < space.size);
+  ut_ad(id.page_no() < space.size_in_header);
+  ut_ad(id.page_no() < space.size +
+        (space.physical_size() == 2048 ? 1
+         : space.physical_size() == 1024 ? 3 : 0));
   /* When flushed, dirty blocks are searched in neighborhoods of this
   size, and flushed along with the original page. */
   const ulint s= buf_pool.curr_size / 16;

Both space.size and space.committed_size would be 64, while space.size_in_header would be 65 for the copy of the data directory. The space.size is based on the file size, and we’d never extend files in increments smaller than 4096 bytes.

This patch has not gone through our stress tests yet. In any case, only cmake -DCMAKE_BUILD_TYPE=Debug should be affected by this.

Generated at Thu Feb 08 09:46:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.