[MDEV-28149] 10.6.7 on FreeBSD hangs with InnoDB Created: 2022-03-22  Updated: 2022-07-03  Resolved: 2022-07-03

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.7
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: László Károlyi Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 0
Labels: None
Environment:

FreeBSD 13.0-RELEASE-p8, running within a jail


Attachments: File dump.sql    
Issue Links:
Relates
relates to MDEV-26537 InnoDB corrupts files due to incorrec... Closed

 Description   

Hey,

this might be related to MDEV-26537, but after having installed 10.6.7 on FreeBSD (in an effort to upgrade from 10.5.x), mariadb starts to hang on certain InnoDB queries. I noticed it happening with my nextcloud executing some queries, completely randomly. 10.5.15 is the one I run now, it doesn't seem to be affected by this.

When this happens, mariadb starts to overwhelm itself, the queries for the DB hang, and I can't even kill them from within the mariadb command line. If I try and restart mariadb, it doesn't shut down by itself, only sending a KILL signal helps, which corrupts other databases in return. Also as it happens, it starts to eat up memory and spins the disk and CPU at almost the maximum rate.

I had this happening a couple times, tried even dumping and loading back the nextcloud database to see if a newly loaded DB would help, but it isn't the case. The last time it happened, my production server almost died in it, exhausting memory: swap full, starting to kill the mariadb process as it was the one eating up all memory.

I'm afraid I can't be of much help to you now as I had to forcefully go back to 10.5.x (which resulted in a halfway upgraded DB but I could get around it for now), to get my production server in a stable condition again.

We had a go at MDEV-26537 earlier, that seemed to have been fixed, I think this is related to that bug.

It is frightening to know that a FreeBSD package that a lot of my systems are relying on, is buggy and breaks my server.



 Comments   
Comment by Marko Mäkelä [ 2022-03-22 ]

karolyi, in which way would KILL corrupt the database? InnoDB is supposed to be crash-safe for DML operations, and starting with 10.6 it aims to be that for DDL too, barring some bugs like MDEV-27234.

Do your InnoDB tables contain SPATIAL indexes? If yes, this would be a duplicate of MDEV-26781 and for FreeBSD fixed in MDEV-26476 in the upcoming 10.6.8 release.

Comment by László Károlyi [ 2022-03-22 ]

Hey,

thanks for the quick response. The corrupted databases are MyISAM DBs (revive adserver and my own forum engine from ages ago). Usually the DB can fix it upon next startup, which is good, but the corruption of them (and the server hang itself) is really frightening.

I've attached the current structure of the nextcloud DB. If I'm not mistaken, no spatial indexes in there. This is the DB that kept freezing only, the other ones (also a lot of them InnoDB based) has none of the issue.

dump.sql

Comment by Marko Mäkelä [ 2022-03-22 ]

Thank you. MyISAM never was crash-safe. ENGINE=Aria is mostly like MyISAM, but aiming to be crash-safe.

Would you be able to produce stack traces of all threads during the hang?

Comment by Marko Mäkelä [ 2022-03-22 ]

karolyi, we are in the process of setting up FreeBSD in our continuous integration system. elenst pointed me to a hang that I was able to reproduce on FreeBSD 13 after disabilng the MDEV-26476 fix:

diff --git a/storage/innobase/include/rw_lock.h b/storage/innobase/include/rw_lock.h
index 70607b97..cef4ebdc 100644
--- a/storage/innobase/include/rw_lock.h
+++ b/storage/innobase/include/rw_lock.h
@@ -30,7 +30,7 @@ this program; if not, write to the Free Software Foundation, Inc.,
 # define SUX_LOCK_GENERIC /* fall back to generic synchronization primitives */
 #endif
 
-#if !defined SUX_LOCK_GENERIC && 0 /* defined SAFE_MUTEX */
+#if !defined SUX_LOCK_GENERIC
 # define SUX_LOCK_GENERIC /* Use dummy implementation for debugging purposes */
 #endif
 

With this revert, a bash invocation of

./mtr --parallel=8 --repeat=10 encryption.compressed_import_tablespace{,,,,,,,}

would almost immediately hang 3 of the 8 concurrently running server processes, and the 4th server would hang after completing one 3-second test round. This was on a system that was equipped with 8 virtual CPUs.

On a retry, it was a little better: 6 servers survived 2 rounds, 5 servers survived 6 rounds, and only 4 servers finished all 10 rounds.

Finally, I successfully ran a test without the above patch:

./mtr --parallel=8 --repeat=100 encryption.compressed_import_tablespace{,,,,,,,}

10.6 35725df6e2791d19bebf0301bb9fcb6200f5b00d

encryption.compressed_import_tablespace 'innodb' w3 [ 100 pass ]   3213
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 2536.165 of 338 seconds executing testcases
 
Completed: All 800 tests were successful.

This suggests that MDEV-26781 affects more than just SPATIAL INDEX. That test does not use any SPATIAL INDEX.

Can you repeat the hang with a recent 10.6 branch?

Also, maybe it would be a good idea for the FreeBSD packagers to run the supplied regression tests? Debian for one does it, and it has resulted in some useful bug reports, such as MDEV-27985.

Comment by László Károlyi [ 2022-03-22 ]

I'm afraid I can't. For starters, it doesn't crash but hangs (maxxing out CPU/memory/disk), therefore no crashlog when I kill it with SIGKILL.

Secondly, I'm unwilling to risk crashing my production server again, causing outages.

I think the best bet would be to try and reproduce it on a VM. The DB structure used by nextcloud is clearly the suspect in here, but I don't know how to reproduce that bug at this point.

Comment by László Károlyi [ 2022-03-22 ]

re: https://jira.mariadb.org/browse/MDEV-28149?focusedCommentId=217619&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-217619

I can try and set up a new jail on my server with 10.6 packaged, and see what gives with this test.

I have to run now but I'll touch base later with the results.

Comment by Marko Mäkelä [ 2022-03-22 ]

I do not know about FreeBSD, but in Linux it is possible to attach gdb to a running process. Depending on how the kernel was built, it may require super-user privileges even when the owner of the process is the same.

I would also expect it to be possible to kill a process with SIGABRT to obtain a core dump from which stack traces could be extracted. In fact, this is what the mtr test framework does when a test case times out.

Comment by László Károlyi [ 2022-03-22 ]

I'm trying to execute the test and getting this result:

encryption.compressed_import_tablespace w1 [ skipped ] Needs example_key_management
encryption.compressed_import_tablespace w6 [ skipped ] Needs example_key_management
encryption.compressed_import_tablespace w5 [ skipped ] Needs example_key_management
encryption.compressed_import_tablespace w7 [ skipped ] Needs example_key_management
encryption.compressed_import_tablespace w3 [ skipped ] Needs example_key_management

Is there a way to not skip the tests?

Comment by Marko Mäkelä [ 2022-03-22 ]

karolyi, when I built from the source code with cmake and make, the file_key_management encryption plugin was compiled. I am sorry, but, I do not know what logic might prevent it from being built.

I think that it would be more useful to apply the futex fix to the 10.6.7 source code and test if it fixes the hangs for you.

Comment by László Károlyi [ 2022-03-22 ]

Unfortunately I'm not able to experiment around with patches on a live system that gets deployed from official packages, so I need to figure out a way to a) either have my test mysql compiled that way or b) get the tests running without requiring that option.

Choice B would be preferable as the tests need to run against the binary provided by FreeBSD.

Is there a way for me to get them running without that requirement?

Comment by László Károlyi [ 2022-03-23 ]

This just showed up today:
https://www.freebsd.org/security/advisories/FreeBSD-EN-22:13.zfs.asc

I'm gonna update my server and see if this bug had to do with this. I'm just guessing though, but I've seen this error in other circumstances.

Comment by Marko Mäkelä [ 2022-03-23 ]

karolyi, I do not expect that your hang has anything to do with anything in the file system, but I am of course guessing here, because I have not seen any stack traces of the hang.

If you are unable to apply the futex fix to a copy of MariaDB 10.6.7 and to rebuild the package, maybe the FreeBSD packager could do that. If that is not possible, you should eventually get the fix with the FreeBSD package of 10.6.8. MariaDB does not currently build or provide any packages for FreeBSD; they are created based on our source code releases.

Comment by László Károlyi [ 2022-03-23 ]

Hey,

I still want to investigate and reproduce this issue, it's just I'm kinda busy. I updated the server now with the zfs fix.

Yesterday I went to compiling mariadb manually in a jail (mostly because the mtr tool is not packaged), but the compilation bailed out with some error. That was when I went with the packaged version that includes some patches in order for it to be able to compile.

Not sure what those patches do, but if I'll have some more time again, I'll retry the compilation and see what I can I do to get it compiled without the FreeBSD package patches. As to what those patches are, you want to check it out for yourself here, as you're as a developer, might be smarter about it than I am: https://cgit.freebsd.org/ports/tree/databases/mariadb106-server/files

It'd be useful to have your insight on this. I hope these patches don't break anything in the packaged MariaDB.

Comment by Marko Mäkelä [ 2022-03-23 ]

karolyi, I see. Last time I checked (for MDEV-26537), the FreeBSD package build was invoked by some make based script that invoked cmake and finally make. I did not figure out how any configuration options would be passed to the cmake. It could be that the FreeBSD packaging is excluding any encryption plugins. It is an optional component.

For checking what I believe to share a root cause with this hang, I did a direct cmake and make on our source code.

I am sorry, but I don’t think I can help you create an updated the package. I suggest that you ask help from some FreeBSD maintainers if needed. I see that the packaking includes a few patches, which I think would be better applied directly to our source code. That is something that we should look at, once we have set up a FreeBSD builder in our continuous integration (CI) system, hopefully in not too far future.

Comment by László Károlyi [ 2022-03-24 ]

Hey,

I've contacted the package maintainer of FreeBSD to come over here in hopes of finding out more about this bug, and also at the same time to improve your integration with FreeBSD.

Here's to hoping he will show up and help us out here.

If nothing else happens, I'll have to wait until 10.6.8 arrives and see if it breaks my systems still.

Comment by Dries Michiels [ 2022-04-23 ]

Also observed this issued, with my Nextcloud instance. Is there anything we can do in the meantime? (are patches available that fix the issue)?

Comment by Elena Stepanova [ 2022-05-29 ]

10.6.8 is now in ports.

Comment by László Károlyi [ 2022-07-03 ]

To be quite honest, I'm not sure if this is fixed. I'm just afraid to test this on a production system. Last time it wasn't fun.

Generated at Thu Feb 08 09:58:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.