[MDEV-24284]  InnoDB: Record field 64 len 18446744... Created: 2020-11-26  Updated: 2023-04-11  Resolved: 2023-04-11

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST, mariabackup, Storage Engine - InnoDB
Affects Version/s: 10.3.25
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Roope Pääkkönen (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Centos 7


Issue Links:
Relates
relates to MDEV-24578 MariaDB 10.5 fails to join Galera clu... Closed

 Description   

We have some mariadb 10.3 instances where the server has log entries similar to below:

[ERROR] InnoDB: Record field 64 len 18446744073709544479

Maybe few times per day at most, I haven't found what the exact trigger is

The value of "record field" 64 may vary, but the "len" is always this 18446744073709544479.

We haven't noticed any operational issues on the servers so far but still interested to learn if this is something to look into deeper
For example, running CHECK TABLE against all tables doesn't come up with his message, either.

I don't really have much information on when/how these might have began to happen.
But what seems to be somewhat related is that the servers where this happens are part of galera cluster, single instance servers do not have these entries.



 Comments   
Comment by Marko Mäkelä [ 2020-11-26 ]

The number corresponds to -7137 in 2’s complement 64-bit arithmetics. That might have something to do with the default innodb_page_size=16384.

Have the affected systems been initiated by snapshot transfer? If yes, in which way exactly?

Comment by Roope Pääkkönen (Inactive) [ 2020-11-26 ]

Hi, yes, these have been set up with mariabackup SST

Comment by Marko Mäkelä [ 2020-11-26 ]

Thank you. That could limit the problem to Mariabackup or the way it is used in Galera snapshot transfer. Do you execute any ALTER TABLE or OPTIMIZE TABLE statements? If yes, have you tried setting innodb_log_optimize_ddl=OFF (which MDEV-23720 made the default since 10.3.26)?

Comment by Roope Pääkkönen (Inactive) [ 2020-11-28 ]

Yes, ALTER's are run every now and then, as part of software updates at minimum, OPTIMIZE not.
There are some DDL's run as part of cron or indexing jobs, like CREATE / RENAME / TRUNCATE TABLEs as well, and a lot of CREATE TEMPORARY .. some timestamps of these innodb errors correspond to the cron's timestamps I believe.

But i don't have any exact query entry as of now that would match with that error message.

I haven't tried the optimize_ddl setting, could it help?
But in fact, it might be that one test environment where we have updated 10.3.27, hasn't spawned these messages since it was updated if the default setting changed simultaneously.

Comment by Mark Reibert [ 2021-02-11 ]

I am experiencing this too, on a three-node MariaDB Galera cluster (MariaDB version 10.4.14 with Galera 26.4.5).

Note I am performing DDL statements as several tables in my DB are partitioned and there is a nightly cron that executes ALTER TABLE to add/drop partitions. Also, innodb_log_optimize_ddl is enabled as that is the default with MariaDB 10.4.14.

Comment by Mark Reibert [ 2021-02-12 ]

Update: I set innodb_log_optimize_ddl = OFF on my cluster nodes and the "Record field len" messages are still appearing in the logs. Note I made the change without restarting mysqld since innodb_log_optimize_ddl is a dynamic variable.

Comment by Marko Mäkelä [ 2021-02-12 ]

I suspect that this could share a root cause with MDEV-24578. I have not reviewed how the Galera snapshot transfer scripts work, and I do not remember encountering this class of problems with non-Galera use of mariabackup.

Comment by Marko Mäkelä [ 2021-02-12 ]

mreibert, some data that has been already permanently corrupted cannot be magically uncorrupted by the change of a configuration parameter that should prevent future corruption. I hope that you are following a systematic approach in order to track down this problem: Always start with a backup of a known good cluster. (CHECK TABLE must pass on all tables, to begin with.) And then try to reproduce the problem, by replaying some load on the cluster and performing the snapshot transfer. otto was able to narrow down his problem in MDEV-24578 by following such systematic approach.

Comment by Marko Mäkelä [ 2021-02-12 ]

One more thought: In the upcoming 10.5.9 release, MDEV-24705 introduces a recovery-time check that could catch the problem at an earlier phase if the problem is related to crash recovery or restoring a backup. If this corruption is indeed related to those, then MDEV-24449 and MDEV-24709 might be the fix. (To avoid those bugs, you can set innodb_change_buffering=none for the entire lifetime of the data files.)

Comment by Mark Reibert [ 2021-02-12 ]

marko — Thank you for the feedback. As it turns out I am dealing with a rather large database (6TB) so it is not easy to debug as systematically as I would like. A SST alone takes 4–6 hours and CHECK TABLE over all the tables takes is a very long operation, so it's not entirely practical to just "try things". (This is a production system for which I cannot afford any down time.)

That being written, I could take one node out of the Galera cluster and ensure it's DB is "clean", then re-introduce it to the cluster and let it IST. After that I could SST the other two nodes (now that I have innodb_log_optimize_ddl = OFF).

In any case, I appreciate your thoughts and will continue to "see what I can do" given the size of the DB and restrictions of the environment.

Comment by Roope Pääkkönen (Inactive) [ 2021-02-26 ]

We also tried the innodb_log_optimize_ddl=OFF setting on mariadb 10.3.25 but it didn't seem to improve things.
But we had one set of testing instances where these messages have not appeared after we had updated to 10.3.27.

And now on another servers that were updated to 10.3.28 from 10.3.25, where previously these messages appeared almost daily, we haven't seen them anymore.
CHECK TABLE always has shown ok with all tables on all nodes, etc.

Comment by Marko Mäkelä [ 2021-02-26 ]

If this problem occurs due to MDEV-24449 or MDEV-24709, then an upgrade to 10.2.37, 10.3.28, 10.4.18, 10.5.9 should prevent it from occurring again. But, old data will not uncorrupt or repair itself. The corruption due to MDEV-24449 is not limited to user tables; also the system tablespace can be corrupted. I suspect that MDEV-24449 was the root cause of the corruption whose effects we worked around in MDEV-20934.

I agree that with large amounts of data, a statistical approach rather than "try things" has to be followed. Hopefully I am right about MDEV-24449, but we will not know that until one of you confirms that the problem is longer occurring, after a few months of not hitting this.

Comment by Roope Pääkkönen (Inactive) [ 2021-02-26 ]

What would your recommended course of action be for existing databases, if the issue is fixed with 10.3.28?

Comment by Marko Mäkelä [ 2021-02-26 ]

I think that this type of corruption could be fixed by logical dump and restore. I do not know Galera well, but I have understood that it supports a logical snapshot transfer. That should allow one corruption-free cluster node to be created. If this issue was only caused by MDEV-24449, then the corruption-free node could be safely cloned using physical snapshot transfer.

You could also take a shortcut and assume that the InnoDB system tablespace is not corrupted. In that case, it would suffice to mysqldump, drop, and restore the corrupted table.

Comment by Mark Reibert [ 2021-02-26 ]

marko — thank you for the additional information and thoughts. For my large (6TB) Galera cluster I can wipe a node and let it recreate the DB from scratch via a SST, but it seems I should first upgrade to 10.4.18 (I am currently on 10.4.14) on account of MDEV-24449.

Comment by Jan Lindström (Inactive) [ 2021-02-27 ]

Yes, you can recreate DB from scratch via a SST but do not use rsync or mariabackup methods, you need to use mysqldump method.

Comment by Mark Reibert [ 2021-03-01 ]

Hello jplindst — Unfortunately with a 6TB database mariabackup is my only practical option given the 6 hours it takes for a SST and the other methods blocking the donor during that time.

Comment by Jan Lindström [ 2023-04-11 ]

10.3 will EOL soon.

Comment by Marko Mäkelä [ 2023-04-11 ]

I think that this corruption should affect MariaDB 10.4 as well. The root cause could be a hard-to-reproduce InnoDB crash recovery failure. In 10.5, the recovery was rewritten as part of MDEV-12353 and many related changes.

Generated at Thu Feb 08 09:28:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.