When data-at-rest encryption is in use, InnoDB currently writes two checksums to the page--a checksum for the contents of the unencrypted page and a checksum for the contents of the post-encrypted page. There have been cases where these two checksums for a single page have had the same value. In those cases, InnoDB would wrongly assume that the page could not be encrypted, so it would not decrypt the page. If the page actually was encrypted, then the server would likely crash and the error log would contain error messages like the following:
Here is an analysis from Marko Mäkelä:
The error message was originally introduced in
MDEV-11759 (MariaDB 10.1.22, 10.2.5) and slightly changed in MDEV-11939 (MariaDB 10.1.26, 10.2.8). (The original wording was "Page %lu in space %s (%lu) maybe corrupted."
The presence of this error message in an error log spells big trouble. The logic that was introduced in
MDEV-11759 was based on the wrong assumption that a data page whose contents appears to carry a valid ‘before encryption’ checksum cannot have been encrypted. Then, MariaDB would wrongly skip the page decryption and hand over the garbled page contents to the caller of buf_page_get_gen(), which would then crash basically anywhere. My debugging session in MariaDB Server 10.1 showed that buf_page_get_gen() would return NULL in this case. Some of its callers are prepared for that error situation.
To make matters worse, there are several checksum algorithm implementations in InnoDB. It sufficed for one of the various algorithms to produce a matching checksum.
MDEV-17957 and MDEV-17958 improved innodb_checksum_algorithm=strict_crc32 so that only one checksum will be attempted.
The data-at-rest encryption in MariaDB was developed before I joined the company. I have identified and fixed several problems with it in the past. This issue revealed more problems, and I believe that there is not much more that we could fix without changing the design and the file format.
I believe that the biggest challenge is the design that allows encryption to be enabled or disabled on data files, one page at a
time. Because of this, a data file can contain encrypted or unencrypted pages.
To make things even more challenging, InnoDB did not initialize unused data fields in buffer pool pages, until MySQL 5.1.48 where I finally
added the initialization due to an external bug report. I had identified and pointed out this problem years earlier, but Heikki Tuuri (the author of InnoDB) thought that initializing memory could reduce performance.
In MariaDB data-at-rest encryption, the unused (and possibly uninitialized) fields were reused for encryption key version and post-encryption checksum. These fields would usually be zero in unencrypted data files, but they could sometimes be nonzero due to the missing initialization, if an unencrypted page was originally created before MySQL 5.1.48.
I think that MariaDB should use only one checksum variant in the future, and should not allow mixing unencrypted and encrypted pages in
a data file. This will be complicated a little due to supporting upgrades from older versions. For each data file separately, we should keep track of which scheme is being used. New files would be created in the clean way. Perhaps at the same time with this, we could improve the way how key rotation works.
Today, I debugged what exactly is going on. My test creates a fake encrypted page (mostly initialized with NUL bytes, and containing valid checksum). I wrote the same checksum value both to the 2 "before encryption" and the 1 "after encryption" fields. This caused fil_space_verify_crypt_checksum() to identify the page as not encrypted:
I started with a recent 10.2 revision that is before any of the fixes:
The test is primarily checking that mariabackup will detect the corruption. Because I am running it against older revisions, we have to relax that part of the test. The interesting part is what happens when the MariaDB server is restarted and DROP TABLE is executed.
With commit 447e4931795a0ae9525005e8fb37bb7347d8ae52, there is no attempt to decrypt the page. fil_space_verify_crypt_checksum() and buf_page_decrypt_after_read() will return false. buf_page_io_complete() will notice this, set the flag table->file_unreadable, and finally
return DB_DECRYPTION_FAILED. Finally, after buf_page_get_gen() has exhausted its 100 retries to read the page, it will return NULL to the caller. The caller in DROP TABLE is prepared for that return value (not all code paths are; that will be eventually fixed in
MDEV-13542), and the DROP TABLE operation on the corrupted table succeeds. My test corrupted the root page of the table.
Then, I moved to test a later commit 7d245083a43e34d94822e580037727bdbb50b6f0, which I believe fixed this issue. It made fil_space_verify_crypt_checksum() only check the post-encryption checksum. The callers would ensure that the page carries the correct checksum after decryption. With this change, fil_space_verify_crypt_checksum() will return true (because the post-encryption checksum matches) without ever issuing any "may be corrupted" message, and buf_page_decrypt_after_read() will invoke fil_space_decrypt() and buf_page_check_corrupt(). In my test case, the decrypted data will be gibberish, and an error message will be output:
After this, DB_DECRYPTION_FAILED will be returned all the way up to buf_page_get_gen(), which will retry the reads, and finally return NULL to the caller.
Until my fix of
MDEV-13103 (MariaDB 10.2.16 and later), the function buf_page_io_complete() ignored the return value of buf_page_decrypt_after_read(). Let us next see what would happen there:
In this version, buf_page_decrypt_after_read() invokes fil_space_verify_crypt_checksum(), which will wrongly issue the
"may be corrupted" and return false as noted above. Due to this, buf_page_decrypt_after_read() will return false without invoking fil_space_decrypt() to decrypt the page.
Because no decryption was performed, the page contents will still match the "before encryption" checksum, and DB_SUCCESS will be returned to the caller, even though we kind-of decided that the page is corrupted.
Finally, I tested the version that you are running, and saw exactly the same behavior:
I think that with this, we have an exact explanation what must have happened with the problematic page 5253952. The MariaDB 10.2.12 server never attempted to decrypt it, and it returned the gibberish page contents to the caller of buf_page_get_gen().
Some corruption could also be attributed to the sloppy checksum validation in Mariabackup, which may have caused it to copy inconsistent pages as they were being simultaneously written by the running MariaDB server.
For this particular page, I think that it is possible that the page is actually valid (after all, the new Mariabackup did not report any checksum mismatch after decrypting the page), and the problem was that the server wrongly skipped the decryption step, and returned the gibberish page to the caller without flagging any corruption. This should now work in MariaDB 10.2.20.