Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17638

Improve error message about corruption of encrypted page

Details

    Description

      See the following error message:

      2018-11-05 3:53:42 139986889774848 [ERROR] InnoDB: Page 227027:2591 may be corrupted. Post encryption checksum 3926119398 stored [3479160576:3479160576] key_version 1
      2018-11-05 03:53:42 0x7f513cd69700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/rem/rem0rec.cc line 580
      

      This is printed here:

      https://github.com/MariaDB/server/blob/5567a8c936ede50efcdf4a7f898dbaa058106d70/storage/innobase/fil/fil0crypt.cc#L2639

      A lot of people seem to think that 227027 in the above message would refer to the page number, but it actually refers to the tablespace ID. I think we should improve this message. Maybe something like?:

      2018-11-05 3:53:42 139986889774848 [ERROR] InnoDB: Page in tablespace 227027 with offset 2591 may be corrupted. Post encryption checksum 3926119398 stored [3479160576:3479160576] key_version 1
      2018-11-05 03:53:42 0x7f513cd69700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/rem/rem0rec.cc line 580
      

      Attachments

        Issue Links

          Activity

            MDEV-13542 would ultimately propagate the error to the SQL layer. Even before that, we definitely could improve the messages that are output to the error log.

            marko Marko Mäkelä added a comment - MDEV-13542 would ultimately propagate the error to the SQL layer. Even before that, we definitely could improve the messages that are output to the error log.

            I think that it would be better in these messages to display the file name. I would still keep the tablespace ID in the output.
            Starting with 10.2, many such messages are emitted via

            /** Print the given page_id_t object.
            @param[in,out]	out	the output stream
            @param[in]	page_id	the page_id_t object to be printed
            @return the output stream */
            std::ostream&
            operator<<(
            	std::ostream&		out,
            	const page_id_t		page_id)
            {
            	out << "[page id: space=" << page_id.m_space
            		<< ", page number=" << page_id.m_page_no << "]";
            	return(out);
            }
            

            which produces more verbose output.

            marko Marko Mäkelä added a comment - I think that it would be better in these messages to display the file name. I would still keep the tablespace ID in the output. Starting with 10.2, many such messages are emitted via /** Print the given page_id_t object. @param[in,out] out the output stream @param[in] page_id the page_id_t object to be printed @return the output stream */ std::ostream& operator<<( std::ostream& out, const page_id_t page_id) { out << "[page id: space=" << page_id.m_space << ", page number=" << page_id.m_page_no << "]" ; return (out); } which produces more verbose output.

            The message was originally introduced in MDEV-11759 (MariaDB 10.1.22, 10.2.5) and slightly changed in MDEV-11939 (MariaDB 10.1.26, 10.2.8). (The original wording was "Page %lu in space %s (%lu) maybe corrupted."

            Ultimately, the message and the code path leading to it was removed in MDEV-12112 (MariaDB 10.1.38, 10.2.20, 10.3.12).

            The presence of this message in an error log spells big trouble. The logic that was introduced in MDEV-11759 was based on the wrong assumption that a data page whose contents appears to carry a valid ‘before encryption’ checksum cannot have been encrypted. Then, MariaDB would wrongly skip the page decryption and hand over the garbled page contents to the caller of buf_page_get_gen(), which would then crash basically anywhere. If I remember correctly my debugging session from December 2018, in MariaDB Server 10.1, buf_page_get_gen() would return NULL in this case. Some of its callers are prepared for that error situation.

            To make matters worse, there are several checksum algorithm implementations in InnoDB. It sufficed for one of the various algorithms to produce a matching checksum. MDEV-17957 and MDEV-17958 improved innodb_checksum_algorithm=strict_crc32 so that only one checksum will be attempted.

            In the case of one customer, an encrypted page had the exact same CRC-32C checksum both for the before-encryption and after-encryption checksums.

            MDEV-18025 extended the test mariabackup.encrypted_page_corruption to cover a scenario where an encrypted page appears to contain a valid checksum on the encrypted data. The test ensures that InnoDB will attempt page decryption of an intentionally corrupted encrypted page, and only then detect that the decrypted contents of the page does not match the before-encryption checksum.

            marko Marko Mäkelä added a comment - The message was originally introduced in MDEV-11759 (MariaDB 10.1.22, 10.2.5) and slightly changed in MDEV-11939 (MariaDB 10.1.26, 10.2.8). (The original wording was "Page %lu in space %s (%lu) maybe corrupted." Ultimately, the message and the code path leading to it was removed in MDEV-12112 (MariaDB 10.1.38, 10.2.20, 10.3.12). The presence of this message in an error log spells big trouble. The logic that was introduced in MDEV-11759 was based on the wrong assumption that a data page whose contents appears to carry a valid ‘before encryption’ checksum cannot have been encrypted. Then, MariaDB would wrongly skip the page decryption and hand over the garbled page contents to the caller of buf_page_get_gen() , which would then crash basically anywhere. If I remember correctly my debugging session from December 2018, in MariaDB Server 10.1, buf_page_get_gen() would return NULL in this case. Some of its callers are prepared for that error situation. To make matters worse, there are several checksum algorithm implementations in InnoDB. It sufficed for one of the various algorithms to produce a matching checksum. MDEV-17957 and MDEV-17958 improved innodb_checksum_algorithm=strict_crc32 so that only one checksum will be attempted. In the case of one customer, an encrypted page had the exact same CRC-32C checksum both for the before-encryption and after-encryption checksums. MDEV-18025 extended the test mariabackup.encrypted_page_corruption to cover a scenario where an encrypted page appears to contain a valid checksum on the encrypted data. The test ensures that InnoDB will attempt page decryption of an intentionally corrupted encrypted page, and only then detect that the decrypted contents of the page does not match the before-encryption checksum.

            Ah, OK. It's good to know that the code path was removed.

            Thanks for the analysis!

            It looks like innochecksum still prints a similar message about corruption in the "Page $space_id:$page_num" format that has confused some people if you'd want to change that one too:

            https://github.com/MariaDB/server/blob/7d245083a43e34d94822e580037727bdbb50b6f0/extra/innochecksum.cc#L527

            GeoffMontee Geoff Montee (Inactive) added a comment - Ah, OK. It's good to know that the code path was removed. Thanks for the analysis! It looks like innochecksum still prints a similar message about corruption in the "Page $space_id:$page_num" format that has confused some people if you'd want to change that one too: https://github.com/MariaDB/server/blob/7d245083a43e34d94822e580037727bdbb50b6f0/extra/innochecksum.cc#L527

            Thanks for the clarification. I am lowering the priority from Critical. I would keep the terse format in messages that are only present in instrumented debug builds, but I agree that a more verbose format could be easier for users.

            marko Marko Mäkelä added a comment - Thanks for the clarification. I am lowering the priority from Critical. I would keep the terse format in messages that are only present in instrumented debug builds, but I agree that a more verbose format could be easier for users.
            kevg Eugene Kosov (Inactive) added a comment - Diagnostics from topic message was removed in https://github.com/mariadb/server/commit/8c43f963882a9d5ac4e4289c8dd3dbcaeb40a0ce

            People

              kevg Eugene Kosov (Inactive)
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.