Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      What does it mean, ACID? Rollback-ability? MVCC?

      I suppose this intuitively means that a DDL could be part of a larger transaction and could be rolled back with it.

      Attachments

        Issue Links

          Activity

            monty Michael Widenius added a comment - - edited

            The following link shows which databases supports transactional DDL:s and how:
            https://wiki.postgresql.org/wiki/Transactional_DDL_in_PostgreSQL:_A_Competitive_Analysis

            Ideally we would like to have transactional DDL for

            • CREATE TABLE
            • ALTER TABLE
            • DROP TABLE

            CREATE TABLE:

            • Reasonable easy to do for all storage engines.

            ALTER TABLE

            • Possible to do for any transactional engine that supports online ALTER TABLE
            • Changes that doesn't affect storage format can be done for all storage engines.

            DROP TABLE

            • Probably the hardest to do.
              We may be want to leave DROP TABLE as the last step.

            That said, I think that we should initially do transactional DDL:s only for transactional engines (XtraDB, InnoDB and TokuDB)

            Internal change:

            • Have version number in the table cache and table share
            • Have a table cache for each THD (to be able to store table versions that is only visible for one user)
            • Do necessary changes in InnoDB to be able to have many versions (any number) of the same table
            • Delay replacing the new created table in ALTER TABLE until commit is done.
              (This can be very tricky if you do several ALTER TABLE:s on the same table that is use by another transaction; We could consider changing the original table to read only while the transaction is running to simplify this case)
            • We need to have a sql-mode that does autocommit before and after CREATE, ALTER TABLE and DROP TABLE to ensure that applications running in not autocommit mode will work as before.

            Example of things that one should be able to do:

            BEGIN
            CREATE TABLE t1 (a int)
            ALTER TABLE t1 add column b int;
            ALTER TABLE t1 add column c int;
            ROLLBACK;

            This should roll back all changes and the creation of table t1;

            monty Michael Widenius added a comment - - edited The following link shows which databases supports transactional DDL:s and how: https://wiki.postgresql.org/wiki/Transactional_DDL_in_PostgreSQL:_A_Competitive_Analysis Ideally we would like to have transactional DDL for CREATE TABLE ALTER TABLE DROP TABLE CREATE TABLE: Reasonable easy to do for all storage engines. ALTER TABLE Possible to do for any transactional engine that supports online ALTER TABLE Changes that doesn't affect storage format can be done for all storage engines. DROP TABLE Probably the hardest to do. We may be want to leave DROP TABLE as the last step. That said, I think that we should initially do transactional DDL:s only for transactional engines (XtraDB, InnoDB and TokuDB) Internal change: Have version number in the table cache and table share Have a table cache for each THD (to be able to store table versions that is only visible for one user) Do necessary changes in InnoDB to be able to have many versions (any number) of the same table Delay replacing the new created table in ALTER TABLE until commit is done. (This can be very tricky if you do several ALTER TABLE:s on the same table that is use by another transaction; We could consider changing the original table to read only while the transaction is running to simplify this case) We need to have a sql-mode that does autocommit before and after CREATE, ALTER TABLE and DROP TABLE to ensure that applications running in not autocommit mode will work as before. Example of things that one should be able to do: BEGIN CREATE TABLE t1 (a int ) ALTER TABLE t1 add column b int ; ALTER TABLE t1 add column c int ; ROLLBACK ; This should roll back all changes and the creation of table t1;

            monty, I would like to have some clarification for this:

            Do necessary changes in InnoDB to be able to have many versions (any number) of the same table

            If I understood correctly, your idea is that every InnoDB table (or partition) would store the TABLE_SHARE version number.

            For your scheme to work, changes to this number would have to be written to an undo log. For that to work, we would have to introduce a new undo log record. It would also be challenging to find the storage for the number. You might suggest that we just allocate a few bytes in page 1 of each .ibd file, but what about tables that exist in the system tablespace?

            In order to support tables in the system tablespace with this scheme, we would have to store the version number somehow inline in the clustered index B-tree. The only reasonably available place for that would be the 'default row' record that was introduced in MDEV-11369 for instant ALTER and whose format will be extended further by MDEV-11424 and related tasks. Such a change to the format would prevent export of .ibd files to older MariaDB versions. Do we really want that?

            Furthermore, what about partitioned tables? If a DDL operation only affects one partition, would we have to update the version number for every partition in InnoDB?

            What about DDL operations that InnoDB used to ignore, such as changing the DEFAULT value of a column? Will this also require some metadata to be updated inside InnoDB?

            I believe that a less intrusive change would be to introduce logging that implements crash-safety for DDL operations. This logging should be transactional, and we would need some kind of two-phase commit between the logging and the participating resource managers (storage engine, .frm file, DDL logging). Maybe the DDL log could be implemented inside the Aria storage engine? But then we would need a distributed transaction between (say) InnoDB and Aria.

            When it comes to InnoDB, I believe that the biggest challenge is ALTER TABLE…ALGORITHM=INPLACE. We should probably refactor the transaction commit out of ha_innobase::commit_inplace_alter_table() and instead call explicit commit from the SQL layer.

            marko Marko Mäkelä added a comment - monty , I would like to have some clarification for this: Do necessary changes in InnoDB to be able to have many versions (any number) of the same table If I understood correctly, your idea is that every InnoDB table (or partition) would store the TABLE_SHARE version number. For your scheme to work, changes to this number would have to be written to an undo log. For that to work, we would have to introduce a new undo log record. It would also be challenging to find the storage for the number. You might suggest that we just allocate a few bytes in page 1 of each .ibd file, but what about tables that exist in the system tablespace? In order to support tables in the system tablespace with this scheme, we would have to store the version number somehow inline in the clustered index B-tree. The only reasonably available place for that would be the 'default row' record that was introduced in MDEV-11369 for instant ALTER and whose format will be extended further by MDEV-11424 and related tasks. Such a change to the format would prevent export of .ibd files to older MariaDB versions. Do we really want that? Furthermore, what about partitioned tables? If a DDL operation only affects one partition, would we have to update the version number for every partition in InnoDB? What about DDL operations that InnoDB used to ignore, such as changing the DEFAULT value of a column? Will this also require some metadata to be updated inside InnoDB? I believe that a less intrusive change would be to introduce logging that implements crash-safety for DDL operations. This logging should be transactional, and we would need some kind of two-phase commit between the logging and the participating resource managers (storage engine, .frm file, DDL logging). Maybe the DDL log could be implemented inside the Aria storage engine? But then we would need a distributed transaction between (say) InnoDB and Aria. When it comes to InnoDB, I believe that the biggest challenge is ALTER TABLE…ALGORITHM=INPLACE . We should probably refactor the transaction commit out of ha_innobase::commit_inplace_alter_table() and instead call explicit commit from the SQL layer.

            I think that there is a practical way to introduce a TABLE_SHARE version number to InnoDB.

            For tables in the InnoDB system tablespace, I believe that the only option to store the version number would be in the root page of the clustered index, or in the metadata record BLOB introduced in MDEV-15562. If we chose the metadata BLOB, then that would break IMPORT TABLESPACE to MariaDB Server 10.3 or earlier. Better, we could repurpose the bytes reserved for the strings "infimum" and "supremum" in the clustered index root page. With this change, we should still be able to import tables into older MariaDB server versions. One of these bytes was already repurposed in MDEV-15562. We can use this location for storing the table definition version also within .ibd files.

            Yes, we would have to introduce new InnoDB undo log record types for updating the table definition version. This would prevent a downgrade to earlier MariaDB server versions. We could add code to MariaDB Server 10.3 to ignore such undo log records. Downgrading from 10.3 to earlier versions is already impossible due to MDEV-12288.

            marko Marko Mäkelä added a comment - I think that there is a practical way to introduce a TABLE_SHARE version number to InnoDB. For tables in the InnoDB system tablespace, I believe that the only option to store the version number would be in the root page of the clustered index, or in the metadata record BLOB introduced in MDEV-15562 . If we chose the metadata BLOB, then that would break IMPORT TABLESPACE to MariaDB Server 10.3 or earlier. Better, we could repurpose the bytes reserved for the strings "infimum" and "supremum" in the clustered index root page. With this change, we should still be able to import tables into older MariaDB server versions. One of these bytes was already repurposed in MDEV-15562 . We can use this location for storing the table definition version also within .ibd files. Yes, we would have to introduce new InnoDB undo log record types for updating the table definition version. This would prevent a downgrade to earlier MariaDB server versions. We could add code to MariaDB Server 10.3 to ignore such undo log records. Downgrading from 10.3 to earlier versions is already impossible due to MDEV-12288 .
            marko Marko Mäkelä added a comment - - edited

            If the table definition version identifier is longer than 8 bytes, then repurposing some bytes in the clustered index root page could be a too tight fit.

            For tables that are located in the system tablespace (which would be phased out in MDEV-11633), we can store the data definition version number in SYS_TABLES.CLUSTER_NAME inside the InnoDB data dictionary, which also resides in the system tablespace. This is possible, because SYS_TABLES.CLUSTER_NAME was always initialized as NULL, starting from the very first public InnoDB revision.

            .ibd files and the innodb_file_per_table parameter were introduced in MySQL 4.1.12. Back then, InnoDB left some unused bytes uninitialized. Therefore, we must be careful when repurposing previously unused data bytes.

            Unfortunately, in the change buffer bitmap page, the unused page payload was left uninitialized. We cannot repurpose any of the remaining bytes on the page, unless we can somehow determine that the file was created with a newer version of InnoDB that zero-initializes the unused bytes.

            Similarly, in the inode page (page number 2), most of the page contents was left uninitialized, even within the payload area.

            Similarly, in the file system header page (page number 0), all unused fields except the one that was later repurposed as FSP_SPACE_FLAGS were left uninitialized.

            The page initialization was finally added in MySQL 5.1.48 for all data files. The original author of InnoDB opposed the initialization for many years, because he feared that it might cost some performance.

            Pages in ROW_FORMAT=COMPRESSED were always zero-initialized. Otherwise, only if flags for features that were introduced after MySQL 5.1.48 are present, there is a guarantee that previously unused bytes are zero-initialized. Such features would include non-default innodb_page_size, and the page_compressed format.

            Because we cannot identify the version that created the the most commonly used data files (innodb_page_size=16k, uncompressed), we must choose a different mechanism for safely repurposing previously unused bytes in .ibd files.

            Unfortunately, before MariaDB 10.4, nothing can be inferred from the innodb_checksum_algorithm of a page, because pages containing uninitialized garbage are being modified, and when those modifications are written, any currently available checksum algorithm can be used. With MDEV-12026 in MariaDB 10.4, data files created with innodb_checksum_algorithm=full_crc32 are guaranteed to zero-initialize any unused bytes.

            In page 1 (the change buffer bitmap page), the size of the unused area is 466 to 32722 bytes, depending on the physical page size. Let us add an additional page trailer as follows:

            (padding with zeroes)
            4 bytes: server ID
            16 bytes: table dictionary version
            32 bytes: the file creator, NUL-padded, for example "MariaDB 10.4.0"
            4 bytes: CRC-32C checksum of the entire page payload area, excluding the page trailer

            This would introduce an additional checksum within the page, hoping that the uninitialized garbage in old .ibd files would fail to satisfy that checksum.
            In the same mini-transaction that adds trailer on page 1, we would also fully zero-initialize page 2, so that the presence of this trailer on page 1 can be used for determining whether previously unused bytes on page 2 are not garbage.

            The additional checksum is not necessary for files that use innodb_checksum_algorithm=full_crc32 (MDEV-12026).

            marko Marko Mäkelä added a comment - - edited If the table definition version identifier is longer than 8 bytes, then repurposing some bytes in the clustered index root page could be a too tight fit. For tables that are located in the system tablespace (which would be phased out in MDEV-11633 ), we can store the data definition version number in SYS_TABLES.CLUSTER_NAME inside the InnoDB data dictionary, which also resides in the system tablespace. This is possible, because SYS_TABLES.CLUSTER_NAME was always initialized as NULL , starting from the very first public InnoDB revision . .ibd files and the innodb_file_per_table parameter were introduced in MySQL 4.1.12 . Back then, InnoDB left some unused bytes uninitialized. Therefore, we must be careful when repurposing previously unused data bytes. Unfortunately, in the change buffer bitmap page, the unused page payload was left uninitialized. We cannot repurpose any of the remaining bytes on the page, unless we can somehow determine that the file was created with a newer version of InnoDB that zero-initializes the unused bytes. Similarly, in the inode page (page number 2), most of the page contents was left uninitialized, even within the payload area. Similarly, in the file system header page (page number 0), all unused fields except the one that was later repurposed as FSP_SPACE_FLAGS were left uninitialized. The page initialization was finally added in MySQL 5.1.48 for all data files. The original author of InnoDB opposed the initialization for many years, because he feared that it might cost some performance. Pages in ROW_FORMAT=COMPRESSED were always zero-initialized. Otherwise, only if flags for features that were introduced after MySQL 5.1.48 are present, there is a guarantee that previously unused bytes are zero-initialized. Such features would include non-default innodb_page_size , and the page_compressed format. Because we cannot identify the version that created the the most commonly used data files ( innodb_page_size=16k , uncompressed), we must choose a different mechanism for safely repurposing previously unused bytes in .ibd files. Unfortunately, before MariaDB 10.4, nothing can be inferred from the innodb_checksum_algorithm of a page, because pages containing uninitialized garbage are being modified, and when those modifications are written, any currently available checksum algorithm can be used. With MDEV-12026 in MariaDB 10.4, data files created with innodb_checksum_algorithm=full_crc32 are guaranteed to zero-initialize any unused bytes. In page 1 (the change buffer bitmap page), the size of the unused area is 466 to 32722 bytes, depending on the physical page size. Let us add an additional page trailer as follows: (padding with zeroes) 4 bytes: server ID 16 bytes: table dictionary version 32 bytes: the file creator, NUL-padded, for example "MariaDB 10.4.0" 4 bytes: CRC-32C checksum of the entire page payload area, excluding the page trailer This would introduce an additional checksum within the page, hoping that the uninitialized garbage in old .ibd files would fail to satisfy that checksum. In the same mini-transaction that adds trailer on page 1, we would also fully zero-initialize page 2, so that the presence of this trailer on page 1 can be used for determining whether previously unused bytes on page 2 are not garbage. The additional checksum is not necessary for files that use innodb_checksum_algorithm=full_crc32 ( MDEV-12026 ).

            Given that there is plenty of unused space in page 1 and that the dictionary version number would have to be updated on every DDL operation, we should also include the secondary index root page numbers in page 1.
            The maximum number of indexes is a compile-time limit, which defaults to 64. InnoDB page numbers are 32 bits, or 4 bytes. The 256 bytes would fit easily even on with 1-kilobyte physical page size (ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1).
            Only if we increased the maximum number of indexes to (say) 128, we would be unable to write it all on the smallest ROW_FORMAT=COMPRESSED page size.

            A smooth upgrade path for removing the InnoDB data dictionary tables (MDEV-11655) and system tablespace (MDEV-11633) could be as follows:

            1. Tables in the InnoDB system tablespace will continue to have their metadata in the system tables only.
            2. We will start issuing deprecation warnings for setting innodb_file_per_table=OFF.
            3. For tables or partitions in .ibd files, we will write the dictionary version number and the secondary index page numbers in page 1.
            4. In some future release, remove the innodb_file_per_table setting, and stop writing anything to the InnoDB system tables. Primarily rely on .frm files for the dictionary. Only for old tables, consult SYS_INDEXES.PAGE for secondary index root page numbers.
            5. Implement an upgrade check that flags anything that is stored in the system tablespace or is missing the dictionary version number.
            6. Some time in the future, remove the code to deal with the system tablespace.
            marko Marko Mäkelä added a comment - Given that there is plenty of unused space in page 1 and that the dictionary version number would have to be updated on every DDL operation, we should also include the secondary index root page numbers in page 1. The maximum number of indexes is a compile-time limit, which defaults to 64. InnoDB page numbers are 32 bits, or 4 bytes. The 256 bytes would fit easily even on with 1-kilobyte physical page size ( ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=1 ). Only if we increased the maximum number of indexes to (say) 128, we would be unable to write it all on the smallest ROW_FORMAT=COMPRESSED page size. A smooth upgrade path for removing the InnoDB data dictionary tables ( MDEV-11655 ) and system tablespace ( MDEV-11633 ) could be as follows: Tables in the InnoDB system tablespace will continue to have their metadata in the system tables only. We will start issuing deprecation warnings for setting innodb_file_per_table=OFF . For tables or partitions in .ibd files, we will write the dictionary version number and the secondary index page numbers in page 1. In some future release, remove the innodb_file_per_table setting, and stop writing anything to the InnoDB system tables. Primarily rely on .frm files for the dictionary. Only for old tables, consult SYS_INDEXES.PAGE for secondary index root page numbers. Implement an upgrade check that flags anything that is stored in the system tablespace or is missing the dictionary version number. Some time in the future, remove the code to deal with the system tablespace.

            When introducing the new InnoDB subformat that makes .ibd files independent of InnoDB dictionary tables, we should try to implement a conversion that guarantees that all garbage in all pages has been initialized, and that all pages use a simple CRC-32C checksum over all bytes of the page, no matter what innodb_checksum_algorithm says. There is no reason to support multiple checksum algorithms for new data files, and upgrade should be performed in such a way that we can at some point be sure that all pages are following the new format.

            marko Marko Mäkelä added a comment - When introducing the new InnoDB subformat that makes .ibd files independent of InnoDB dictionary tables, we should try to implement a conversion that guarantees that all garbage in all pages has been initialized, and that all pages use a simple CRC-32C checksum over all bytes of the page, no matter what innodb_checksum_algorithm says. There is no reason to support multiple checksum algorithms for new data files, and upgrade should be performed in such a way that we can at some point be sure that all pages are following the new format.

            In MDEV-25180 (Atomic ALTER TABLE) a suggestion to introduce a .frm file version in InnoDB data files was not implemented, because it would have required a file format change. Instead, we identify whether an ALTER TABLE operation was committed based on the DB_TRX_ID column contents in some InnoDB data dictionary tables.

            For this task (transactional DDL) we would need substantial changes to InnoDB. I have posted some raw ideas in MDEV-11655. The bare minimum is:

            • Explicitly identify secondary index root pages in .ibd files (also needed for MDEV-11658). This would require changing not only the data file format but also the undo log format.
            • Remove or reimplement the FULLTEXT INDEX functionality of InnoDB. It not only depends on the internal InnoDB SQL parser but also on the existence of hidden InnoDB tables that store a partitioned inverted index.
            • Remove the InnoDB data dictionary tables, and assume that all DDL operations with respect to the .frm files are crash safe. DDL operations would no longer use separate internal InnoDB transactions. (This requires a major rewrite of all DDL operations in InnoDB, including redesigning the MDEV-25180 recovery logic.)
            • Remove the InnoDB internal SQL parser.
            marko Marko Mäkelä added a comment - In MDEV-25180 (Atomic ALTER TABLE ) a suggestion to introduce a .frm file version in InnoDB data files was not implemented, because it would have required a file format change. Instead, we identify whether an ALTER TABLE operation was committed based on the DB_TRX_ID column contents in some InnoDB data dictionary tables. For this task (transactional DDL) we would need substantial changes to InnoDB. I have posted some raw ideas in MDEV-11655 . The bare minimum is: Explicitly identify secondary index root pages in .ibd files (also needed for MDEV-11658 ). This would require changing not only the data file format but also the undo log format. Remove or reimplement the FULLTEXT INDEX functionality of InnoDB. It not only depends on the internal InnoDB SQL parser but also on the existence of hidden InnoDB tables that store a partitioned inverted index. Remove the InnoDB data dictionary tables, and assume that all DDL operations with respect to the .frm files are crash safe. DDL operations would no longer use separate internal InnoDB transactions. (This requires a major rewrite of all DDL operations in InnoDB, including redesigning the MDEV-25180 recovery logic.) Remove the InnoDB internal SQL parser.
            maxmether Max Mether added a comment - - edited

            I think that MDEV-11655 will definitely be a higher priority than this task. What is left to be done after MDEV-11655 will then have to be re-evaluated.

            maxmether Max Mether added a comment - - edited I think that MDEV-11655 will definitely be a higher priority than this task. What is left to be done after MDEV-11655 will then have to be re-evaluated.

            People

              serg Sergei Golubchik
              serg Sergei Golubchik
              Votes:
              30 Vote for this issue
              Watchers:
              32 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.