Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16329

Engine-independent online ALTER TABLE

Details

    Description

      Implement online ALTER TABLE above the storage engine layer by mimicking what InnoDB does since MariaDB 10.0.

      Intro

      ALTER TABLE can perform many various table metadata alterations, individually or batched (many alterations at once). It supports different algorithms for applying those alterations and different lock levels restricting access to the table while it's being altered. What algorithm and lock level to use depends on the storage engine, requested alterations and explicitly specified algorithm and lock, if any. If no algorithm or lock level is explicitly specified, the server is supposed to select the best algorithm/lock combination automatically.

      While certain alterations (like adding a column) can be done by certain storage engines (like InnoDB) internally (using InnoDB-specific ALGORITHM=INSTANT) and without locking the table (LOCK=NONE), the most universal ALTER TABLE algorithm that supports arbitrary alterations in arbitrary combinations is the COPY algorithm and it locks the table, allowing only read access during the whole ALTER TABLE duration. When the server has to resort to the COPY algorithm (because no other one can perform the requested set of alterations) it often means long periods of the application being essentially down, because the table cannot be written into.

      The goal of this task is to allow the COPY algorithm to work without read-locking the table. In other words, this should make the combination ALGORITHM=COPY, LOCK=NONE possible.

      Implementation

      The COPY algorithm for ALTER ONLINE TABLE is supposed to do the following:

      1. Exclusively acquire the table Metadata Lock (MDL).
      2. Acquire the table lock for read (TL_READ)
      3. Read the first record. In table is empty, online is skipped (goto 11).
      4. Set up (a separate, per-table one) row-based replication for tracking changes from concurrent DMLs ("online changes").
      5. Downgrade the MDL lock.
      6. Copy the table contents (using a non-locking read if supported by the storage engine).
      7. Apply the online changes from the replicated contents.
      8. Unlock the table lock
      9. Exclusively lock the table MDL (upgrade to MDL_SHARED_WRITE).
      10. Apply any remaining online changes.
      11. Swap the old and new table, unlock, drop the old table.

      This would remove some limitations that currently exist with the InnoDB-only online table rebuild. Basically, anything that is supported by ALGORITHM=COPY should 'just work' (however see the limitations section). The bulk copying could still happen in copy_data_between_tables(). A few examples:

      1. Arbitrary changes of column type will be possible, without duplicating any conversion logic.
      2. It will be possible to add virtual columns (materialized or not) together with adding indexes, while allowing concurrent writes (MDEV-13795, MDEV-14332).
      3. The ENGINE or the partitioning of a table can be changed, just like any other attribute.

      [Not implemented here] We should remove the online table rebuild code from InnoDB (row_log_table_apply() and friends), and just let InnoDB fall back to this. The only ALTER ONLINE TABLE that could better be implemented inside storage engines would be ADD INDEX. Then, ALGORITHM=INPLACE would no longer be misleading, because it would mean exactly the same as the ALGORITHM=NOCOPY that was introduced in MDEV-13134. Before this, we must implement MDEV-515 (bulk load into an empty InnoDB table) to avoid a performance regression.

      Behavior of different engines

      The per-engine behavior depends on what operations can happen concurrently while TL_READ is held.

      • Innodb can do any DML (except TRUNCATE i presume). It lazily opens the read view once the first record is read during the copy stage. This means that in theory some transaction can slip concurrently between TL_READ-locked table and first record is read. This is why we first read one record out, and then set up the online change buffer.
      • Myisam/Aria only allow inserts in parallel with reads: The last table's record offset is remembered for the table handle, so copy stage will read out only the changes, that are already there. Other DMLs will be blocked until table lock is released.
      • Online is disabled for temporary tables.
      • For other engines, it depends on whether is it possible to acquire a particular table lock in parallel with TL_READ.

      Limitations

      • Embedded server doesn't support LOCK=NONE, Until HAVE_REPLICATION is enabled there (or until some finer refactoring).
      • DROP SYSTEM VERSIONING is not currently supported, but the support can be added on demand
      • ALTER TABLE ... ORDER BY is not and cannot be supported
      • Tables which are referenced by FOREIGN KEYs with CASCADE operations, see MDEV-29068
      • ALTER IGNORE TABLE
      • Adding autoinc to the existing column, when NO_AUTO_VALUE_ON_ZERO is not present, and there were no unchanged UNIQUE NOT NULL keys. A NULL column is always impossible to update to AUTOINC with Online COPY.
      • Sequences are not supported
      • ADD COLUMN ... AUTO_INCREMENT and ADD COLUMN ... DEFAULT(NEXTVAL(..))
      • MODIFY ... NOT NULL DEFAULT(NEXTVAL(..)), if the column initially was NULLable
      • Sequences
      • Engines S3 and CONNECT

      [Old part] Challenges

      We should replicate the online rebuild on slaves in parallel, so that the master and slaves will be able to commit at roughly the same time. This would be something similar to MDEV-11675, which would still be needed for native online ADD INDEX, which would avoid copying the table.

      In InnoDB, there is some logic for logging the changes when the PRIMARY KEY columns are changed, or a PRIMARY KEY is being added. The 'row event log' online_log will additionally contain the PRIMARY KEY values in the new table, so that the records can easily be found. The online_log will contain INSERT, UPDATE, and DELETE events.

      We will need some interface from ROLLBACK inside the storage engine to the 'row event log', so that BEGIN; INSERT; ROLLBACK will also create a DELETE event. Similarly, we will need an interface that allows CASCADE or SET NULL operations from FOREIGN KEY constraints to be relayed to the 'row event log'.

      Starting with MariaDB 10.2, there is an optimization that avoids unnecessarily sorting the data by PRIMARY KEY when the sorting does not change. Search for skip_pk_sort. It would be nice if the future MDEV-515 code inside InnoDB could be informed of this, so that it can assume that the data is already sorted by PRIMARY KEY.

      If there exist FOREIGN KEY constraints on the being-rebuilt table, then this approach should work just as fine as the current online table rebuild in InnoDB: The constraints would be enforced on the old copy of the table until the very end where we switch the tables, and from that point on, on the new copy of the table.

      Initially, we could disable ONLINE...ADD FOREIGN KEY. That could be easier to implement after moving the FOREIGN KEY processing from InnoDB to the SQL layer.

      Attachments

        Issue Links

          Activity

            bjquinn BJ Quinn added a comment -

            Thanks! Do you think that's what's causing the alternating active/inactive cycles, or is it something that's affecting the overall efficiency of the ALTER?

            bjquinn BJ Quinn added a comment - Thanks! Do you think that's what's causing the alternating active/inactive cycles, or is it something that's affecting the overall efficiency of the ALTER?

            MDEV-33094 was filed for further optimizing the online log application. It is currently writing undo log records inside InnoDB, for no good reason.

            bjquinn, I do not have any idea what could be causing the active/inactive cycles. Would it be possible to collect stack traces of all threads (attach a debugger to the running process) while the system is inactive? (Or just something like http://poormansprofiler.org once per second?) Also, a system profiler like perf or offcputime could be helpful, but the latter is tricky because you’d typically need all code to be compiled with -fno-omit-frame-pointer in order to get meaningful stack traces (because the stack unwinder in the Linux kernel requires frame pointers; see 1234). Back in September, I successfully used offcputime in MDEV-32050 to identify one bottleneck that I was completely unaware of.

            marko Marko Mäkelä added a comment - MDEV-33094 was filed for further optimizing the online log application. It is currently writing undo log records inside InnoDB, for no good reason. bjquinn , I do not have any idea what could be causing the active/inactive cycles. Would it be possible to collect stack traces of all threads (attach a debugger to the running process) while the system is inactive? (Or just something like http://poormansprofiler.org once per second?) Also, a system profiler like perf or offcputime could be helpful, but the latter is tricky because you’d typically need all code to be compiled with -fno-omit-frame-pointer in order to get meaningful stack traces (because the stack unwinder in the Linux kernel requires frame pointers; see 1234 ). Back in September, I successfully used offcputime in MDEV-32050 to identify one bottleneck that I was completely unaware of.
            bjquinn BJ Quinn added a comment - - edited

            Thanks Marko. I should be able to set it up to capture the stack traces, this system is not yet in production so I should be able to do whatever is necessary. I'll try to get that to you soon.

            Stéphane also had a good suggestion to test mysql -e ’’select * from bigtable’ > /dev/null and see if I get a similar active/inactive cycle. I did not, but it does settle on about 65% CPU usage over time after starting at 100% CPU usage. Disks are 10% to 25% busy, so I don't think that's the bottleneck here.

            EDIT: I'm going to be out of town the next couple of weeks so some of this data might be delayed.

            bjquinn BJ Quinn added a comment - - edited Thanks Marko. I should be able to set it up to capture the stack traces, this system is not yet in production so I should be able to do whatever is necessary. I'll try to get that to you soon. Stéphane also had a good suggestion to test mysql -e ’’select * from bigtable’ > /dev/null and see if I get a similar active/inactive cycle. I did not, but it does settle on about 65% CPU usage over time after starting at 100% CPU usage. Disks are 10% to 25% busy, so I don't think that's the bottleneck here. EDIT: I'm going to be out of town the next couple of weeks so some of this data might be delayed.
            bjquinn BJ Quinn added a comment -

            Marko, attached is the result of poormansprofiler.org, ran once a second while the CPU was idle. Please let me know if this is helpful or you need me to collect this data any differently.

            One thing I noticed that Stephane pointed out was that it got better, at least early on, if innodb_log_file_size was set larger, but as the table I'm testing with is much larger than I can reasonably set innodb_log_file_size to, the issue recurs after a while anyway. But it may be related.

            Thanks!! output.txt

            bjquinn BJ Quinn added a comment - Marko, attached is the result of poormansprofiler.org, ran once a second while the CPU was idle. Please let me know if this is helpful or you need me to collect this data any differently. One thing I noticed that Stephane pointed out was that it got better, at least early on, if innodb_log_file_size was set larger, but as the table I'm testing with is much larger than I can reasonably set innodb_log_file_size to, the issue recurs after a while anyway. But it may be related. Thanks!! output.txt
            bjquinn BJ Quinn added a comment -

            Marko, please disregard. This may have ended up being a hardware problem that was affecting multiple identical servers. A firmware bug in our SSDs. In case anyone is interested, here was the problem and apparent solution – https://forum-proxmox-com.translate.goog/threads/nvme-qid-timeout.51579/?_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=sc

            bjquinn BJ Quinn added a comment - Marko, please disregard. This may have ended up being a hardware problem that was affecting multiple identical servers. A firmware bug in our SSDs. In case anyone is interested, here was the problem and apparent solution – https://forum-proxmox-com.translate.goog/threads/nvme-qid-timeout.51579/?_x_tr_sl=de&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=sc

            People

              nikitamalyavin Nikita Malyavin
              marko Marko Mäkelä
              Votes:
              10 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.