Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11934

MariaRocks: Group Commit with binlog

    Details

    • Sprint:
      10.2.5-1

      Description

      MyRocks has group commit with the binary log based on MySQL API:
      https://github.com/facebook/mysql-5.6/commit/14a0d4a97c09b52fa7450e6a3d56ebe7ed193ab6

      Inside MyRocks/RocksDB:

      • One can set m_rocksdb_tx->GetWriteOptions()->sync to false to avoid flushing.
      • One can flush WAL to disk with rdb->SyncWAL() call.
      • RocksDB has its own group commit imlementation which "just works" and is not visible from outside of RocksDB API.

      == MySQL's Group Commit API ==

      Here is a description of how it works when safe settings are ( sync_binlog=1, rocksdb_enable_2pc=ON, rocksdb_write_sync=ON)

      === Prepare ===
      The storage engine checks `thd->durability_property == HA_IGNORE_DURABILITY`.
      If true, it sets sync=false, which causes RocksDB not to persist the Prepare operation to disk.

      === Flush logs ===

      Then SQL layer calls rocksdb_flush_wal() which makes the effect of
      rocksdb_prepare() call persistent by calling SyncWAL().

      If we crash at this point, recovery process will roll back the prepared
      transaction in MyRocks.

      Then, SQL layer writes and flushes the binlog. If we crash after that, recovery
      will commit the prepared MyRocks' transaction.

      As far as MyRocks is concerned, each SyncWAL() call is made individually.
      RocksDB has its own Group Commit implementation under the hood.

      === Commit ===

      Then SQL layer calls rocksdb_commit().

      Commit writes to WAL too, but does not sync it.
      (The effect of rocksdb_prepare() was flushed, the binlog has the information about whether the recovery should commit or roll back, the binlog has been flushed to disk)

      == MariaDB ==

      MariaDB 10.2 has thd->durability_property but it is always equal to HA_REGULAR_DURABILITY

      For actually doing Group Commit, MariaDB 10.0+ has new handlerton functions:

      • handlerton->prepare_ordered
      • handlerton->commit_ordered
      • (handlerton->commit is still there and still used also)
      • handlerton->commit_checkpoint_request

        Attachments

        1. _b.test.innodb
          3 kB
        2. _b.test.myrocks
          4 kB
        3. commit-time-histogram.png
          commit-time-histogram.png
          47 kB
        4. commit-time-histogram.png
          commit-time-histogram.png
          47 kB
        5. oct17-benchmark.ods
          40 kB
        6. oct17-benchmark-result-sshot.png
          oct17-benchmark-result-sshot.png
          39 kB
        7. psergey-test-scaling.test
          3 kB
        8. psergey-test-scaling2.test
          4 kB
        9. test-rocksdb-gcommit.tgz
          2 kB

          Issue Links

            Activity

              People

              • Assignee:
                psergey Sergei Petrunia
                Reporter:
                psergey Sergei Petrunia
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: