Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11934

MariaRocks: Group Commit with binlog



    • 10.2.5-1


      MyRocks has group commit with the binary log based on MySQL API:

      Inside MyRocks/RocksDB:

      • One can set m_rocksdb_tx->GetWriteOptions()->sync to false to avoid flushing.
      • One can flush WAL to disk with rdb->SyncWAL() call.
      • RocksDB has its own group commit imlementation which "just works" and is not visible from outside of RocksDB API.

      == MySQL's Group Commit API ==

      Here is a description of how it works when safe settings are ( sync_binlog=1, rocksdb_enable_2pc=ON, rocksdb_write_sync=ON)

      === Prepare ===
      The storage engine checks `thd->durability_property == HA_IGNORE_DURABILITY`.
      If true, it sets sync=false, which causes RocksDB not to persist the Prepare operation to disk.

      === Flush logs ===

      Then SQL layer calls rocksdb_flush_wal() which makes the effect of
      rocksdb_prepare() call persistent by calling SyncWAL().

      If we crash at this point, recovery process will roll back the prepared
      transaction in MyRocks.

      Then, SQL layer writes and flushes the binlog. If we crash after that, recovery
      will commit the prepared MyRocks' transaction.

      As far as MyRocks is concerned, each SyncWAL() call is made individually.
      RocksDB has its own Group Commit implementation under the hood.

      === Commit ===

      Then SQL layer calls rocksdb_commit().

      Commit writes to WAL too, but does not sync it.
      (The effect of rocksdb_prepare() was flushed, the binlog has the information about whether the recovery should commit or roll back, the binlog has been flushed to disk)

      == MariaDB ==

      MariaDB 10.2 has thd->durability_property but it is always equal to HA_REGULAR_DURABILITY

      For actually doing Group Commit, MariaDB 10.0+ has new handlerton functions:

      • handlerton->prepare_ordered
      • handlerton->commit_ordered
      • (handlerton->commit is still there and still used also)
      • handlerton->commit_checkpoint_request


        1. _b.test.innodb
          3 kB
        2. _b.test.myrocks
          4 kB
        3. commit-time-histogram.png
          47 kB
        4. commit-time-histogram.png
          47 kB
        5. oct17-benchmark.ods
          40 kB
        6. oct17-benchmark-result-sshot.png
          39 kB
        7. psergey-test-scaling.test
          3 kB
        8. psergey-test-scaling2.test
          4 kB
        9. test-rocksdb-gcommit.tgz
          2 kB

        Issue Links



              psergei Sergei Petrunia
              psergei Sergei Petrunia
              0 Vote for this issue
              3 Start watching this issue



                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.