[MDEV-11934] MariaRocks: Group Commit with binlog Created: 2017-01-29  Updated: 2018-02-21  Resolved: 2018-02-21

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Fix Version/s: 10.2.13

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Attachments: File _b.test.innodb     File _b.test.myrocks     PNG File commit-time-histogram.png     PNG File commit-time-histogram.png     PNG File oct17-benchmark-result-sshot.png     File oct17-benchmark.ods     File psergey-test-scaling.test     File psergey-test-scaling2.test     File test-rocksdb-gcommit.tgz    
Issue Links:
PartOf
is part of MDEV-9658 Make MyRocks in MariaDB stable Closed
Relates
relates to MDEV-14103 Testing for group commit in MyRocks Closed
Sprint: 10.2.5-1

 Description   

MyRocks has group commit with the binary log based on MySQL API:
https://github.com/facebook/mysql-5.6/commit/14a0d4a97c09b52fa7450e6a3d56ebe7ed193ab6

Inside MyRocks/RocksDB:

  • One can set m_rocksdb_tx->GetWriteOptions()->sync to false to avoid flushing.
  • One can flush WAL to disk with rdb->SyncWAL() call.
  • RocksDB has its own group commit imlementation which "just works" and is not visible from outside of RocksDB API.

== MySQL's Group Commit API ==

Here is a description of how it works when safe settings are ( sync_binlog=1, rocksdb_enable_2pc=ON, rocksdb_write_sync=ON)

=== Prepare ===
The storage engine checks `thd->durability_property == HA_IGNORE_DURABILITY`.
If true, it sets sync=false, which causes RocksDB not to persist the Prepare operation to disk.

=== Flush logs ===

Then SQL layer calls rocksdb_flush_wal() which makes the effect of
rocksdb_prepare() call persistent by calling SyncWAL().

If we crash at this point, recovery process will roll back the prepared
transaction in MyRocks.

Then, SQL layer writes and flushes the binlog. If we crash after that, recovery
will commit the prepared MyRocks' transaction.

As far as MyRocks is concerned, each SyncWAL() call is made individually.
RocksDB has its own Group Commit implementation under the hood.

=== Commit ===

Then SQL layer calls rocksdb_commit().

Commit writes to WAL too, but does not sync it.
(The effect of rocksdb_prepare() was flushed, the binlog has the information about whether the recovery should commit or roll back, the binlog has been flushed to disk)

== MariaDB ==

MariaDB 10.2 has thd->durability_property but it is always equal to HA_REGULAR_DURABILITY

For actually doing Group Commit, MariaDB 10.0+ has new handlerton functions:

  • handlerton->prepare_ordered
  • handlerton->commit_ordered
  • (handlerton->commit is still there and still used also)
  • handlerton->commit_checkpoint_request


 Comments   
Comment by Sergei Petrunia [ 2017-01-29 ]

Debugging MariaDB/InnoDB with
sync_binlog=1, innodb_flush_logs_at_trx_commit=1 (default),

== Prepare ==

  • innobase_xa_prepare() does the prepare operation and flushes the logs
    (note: it seems there is an innodb-internal group flushing there.. there's a comment about it but didn't find the code so far)

== innobase_commit_ordered ==

  • Fetch the binlog position from the SQL layer
  • set trx->flush_log_later= true (that is, we don't flush here)
  • trx_sys_update_mysql_binlog_offset ...
  • Logical commit happened, but flushing the log didnt
  • Sets trx->active_commit_ordered=1 to remember this
  • Also sets trx->must_flush_log_later=true

== innobase_commit ==

  • Check that we are just doing the final steps after innobase_commit_ordered.
  • Call trx_commit_complete_for_mysql
  • call trx_flush_log_if_needed, which flushes the log

== innobase_checkpoint_request ==

  • This is only called on binlog rotation (that is, rarely)
Comment by Sergei Petrunia [ 2017-01-29 ]

So, MariaRocks could work like this:

  • prepare_ordered() is not implemented (same as in InnoDB)
  • prepare() does trx->prepare(). It will have to sync to disk.
  • commit_ordered() would commit without writing the WAL.
  • commit() (when called after commit_ordered) would just flush the WAL
  • rocksdb_checkpoint_request would flush the WAL

But will this be performant? Reading MariaDB's handler.h:

     The intention is that commit_ordered() should do the minimal amount of
     work that needs to happen in consistent commit order among handlers. To
     preserve ordering, calls need to be serialised on a global mutex, so
     doing any time-consuming or blocking operations in commit_ordered() will
     limit scalability.
 
     Handlers can rely on commit_ordered() calls to be serialised (no two
     calls can run in parallel, so no extra locking on the handler part is
     required to ensure this).

So, if we flush the WAL in rocksdb_commit(), then requests to flush the WAL from different threads will go in parallel and hopefully that will be grouped (although it looks a bit inefficient).

However, the commit itself (making the changes visible to other transactions, etc) will not be done in parallel.

Comment by Sergei Petrunia [ 2017-01-29 ]

Another thing to note:

     Note that commit_ordered() can be called from a different thread than the
     one handling the transaction! So it can not do anything that depends on
     thread local storage, in particular it can not call my_error() and
     friends

Does RocksDB API allow this?

EDIT: most likely yes, because one can find this in facebook/mysql-5.6:

MYSQL_BIN_LOG::process_commit_stage_queue()
{
  ...
  Thread_excursion excursion(thd);
  for (THD *head= first ; head ; head = head->next_to_commit)
  {
    ...
    excursion.try_to_attach_to(head);
    ...
    if (ha_commit_low(head, all, async))

Looks like rocksdb_commit() can be invoked from a different thread.

Comment by Sergei Petrunia [ 2017-01-31 ]

An interesting related issue: https://github.com/facebook/mysql-5.6/issues/474 :

But in MyRocks, commit() phase is much more expensive than InnoDB's commit() phase, because writing to MemTable happens there. If you run pmp/quickstack, you may consistently find only one thread is doing myrocks::Rdb_transaction::commit() and other threads are waiting at MYSQL_BIN_LOG::change_stage(). And you'll find commit() is calling rocksdb::MemTableInserter::PutCF() and other MemTable related functions. Serialized MemTable write calls is pretty inefficient.

So MariaDB has commit and commit_ordered. However, RocksDB does not provide an API call that would be suitable for calling from rocksdb_commit_ordered.

  • rocksdb_commit_ordered() would call something that would just put the COMMIT record into WAL (without flushing)
  • Then, rocksdb_commit would apply the transaction's MemTable contents into the global MemTable.
Comment by Sergei Petrunia [ 2017-02-03 ]

See also MDEV-11900 for the situation with 10.2 and InnoDB

Comment by Sergei Petrunia [ 2017-08-28 ]

Looking at how "Prepare" operation for the storage engine is persisted on disk.

In MariaDB, each Prepare operation tries to flush the logs itself :

  #0  log_write_up_to (lsn=1908254, flush_to_disk=true) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/log/log0log.cc:1234
  #1  0x00005555562f5c08 in trx_flush_log_if_needed_low (lsn=1908254) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:1600
  #2  0x00005555562f5c51 in trx_flush_log_if_needed (lsn=1908254, trx=0x7fffe64c1878) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:1622
  #3  0x00005555562f8fd2 in trx_prepare (trx=0x7fffe64c1878) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:2774
  #4  0x00005555562f9084 in trx_prepare_for_mysql (trx=0x7fffe64c1878) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:2796
  #5  0x000055555610cc78 in innobase_xa_prepare (hton=0x5555579371c0, thd=0x7fff78000b00, prepare_trx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/handler/ha_innodb.cc:17869
  #6  0x0000555555d7923c in prepare_or_error (ht=0x5555579371c0, thd=0x7fff78000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
  #7  0x0000555555d79adf in ha_commit_trans (thd=0x7fff78000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
  #8  0x0000555555c60990 in trans_commit_stmt (thd=0x7fff78000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
  #9  0x0000555555b0c3ef in mysql_execute_command (thd=0x7fff78000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
  #10 0x0000555555b10aa6 in mysql_parse (thd=0x7fff78000b00, rawbuf=0x7fff780111f8 "insert into t20 values (2)", length=26, parser_state=0x7ffff4463210, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886

Concurrent Prepare operations can get grouped (one Prepare will make a flush for many), this is achieved by relying on InnoDB's native group commit mechanism (which predates XA group commit):

See this comment in trx_prepare:

		The idea in InnoDB's group prepare is that a group of
		transactions gather behind a trx doing a physical disk write
		to log files, and when that physical write has been completed,
		one of those transactions does a write which prepares the whole
		group. Note that this group prepare will only bring benefit if
		there are > 2 users in the database. Then at least 2 users can
		gather behind one doing the physical log write to disk.

In fb/mysql-5.6, it works differently.

InnoDB's Prepare calls thd_requested_durability() which returns HA_IGNORE_DURABILITY, and flushing is not done:

(gdb) wher
  #0  thd_requested_durability (thd=0x2bb4cd0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/handler/ha_innodb.cc:1830
  #1  0x000000000150c35a in trx_prepare (trx=0x7fffe0d01e60, async=0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/trx/trx0trx.cc:2365
  #2  0x000000000150c421 in trx_prepare_for_mysql (trx=0x7fffe0d01e60, async=0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/trx/trx0trx.cc:2409
  #3  0x000000000139bea8 in innobase_xa_prepare (hton=0x2949ad0, thd=0x2bb4cd0, prepare_trx=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/handler/ha_innodb.cc:15170
  #4  0x0000000000d85aa5 in ha_prepare_low (thd=0x2bb4cd0, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:2264
  #5  0x0000000001237ee0 in MYSQL_BIN_LOG::prepare (this=0x2744000 <mysql_bin_log>, thd=0x2bb4cd0, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7054
  #6  0x0000000000d83c9c in ha_commit_trans (thd=0x2bb4cd0, all=false, async=false, ignore_global_read_lock=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:1540
  #7  0x000000000106f0a3 in trans_commit_stmt (thd=0x2bb4cd0, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/transaction.cc:452
  #8  0x0000000000f8cf3f in mysql_execute_command (thd=0x2bb4cd0, statement_start_time=0x7ffff430aaa8, post_parse=0x7ffff430ac00) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:6254
  #9  0x0000000000f90913 in mysql_parse (thd=0x2bb4cd0, rawbuf=0x7fffa4005680 "insert into t21 values (4)", length=26, parser_state=0x7ffff430b4b0, last_timer=0x7ffff430ac00, async_commit=0x7ffff430abcc "") at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:7770
 
(gdb) fini
  Run till exit from #0  thd_requested_durability (thd=0x2bb4cd0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/handler/ha_innodb.cc:1830
  0x000000000150c35a in trx_prepare (trx=0x7fffe0d01e60, async=0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/trx/trx0trx.cc:2365
  Value returned is $17 = HA_IGNORE_DURABILITY

and then the SQL layer makes a special call to make the effects of all Prepare transactions persistent:

  Breakpoint 3, ha_flush_logs (db_type=0x0, engine_map=0x7fffa40086b0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:2452
(gdb) wher
  #0  ha_flush_logs (db_type=0x0, engine_map=0x7fffa40086b0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:2452
  #1  0x0000000001238999 in MYSQL_BIN_LOG::process_flush_stage_queue (this=0x2744000 <mysql_bin_log>, total_bytes_var=0x7ffff4308468, rotate_var=0x7ffff430844f, out_queue_var=0x7ffff4308470, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7322
  #2  0x0000000001239b85 in MYSQL_BIN_LOG::ordered_commit (this=0x2744000 <mysql_bin_log>, thd=0x2bb4cd0, all=false, skip_commit=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7891
  #3  0x00000000012385fd in MYSQL_BIN_LOG::commit (this=0x2744000 <mysql_bin_log>, thd=0x2bb4cd0, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7223
  #4  0x0000000000d83cda in ha_commit_trans (thd=0x2bb4cd0, all=false, async=false, ignore_global_read_lock=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:1542
  #5  0x000000000106f0a3 in trans_commit_stmt (thd=0x2bb4cd0, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/transaction.cc:452
  #6  0x0000000000f8cf3f in mysql_execute_command (thd=0x2bb4cd0, statement_start_time=0x7ffff430aaa8, post_parse=0x7ffff430ac00) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:6254
  #7  0x0000000000f90913 in mysql_parse (thd=0x2bb4cd0, rawbuf=0x7fffa4005680 "insert into t21 values (4)", length=26, parser_state=0x7ffff430b4b0, last_timer=0x7ffff430ac00, async_commit=0x7ffff430abcc "") at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:7770

Comment by Sergei Petrunia [ 2017-08-31 ]

Created a simple testcase and tried it on both InnoDB and MyRocks. InnoDB results were:

InnoDB

# THIS IS: --concurrency=50 --number-of-queries=10000, rotating disk
Innodb_data_fsyncs= 540
Binlog_commits= 10000
Binlog_group_commits= 2223

# This is  --concurrency=50 --number-of-queries=10000, --mem
Innodb_data_fsyncs = 9912
Binlog_commits = 10000
Binlog_group_commits= 9851

# THIS IS: --concurrency=50 --number-of-queries=10000, aws, ssd
Innodb_data_fsyncs 590
Binlog_commits 10000
Binlog_group_commits 4559

# THIS IS: --concurrency=50 --number-of-queries=10000, aws, memory
Innodb_data_fsyncs 9075  (the value is around 9K)
Binlog_commits 10000
Binlog_group_commits 2517 (the value is 2K -4K)

  • I am not sure why running with --mem makes Binlog_group_commits be ~2K on aws host and ~9.8K on my desktop. (Both numbers are stable and do not fluctuate).
  • There's more grouping on rotating disk than on SSD, makes sense.
Comment by Sergei Petrunia [ 2017-08-31 ]

Trying the same with MyRocks and experimental group commit patch :

# THIS IS: --concurrency=50 --number-of-queries=10000, home, --mem
ROCKSDB_WAL_GROUP_SYNCS 0
ROCKSDB_WAL_SYNCED 8247  (aws, --mem: 7K)
Binlog_commits 10000
Binlog_group_commits 726  (aws, --mem: 800)

# THIS IS: --concurrency=50 --number-of-queries=10000, home, rotating disk
ROCKSDB_WAL_GROUP_SYNCS 0
ROCKSDB_WAL_SYNCED 8540
Binlog_commits 10000
Binlog_group_commits 652
# This takes ~300 seconds.

# THIS IS: --concurrency=50 --number-of-queries=10000, aws, ssd
ROCKSDB_WAL_GROUP_SYNCS 0
ROCKSDB_WAL_SYNCED 6994
Binlog_commits 10000
Binlog_group_commits 659

  • ROCKSDB_WAL_GROUP_SYNCS=0 is expected. Look in the previous comments for " the SQL layer makes a special call to make the effects of all Prepare transactions persistent" and how that call is lacking in MariaDB.
  • The values of Binlog_group_commits show there's some amount of grouping
  • ROCKSDB_WAL_SYNCED is too high for both on-disk workloads.
    • Are these real sync operations or just requests to sync the WAL? (That is, if two Transaction::Prepare() calls issue requests to sync the WAL, and one of them happens to sync it for both, will ROCKSDB_WAL_SYNCED be incremented by 1 or by 2?)
  • Execution time for rotating disk is too high?
Comment by Sergei Petrunia [ 2017-08-31 ]

_b.test.innodb _b.test.myrocks

Comment by Sergei Petrunia [ 2017-08-31 ]

Probably related: MDEV-11937

Comment by Sergei Petrunia [ 2017-09-01 ]

_c.test is a testcase that

  • runs a serial workload
  • runs a parallel workload

the workload is run with {$engine}_flush_log_at_trx_commit=1, log_bin=0.
the results do make sense:

aws ssd:

For --concurrency=1 --number-of-queries=10000
Innodb_os_log_fsyncs=10K
time: 12 sec (1K syncs/sec)
 
For --concurrency=50 --number-of-queries=100000
Innodb_os_log_fsyncs=4735
time: 7 sec (676 syncs/sec.)
 
For --concurrency=1 --number-of-queries=10000
ROCKSDB_WAL_SYNCED: 10000
time: 25 sec (400 syncs/sec)
 
For --concurrency=50 --number-of-queries=100000
ROCKSDB_WAL_SYNCED: 3952
time: 9 sec (439 syncs/sec)

rotating desktop hdd:

For --concurrency=1 --number-of-queries=1000
Innodb_os_log_fsyncs: 1005
time: 30 sec  (33 syncs/sec)
 
For --concurrency=50 --number-of-queries=10000
Innodb_os_log_fsyncs: 450
time: 16 sec (28 syncs/sec).
 
For --concurrency=1 --number-of-queries=1000 
ROCKSDB_WAL_SYNCED: 1000
time: 36 sec (27.7 syncs /sec)
 
For --concurrency=50 --number-of-queries=10000
ROCKSDB_WAL_SYNCED: 354
time: 13 sec (27.2 syncs/ sec)

The numbers are fairly stable and consistent across the InnoDB and MyRocks.

Comment by Sergei Petrunia [ 2017-09-01 ]

Now, trying InnoDB with binlog enabled

Rotating hdd, log_bin=1, sync_binlog=1, innodb_flush_logs_at_trx_commit=1

For--concurrency=1 --number-of-queries=1000
Innodb_os_log_fsyncs 1006
Binlog_commits 1000
Binlog_group_commits 1000
time: 30 sec  

  1. Both InnoDB and binlog manage to do 30 commits/sec.
  2. The interesting part is that they manage to do this together.

For --concurrency=50 --number-of-queries=10000
Innodb_os_log_fsyncs 473
Binlog_commits 10000
Binlog_group_commits 2246
time: 16 sec

  1. Binlog commit numbers do not look realistic: 2246/16=140 commits/sec ?
  2. InnoDB's fsyncs agree with previous observations: 473/16 = 29.6 commits/sec

AWS ssd, log_bin=1, sync_binlog=1, innodb_flush_logs_at_trx_commit=1

# THIS IS: --concurrency=1 --number-of-queries=10000
Innodb_os_log_fsyncs 10008
Binlog_commits 10000
Binlog_group_commits 10000
time: 14 sec

  1. Both InnoDB and binlog do 714 commits/sec.

# THIS IS: --concurrency=50 --number-of-queries=100000
Innodb_os_log_fsyncs 4992
Binlog_commits 100000
Binlog_group_commits 47449
time: 7 sec.

  1. InnoDB does 713 commits/sec (agrees with previous observations)
  2. Binlog does 6778 commits/sec (looks too high also?)
Comment by Sergei Petrunia [ 2017-09-01 ]

Trying the current MyRocks+MariaDB group commit patch

Rotating hdd

# THIS IS: --concurrency=1 --number-of-queries=1000
ROCKSDB_WAL_SYNCED 1000
Binlog_commits  1000
time: 40 sec

25 syncs/sec, slightly lower than without the binlog

# THIS IS: --concurrency=50 --number-of-queries=10000
ROCKSDB_WAL_SYNCED 7647
Binlog_commits 10000
Binlog_group_commits 651
time: 293 sec

  • This is very slow
  • RocksDB commit rate was 7647/293.=26 rocksdb commits/sec. which looks like no commit grouping.

aws ssd

# THIS IS: --concurrency=1 --number-of-queries=10000
ROCKSDB_WAL_SYNCED 10000
Binlog_commits 10000
Binlog_group_commits 10000
time: 25 sec

this gives 400 commits/sec for both binlog and RocksDB. this matches previous numbers we've got for this SSD

# THIS IS: --concurrency=50 --number-of-queries=100000
ROCKSDB_WAL_SYNCED 65737
Binlog_commits 100000
Binlog_group_commits 6535
time: 153 sec

this is 429 WAL syncs/sec. Again looks as if WAL sync operations were not grouped at all?

Comment by Sergei Petrunia [ 2017-09-02 ]

Trying to figure out

  1. if indeed Transaction::Prepare operations do not use RocksDB's internal Group Commit for some reason
  2. If the above is true, why.
  • Found DBOptions::concurrent_prepare parameter. It is off (by default)
    • perphaps, I should have it ON (update: this doesn't seem to be true)
    • attempt to switch it ON caused mysqld to fail an assertion on start
  • Debugged (with concurrent_prepare=off) to the point where PREPARE command
    does sync:

(gdb) wher
  #0  rocksdb::PosixWritableFile::Sync (this=0x555557bbef20) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/env/io_posix.cc:849
  #1  0x00007ffff4aab9f7 in rocksdb::WritableFileWriter::SyncInternal (this=0x555557bb81b0, use_fsync=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/util/file_reader_writer.cc:352
  #2  0x00007ffff4aab568 in rocksdb::WritableFileWriter::Sync (this=0x555557bb81b0, use_fsync=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/util/file_reader_writer.cc:323
  #3  0x00007ffff48db9fd in rocksdb::DBImpl::WriteToWAL (this=0x555557baee50, write_group=..., log_writer=0x555557bbed90, log_used=0x7fffb4025f58, need_log_sync=true, need_log_dir_sync=false, sequence=10) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:719
  #4  0x00007ffff48d8eb7 in rocksdb::DBImpl::WriteImpl (this=0x555557baee50, write_options=..., my_batch=0x7fffb4026140, callback=0x0, log_used=0x7fffb4025f58, log_ref=0, disable_memtable=true) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:234
  #5  0x00007ffff4b0a6ba in rocksdb::TransactionImpl::Prepare (this=0x7fffb4025f50) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/utilities/transactions/transaction_impl.cc:195
  #6  0x00007ffff47ec27a in myrocks::Rdb_transaction_impl::prepare (this=0x7fffb4024940, name="\000\000\000\000\000\000\000\001\030\000MySQLXid\f\000\000\000\000\000\000\000\004\000\000\000\000\000\000") at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2129
  #7  0x00007ffff47c1030 in myrocks::rocksdb_prepare (hton=0x555557b96e70, thd=0x7fffb4000b00, prepare_tx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2733
  #8  0x0000555555d70582 in prepare_or_error (ht=0x555557b96e70, thd=0x7fffb4000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
  #9  0x0000555555d70e2d in ha_commit_trans (thd=0x7fffb4000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
  #10 0x0000555555c5d5c7 in trans_commit_stmt (thd=0x7fffb4000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
  #11 0x0000555555b09c5b in mysql_execute_command (thd=0x7fffb4000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
  #12 0x0000555555b0e332 in mysql_parse (thd=0x7fffb4000b00, rawbuf=0x7fffb40110f8 "insert into t32 values (101)", length=28, parser_state=0x7fffe6bd7250, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886

DBImpl::WriteImpl has some grouping logic but I am not sure what grouping applies to:

  • just putting new data into the MemTable
  • or syncing the WAL also?
Comment by Sergei Petrunia [ 2017-09-02 ]

Re concurrent_prepare: the assertion I am hitting is this one in DBImpl::WriteImpl

      assert(!need_log_sync && !need_log_dir_sync);

I have

 
need_log_sync= true
need_log_dir_sync=true

these come from

  bool need_log_sync = !write_options.disableWAL && write_options.sync;
  bool need_log_dir_sync = need_log_sync && !log_dir_synced_;

It seems unlikely that MyRocks should operate with need_log_sync=off.

Comment by Sergei Petrunia [ 2017-09-05 ]

test-rocksdb-gcommit.tgz Wrote a test program that uses RocksDB to perform:

  • concurrent writes with commits
  • concurrent "Prepare( sync=true|false); commit (sync=true|false)"

AWS ssd:

  sync_prepare=false
  sync_commit=true
  using_2pc=false
TH      trx     n_threads       time    tps
TR      1000    2       4.513832        443.082538
TR      1000    4       4.754442        841.318433
TR      1000    8       4.881946        1638.690906
TR      1000    16      5.181725        3087.774675
TR      1000    32      5.734200        5580.551382
TR      1000    64      5.817391        11001.495775
TR      1000    128     8.551488        14968.155234
TR      1000    256     12.689465       20174.215468
TR      1000    512     22.345159       22913.239996

  sync_prepare=true
  sync_commit=false
  using_2pc=true
TH      trx     n_threads       time    tps
TR      1000    1       2.109577        474.028568
TR      1000    2       4.606521        434.167139
TR      1000    4       5.010855        798.266883
TR      1000    8       6.948192        1151.378641
TR      1000    16      7.309022        2189.075393
TR      1000    32      8.835688        3621.676118
TR      1000    64      10.030643       6380.448559
TR      1000    128     12.595573       10162.300453
TR      1000    256     18.653547       13723.931357
TR      1000    512     29.878734       17135.933763

Rotating HDD:
  sync_prepare=true
  sync_commit=false
  using_2pc=true
TH      trx     n_threads       time    tps
TR      1000    1       31.522334       31.723539
TR      1000    2       63.834870       31.330839
TR      1000    4       76.781369       52.095972
TR      1000    8       110.673415      72.284749
TR      1000    16      116.457815      137.388804
TR      1000    32      98.671802       324.307446
TR      1000    64      125.448014      510.171488
TR      1000    128     123.296987      1038.143781
TR      1000    256     123.317320      2075.945219
TR      1000    512     155.731128      3287.717785

So, RocksDB itself seems to be perfectly capable of providing group commit for Prepare(sync=true)/Commit(sync=false) transactions.

Comment by Sergei Petrunia [ 2017-09-05 ]

Running the same parallel workload on fb-mysql and MariaDB, with and without binlog. Noting counter values:

fb/mysql, no-binlog vs binlog:

@@ -40,8 +36,8 @@
-| rocksdb_memtable_total                | 44704   |
-| rocksdb_memtable_unflushed            | 44704   |
+| rocksdb_memtable_total                | 95096   |
+| rocksdb_memtable_unflushed            | 95096   |
@@ -59,7 +55,7 @@
-| rocksdb_bytes_written                 | 27561   |
+| rocksdb_bytes_written                 | 210849  |
@@ -79,7 +75,7 @@
-| rocksdb_number_keys_written           | 1005    |
+| rocksdb_number_keys_written           | 2005    |
@@ -97,13 +93,13 @@
-| rocksdb_wal_bytes                     | 27561   |
-| rocksdb_wal_group_syncs               | 0       |
+| rocksdb_wal_bytes                     | 149849  |
+| rocksdb_wal_group_syncs               | 200     |
 | rocksdb_wal_synced                    | 204     |
-| rocksdb_write_other                   | 800     |
-| rocksdb_write_self                    | 204     |
+| rocksdb_write_other                   | 26      |
+| rocksdb_write_self                    | 1978    |
 | rocksdb_write_timedout                | 0       |
-| rocksdb_write_wal                     | 2008    |
+| rocksdb_write_wal                     | 4008    |

MariaDB-gcommit-patch, no-binlog vs binlog:

@@ -28,8 +37,8 @@
-| Rocksdb_memtable_total                | 44584  |
-| Rocksdb_memtable_unflushed            | 44584  |
+| Rocksdb_memtable_total                | 95280  |
+| Rocksdb_memtable_unflushed            | 95280  |
@@ -47,7 +56,7 @@
-| Rocksdb_bytes_written                 | 27477  |
+| Rocksdb_bytes_written                 | 206333 |
@@ -65,7 +74,7 @@
-| Rocksdb_number_keys_written           | 1005   |
+| Rocksdb_number_keys_written           | 2005   |
@@ -83,12 +92,13 @@
-| Rocksdb_wal_bytes                     | 27477  |
+| Rocksdb_wal_bytes                     | 145333 |
 | Rocksdb_wal_group_syncs               | 0      |
-| Rocksdb_wal_synced                    | 197    |
-| Rocksdb_write_other                   | 807    |
-| Rocksdb_write_self                    | 197    |
+| Rocksdb_wal_synced                    | 518    |
+| Rocksdb_write_other                   | 569    |
+| Rocksdb_write_self                    | 1435   |
 | Rocksdb_write_timedout                | 0      |
-| Rocksdb_write_wal                     | 2008   |
+| Rocksdb_write_wal                     | 4008   |

Comment by Sergei Petrunia [ 2017-09-05 ]

Most counter values are similar across MySQL and MariaDB.

  • MariaDB has rocksdb_wal_group_syncs=0 in both cases, this is expected because there's no hton->flush() call.
  • There is a difference in rocksdb_wal_synced:
    MySQL has:

     | rocksdb_wal_synced                    | 204     |
    

    MariaDB:

    -| Rocksdb_wal_synced                    | 197    |
    +| Rocksdb_wal_synced                    | 518    |
    

This agrees with the test execution time.
Experiments with test-gcommit.cc show that WAL_FILE_SYNCED grows at a rate of ~30/sec (with total TPS being 50, 200 or 500 for 5, 20, and 50 threads in the run, respectively)

That is, Rocksdb_wal_synced is "grouped commits".
It remains unclear why MariaDB+Binlog makes much more syncs than all other variants.

Comment by Sergei Petrunia [ 2017-09-06 ]

Added a printout in db_impl_write.cc, printing write_group.size right after
this call:

  // Add to log and apply to memtable.  We can release the lock
  // during this phase since &w is currently responsible for logging
  // and protects against concurrent loggers and concurrent writes
  // into memtables
 
  last_batch_group_size_ =
      write_thread_.EnterAsBatchGroupLeader(&w, &write_group);

Trying with mysqlslap --concurrency=20.
With log_bin=off, I get:

  AAA write_group.size=15
  AAA write_group.size=5
  AAA write_group.size=15
  AAA write_group.size=15
  AAA write_group.size=5
  AAA write_group.size=15
  AAA write_group.size=5
  AAA write_group.size=15
  AAA write_group.size=9
  AAA write_group.size=11
  AAA write_group.size=9
  AAA write_group.size=17

with log_bin=ON, I get this:

  AAA write_group.size=1
  AAA write_group.size=4
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1
  AAA write_group.size=1

Comment by Sergei Petrunia [ 2017-09-07 ]

Is there something that prevents Prepare() calls from grouping? I've used gdb and manually had two Prepare calls enter a group, and the leader flushed the WAL file for both:

Follower:

  Thread 12 (Thread 0x7fffe6b8e700 (LWP 31582)):
  #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  #1  0x00007ffff629ad1c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
  #2  0x00007ffff49d9e73 in std::condition_variable::wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::<lambda()> >(std::unique_lock<std::mutex> &, rocksdb::WriteThread::<lambda()>) (this=0x7fffe6b8c080, __lock=..., __p=...) at /usr/include/c++/4.9/condition_variable:98
  #3  0x00007ffff49d803e in rocksdb::WriteThread::BlockingAwaitState (this=0x555557bb0360, w=0x7fffe6b8bff0, goal_mask=30 '\036') at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/write_thread.cc:46
  #4  0x00007ffff49d83d8 in rocksdb::WriteThread::AwaitState (this=0x555557bb0360, w=0x7fffe6b8bff0, goal_mask=30 '\036', ctx=0x7ffff50f59a0 <rocksdb::WriteThread::JoinBatchGroup(rocksdb::WriteThread::Writer*)::ctx>) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/write_thread.cc:166
  #5  0x00007ffff49d8c51 in rocksdb::WriteThread::JoinBatchGroup (this=0x555557bb0360, w=0x7fffe6b8bff0) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/write_thread.cc:299
  #6  0x00007ffff48d866a in rocksdb::DBImpl::WriteImpl (this=0x555557bafc20, write_options=..., my_batch=0x7fffb001a040, callback=0x0, log_used=0x7fffb0019e58, log_ref=0, disable_memtable=true) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:102
  #7  0x00007ffff4b0a7cc in rocksdb::TransactionImpl::Prepare (this=0x7fffb0019e50) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/utilities/transactions/transaction_impl.cc:195
  #8  0x00007ffff47ec302 in myrocks::Rdb_transaction_impl::prepare (this=0x7fffb0019a00, name="\000\000\000\000\000\000\000\001\030\000MySQLXid\f\000\000\000\000\000\000\000\b\000\000\000\000\000\000") at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2129
  #9  0x00007ffff47c1090 in myrocks::rocksdb_prepare (hton=0x555557b96e70, thd=0x7fffb0000b00, prepare_tx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2733
  #10 0x0000555555d70582 in prepare_or_error (ht=0x555557b96e70, thd=0x7fffb0000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
  #11 0x0000555555d70e2d in ha_commit_trans (thd=0x7fffb0000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
  #12 0x0000555555c5d5c7 in trans_commit_stmt (thd=0x7fffb0000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
  #13 0x0000555555b09c5b in mysql_execute_command (thd=0x7fffb0000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
  #14 0x0000555555b0e332 in mysql_parse (thd=0x7fffb0000b00, rawbuf=0x7fffb0011058 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", length=42, parser_state=0x7fffe6b8d250, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886
  #15 0x0000555555afc33d in dispatch_command (command=COM_QUERY, thd=0x7fffb0000b00, packet=0x7fffb0008911 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", packet_length=42, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1812
  #16 0x0000555555afacb7 in do_command (thd=0x7fffb0000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1360
  #17 0x0000555555c47623 in do_handle_one_connection (connect=0x555557cf8ba0) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1354
  #18 0x0000555555c473b0 in handle_one_connection (arg=0x555557cf8ba0) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1260
  #19 0x00007ffff64fa6aa in start_thread (arg=0x7fffe6b8e700) at pthread_create.c:333
  #20 0x00007ffff5a02eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The leader:

  Thread 11 (Thread 0x7fffe6bd8700 (LWP 31575)):
  #0  rocksdb::DBImpl::WriteImpl (this=0x555557bafc20, write_options=..., my_batch=0x7fffac020a60, callback=0x0, log_used=0x7fffac020878, log_ref=0, disable_memtable=true) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:319
  #1  0x00007ffff4b0a7cc in rocksdb::TransactionImpl::Prepare (this=0x7fffac020870) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/utilities/transactions/transaction_impl.cc:195
  #2  0x00007ffff47ec302 in myrocks::Rdb_transaction_impl::prepare (this=0x7fffac01f260, name="\000\000\000\000\000\000\000\001\030\000MySQLXid\f\000\000\000\000\000\000\000\a\000\000\000\000\000\000") at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2129
  #3  0x00007ffff47c1090 in myrocks::rocksdb_prepare (hton=0x555557b96e70, thd=0x7fffac000b00, prepare_tx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2733
  #4  0x0000555555d70582 in prepare_or_error (ht=0x555557b96e70, thd=0x7fffac000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
  #5  0x0000555555d70e2d in ha_commit_trans (thd=0x7fffac000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
  #6  0x0000555555c5d5c7 in trans_commit_stmt (thd=0x7fffac000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
  #7  0x0000555555b09c5b in mysql_execute_command (thd=0x7fffac000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
  #8  0x0000555555b0e332 in mysql_parse (thd=0x7fffac000b00, rawbuf=0x7fffac011058 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", length=42, parser_state=0x7fffe6bd7250, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886
  #9  0x0000555555afc33d in dispatch_command (command=COM_QUERY, thd=0x7fffac000b00, packet=0x7fffac008911 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", packet_length=42, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1812
  #10 0x0000555555afacb7 in do_command (thd=0x7fffac000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1360
  #11 0x0000555555c47623 in do_handle_one_connection (connect=0x555557d86e60) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1354
  #12 0x0000555555c473b0 in handle_one_connection (arg=0x555557d86e60) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1260
  #13 0x00007ffff64fa6aa in start_thread (arg=0x7fffe6bd8700) at pthread_create.c:333
  #14 0x00007ffff5a02eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Comment by Sergei Petrunia [ 2017-09-07 ]

How about this explanation:

rocksdb_commit_ordered() is called for each transaction in a sequence (that is, SQL layer will not call it for transaction X+1 until the call for transaction X
has returned).

The code in rocksdb_commit_ordered does this:

  tx->set_sync(false);
  tx->commit_ordered_res= tx->commit();

the commit() call itself does not need a disk sync. However it gets into a write group with Prepare() calls that do.

Since the group does a sync, the commit() call will be delayed until that happens. This gives a total run time of #commits * sync_time.

Comment by Sergei Petrunia [ 2017-09-07 ]

EDIT how does that agree with

  AAA write_group.size=1
  AAA write_group.size=1
...

above?

Comment by Sergei Petrunia [ 2017-09-08 ]
  • Added logging for commit start and end timestamp
  • Ran a concurrent workload (20 clients) on a box with a rotating HDD.
  • total test run time 24.8 sec, time spent in commit calls is 25.6 sec.
  • Distribution of the time spent in commit:
Comment by Sergei Petrunia [ 2017-09-08 ]

Comment by Sergei Petrunia [ 2017-09-11 ]

The current code (unfinished, with the issues described above) is at https://github.com/MariaDB/server/tree/bb-10.2-mdev11934

Comment by Sergei Petrunia [ 2017-09-11 ]

Email thread dedicated to this topic:
https://lists.launchpad.net/maria-developers/msg10862.html
https://groups.google.com/forum/#!topic/myrocks-dev/-IgsKABECug

Comment by Sergei Petrunia [ 2017-09-14 ]

Found this in rocksdb/db/db_impl.h :

 
  WriteThread write_thread_;
  ..
  // The write thread when the writers have no memtable write. This will be used
  // in 2PC to batch the prepares separately from the serial commit.
  WriteThread nonmem_write_thread_;

but nonmem_write_thread_ does not seem to be used by the 2PC when the latter is used by MyRocks?

The above code was added in: https://github.com/facebook/rocksdb/commit/499ebb3ab5ea4207950fc95acf102b8f58add1c5

Comment by Sergei Petrunia [ 2017-10-17 ]

Merged the patch with the current code.
Commit cd7fa0fd62da36f8c6bb88e29fd23fcfd85e3991. It is pushed into bb-10.2-mariarocks branch.

Comment by Sergei Petrunia [ 2017-10-17 ]

Did some benchmarking

Comment by Sergei Petrunia [ 2017-10-17 ]

Looking at the results

  • MariaDB from September (before the latest MyRocks merge) gets no benefit from doing commits in multiple threads.
  • FB/MySQL and current MariaDB do get the benefit
  • * MariaDB is actually faster than FB/MySQL. I am not sure why.
  • * * It's a debug build, so if the slowdown was the CPU, it would be explainable
  • * * but the workload is disk-bound (I'm intentionally running this on a slow rotating disk) , so CPU speed should not have much effect.
Comment by Sergei Petrunia [ 2017-10-17 ]

Re-ran the benchmark while collecting more data:
MariaDB:

 TH CONCURRENCY QUERIES TIME QPS WSYNCS WSYNCED Binlog_bytes Binlog_commits Binlog_group_commits 
TR 4 5000 84.3612 59.2690 0 2500 1196120 5000 4720
TR 8 5000 43.4948 114.9563 0 1251 1201960 5000 2678
TR 16 5000 21.2343 235.4681 0 625 1201682 4992 1499
TR 32 5000 10.8093 462.5646 0 313 1202374 4992 854
TR 64 5000 5.4506 917.3302 0 157 1202630 4992 547
TR 128 5000 2.8370 1762.4251 0 80 1202902 4992 339

FB/MySQL-5.6

TH CONCURRENCY QUERIES TIME QPS WSYNCS WSYNCED Binlog_bytes Binlog_fsync 
TR 4 5000 169.1061 29.5672 2500 2500 995000 2500
TR 8 5000 85.9797 58.1533 1250 1251 995000 1250
TR 16 5000 43.1937 115.7576 624 625 993408 624
TR 32 5000 21.4033 233.6088 312 313 993408 312
TR 64 5000 10.5582 473.5656 156 157 993408 156
TR 128 5000 5.6467 885.4729 79 80 993408 79

Comment by Sergei Petrunia [ 2017-10-17 ]

File used for producing the above: psergey-test-scaling2.test

Comment by Sergei Petrunia [ 2017-10-23 ]

MDEV-14103 is about testing for this task

Comment by Sergei Petrunia [ 2017-10-27 ]

KB page https://mariadb.com/kb/en/library/myrocks-and-group-commit-with-binary-log/

Comment by Sergei Petrunia [ 2018-02-21 ]

Closing as this has been fixed and testing found no issues.

Generated at Thu Feb 08 07:53:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.