|
Debugging MariaDB/InnoDB with
sync_binlog=1, innodb_flush_logs_at_trx_commit=1 (default),
== Prepare ==
- innobase_xa_prepare() does the prepare operation and flushes the logs
(note: it seems there is an innodb-internal group flushing there.. there's a comment about it but didn't find the code so far)
== innobase_commit_ordered ==
- Fetch the binlog position from the SQL layer
- set trx->flush_log_later= true (that is, we don't flush here)
- trx_sys_update_mysql_binlog_offset ...
- Logical commit happened, but flushing the log didnt
- Sets trx->active_commit_ordered=1 to remember this
- Also sets trx->must_flush_log_later=true
== innobase_commit ==
- Check that we are just doing the final steps after innobase_commit_ordered.
- Call trx_commit_complete_for_mysql
- call trx_flush_log_if_needed, which flushes the log
== innobase_checkpoint_request ==
- This is only called on binlog rotation (that is, rarely)
|
|
So, MariaRocks could work like this:
- prepare_ordered() is not implemented (same as in InnoDB)
- prepare() does trx->prepare(). It will have to sync to disk.
- commit_ordered() would commit without writing the WAL.
- commit() (when called after commit_ordered) would just flush the WAL
- rocksdb_checkpoint_request would flush the WAL
But will this be performant? Reading MariaDB's handler.h:
The intention is that commit_ordered() should do the minimal amount of
|
work that needs to happen in consistent commit order among handlers. To
|
preserve ordering, calls need to be serialised on a global mutex, so
|
doing any time-consuming or blocking operations in commit_ordered() will
|
limit scalability.
|
|
Handlers can rely on commit_ordered() calls to be serialised (no two
|
calls can run in parallel, so no extra locking on the handler part is
|
required to ensure this).
|
So, if we flush the WAL in rocksdb_commit(), then requests to flush the WAL from different threads will go in parallel and hopefully that will be grouped (although it looks a bit inefficient).
However, the commit itself (making the changes visible to other transactions, etc) will not be done in parallel.
|
|
Another thing to note:
Note that commit_ordered() can be called from a different thread than the
|
one handling the transaction! So it can not do anything that depends on
|
thread local storage, in particular it can not call my_error() and
|
friends
|
Does RocksDB API allow this?
EDIT: most likely yes, because one can find this in facebook/mysql-5.6:
MYSQL_BIN_LOG::process_commit_stage_queue()
|
{
|
...
|
Thread_excursion excursion(thd);
|
for (THD *head= first ; head ; head = head->next_to_commit)
|
{
|
...
|
excursion.try_to_attach_to(head);
|
...
|
if (ha_commit_low(head, all, async))
|
Looks like rocksdb_commit() can be invoked from a different thread.
|
|
An interesting related issue: https://github.com/facebook/mysql-5.6/issues/474 :
But in MyRocks, commit() phase is much more expensive than InnoDB's commit() phase, because writing to MemTable happens there. If you run pmp/quickstack, you may consistently find only one thread is doing myrocks::Rdb_transaction::commit() and other threads are waiting at MYSQL_BIN_LOG::change_stage(). And you'll find commit() is calling rocksdb::MemTableInserter::PutCF() and other MemTable related functions. Serialized MemTable write calls is pretty inefficient.
So MariaDB has commit and commit_ordered. However, RocksDB does not provide an API call that would be suitable for calling from rocksdb_commit_ordered.
- rocksdb_commit_ordered() would call something that would just put the COMMIT record into WAL (without flushing)
- Then, rocksdb_commit would apply the transaction's MemTable contents into the global MemTable.
|
|
See also MDEV-11900 for the situation with 10.2 and InnoDB
|
|
Looking at how "Prepare" operation for the storage engine is persisted on disk.
In MariaDB, each Prepare operation tries to flush the logs itself :
#0 log_write_up_to (lsn=1908254, flush_to_disk=true) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/log/log0log.cc:1234
|
#1 0x00005555562f5c08 in trx_flush_log_if_needed_low (lsn=1908254) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:1600
|
#2 0x00005555562f5c51 in trx_flush_log_if_needed (lsn=1908254, trx=0x7fffe64c1878) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:1622
|
#3 0x00005555562f8fd2 in trx_prepare (trx=0x7fffe64c1878) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:2774
|
#4 0x00005555562f9084 in trx_prepare_for_mysql (trx=0x7fffe64c1878) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/trx/trx0trx.cc:2796
|
#5 0x000055555610cc78 in innobase_xa_prepare (hton=0x5555579371c0, thd=0x7fff78000b00, prepare_trx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/innobase/handler/ha_innodb.cc:17869
|
#6 0x0000555555d7923c in prepare_or_error (ht=0x5555579371c0, thd=0x7fff78000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
|
#7 0x0000555555d79adf in ha_commit_trans (thd=0x7fff78000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
|
#8 0x0000555555c60990 in trans_commit_stmt (thd=0x7fff78000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
|
#9 0x0000555555b0c3ef in mysql_execute_command (thd=0x7fff78000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
|
#10 0x0000555555b10aa6 in mysql_parse (thd=0x7fff78000b00, rawbuf=0x7fff780111f8 "insert into t20 values (2)", length=26, parser_state=0x7ffff4463210, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886
|
Concurrent Prepare operations can get grouped (one Prepare will make a flush for many), this is achieved by relying on InnoDB's native group commit mechanism (which predates XA group commit):
See this comment in trx_prepare:
The idea in InnoDB's group prepare is that a group of
|
transactions gather behind a trx doing a physical disk write
|
to log files, and when that physical write has been completed,
|
one of those transactions does a write which prepares the whole
|
group. Note that this group prepare will only bring benefit if
|
there are > 2 users in the database. Then at least 2 users can
|
gather behind one doing the physical log write to disk.
|
In fb/mysql-5.6, it works differently.
InnoDB's Prepare calls thd_requested_durability() which returns HA_IGNORE_DURABILITY, and flushing is not done:
(gdb) wher
|
#0 thd_requested_durability (thd=0x2bb4cd0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/handler/ha_innodb.cc:1830
|
#1 0x000000000150c35a in trx_prepare (trx=0x7fffe0d01e60, async=0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/trx/trx0trx.cc:2365
|
#2 0x000000000150c421 in trx_prepare_for_mysql (trx=0x7fffe0d01e60, async=0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/trx/trx0trx.cc:2409
|
#3 0x000000000139bea8 in innobase_xa_prepare (hton=0x2949ad0, thd=0x2bb4cd0, prepare_trx=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/handler/ha_innodb.cc:15170
|
#4 0x0000000000d85aa5 in ha_prepare_low (thd=0x2bb4cd0, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:2264
|
#5 0x0000000001237ee0 in MYSQL_BIN_LOG::prepare (this=0x2744000 <mysql_bin_log>, thd=0x2bb4cd0, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7054
|
#6 0x0000000000d83c9c in ha_commit_trans (thd=0x2bb4cd0, all=false, async=false, ignore_global_read_lock=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:1540
|
#7 0x000000000106f0a3 in trans_commit_stmt (thd=0x2bb4cd0, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/transaction.cc:452
|
#8 0x0000000000f8cf3f in mysql_execute_command (thd=0x2bb4cd0, statement_start_time=0x7ffff430aaa8, post_parse=0x7ffff430ac00) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:6254
|
#9 0x0000000000f90913 in mysql_parse (thd=0x2bb4cd0, rawbuf=0x7fffa4005680 "insert into t21 values (4)", length=26, parser_state=0x7ffff430b4b0, last_timer=0x7ffff430ac00, async_commit=0x7ffff430abcc "") at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:7770
|
|
(gdb) fini
|
Run till exit from #0 thd_requested_durability (thd=0x2bb4cd0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/handler/ha_innodb.cc:1830
|
0x000000000150c35a in trx_prepare (trx=0x7fffe0d01e60, async=0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/storage/innobase/trx/trx0trx.cc:2365
|
Value returned is $17 = HA_IGNORE_DURABILITY
|
and then the SQL layer makes a special call to make the effects of all Prepare transactions persistent:
Breakpoint 3, ha_flush_logs (db_type=0x0, engine_map=0x7fffa40086b0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:2452
|
(gdb) wher
|
#0 ha_flush_logs (db_type=0x0, engine_map=0x7fffa40086b0) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:2452
|
#1 0x0000000001238999 in MYSQL_BIN_LOG::process_flush_stage_queue (this=0x2744000 <mysql_bin_log>, total_bytes_var=0x7ffff4308468, rotate_var=0x7ffff430844f, out_queue_var=0x7ffff4308470, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7322
|
#2 0x0000000001239b85 in MYSQL_BIN_LOG::ordered_commit (this=0x2744000 <mysql_bin_log>, thd=0x2bb4cd0, all=false, skip_commit=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7891
|
#3 0x00000000012385fd in MYSQL_BIN_LOG::commit (this=0x2744000 <mysql_bin_log>, thd=0x2bb4cd0, all=false, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/binlog.cc:7223
|
#4 0x0000000000d83cda in ha_commit_trans (thd=0x2bb4cd0, all=false, async=false, ignore_global_read_lock=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/handler.cc:1542
|
#5 0x000000000106f0a3 in trans_commit_stmt (thd=0x2bb4cd0, async=false) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/transaction.cc:452
|
#6 0x0000000000f8cf3f in mysql_execute_command (thd=0x2bb4cd0, statement_start_time=0x7ffff430aaa8, post_parse=0x7ffff430ac00) at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:6254
|
#7 0x0000000000f90913 in mysql_parse (thd=0x2bb4cd0, rawbuf=0x7fffa4005680 "insert into t21 values (4)", length=26, parser_state=0x7ffff430b4b0, last_timer=0x7ffff430ac00, async_commit=0x7ffff430abcc "") at /home/psergey/dev-git/mysql-5.6-rocksdb-look400/sql/sql_parse.cc:7770
|
|
|
Created a simple testcase and tried it on both InnoDB and MyRocks. InnoDB results were:
InnoDB
# THIS IS: --concurrency=50 --number-of-queries=10000, rotating disk
|
Innodb_data_fsyncs= 540
|
Binlog_commits= 10000
|
Binlog_group_commits= 2223
|
# This is --concurrency=50 --number-of-queries=10000, --mem
|
Innodb_data_fsyncs = 9912
|
Binlog_commits = 10000
|
Binlog_group_commits= 9851
|
# THIS IS: --concurrency=50 --number-of-queries=10000, aws, ssd
|
Innodb_data_fsyncs 590
|
Binlog_commits 10000
|
Binlog_group_commits 4559
|
# THIS IS: --concurrency=50 --number-of-queries=10000, aws, memory
|
Innodb_data_fsyncs 9075 (the value is around 9K)
|
Binlog_commits 10000
|
Binlog_group_commits 2517 (the value is 2K -4K)
|
- I am not sure why running with --mem makes Binlog_group_commits be ~2K on aws host and ~9.8K on my desktop. (Both numbers are stable and do not fluctuate).
- There's more grouping on rotating disk than on SSD, makes sense.
|
|
Trying the same with MyRocks and experimental group commit patch :
# THIS IS: --concurrency=50 --number-of-queries=10000, home, --mem
|
ROCKSDB_WAL_GROUP_SYNCS 0
|
ROCKSDB_WAL_SYNCED 8247 (aws, --mem: 7K)
|
Binlog_commits 10000
|
Binlog_group_commits 726 (aws, --mem: 800)
|
# THIS IS: --concurrency=50 --number-of-queries=10000, home, rotating disk
|
ROCKSDB_WAL_GROUP_SYNCS 0
|
ROCKSDB_WAL_SYNCED 8540
|
Binlog_commits 10000
|
Binlog_group_commits 652
|
# This takes ~300 seconds.
|
# THIS IS: --concurrency=50 --number-of-queries=10000, aws, ssd
|
ROCKSDB_WAL_GROUP_SYNCS 0
|
ROCKSDB_WAL_SYNCED 6994
|
Binlog_commits 10000
|
Binlog_group_commits 659
|
- ROCKSDB_WAL_GROUP_SYNCS=0 is expected. Look in the previous comments for " the SQL layer makes a special call to make the effects of all Prepare transactions persistent" and how that call is lacking in MariaDB.
- The values of Binlog_group_commits show there's some amount of grouping
- ROCKSDB_WAL_SYNCED is too high for both on-disk workloads.
- Are these real sync operations or just requests to sync the WAL? (That is, if two Transaction::Prepare() calls issue requests to sync the WAL, and one of them happens to sync it for both, will ROCKSDB_WAL_SYNCED be incremented by 1 or by 2?)
- Execution time for rotating disk is too high?
|
|
_b.test.innodb _b.test.myrocks
|
|
Probably related: MDEV-11937
|
|
_c.test is a testcase that
- runs a serial workload
- runs a parallel workload
the workload is run with {$engine}_flush_log_at_trx_commit=1, log_bin=0.
the results do make sense:
aws ssd:
For --concurrency=1 --number-of-queries=10000
|
Innodb_os_log_fsyncs=10K
|
time: 12 sec (1K syncs/sec)
|
|
For --concurrency=50 --number-of-queries=100000
|
Innodb_os_log_fsyncs=4735
|
time: 7 sec (676 syncs/sec.)
|
|
For --concurrency=1 --number-of-queries=10000
|
ROCKSDB_WAL_SYNCED: 10000
|
time: 25 sec (400 syncs/sec)
|
|
For --concurrency=50 --number-of-queries=100000
|
ROCKSDB_WAL_SYNCED: 3952
|
time: 9 sec (439 syncs/sec)
|
rotating desktop hdd:
For --concurrency=1 --number-of-queries=1000
|
Innodb_os_log_fsyncs: 1005
|
time: 30 sec (33 syncs/sec)
|
|
For --concurrency=50 --number-of-queries=10000
|
Innodb_os_log_fsyncs: 450
|
time: 16 sec (28 syncs/sec).
|
|
For --concurrency=1 --number-of-queries=1000
|
ROCKSDB_WAL_SYNCED: 1000
|
time: 36 sec (27.7 syncs /sec)
|
|
For --concurrency=50 --number-of-queries=10000
|
ROCKSDB_WAL_SYNCED: 354
|
time: 13 sec (27.2 syncs/ sec)
|
The numbers are fairly stable and consistent across the InnoDB and MyRocks.
|
|
Now, trying InnoDB with binlog enabled
Rotating hdd, log_bin=1, sync_binlog=1, innodb_flush_logs_at_trx_commit=1
For--concurrency=1 --number-of-queries=1000
|
Innodb_os_log_fsyncs 1006
|
Binlog_commits 1000
|
Binlog_group_commits 1000
|
time: 30 sec
|
- Both InnoDB and binlog manage to do 30 commits/sec.
- The interesting part is that they manage to do this together.
For --concurrency=50 --number-of-queries=10000
|
Innodb_os_log_fsyncs 473
|
Binlog_commits 10000
|
Binlog_group_commits 2246
|
time: 16 sec
|
- Binlog commit numbers do not look realistic: 2246/16=140 commits/sec ?
- InnoDB's fsyncs agree with previous observations: 473/16 = 29.6 commits/sec
AWS ssd, log_bin=1, sync_binlog=1, innodb_flush_logs_at_trx_commit=1
# THIS IS: --concurrency=1 --number-of-queries=10000
|
Innodb_os_log_fsyncs 10008
|
Binlog_commits 10000
|
Binlog_group_commits 10000
|
time: 14 sec
|
- Both InnoDB and binlog do 714 commits/sec.
# THIS IS: --concurrency=50 --number-of-queries=100000
|
Innodb_os_log_fsyncs 4992
|
Binlog_commits 100000
|
Binlog_group_commits 47449
|
time: 7 sec.
|
- InnoDB does 713 commits/sec (agrees with previous observations)
- Binlog does 6778 commits/sec (looks too high also?)
|
|
Trying the current MyRocks+MariaDB group commit patch
Rotating hdd
# THIS IS: --concurrency=1 --number-of-queries=1000
|
ROCKSDB_WAL_SYNCED 1000
|
Binlog_commits 1000
|
time: 40 sec
|
25 syncs/sec, slightly lower than without the binlog
# THIS IS: --concurrency=50 --number-of-queries=10000
|
ROCKSDB_WAL_SYNCED 7647
|
Binlog_commits 10000
|
Binlog_group_commits 651
|
time: 293 sec
|
- This is very slow
- RocksDB commit rate was 7647/293.=26 rocksdb commits/sec. which looks like no commit grouping.
aws ssd
# THIS IS: --concurrency=1 --number-of-queries=10000
|
ROCKSDB_WAL_SYNCED 10000
|
Binlog_commits 10000
|
Binlog_group_commits 10000
|
time: 25 sec
|
this gives 400 commits/sec for both binlog and RocksDB. this matches previous numbers we've got for this SSD
# THIS IS: --concurrency=50 --number-of-queries=100000
|
ROCKSDB_WAL_SYNCED 65737
|
Binlog_commits 100000
|
Binlog_group_commits 6535
|
time: 153 sec
|
this is 429 WAL syncs/sec. Again looks as if WAL sync operations were not grouped at all?
|
|
Trying to figure out
- if indeed Transaction::Prepare operations do not use RocksDB's internal Group Commit for some reason
- If the above is true, why.
- Found DBOptions::concurrent_prepare parameter. It is off (by default)
- perphaps, I should have it ON (update: this doesn't seem to be true)
- attempt to switch it ON caused mysqld to fail an assertion on start
- Debugged (with concurrent_prepare=off) to the point where PREPARE command
does sync:
(gdb) wher
|
#0 rocksdb::PosixWritableFile::Sync (this=0x555557bbef20) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/env/io_posix.cc:849
|
#1 0x00007ffff4aab9f7 in rocksdb::WritableFileWriter::SyncInternal (this=0x555557bb81b0, use_fsync=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/util/file_reader_writer.cc:352
|
#2 0x00007ffff4aab568 in rocksdb::WritableFileWriter::Sync (this=0x555557bb81b0, use_fsync=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/util/file_reader_writer.cc:323
|
#3 0x00007ffff48db9fd in rocksdb::DBImpl::WriteToWAL (this=0x555557baee50, write_group=..., log_writer=0x555557bbed90, log_used=0x7fffb4025f58, need_log_sync=true, need_log_dir_sync=false, sequence=10) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:719
|
#4 0x00007ffff48d8eb7 in rocksdb::DBImpl::WriteImpl (this=0x555557baee50, write_options=..., my_batch=0x7fffb4026140, callback=0x0, log_used=0x7fffb4025f58, log_ref=0, disable_memtable=true) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:234
|
#5 0x00007ffff4b0a6ba in rocksdb::TransactionImpl::Prepare (this=0x7fffb4025f50) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/utilities/transactions/transaction_impl.cc:195
|
#6 0x00007ffff47ec27a in myrocks::Rdb_transaction_impl::prepare (this=0x7fffb4024940, name="\000\000\000\000\000\000\000\001\030\000MySQLXid\f\000\000\000\000\000\000\000\004\000\000\000\000\000\000") at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2129
|
#7 0x00007ffff47c1030 in myrocks::rocksdb_prepare (hton=0x555557b96e70, thd=0x7fffb4000b00, prepare_tx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2733
|
#8 0x0000555555d70582 in prepare_or_error (ht=0x555557b96e70, thd=0x7fffb4000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
|
#9 0x0000555555d70e2d in ha_commit_trans (thd=0x7fffb4000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
|
#10 0x0000555555c5d5c7 in trans_commit_stmt (thd=0x7fffb4000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
|
#11 0x0000555555b09c5b in mysql_execute_command (thd=0x7fffb4000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
|
#12 0x0000555555b0e332 in mysql_parse (thd=0x7fffb4000b00, rawbuf=0x7fffb40110f8 "insert into t32 values (101)", length=28, parser_state=0x7fffe6bd7250, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886
|
DBImpl::WriteImpl has some grouping logic but I am not sure what grouping applies to:
- just putting new data into the MemTable
- or syncing the WAL also?
|
|
Re concurrent_prepare: the assertion I am hitting is this one in DBImpl::WriteImpl
assert(!need_log_sync && !need_log_dir_sync);
|
I have
|
need_log_sync= true
|
need_log_dir_sync=true
|
these come from
bool need_log_sync = !write_options.disableWAL && write_options.sync;
|
bool need_log_dir_sync = need_log_sync && !log_dir_synced_;
|
It seems unlikely that MyRocks should operate with need_log_sync=off.
|
|
test-rocksdb-gcommit.tgz Wrote a test program that uses RocksDB to perform:
- concurrent writes with commits
- concurrent "Prepare( sync=true|false); commit (sync=true|false)"
AWS ssd:
sync_prepare=false
|
sync_commit=true
|
using_2pc=false
|
TH trx n_threads time tps
|
TR 1000 2 4.513832 443.082538
|
TR 1000 4 4.754442 841.318433
|
TR 1000 8 4.881946 1638.690906
|
TR 1000 16 5.181725 3087.774675
|
TR 1000 32 5.734200 5580.551382
|
TR 1000 64 5.817391 11001.495775
|
TR 1000 128 8.551488 14968.155234
|
TR 1000 256 12.689465 20174.215468
|
TR 1000 512 22.345159 22913.239996
|
sync_prepare=true
|
sync_commit=false
|
using_2pc=true
|
TH trx n_threads time tps
|
TR 1000 1 2.109577 474.028568
|
TR 1000 2 4.606521 434.167139
|
TR 1000 4 5.010855 798.266883
|
TR 1000 8 6.948192 1151.378641
|
TR 1000 16 7.309022 2189.075393
|
TR 1000 32 8.835688 3621.676118
|
TR 1000 64 10.030643 6380.448559
|
TR 1000 128 12.595573 10162.300453
|
TR 1000 256 18.653547 13723.931357
|
TR 1000 512 29.878734 17135.933763
|
Rotating HDD:
|
sync_prepare=true
|
sync_commit=false
|
using_2pc=true
|
TH trx n_threads time tps
|
TR 1000 1 31.522334 31.723539
|
TR 1000 2 63.834870 31.330839
|
TR 1000 4 76.781369 52.095972
|
TR 1000 8 110.673415 72.284749
|
TR 1000 16 116.457815 137.388804
|
TR 1000 32 98.671802 324.307446
|
TR 1000 64 125.448014 510.171488
|
TR 1000 128 123.296987 1038.143781
|
TR 1000 256 123.317320 2075.945219
|
TR 1000 512 155.731128 3287.717785
|
So, RocksDB itself seems to be perfectly capable of providing group commit for Prepare(sync=true)/Commit(sync=false) transactions.
|
|
Running the same parallel workload on fb-mysql and MariaDB, with and without binlog. Noting counter values:
fb/mysql, no-binlog vs binlog:
@@ -40,8 +36,8 @@
|
-| rocksdb_memtable_total | 44704 |
|
-| rocksdb_memtable_unflushed | 44704 |
|
+| rocksdb_memtable_total | 95096 |
|
+| rocksdb_memtable_unflushed | 95096 |
|
@@ -59,7 +55,7 @@
|
-| rocksdb_bytes_written | 27561 |
|
+| rocksdb_bytes_written | 210849 |
|
@@ -79,7 +75,7 @@
|
-| rocksdb_number_keys_written | 1005 |
|
+| rocksdb_number_keys_written | 2005 |
|
@@ -97,13 +93,13 @@
|
-| rocksdb_wal_bytes | 27561 |
|
-| rocksdb_wal_group_syncs | 0 |
|
+| rocksdb_wal_bytes | 149849 |
|
+| rocksdb_wal_group_syncs | 200 |
|
| rocksdb_wal_synced | 204 |
|
-| rocksdb_write_other | 800 |
|
-| rocksdb_write_self | 204 |
|
+| rocksdb_write_other | 26 |
|
+| rocksdb_write_self | 1978 |
|
| rocksdb_write_timedout | 0 |
|
-| rocksdb_write_wal | 2008 |
|
+| rocksdb_write_wal | 4008 |
|
MariaDB-gcommit-patch, no-binlog vs binlog:
@@ -28,8 +37,8 @@
|
-| Rocksdb_memtable_total | 44584 |
|
-| Rocksdb_memtable_unflushed | 44584 |
|
+| Rocksdb_memtable_total | 95280 |
|
+| Rocksdb_memtable_unflushed | 95280 |
|
@@ -47,7 +56,7 @@
|
-| Rocksdb_bytes_written | 27477 |
|
+| Rocksdb_bytes_written | 206333 |
|
@@ -65,7 +74,7 @@
|
-| Rocksdb_number_keys_written | 1005 |
|
+| Rocksdb_number_keys_written | 2005 |
|
@@ -83,12 +92,13 @@
|
-| Rocksdb_wal_bytes | 27477 |
|
+| Rocksdb_wal_bytes | 145333 |
|
| Rocksdb_wal_group_syncs | 0 |
|
-| Rocksdb_wal_synced | 197 |
|
-| Rocksdb_write_other | 807 |
|
-| Rocksdb_write_self | 197 |
|
+| Rocksdb_wal_synced | 518 |
|
+| Rocksdb_write_other | 569 |
|
+| Rocksdb_write_self | 1435 |
|
| Rocksdb_write_timedout | 0 |
|
-| Rocksdb_write_wal | 2008 |
|
+| Rocksdb_write_wal | 4008 |
|
|
|
Most counter values are similar across MySQL and MariaDB.
- MariaDB has rocksdb_wal_group_syncs=0 in both cases, this is expected because there's no hton->flush() call.
- There is a difference in rocksdb_wal_synced:
MySQL has:
| rocksdb_wal_synced | 204 |
|
MariaDB:
-| Rocksdb_wal_synced | 197 |
|
+| Rocksdb_wal_synced | 518 |
|
This agrees with the test execution time.
Experiments with test-gcommit.cc show that WAL_FILE_SYNCED grows at a rate of ~30/sec (with total TPS being 50, 200 or 500 for 5, 20, and 50 threads in the run, respectively)
That is, Rocksdb_wal_synced is "grouped commits".
It remains unclear why MariaDB+Binlog makes much more syncs than all other variants.
|
|
Added a printout in db_impl_write.cc, printing write_group.size right after
this call:
// Add to log and apply to memtable. We can release the lock
|
// during this phase since &w is currently responsible for logging
|
// and protects against concurrent loggers and concurrent writes
|
// into memtables
|
|
last_batch_group_size_ =
|
write_thread_.EnterAsBatchGroupLeader(&w, &write_group);
|
Trying with mysqlslap --concurrency=20.
With log_bin=off, I get:
AAA write_group.size=15
|
AAA write_group.size=5
|
AAA write_group.size=15
|
AAA write_group.size=15
|
AAA write_group.size=5
|
AAA write_group.size=15
|
AAA write_group.size=5
|
AAA write_group.size=15
|
AAA write_group.size=9
|
AAA write_group.size=11
|
AAA write_group.size=9
|
AAA write_group.size=17
|
with log_bin=ON, I get this:
AAA write_group.size=1
|
AAA write_group.size=4
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
AAA write_group.size=1
|
|
|
Is there something that prevents Prepare() calls from grouping? I've used gdb and manually had two Prepare calls enter a group, and the leader flushed the WAL file for both:
Follower:
Thread 12 (Thread 0x7fffe6b8e700 (LWP 31582)):
|
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
|
#1 0x00007ffff629ad1c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
|
#2 0x00007ffff49d9e73 in std::condition_variable::wait<rocksdb::WriteThread::BlockingAwaitState(rocksdb::WriteThread::Writer*, uint8_t)::<lambda()> >(std::unique_lock<std::mutex> &, rocksdb::WriteThread::<lambda()>) (this=0x7fffe6b8c080, __lock=..., __p=...) at /usr/include/c++/4.9/condition_variable:98
|
#3 0x00007ffff49d803e in rocksdb::WriteThread::BlockingAwaitState (this=0x555557bb0360, w=0x7fffe6b8bff0, goal_mask=30 '\036') at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/write_thread.cc:46
|
#4 0x00007ffff49d83d8 in rocksdb::WriteThread::AwaitState (this=0x555557bb0360, w=0x7fffe6b8bff0, goal_mask=30 '\036', ctx=0x7ffff50f59a0 <rocksdb::WriteThread::JoinBatchGroup(rocksdb::WriteThread::Writer*)::ctx>) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/write_thread.cc:166
|
#5 0x00007ffff49d8c51 in rocksdb::WriteThread::JoinBatchGroup (this=0x555557bb0360, w=0x7fffe6b8bff0) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/write_thread.cc:299
|
#6 0x00007ffff48d866a in rocksdb::DBImpl::WriteImpl (this=0x555557bafc20, write_options=..., my_batch=0x7fffb001a040, callback=0x0, log_used=0x7fffb0019e58, log_ref=0, disable_memtable=true) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:102
|
#7 0x00007ffff4b0a7cc in rocksdb::TransactionImpl::Prepare (this=0x7fffb0019e50) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/utilities/transactions/transaction_impl.cc:195
|
#8 0x00007ffff47ec302 in myrocks::Rdb_transaction_impl::prepare (this=0x7fffb0019a00, name="\000\000\000\000\000\000\000\001\030\000MySQLXid\f\000\000\000\000\000\000\000\b\000\000\000\000\000\000") at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2129
|
#9 0x00007ffff47c1090 in myrocks::rocksdb_prepare (hton=0x555557b96e70, thd=0x7fffb0000b00, prepare_tx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2733
|
#10 0x0000555555d70582 in prepare_or_error (ht=0x555557b96e70, thd=0x7fffb0000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
|
#11 0x0000555555d70e2d in ha_commit_trans (thd=0x7fffb0000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
|
#12 0x0000555555c5d5c7 in trans_commit_stmt (thd=0x7fffb0000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
|
#13 0x0000555555b09c5b in mysql_execute_command (thd=0x7fffb0000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
|
#14 0x0000555555b0e332 in mysql_parse (thd=0x7fffb0000b00, rawbuf=0x7fffb0011058 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", length=42, parser_state=0x7fffe6b8d250, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886
|
#15 0x0000555555afc33d in dispatch_command (command=COM_QUERY, thd=0x7fffb0000b00, packet=0x7fffb0008911 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", packet_length=42, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1812
|
#16 0x0000555555afacb7 in do_command (thd=0x7fffb0000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1360
|
#17 0x0000555555c47623 in do_handle_one_connection (connect=0x555557cf8ba0) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1354
|
#18 0x0000555555c473b0 in handle_one_connection (arg=0x555557cf8ba0) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1260
|
#19 0x00007ffff64fa6aa in start_thread (arg=0x7fffe6b8e700) at pthread_create.c:333
|
#20 0x00007ffff5a02eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
|
The leader:
Thread 11 (Thread 0x7fffe6bd8700 (LWP 31575)):
|
#0 rocksdb::DBImpl::WriteImpl (this=0x555557bafc20, write_options=..., my_batch=0x7fffac020a60, callback=0x0, log_used=0x7fffac020878, log_ref=0, disable_memtable=true) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/db/db_impl_write.cc:319
|
#1 0x00007ffff4b0a7cc in rocksdb::TransactionImpl::Prepare (this=0x7fffac020870) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/rocksdb/utilities/transactions/transaction_impl.cc:195
|
#2 0x00007ffff47ec302 in myrocks::Rdb_transaction_impl::prepare (this=0x7fffac01f260, name="\000\000\000\000\000\000\000\001\030\000MySQLXid\f\000\000\000\000\000\000\000\a\000\000\000\000\000\000") at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2129
|
#3 0x00007ffff47c1090 in myrocks::rocksdb_prepare (hton=0x555557b96e70, thd=0x7fffac000b00, prepare_tx=false) at /home/psergey/dev-git/10.2-mariarocks/storage/rocksdb/ha_rocksdb.cc:2733
|
#4 0x0000555555d70582 in prepare_or_error (ht=0x555557b96e70, thd=0x7fffac000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1146
|
#5 0x0000555555d70e2d in ha_commit_trans (thd=0x7fffac000b00, all=false) at /home/psergey/dev-git/10.2-mariarocks/sql/handler.cc:1425
|
#6 0x0000555555c5d5c7 in trans_commit_stmt (thd=0x7fffac000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/transaction.cc:510
|
#7 0x0000555555b09c5b in mysql_execute_command (thd=0x7fffac000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:6261
|
#8 0x0000555555b0e332 in mysql_parse (thd=0x7fffac000b00, rawbuf=0x7fffac011058 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", length=42, parser_state=0x7fffe6bd7250, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:7886
|
#9 0x0000555555afc33d in dispatch_command (command=COM_QUERY, thd=0x7fffac000b00, packet=0x7fffac008911 "INSERT INTO t1 (id, value) VALUES(NULL, 1)", packet_length=42, is_com_multi=false, is_next_command=false) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1812
|
#10 0x0000555555afacb7 in do_command (thd=0x7fffac000b00) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_parse.cc:1360
|
#11 0x0000555555c47623 in do_handle_one_connection (connect=0x555557d86e60) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1354
|
#12 0x0000555555c473b0 in handle_one_connection (arg=0x555557d86e60) at /home/psergey/dev-git/10.2-mariarocks/sql/sql_connect.cc:1260
|
#13 0x00007ffff64fa6aa in start_thread (arg=0x7fffe6bd8700) at pthread_create.c:333
|
#14 0x00007ffff5a02eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
|
|
|
How about this explanation:
rocksdb_commit_ordered() is called for each transaction in a sequence (that is, SQL layer will not call it for transaction X+1 until the call for transaction X
has returned).
The code in rocksdb_commit_ordered does this:
tx->set_sync(false);
|
tx->commit_ordered_res= tx->commit();
|
the commit() call itself does not need a disk sync. However it gets into a write group with Prepare() calls that do.
Since the group does a sync, the commit() call will be delayed until that happens. This gives a total run time of #commits * sync_time.
|
|
EDIT how does that agree with
AAA write_group.size=1
|
AAA write_group.size=1
|
...
|
above?
|
- Added logging for commit start and end timestamp
- Ran a concurrent workload (20 clients) on a box with a rotating HDD.
- total test run time 24.8 sec, time spent in commit calls is 25.6 sec.
- Distribution of the time spent in commit:
|
|
|
|
The current code (unfinished, with the issues described above) is at https://github.com/MariaDB/server/tree/bb-10.2-mdev11934
|
|
Email thread dedicated to this topic:
https://lists.launchpad.net/maria-developers/msg10862.html
https://groups.google.com/forum/#!topic/myrocks-dev/-IgsKABECug
|
|
Found this in rocksdb/db/db_impl.h :
|
WriteThread write_thread_;
|
..
|
// The write thread when the writers have no memtable write. This will be used
|
// in 2PC to batch the prepares separately from the serial commit.
|
WriteThread nonmem_write_thread_;
|
but nonmem_write_thread_ does not seem to be used by the 2PC when the latter is used by MyRocks?
The above code was added in: https://github.com/facebook/rocksdb/commit/499ebb3ab5ea4207950fc95acf102b8f58add1c5
|
|
Merged the patch with the current code.
Commit cd7fa0fd62da36f8c6bb88e29fd23fcfd85e3991. It is pushed into bb-10.2-mariarocks branch.
|
|
Did some benchmarking
|
|
Looking at the results
- MariaDB from September (before the latest MyRocks merge) gets no benefit from doing commits in multiple threads.
- FB/MySQL and current MariaDB do get the benefit
- * MariaDB is actually faster than FB/MySQL. I am not sure why.
- * * It's a debug build, so if the slowdown was the CPU, it would be explainable
- * * but the workload is disk-bound (I'm intentionally running this on a slow rotating disk) , so CPU speed should not have much effect.
|
|
Re-ran the benchmark while collecting more data:
MariaDB:
TH CONCURRENCY QUERIES TIME QPS WSYNCS WSYNCED Binlog_bytes Binlog_commits Binlog_group_commits
|
TR 4 5000 84.3612 59.2690 0 2500 1196120 5000 4720
|
TR 8 5000 43.4948 114.9563 0 1251 1201960 5000 2678
|
TR 16 5000 21.2343 235.4681 0 625 1201682 4992 1499
|
TR 32 5000 10.8093 462.5646 0 313 1202374 4992 854
|
TR 64 5000 5.4506 917.3302 0 157 1202630 4992 547
|
TR 128 5000 2.8370 1762.4251 0 80 1202902 4992 339
|
FB/MySQL-5.6
TH CONCURRENCY QUERIES TIME QPS WSYNCS WSYNCED Binlog_bytes Binlog_fsync
|
TR 4 5000 169.1061 29.5672 2500 2500 995000 2500
|
TR 8 5000 85.9797 58.1533 1250 1251 995000 1250
|
TR 16 5000 43.1937 115.7576 624 625 993408 624
|
TR 32 5000 21.4033 233.6088 312 313 993408 312
|
TR 64 5000 10.5582 473.5656 156 157 993408 156
|
TR 128 5000 5.6467 885.4729 79 80 993408 79
|
|
|
File used for producing the above: psergey-test-scaling2.test
|
|
MDEV-14103 is about testing for this task
|
|
KB page https://mariadb.com/kb/en/library/myrocks-and-group-commit-with-binary-log/
|
|
Closing as this has been fixed and testing found no issues.
|