[MDEV-18080] Run MyRocks benchmark: MariaDB vs Percona Server vs FB/MySQL Created: 2018-12-25  Updated: 2019-03-28

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Fix Version/s: None

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: None

Attachments: PNG File image-2018-12-26-12-14-32-900.png     File out-mariadb-10.3-rocksdb.log     File out-percona-5.7-rocksdb.log     File run-sysbench.sh     PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File screenshot-3.png     File setup-mariadb-current.sh     File setup-os-ubuntu.sh     File setup-percona-current.sh     File setup-sysbench-ubuntu.sh    
Issue Links:
Relates
relates to MDEV-13845 benchmark RocksDB engine Closed
relates to MDEV-17261 sysbench oltp read only too slow for ... Closed
relates to MDEV-15372 Parallel slave speedup very limited w... Closed

 Description   

I used AWS c5.2xlarge, 50Gb EBS ssd with 150 IOPs. The scripts to setup servers, sysbench, and run the benchmark are attached. (one only needs to edit my.cnf and start servers)

Servers:

  • MariaDB 10.3 current, revision 2999492c3278528ceb9f37bd6cfca5ca5295ef9a
  • Percona Server 5.7 current, revision 6604e02a4ae73a8d542ba70e71ad91f2af4514cb
  • Facebook/MySQL-5.6 current, revision 5e398eab68dbf58312ab1544f0e42084552967e1

Settings that were added to my.cnf:
MariaDB:

log_bin=1
rocksdb_block_cache_size=2G
binlog_format=row
sync_binlog=1

Percona Server:

rocksdb_block_cache_size=2G
log_bin=1

Facebook/MySQL 5.6

log-bin=pslp                                                                                                                                                    
binlog-format=row                                                                                                                                               
sync_binlog=1                                                                                                                                                   
rocksdb_block_cache_size=2G                                                                                                                                     

Sysbench prepare and run commands:

sysbench /usr/share/sysbench/oltp_update_non_index.lua \
  --table-size=1000000 \
  --threads=$threads \
  --time=60 \
  --rand-type=uniform \
  --db-driver=mysql \
  --mysql-socket=/tmp/mysql20.sock \
  --mysql-user=root \
  --mysql_storage_engine=$engine \
  prepare

sysbench /usr/share/sysbench/oltp_update_non_index.lua \
  --table-size=1000000 \
  --threads=$threads \
  --time=60 \
  --rand-type=uniform \
  --db-driver=mysql \
  --mysql-socket=/tmp/mysql20.sock \
  --mysql-user=root \
  --mysql_storage_engine=$engine \
  run 

Results:

Percona 5.7

n_threads, qps
 20,  4117.79  
 50,  9487.79 
 80, 13952.85
100, 16852.61
150, 21942.59

MariaDB 10.3

n_threads, qps
 20,  3125.01
 50,  7494.81
 80, 11821.79
100, 14749.30
150, 20313.95

FB/MySQL-5.6

n_threads, qps
 20,  3291.02
 50,  7711.92
 80, 11394.20
100, 13300.78
150, 18795.42



 Comments   
Comment by Sergei Petrunia [ 2018-12-25 ]

n_threads	Percona 5.7	MariaDB 10.3	FB/MySQL-5.6
20	4117.79	3125.01	3291.02
50	9487.79	7494.81	7711.92
80	13952.85	11821.79	11394.2
100	16852.61	14749.3	13300.78
150	21942.59	20313.95	18795.42

Comment by Sergei Petrunia [ 2018-12-25 ]

So, there is some slowdown (starting at 30% and going down as concurrency increases). I'm not sure about the cause.

Comment by Sergei Petrunia [ 2018-12-26 ]

Added current upstream FB/MySQL-5.6 to the chart.

Comment by Sergei Petrunia [ 2018-12-26 ]

Results with log-bin=0:

Percona-5.7

 20,  6574.14
 50, 15140.14
 80, 20138.38
100, 23784.30
150, 34138.96

MariaDB-10.3

 20,  665.07
 50,  641.08
 80,  626.83
100,  607.69
150,  654.11

FBMySQL-5.6

 20,  6340.32
 50, 14219.20
 80, 19054.88
100, 23190.26
150, 30586.72

Comment by Sergei Petrunia [ 2018-12-26 ]

MariaDB is much slower here:

This is obviously a bug and should not be happening.

Comment by Sergei Petrunia [ 2018-12-27 ]

Ok, the problem is in non-XA mode. In this mode, rocksdb_prepare and rocksdb_commit_ordered are not called. Only the rocksdb_commit() call is made.

In that call, MyRocks is expected to commit the transaction and make its changes persistent.

After last fixes (for XA mode), it does it like so:

  tx->set_sync(false);
  tx->commit();
  rdb->FlushWAL(true);

Looking at DBImpl::FlushWAL() code, I see that it doesn't participate RocksDB's GroupCommit (or, rather GroupFlush). It will attempt to make a flush call, followed by fsync/fdatasync call on its own.

This is consistent with the observed performance.

(RocksDB's PessimisticTransaction::Commit() will eventually call DBImpl::FlushWAL(). But it will do it inside its group commit implementation, only one call will be made for transactions that are committing concurrently).

Comment by Sergei Petrunia [ 2018-12-27 ]

Trying on a patched version:

MariaDB-10.3-patch1

 20,  6483.17
 50, 14635.17
 80, 19708.48
100, 23949.44
150, 32377.06

This is on par with other branches.

Comment by Sergei Petrunia [ 2018-12-28 ]

The code that is causing slowdown here was introduced in MDEV-15372.

That MDEV was fixing the performance of multi-threaded slave (non-XA variant of it). The slave wants to make commits in the same order as the master does, the idea was to let the transactions run, but then commit them (call rocksdb_commit) in their order on the master.

This caused them to be serialized. The way to un-serialize them was mimicking InnoDB, and it was:

  tx->set_sync(false);
  tx->commit(); // this establishes the commit order. It is serialized but it does not flush
 
  // this notifies the SQL layer that subsequent transactions can run:
  thd_wakeup_subsequent_commits(thd, 0);
 
  // this makes the changes persistent:
   rocksdb::Status s= rdb->FlushWAL(true);

Comment by Sergei Petrunia [ 2018-12-28 ]

I'm not sure why did this change fix the performance back then but is killing it now. Maybe, something has changed inside RocksDB? (looks like no)

Comment by Sergei Petrunia [ 2019-03-28 ]

Disable this for non-slave threads

An obvious thing to do is to disable the new code for non-slave threads (see THD::slave_thread).

Possible solutions for slave threads:

Hook in RocksDB

Add a hook inside RocksDB somewhere to call thd_wakeup_subsequent_commits().
roblems: there doesn't seem to be any hook for this, so adding it will
require A) finding the right place and B) convincing RocksDB to accept a PR
with a hook.

Non-durable mode for the slave commits

Transactions on the slave come from the binary log, so it is not an issue if
some of them are lost in a crash. They can be replayed from the relay log.

(We only need a guarantee that if transaction #N disappears, then all
subsequent transactions disappear as well. I think we have this property:
writes to RocksDB WAL are done sequentially. Failing to flush may truncate the
WAL, but will not create "gaps" in it)

One thing to check: when the slave thinks it has applied all events from
a relay log file, it may remove that relay log file. But what if the storage
engine has not persisted the transactions from that that log file yet
(assuming they can be replayed)? Can this situation happen and if yes can it
be prevented (e.g. have MyRocks flush its WAL before a relay log file is
removed)?

Use XA-mode for slave threads.

(TODO: this looked like a solution but now I'm trying to describe it and
it's not obvious how to achieve both performance and safety?)

Comment by Sergei Petrunia [ 2019-03-28 ]

Taking MariaDB 10.2 as a base, cset 0623cc7c16c3280d1f81b9049e1561d1b4b6c1d0.

Developed a patch to disable the MDEV-15372 code for non-slave threads. Trying it on a c5.2xlarge instance, with log-bin off, other settings being default:

n_threads	MariaDB-102-cur	MariaDB-10.2-patched
20	648.16	6195.19
50	630.57	13620.29
80	594.41	18855.85
100	599.48	22311.91
150	670.35	30937.94

Generated at Thu Feb 08 08:41:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.