[MDEV-14561] Mariadb drop in sysbench Tps for higher duration run in tmpfs filesystem Created: 2017-12-04  Updated: 2021-09-16  Resolved: 2021-09-16

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.3.2
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: Sandeep sethia Assignee: Axel Schwenke
Resolution: Won't Fix Votes: 0
Labels: innodb
Environment:

Ubuntu 16.0.2


Attachments: PNG File oltp_update_non_index_256.png    
Issue Links:
Blocks
blocks MDEV-14442 Optimization for ARM64 platform. Open
Epic Link: arm64 optimization

 Description   

I want to confirm if this is the expected behaviour on tmpfs/ramdisk filesystem

While benchmarking sysbench oltp_update_index lua for a 150 sec run I see drop in tps consistently after 80 sec . This is happening both in intel x86 as well as ARM platform but hurting ARM a little more as good as 10%,

Setup : In memory filesystem( datadir is kept in tmpfs filesystem) and bufferpool is 50gb, innodb_flush_log_at_trx_commit =0,and logfile size is 2 gb ,bufferpool instance is 30

Test : 20 tables with 100000 records in each table. Run for 64 threads with sysbench oltp_update_index.lua (update tests)

Time : 150 sec

Few observation :
1) I don’t see the issue in pure I/O operations ie The bufferpool size is kept default and datadir is in ssd’s instead of tmpfs.
2) The tps as expected is heavily dropped in pure I/O so the issues seems to have masked.
3) Flushing seems to be slow when we do in memory test and not able to cope up with read speed in tmpfs.
4) In pure I/O read/write test I see innodb_flush_sync =0 is not recommended as the TPS is almost 40% compared to async operations.

017-11-24 9:40:34 76 [Note] Detected table cache mutex contention at instance 1: 20% waits. Additional table cache instance activated. Number of instances after activation: 2.
[ 10s ] thds: 64 tps: 138157.88 qps: 138157.88 (r/w/o: 0.00/138157.88/0.00) lat (ms,95%): 1.14 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 64 tps: 164025.68 qps: 164025.68 (r/w/o: 0.00/164025.68/0.00) lat (ms,95%): 0.94 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 64 tps: 164611.23 qps: 164611.23 (r/w/o: 0.00/164611.23/0.00) lat (ms,95%): 0.94 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 64 tps: 164317.42 qps: 164317.42 (r/w/o: 0.00/164317.42/0.00) lat (ms,95%): 0.94 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 64 tps: 164128.75 qps: 164128.85 (r/w/o: 0.00/164128.85/0.00) lat (ms,95%): 0.94 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 64 tps: 164028.19 qps: 164028.19 (r/w/o: 0.00/164028.19/0.00) lat (ms,95%): 0.94 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 64 tps: 148203.75 qps: 148203.65 (r/w/o: 0.00/148203.65/0.00) lat (ms,95%): 1.23 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 64 tps: 113156.11 qps: 113156.11 (r/w/o: 0.00/113156.11/0.00) lat (ms,95%): 1.58 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 64 tps: 118685.99 qps: 118685.99 (r/w/o: 0.00/118685.99/0.00) lat (ms,95%): 1.52 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 64 tps: 118142.32 qps: 118142.32 (r/w/o: 0.00/118142.32/0.00) lat (ms,95%): 1.55 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 64 tps: 122747.22 qps: 122747.22 (r/w/o: 0.00/122747.22/0.00) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 64 tps: 112475.31 qps: 112475.31 (r/w/o: 0.00/112475.31/0.00) lat (ms,95%): 1.58 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 64 tps: 112336.17 qps: 112336.17 (r/w/o: 0.00/112336.17/0.00) lat (ms,95%): 1.58 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 64 tps: 121541.22 qps: 121541.22 (r/w/o: 0.00/121541.22/0.00) lat (ms,95%): 1.50 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 64 tps: 116951.59 qps: 116951.59 (r/w/o: 0.00/116951.59/0.00) lat (ms,95%): 1.55 err/s: 0.00 reconn/s: 0.00

I modified the cnf files to use the below two parameter

innodb_flush_sync=0
innodb_io_capacity =8000

[ 10s ] thds: 64 tps: 139392.95 qps: 139392.95 (r/w/o: 0.00/139392.95/0.00) lat (ms,95%): 1.08 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 64 tps: 163050.96 qps: 163050.96 (r/w/o: 0.00/163050.96/0.00) lat (ms,95%): 0.90 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 64 tps: 162052.53 qps: 162052.53 (r/w/o: 0.00/162052.53/0.00) lat (ms,95%): 0.90 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 64 tps: 162532.91 qps: 162532.91 (r/w/o: 0.00/162532.91/0.00) lat (ms,95%): 0.92 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 64 tps: 163324.90 qps: 163324.90 (r/w/o: 0.00/163324.90/0.00) lat (ms,95%): 0.90 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 64 tps: 164528.45 qps: 164528.65 (r/w/o: 0.00/164528.65/0.00) lat (ms,95%): 0.89 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 64 tps: 164178.11 qps: 164177.91 (r/w/o: 0.00/164177.91/0.00) lat (ms,95%): 0.89 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 64 tps: 162853.44 qps: 162853.44 (r/w/o: 0.00/162853.44/0.00) lat (ms,95%): 0.90 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 64 tps: 163859.76 qps: 163859.76 (r/w/o: 0.00/163859.76/0.00) lat (ms,95%): 0.89 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 64 tps: 163382.50 qps: 163382.50 (r/w/o: 0.00/163382.50/0.00) lat (ms,95%): 0.89 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 64 tps: 164321.72 qps: 164321.82 (r/w/o: 0.00/164321.82/0.00) lat (ms,95%): 0.87 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 64 tps: 163324.30 qps: 163324.20 (r/w/o: 0.00/163324.20/0.00) lat (ms,95%): 0.90 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 64 tps: 164130.22 qps: 164130.22 (r/w/o: 0.00/164130.22/0.00) lat (ms,95%): 0.89 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 64 tps: 163013.68 qps: 163013.68 (r/w/o: 0.00/163013.68/0.00) lat (ms,95%): 0.90 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 64 tps: 155699.33 qps: 155699.33 (r/w/o: 0.00/155699.33/0.00) lat (ms,95%): 1.08 err/s: 0.00 reconn/s: 0.00



 Comments   
Comment by Sergey Vojtovich [ 2018-01-24 ]

I just reproduced this on Intel, but after 120 seconds. Normally I run benchmark for 2 minutes, thus I didn't see this problem.

[ 10s ] thds: 40 tps: 209271.39 qps: 209271.49 (r/w/o: 0.00/209271.49/0.00) lat (ms,95%): 0.30 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 40 tps: 219256.27 qps: 219256.17 (r/w/o: 0.00/219256.17/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 40 tps: 220063.99 qps: 220063.99 (r/w/o: 0.00/220063.99/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 40 tps: 220007.66 qps: 220007.66 (r/w/o: 0.00/220007.66/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 40 tps: 221700.33 qps: 221700.33 (r/w/o: 0.00/221700.33/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 40 tps: 221726.18 qps: 221726.18 (r/w/o: 0.00/221726.18/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 40 tps: 220214.89 qps: 220214.89 (r/w/o: 0.00/220214.89/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 40 tps: 217774.79 qps: 217774.79 (r/w/o: 0.00/217774.79/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 40 tps: 219878.31 qps: 219878.31 (r/w/o: 0.00/219878.31/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 40 tps: 221915.26 qps: 221915.26 (r/w/o: 0.00/221915.26/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 40 tps: 221352.38 qps: 221352.38 (r/w/o: 0.00/221352.38/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 40 tps: 216014.77 qps: 216014.77 (r/w/o: 0.00/216014.77/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 40 tps: 186651.10 qps: 186651.10 (r/w/o: 0.00/186651.10/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 40 tps: 187580.45 qps: 187580.45 (r/w/o: 0.00/187580.45/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 40 tps: 187694.01 qps: 187694.01 (r/w/o: 0.00/187694.01/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 40 tps: 190074.07 qps: 190074.17 (r/w/o: 0.00/190074.17/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 40 tps: 182377.19 qps: 182377.19 (r/w/o: 0.00/182377.19/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 40 tps: 176981.36 qps: 176981.26 (r/w/o: 0.00/176981.26/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 40 tps: 171275.16 qps: 171275.16 (r/w/o: 0.00/171275.16/0.00) lat (ms,95%): 0.30 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 40 tps: 174915.05 qps: 174915.05 (r/w/o: 0.00/174915.05/0.00) lat (ms,95%): 0.29 err/s: 0.00 reconn/s: 0.00

Comment by Axel Schwenke [ 2021-09-16 ]

It is most likely that you see the effects of a busy log_sys_mutex during checkpoint storms. See attached oltp_update_non_index_256.png for an example. Once the checkpoint age gets critically high, the page flushing blocks the log_sys_mutex and the throughput decreases. That problem got more pronounced with the switch to a single buffer pool instance in 10.5.

We are aware of the problem, but it isn't easy to fix.

Comment by Axel Schwenke [ 2021-09-16 ]

There is now MDEV-26620 for the adaptive flushing.

Generated at Thu Feb 08 08:14:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.