[MDEV-21534] improve locking/waiting in log_write_up_to Created: 2020-01-20  Updated: 2022-06-17  Resolved: 2020-03-02

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.5.2

Type: Task Priority: Major
Reporter: Vladislav Vaintroub Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 1
Labels: None

Attachments: File pprepare.errlog.gz     File pprepare.stack.dump    
Issue Links:
Relates
relates to MDEV-27848 Remove unused wait/io/file/innodb/inn... Closed
relates to MDEV-27917 Some redo log diagnostics is always r... Closed
relates to MDEV-28870 InnoDB: Missing FILE_CREATE, FILE_DEL... Closed
relates to MDEV-17084 Optimize append only files for NVDIMM Closed
relates to MDEV-21925 Redo log mutex competition is serious Closed

 Description   

On multithreaded write-intensive benchmarks, with innodb_flush_log_at_trx_comit (e.g sysbench update_index), log_write_up_to() is one of the hottest functions, and the log_sys.mutex is one of the hottest mutexes.

This is partially due to the way how log_sys.flush_event is used.
Whenever there is a pending flush, a thread is going to wait on this event, and when event is signalled, it retries the flush from the start. The problem here is that when flush is done, all threads are woken, resulting in contention of log_sys.mutex. There is a lot of spurious wakeups, and retries.

Situation can be improved e.g by using a custom synchronization primitive instead of Innodb event.

The synchronization primitive could have 2 operations

  1. wait(wait_lsn)
  2. set(set_lsn)

Where wait blocks current thread until set is called with set_lsn >= wait_lsn.
Note, that seldom "pseudo-spurious" wakeup might still be necessary in order to elect new group commit leader. But set should not wakeup all of the waiting threads.



 Comments   
Comment by Vladislav Vaintroub [ 2020-01-20 ]

Quick benchmark

Windows:
sysbench 4.0

sysbench --test=oltp --oltp-table-size=1000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql
--mysql-table-engine=innodb --max-time=60 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --num-threads=%2
--oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0
--oltp-index-updates=1 --oltp-delete-inserts=0 --oltp-non-index-updates=0 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL %1

concurrency 10.5 tps 10.5+patch tps
1 906 845
2 1144 1130
4 2240 2197
8 4040 3935
16 7149 7483
32 12121 13088
64 17560 20585
128 20937 31696
256 2424 40960
512 23516 45617
1024 25474 42328
Comment by Vladislav Vaintroub [ 2020-01-20 ]

Quick benchmark, Linux (Hyper-V) ,same box as above

concurrency 10.5 tps 10.5+patch tps
1 527 558
2 750 777
4 1272 1221
8 1925 2094
16 3037 3406
32 5171 6371
64 8414 12012
128 15166 21650
256 18167 32690
512 21264 35022
1000 20233 36374

== my.cnf ==

[mysqld]
back_log=500
user=root
port=3306
max_connections=4096
max_connect_errors=5000
max_prepared_stmt_count=500000
table_open_cache=10000
transaction_isolation=REPEATABLE-READ
innodb_status_file=0
innodb_flush_method=O_DIRECT
innodb_buffer_pool_size=3G
innodb_log_file_size=4G
innodb_log_buffer_size=100M                                                                                             innodb_doublewrite=1
innodb_flush_log_at_trx_commit=1                                           
innodb-io-capacity=20000    
innodb-io-capacity-max=50000  
innodb-use-native-aio=1                                                                                               
 loose-enable-named-pipe
innodb-stats-persistent=0                                                                                               innodb_write_io_threads=4   
innodb-buffer-pool-instances=1      

Sysbench 1.0 script

==update_index.sh ==
sysbench --test=/usr/share/sysbench/oltp_update_index.lua --table-size=1000000 --mysql-host=localhost\
--mysql-db=sbtest --mysql-user=wlad --mysql-password= --db-driver=mysql --rand-type=uniform \
--max-time=60 --max-requests=0 --num-threads=$2 \
--skip-trx=on --mysql-socket=/tmp/mysql.sock $1 $3

run whole benchmark
for i in 1 2 4 8 16 32 64 128 256 512 1000
do
./update_index.sh cleanup 1
./update_index.sh prepare 1
./update_index.sh run $i
done

Comment by Marko Mäkelä [ 2020-01-20 ]

I tested the changes on my 2-socket system (Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz), that is, with 2×10 CPU cores (40 with hyperthreading).
The base revision was 10.5 ded128aa9b3d40775e626a08a744a93d9ba71b01, and I did cherry-pick bb-10.5-MDEV-21534 fcb5d00816778d249569c2736e33158f0f5af255 on top of it.

I used sysbench version 1.0.18+ds-1 on Debian GNU/Linux unstable and the following update_index.sh script:

#!/bin/sh
exec sysbench --test=/usr/share/sysbench/oltp_update_index.lua --table-size=1000000 --mysql-host=localhost\
--mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --rand-type=uniform \
--max-time=60 --max-requests=0 --num-threads=$2 \
--skip-trx=on --mysql-socket=/tmp/mysql.sock $1

I created the database by running ./mtr innodb.101_compatibility,16k and then copying mysql-test/var/install.db to the root file system (ext4fs on an Intel Optane PCI-X NVMe device SSDPED1D960GAY).
The reported latency was virtually the same, and there is a clear trend that the number of transactions improves. Maybe we should have set innodb_change_buffering=none to reduce randomness further:

version 1 2 4 8 16 32 64 128 256 512 1000
baseline 125518 314452 1144410 2773692 4446403 5448176 4936405 4716960 4332979 3046671 1410304
revised 133663 411896 1119697 2763726 4750950 7090966 7816565 7202068 4307964 2789257 1434679

Sum of latency:

version 1 2 4 8 16 32 64 128 256 512 1000
baseline 59848.80 119676.62 239178.92 478609.91 957364.62 1917152.71 3836725.65 7676649.61 15356284.63 30721409.86 60011809.97
revised 59841.27 119638.88 239227.52 478606.61 957232.89 1916185.21 3835150.35 7675010.65 15356387.95 30705764.06 60004381.35

Note: I used the following kinds of commands to construct the above tables:

sed -ne 's/^ *write: *//p'< baseline.txt|tr \\012 \|; echo
sed -ne 's/^ *sum: *//p'< baseline.txt|tr \\012 \|; echo

Comment by Axel Schwenke [ 2020-01-28 ]

I tested a build from bb-10.5-MDEV-21534 with sysbench and noticed odd behavior. The server would sporadically stop all operations (I waited several minutes before giving up).

And - maybe even more seriously - it was impossible to shutdown the server gracefully. It would just print a lot of warnings to the error log (warning: thread xxx did not stop) but not go down. I had to resort to sigkill to get rid of it.

Comment by Vladislav Vaintroub [ 2020-01-28 ]

axel, can you describe "how to reproduce"`? Nobody has seen "odd behaviour", but you. If there is no way to reproduce, please share the all threads callstacks.

Comment by Vladislav Vaintroub [ 2020-01-28 ]

The repository contains several commits for the MDEV-21534
4ce687d6d1f53be3b6928739d11febe9e1b936c7 fixes a potential deadlock, so it is necessary to know which commit was tested
axel ^

Comment by Axel Schwenke [ 2020-01-28 ]

There is one easy way to cause a server hang: use parallel_prepare.lua (distributed with sysbench-mariadb) to load multiple tables in parallel. I just tried with 16 tables and it didn't even reach the "loading rows" stage.
I pulled stack traces with PMP. I tried to shutdown the server and got the same warnings in the error log. This time I waited a bit longer and InnoDB started printing "long semaphore wait" messages. I attach both the stack dump and the error log.

Files pprepare.stack.dump and pprepare.errlog.gz

I remember that when I spotted this, I reconfigured the sysbench script to load tables sequentially, but then got a hang during a benchmark run. A cannot reproduce it right now, but keep trying.

As for which commit it was - I cannot say. The build finished on Jan 21st 15:11 UTC and I pulled some minutes before. Wait, looks like I did not pull in this repo after that. It's at commit fcb5d008 now, so probably was at that when I built. And yes, that is before that deadlock fix.

Comment by Marko Mäkelä [ 2020-01-28 ]

I think that a merge of (or rebase to) the latest 10.5 is needed for the development branch. Some hangs were introduced in 10.5 by MDEV-16678 and MDEV-16264, and I have not seen any hangs since this commit marked with MDEV-21551.

Comment by Axel Schwenke [ 2020-01-28 ]

OK, with commit 09fa2894d9e I did not see any anomalies from branch bb-10.5-MDEV-21534 anymore.

Generated at Thu Feb 08 09:07:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.