[MDEV-25404] read-only performance regression in 10.6 Created: 2021-04-13  Updated: 2021-10-07  Resolved: 2021-04-19

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6
Fix Version/s: 10.6.0

Type: Bug Priority: Blocker
Reporter: Axel Schwenke Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: regression

Attachments: PDF File gdigest.pdf    
Issue Links:
Relates
relates to MDEV-24142 rw_lock_t has unnecessarily complex w... Closed
relates to MDEV-26467 Unnecessary compare-and-swap loop in ... Closed
relates to MDEV-26476 InnoDB is missing futex support on so... Closed
relates to MDEV-25451 TPC-C in-memory performance degradati... Closed

 Description   

I see a heavy performance regression in 10.6 that did not exist ~4 weeks ago. It affects all workloads, even read-only:

--------------------------------------------------------------------------------
Test 't_1K-reads-innodb-multi' - sysbench OLTP readonly
1000 point selects per iteration, no range queries
20 tables, 1 mio rows total, engine InnoDB/XtraDB (builtin)
numbers are queries per second
 
#thread count           1       8       16      32      64      128     256
mariadb-10.5.6          17829   121710  198138  323747  322578  325941  320516
mariadb-10.5.7          17909   123676  196655  322730  319521  321345  317861
mariadb-10.5.8          17323   122421  194577  323129  321011  322180  318691
mariadb-10.5.9          17908   121776  195502  319654  316815  321072  315318
mariadb-10.6.0          16571   114360  187503  309040  306141  308083  304082
--------------------------------------------------------------------------------
Test 't_collate_distinct_range_utf8_unicode' - sysbench OLTP readonly
selecting distinct rows from short range, collation utf8_unicode_ci
1 table, 1 mio rows, engine InnoDB/XtraDB (builtin)
numbers are queries per second
 
#thread count           1       8       16      32      64      128     256
mariadb-10.5.6          7802.2  52344   90565   143215  143131  143469  142597
mariadb-10.5.7          7661.6  51889   89530   141981  141824  142293  141383
mariadb-10.5.8          7606.1  52009   90159   141161  142194  142386  141560
mariadb-10.5.9          7561.6  51927   90081   142035  142127  142333  141701
mariadb-10.6.0          7121.8  48368   84864   136162  135595  135270  134618



 Comments   
Comment by Axel Schwenke [ 2021-04-13 ]

Full results in attachment gdigest.pdf

Comment by Axel Schwenke [ 2021-04-14 ]

I tested back until e9f33b77605 (from Dec 3rd) but still see this regression.

Comment by Axel Schwenke [ 2021-04-15 ]

I found a good and a bad revision now:

#thread count           64
mariadb-565b0dd17df     162803
mariadb-e9f33b77605     155187

sysbench command line is

sysbench-0.5 --test=lua/oltp.lua --mysql-table-engine=InnoDB --oltp_tables_count=1\
  --oltp-table-size=1000000 --mysql-socket=/tmp/mysqld.sock.sysbench \
  --mysql-user=root prepare
sysbench-0.5 --test=lua/oltp.lua --oltp-read-only=on --oltp_point_selects=0 \
  --oltp_simple_ranges=0 --oltp_sum_ranges=0 --oltp_order_ranges=0 \
  --oltp_distinct_ranges=10 --oltp_range_size=10 --oltp_tables_count=1 \
  --oltp-table-size=1000000 --num-threads=64 --max-requests=0 --max-time=100 \
  --forced-shutdown=60 --report-interval=3 \
  --mysql-socket=/tmp/mysqld.sock.sysbench --mysql-user=root run

I switched back to standard collation (latin1) now.

Comment by Axel Schwenke [ 2021-04-15 ]

Hunted down (bisecting) to commit 03ca6495df3

#thread count           64
mariadb-03ca6495df3     154882
mariadb-d46b42489a6     162230

Comment by Marko Mäkelä [ 2021-04-15 ]

wlad already noticed something similar. To address that, we first made srw_lock a thin wrapper of SRWLOCK on Microsoft Windows, and later the futex-based implementation was disabled altogether on Microsoft Windows.

axel confirmed that defining SRW_LOCK_DUMMY (switching to an implementation with one mutex and two condition variables (separate ones for pending shared and exclusive requests) helps improve performance. The problem with a futex is that both readers and writers will unnecessarily be waken up.

I think that we definitely need a futex-based srw_mutex and a mutex-and-condition-variables based sux_lock (for dict_index_t::lock and buf_block_t::lock). It remains to be seen whether also srw_lock must be changed to use a mutex and two condition variables.

It would be so much nicer to use only 4 bytes instead of 40+2*48 bytes, especially inside buf_block_t::lock.

Comment by Marko Mäkelä [ 2021-04-16 ]

Using http://locklessinc.com/articles/sleeping_rwlocks/ for inspiration, I came up with a simpler composition for srw_lock or sux_lock. Writers (U and X lock holders) will use a combination of a writer mutex and a flag in the atomic readers lock word. If a read lock request sees that the WRITER flag is set, it will enter a loop where it acquires the mutex, increments the readers and checks the WRITER flag.

I also reimplemented srw_mutex using 31 bits for counting waiting requests and 1 bit for a HOLDER flag. The counter allows us avoid unnecessary FUTEX_WAKE system calls.

With SRW_LOCK_DUMMY, we will use two mutexes and one condition variable, instead of using 1 mutex and 2 condition variables.

Comment by Marko Mäkelä [ 2021-04-17 ]

We can only compose the ssux_lock using a writer mutex on systems where the mutex is not re-entrant. There is no such guarantee for the POSIX pthread_mutex_t.

Hence, for systems where only a generic mutex is available, we must retain the old SRW_LOCK_DUMMY implementation that consists of std::atomic<uint32_t>, a pthread_mutex_t and two pthread_cond_t so that in case the ownership of a buf_block_t::lock is transferred to a write completion callback thread, the submitting thread of the write will not wrongly acquire the writer mutex of the buf_block_t::lock while the previously submitted write is in progress. This problem was caught in Microsoft Windows on a system where the tests were run on relatively slow hard disk.

Using a futex-based srw_mutex writer works correctly, because re-entrant acquisition is not allowed and the mutex does not keep track of the holding thread.

I successfully tested the fix on Microsoft Windows both with and without the following patch:

diff --git a/storage/innobase/include/rw_lock.h b/storage/innobase/include/rw_lock.h
index cf02fe26c2c..7bfce1b62f7 100644
--- a/storage/innobase/include/rw_lock.h
+++ b/storage/innobase/include/rw_lock.h
@@ -22,7 +22,7 @@ this program; if not, write to the Free Software Foundation, Inc.,
 
 #if !(defined __linux__ || defined __OpenBSD__ || defined _WIN32)
 # define SRW_LOCK_DUMMY
-#elif 0 // defined SAFE_MUTEX
+#elif 1 // defined SAFE_MUTEX
 # define SRW_LOCK_DUMMY /* Use dummy implementation for debugging purposes */
 #endif

Comment by Vladislav Vaintroub [ 2021-04-19 ]

Performance seems ok, the code has grown to be somewhat complicated. maybe it makes sense to document relationship between different rwlocks in Innodb.

Comment by Marko Mäkelä [ 2021-04-19 ]

If there was a way to have non-recursive mutexes on all platforms, the fallback implementation for futex-less systems would be simpler. On GNU/Linux (with GNU libc), pthread_mutex_t is non-recursive by default and "just works". On Microsoft Windows, and on some proprietary UNIX systems, mutexes are recursive by default. There is a way to explicitly request a mutex to be recursive, but nothing to request them to be non-recursive. Recursive mutexes are inherently incompatible with "ownership passing", which is a requirement for the asynchronous writes of pages that are protected by buf_block_t::lock.

Comment by Marko Mäkelä [ 2021-04-19 ]

SRW_LOCK_DUMMY was renamed to SUX_LOCK_GENERIC, because on Microsoft Windows, srw_lock will always wrap SRWLOCK even if that alternative implementation were enabled.

Comment by Marko Mäkelä [ 2021-05-05 ]

According to https://shift.click/blog/futex-like-apis/ documented futex equivalents do exist on some operating systems beyond Linux, OpenBSD and Microsoft Windows:

Furthermore, C++20 defines std::atomic_wait and std::atomic_notify_one.

Generated at Thu Feb 08 09:37:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.