[MDEV-21423] lock-free trx_sys get performance regression cause by lf_find and ut_delay Created: 2020-01-05  Updated: 2024-01-02

Status: Stalled
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.0
Fix Version/s: 10.6

Type: Bug Priority: Major
Reporter: zongzhi chen Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: innodb, lock-free, performance
Environment:

linux


Attachments: PNG File 08569298-111E-4929-A922-9E9AE6541BDF.png     PNG File 50A6071F-CF5C-44EB-A8F5-14FAA7EFEFD4.png     PNG File image-2021-10-29-17-26-46-207.png     Text File load_test.cc     PNG File screenshot-1.png    
Issue Links:
Relates
relates to MDEV-28313 InnoDB transactions are not aligned a... Closed
relates to MDEV-28445 Secondary index locking invokes costl... Closed
relates to MDEV-30357 Performance regression in locking rea... Closed
relates to MDEV-13190 InnoDB write-only performance regression Open
relates to MDEV-20630 lf_hash get performance regression si... Confirmed
relates to MDEV-33067 SCN(Sequence Commit Number) based MVCC Open
Epic Link: InnoDB trx_sys improvements

 Description   

Hello, guys

we have port the lock-free trx-sys, however I find that the oltp_read_write case get too much performance regression compare with non-lock-free version..
Especially when the isolation-level is "read-committed", lock-free trx-sys get about 40% performance regression.
I guess Mariadb has the same problem..

This is my sysbench test configure

bench_type=oltp_read_write;
threads=560
tables=8
table_size=500000

There is another issue that relate to the lock-free trx_sys
https://jira.mariadb.org/browse/MDEV-20630?filter=-2

Below is the sysbench result:
you can find that the lockfree trx-sys vs non-lockfree trx-sys is

tps 13405.28 vs 20095.02 
qps 268105.66 vs 401900.40

lockfree trx-sys :  isolation-level 
 
read-committed
 
SQL statistics:
    queries performed:
        read:                            33803098
        write:                           9658028
        other:                           4829014
        total:                           48290140
    transactions:                        2414507 (13405.28 per sec.)
    queries:                             48290140 (268105.66 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
 
General statistics:
    total time:                          180.1141s
    total number of events:              2414507
 
Latency (ms):
         min:                                    2.96
         avg:                                   41.75
         max:                                 4487.73
         95th percentile:                       92.42
         sum:                            100805088.64
 
Threads fairness:
    events (avg/stddev):           4311.6196/167.34
    execution time (avg/stddev):   180.0091/0.01

non-lockfree: read-committed
 
SQL statistics:
    queries performed:
        read:                            50672678
        write:                           14477908
        other:                           7238954
        total:                           72389540
    transactions:                        3619477 (20095.02 per sec.)
    queries:                             72389540 (401900.40 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
 
General statistics:
    total time:                          180.1161s
    total number of events:              3619477
 
Latency (ms):
         min:                                    2.47
         avg:                                   27.85
         max:                                  198.43
         95th percentile:                       52.89
         sum:                            100798260.68
 
Threads fairness:
    events (avg/stddev):           6463.3518/107.19
    execution time (avg/stddev):   179.9969/0.01



 Comments   
Comment by Sergey Vojtovich [ 2020-01-06 ]

baotiao, feels like another iterator issue. Was this ARM or Intel? How many cores?

Comment by zongzhi chen [ 2020-01-06 ]

this is intel with 64 cores.

I guess the performance cause by the iterate the lf_hash and copy the item to create new readview.
the non-lockfree hash only need add the lock and doing memcpy then unlock.

When there is little items in lf_hash, then the iterator operation is faster than the lock, memcpy, unlock operation. However, when there is lots of items in lf_hash, the iterator operation will cause much more time..

so if there is an operation can directly copy the lf_hash and store it in readview without lock?

Comment by Sergey Vojtovich [ 2020-01-06 ]

No locks acquired during ReadView snapshot. I'm afraid direct copy won't work either.

If we compare original:

mutex_enter(&trx_sys.mutex);
memcpy(snapshot, &trx_sys.rw_trx_ids, sizeof(trx_sys.rw_trx_ids));
mutex_exit(&trx_sys.mutex);

versus new:

lf_hash_iterate(copy_one_id);

memcpy() is certainly cheaper than lf_hash iteration. But it requires all 560 threads to acquire trx_sys.mutex, which is certainly more expensive.

I will need to look into that. It feels like the problem is the same as in MDEV-20630. IIRC for 560 threads we must have 1024 dummy nodes, which have to be iterated.

Did you try comparing 500 threads? Or 1000 threads?

Comment by zongzhi chen [ 2020-01-06 ]

Yes, I try to comparing 560 threads.

If you make lf_hash_iterate less overhead. I MDEV-20630 don't need to reinit the hash_table size..

Comment by Sergey Vojtovich [ 2020-01-06 ]

Right, I guess if MDEV-20630 can be worked around be reinit-ing the hash, this one can't.

Testing 500 and 1000 threads (lf vs non-lf) should prove my guess (I'm expecting lf to be faster with these thread counts).

Comment by Sergey Vojtovich [ 2020-01-08 ]

I kind of reproduced this on my Broadwell
lf (e6a9ce27591abd612dbb1d6b89cf4b5f2cd24a42)
14601.65 (500 threads), 13634.36 (560 threads)

pre-lf (1464f4808c08155fd10ba09eae2bb2c3f177482c)
15786.70 (560 threads)

And I confirm iterator is high in the profile.

baotiao, are you benchmarking on SkyLake?

Comment by Sergey Vojtovich [ 2020-01-08 ]

May be this is worth exploring:

diff --git a/mysys/lf_hash.c b/mysys/lf_hash.c
index a7c0767..76d45b2 100644
--- a/mysys/lf_hash.c
+++ b/mysys/lf_hash.c
@@ -419,7 +419,7 @@ int lf_hash_insert(LF_HASH *hash, LF_PINS *pins, const void *data)
     return 1;
   }
   csize= hash->size;
-  if ((my_atomic_add32(&hash->count, 1)+1.0) / csize > MAX_LOAD)
+  if ((my_atomic_add32(&hash->count, 1)+1.0) / csize > MAX_LOAD && csize < 128)
     my_atomic_cas32(&hash->size, &csize, csize*2);
   return 0;
 }

The above limits number of dummy nodes. It brings performance to pre-lf level.

Comment by Sergei Golubchik [ 2020-01-23 ]

As far as I understand, if you have less dummy nodes, you'll have more hash collisions. Basically all nodes between two dummy nodes is one hash bucket. So with a limit like csize < 128 you pretty much change the hash from O(1) to O(N).

May be it'd be better to increase MAX_LOAD? It'll still be O(1) but with a larger constant factor.

Comment by zongzhi chen [ 2020-02-21 ]

increase MAX_LOAD also cause more conflict.. as you said, change the algorithm from O(1) to O(MAX_LOAD)..

Comment by Sergey Vojtovich [ 2020-05-16 ]

Looks like my patches for MDEV-20630 did make some difference for this workload: https://github.com/MariaDB/server/commits/bb-10.5-svoj-MDEV-20630
I can see performance is brought almost to the pre-lf level.

Comment by Marko Mäkelä [ 2020-12-17 ]

I think that the lock-free hash table belongs to the runtime team, which sanja is the leader of. InnoDB trx_sys.rw_trx_hash is merely using that code. There are also bigger worries: ThreadSanitizer is giving warnings, and apparently there are occasional crashes on the ARM platform.

Comment by Daniel Black [ 2021-04-22 ]

ARM fixed in MDEV-23510 ( e0ba68ba34f5623cfa3c61e2e1966971703297f5 ). Structure was slightly changed (though commit message probably incorrect) and the misuse of volatile as a key was replaced. Its probably not enough to replace the pointer dereference hits krizhanovsky noted in MDEV-20630

Comment by Marko Mäkelä [ 2021-10-29 ]

In 10.6, thanks to MDEV-24167 having introduced a small rw-lock, we can do something similar to MDEV-20612 and have a regular hash table where we would embed the rw-locks. Possibly we can simply follow the solution of MDEV-22871 and MDEV-26828, and use a spinlock when a futex-like system call (MDEV-26476) is not available.

On ARMv8 we might want to implement MDEV-26609, that is, limit the number of pointers per cache line.

Comment by Vladislav Vaintroub [ 2021-10-29 ]

I'm running a synthetic benchmark, which does

 
UPDATE t1 set i = i+1

from 10 to 10 000 concurrent client connections, with table defined as

CREATE OR REPLACE TABLE t1 SELECT 0 as i;

Problem - if I run 256 clients after 10 000 clients, I get 700 000 queries per minute. If I run it before, I get 1 200 000 queries per minute. That's 40% regression. The profiler confirms that it is searching for something in lf_hash.
match_pins alone takes 14% of the whole CPU.

Comment by Vladislav Vaintroub [ 2021-10-29 ]

I agree with markoInnodb should try another kind of hash, that does not suffer garbage collections issues, especially since there now high performance hashes in Innodb, that can be protected by new scalable slim reader-writer lock, transactional memory lock guard, and what not.

Comment by Sergey Vojtovich [ 2021-10-29 ]

FWIW: the biggest problem with lf_hash is memory fragmentation, which causes iterator to thrash dTLB. I had ~80% performance improvement by allocating nodes close to each other. This solution is under 100 LOC.

Comment by Vladislav Vaintroub [ 2021-10-29 ]

I trust that, and I'd take what works, as long as it works. Currently, lf thing suffers GC issues. and non-lf thing maybe would not. We're not using lock-free because lock-free algorithms are beautiful. It should improve performance, that's all

Comment by Sergey Vojtovich [ 2021-10-29 ]

wlad uhm, does this issue have anything to do with GC? I thought it is about poor performance of lf_hash iterator, which doesn't touch GC at all.

Comment by Sergey Vojtovich [ 2021-10-29 ]

...and you would certainly not see this problem with UPDATE t1 set i = i+1, because it doesn't snapshot ReadVew. It should be something like sysbench oltp_rw, probably in read-committed mode, with backup locks disabled (unless they were fixed in the meantime).

Comment by Vladislav Vaintroub [ 2021-10-29 ]

Nah, I do see this problem exactly with this workload. Below the full technology demonstration, I did not use sysbench this time . I also reran it, just to make sure I did not make a mistake. match_pins around 15% CPU.
load_test.cc

Comment by Sergey Vojtovich [ 2021-10-29 ]

wlad, what you see is a different problem. Original problem was about lf_hash iterator, but your updates unveil pure lf_hash GC issue. Nevertheless having GC (match_pins()) consume 15% of CPU is not acceptable indeed.

One thing worth keeping in mind is that these 15% come from 2 lf_hash'es: MDL and trx_sys. There's also lf_hash of table definition cache on the way, but it has different load pattern and shouldn't contribute to match_pins().

It is also nice to see numbers like 700 000QPS and 1 200 000QPS, IIRC in pre-lf trx_sys (10.2) we couldn't go over ~100 000QPS.

Speaking of updates. With current lf_hash implementation rw_trx_hash concurrency tends to 100%, that is threads can perform hash operations without blocking each other. But it has overhead: GC.

If you partition rw_trx_hash and use mutex/rw_lock for protection, concurrency is going to be lower, roughly (100 - 100 / N)% (where N is number of partitions). That is if you have 4 partitions it will be 75%, for 10 partitions - 90%, etc.

I must admit I'm unsure which approach is going to provide better throughput. Partitioned hash sounds promising. Although lf_hash has room for improvement as well: it should be possible to make GC much faster than it currently is.

Speaking of quintessence of the problem: current MVCC implementation would probably never scale well for over 10000 connections. It has to snapshot identifiers of all connections, organise them for quick access (currently it is sort+bsearch). It is obvious that the more connections one has, the more expensive MVCC handling becomes. Dunno, probably there're better MVCC approaches out there.

Comment by Vladislav Vaintroub [ 2021-10-29 ]

svoj, to clarify - the numbers I gave is queries per minute, not per second. It would be fabulous is this would be QPS, but in fact I run test for a minute, and reported the overall number (bear in mind, it is a laptop with 4 real cores, so maybe it is still fine this way). There is also quite a contention on a single row that the test operates on. 15 % come almost all from trx_sys.

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module	Category
| + [External Code]	152441 (99,85 %)	9526 (6,24 %)	Multiple modules	IO | Networking | Driver | Graphics | File System | Kernel
|| + tp_callback	128592 (84,22 %)	58 (0,04 %)	server.dll	IO | Networking | Driver | File System | Kernel
||| + threadpool_process_request	126337 (82,75 %)	56 (0,04 %)	server.dll	IO | Networking | Driver | File System | Kernel
|||| + do_command	125615 (82,27 %)	101 (0,07 %)	server.dll	IO | Networking | Driver | File System | Kernel
||||| + dispatch_command	125169 (81,98 %)	398 (0,26 %)	server.dll	IO | Networking | Driver | File System | Kernel
|||||| + mysql_parse	118937 (77,90 %)	98 (0,06 %)	server.dll	IO | Networking | Driver | File System | Kernel
||||||| + mysql_execute_command	112364 (73,60 %)	383 (0,25 %)	server.dll	IO | Networking | Driver | File System | Kernel
|||||||| - mysql_update	59004 (38,65 %)	822 (0,54 %)	server.dll	IO | Networking | File System | Kernel
|||||||| + trans_commit_stmt	46516 (30,47 %)	32 (0,02 %)	server.dll	IO | Networking | Driver | File System | Kernel
||||||||| + ha_commit_trans	46434 (30,41 %)	96 (0,06 %)	server.dll	IO | Networking | Driver | File System | Kernel
|||||||||| + ha_commit_one_phase	40267 (26,37 %)	13 (0,01 %)	server.dll	IO | Networking | Driver | File System | Kernel
||||||||||| + commit_one_phase_2	40254 (26,37 %)	69 (0,05 %)	server.dll	IO | Networking | Driver | File System | Kernel
|||||||||||| + innobase_commit	40037 (26,22 %)	55 (0,04 %)	server.dll	IO | Networking | Driver | File System | Kernel
||||||||||||| + innobase_commit_ordered_2	33424 (21,89 %)	20 (0,01 %)	server.dll	IO | Networking | Kernel
|||||||||||||| + trx_commit_for_mysql	33358 (21,85 %)	29 (0,02 %)	server.dll	IO | Networking | Kernel
||||||||||||||| + trx_t::commit	29979 (19,64 %)	28 (0,02 %)	server.dll	IO | Kernel
|||||||||||||||| + trx_t::commit_persist	29951 (19,62 %)	38 (0,02 %)	server.dll	IO | Kernel
||||||||||||||||| + trx_t::commit_in_memory	28322 (18,55 %)	179 (0,12 %)	server.dll	IO | Kernel
|||||||||||||||||| + lf_hash_delete	22832 (14,95 %)	66 (0,04 %)	server.dll	IO | Kernel
||||||||||||||||||| + l_delete	22701 (14,87 %)	35 (0,02 %)	server.dll	IO | Kernel
|||||||||||||||||||| + lf_pinbox_real_free	22477 (14,72 %)	127 (0,08 %)	server.dll	IO | Kernel
||||||||||||||||||||| + lf_dynarray_iterate	22347 (14,64 %)	17 (0,01 %)	server.dll	Kernel
|||||||||||||||||||||| + recursive_iterate	22330 (14,63 %)	242 (0,16 %)	server.dll	Kernel
||||||||||||||||||||||| + recursive_iterate	21569 (14,13 %)	431 (0,28 %)	server.dll	Kernel
|||||||||||||||||||||||| + match_pins	21138 (13,84 %)	21135 (13,84 %)	server.dll	Kernel
||||||||||||||||||||||||| - [System Call] ntoskrnl.exe	3 (0,00 %)	3 (0,00 %)	ntoskrnl.exe	Kernel
||||||||||||||||||||||| - match_pins	519 (0,34 %)	519 (0,34 %)	server.dll	

Comment by Sergey Vojtovich [ 2021-10-29 ]

oh, 4 cores, c'mon

Note that your profiler refers to lf_dynarray_iterate(), while the bug is about lf_hash_iterate().

But match_pins() is a problem for sure.

BTW, I'd suggest you to check if TLB load is an issue.

Comment by Marko Mäkelä [ 2021-10-31 ]

svoj, for buf_pool.page_hash and lock_sys.rec_hash we have one rw-lock per cache line. That is, on AMD64 with 64-byte cache lines, each rw-lock covers 7 pointers. As a side effect of acquiring the lock, you will also have loaded those 7 pointers.

MDEV-26609 demonstrated that on systems with 128-byte cache lines, it could be useful to only use 64 bytes each cache line, to limit the lock contention.

The maximum number of concurrent transactions is innodb_page_size/16*128. With the default innodb_page_size=16k and a 64-byte cache line, we might allocate a hash array for all 131,072 elements, or 149,796 elements with the rw-locks (2,340 of them). I think that using a bit over 1 megabyte for the hash array might be acceptable. Using your formula, this would reduce concurrency by less than 5‰.

Comment by Sergey Vojtovich [ 2021-10-31 ]

marko, yeah, it may work and fix issue described by Wlad. Although it is not evident (to me) if it is going to fix snapshot performance where you need to collect all active rw transactions id's as well as aggregate min(trx->no).

Luckily rw_trx_hash_t framework is friendly for experiments. Hope it can get tested on systems that have more than 4 laptop cores too.

Comment by Marko Mäkelä [ 2021-11-18 ]

svoj, to my understanding, our performance testing on Microsoft Windows is currently limited to laptop-grade hardware. On GNU/Linux we have more choice.

The hash table traversal for making a snapshot could acquire a shared lock on one hash array slice at a time. That lock could be elided on processors that support it, as in MDEV-26769. As far as I understand, the read view creation does not care about entries that would be added or removed afterwards.

Somewhat related to this, in MDEV-20612 we introduced a combination of lock_sys.latch and individual rw-locks inside lock_sys.rec_hash.array. An attempt to duplicate that would easily make the trx_sys equivalent of lock_sys.latch a bottleneck.

Comment by Vladislav Vaintroub [ 2021-11-18 ]

The 4 core laptop outlines the performance bugs much better than GNU/Linux, for which , no matter what patch is applied , all test differences would usually fall into 1-2% noise ratio (also, GNU/Linux has a real problem with providing a succinct, readable, usable profiler output). Until Linux can provide a profile that can pinpoint a problem, I'd have to stick to Windows laptops, with 4 core, or whatever cores we'll have.

Comment by Marko Mäkelä [ 2022-01-28 ]

I have not been actively working on this, but I keep this "in progress" so that I will not forget about this. Hopefully I will be able to return to this soon.

Like I wrote earlier, if we assume the default innodb_page_size=16k, the new trx_sys.rw_trx_hash could be an array that is a bit over 1 megabyte on a 64-bit system. The first element in each cache line would be a slim mutex or rw-lock (4 or 8 bytes), and the subsequent elements would be pointers to something similar to what we currently have:

struct rw_trx_hash_element_t
{
// constructor and destructor omitted for clarity
  trx_id_t id; /* lf_hash_init() relies on this to be first in the struct */
 
  /**
    Transaction serialization number.
 
    Assigned shortly before the transaction is moved to COMMITTED_IN_MEMORY
    state. Initially set to TRX_ID_MAX.
  */
  Atomic_counter<trx_id_t> no;
  trx_t *trx;
  mysql_mutex_t mutex;
};

We might eliminate the mutex and simply let each hash array mutex protect the contents of all hash bucket chains that reside in that cache line (it would be 7 on AMD64 and up to 15 on IA-32). If we do that, we might just let the transaction commit number be a normal variable. But, we would need a next-bucket pointer. The new structure could be like this:

struct rw_hash_element
{
  trx_id_t id; // duplicate of trx->id, to avoid dereferencing trx
  trx_id_t no; // transaction commit number, for read view creation
  trx_t *trx;
  rw_hash_element *next;
};

This is 32 bytes per element (or 24 on a 32-bit system, or 16 if we omitted the redundant id field). We might try to pack multiple elements per cache line (2 on AMD64) to improve the locality of reference. If the element mutex turns out to be necessary for performance, then we would probably want to have only one element per cache line.

Comment by Marko Mäkelä [ 2022-04-07 ]

An alternative hash table element could be as follows:

struct element
{
  trx_id_t no;
  trx_t *trx;
};

In this variant, the trx->id would not be duplicated, and the next pointer would be in trx_t in the same cache line with n_ref, id, state and possibly also rw_trx_hash_element.

On AMD64, this would allow us to pack 3 elements and one rw-lock in a 64-byte cache line. On ARMv8, it would be 7 elements per 128-byte cache line.

We might go one more step further and move also the commit number to trx_t::no, and directly store trx_t pointers in the hash array (7 to 15 trx_t* per cache line). In trx_t, we would like all of the following to be in a single cache line: n_ref, state, id, no, next, rw_trx_hash_element, mutex. We seem to be fine: 4+4+8+8+ptr+ptr+4=36 or 44 bytes (more than 4 bytes for mutex if SUX_LOCK_GENERIC is needed, that is, the underlying operating system does not support anything futex-like).

If the underlying operating system does not support futex, I would use the 32-bit std::atomic based rw_lock as a pure spinlock, like we do for the buf_pool.page_hash latches.

Comment by Marko Mäkelä [ 2022-04-08 ]

wlad, I posted a rough prototype. It survived the regression tests, but I did not check performance yet. It turns out that thanks to having the commit number directly in trx_t::no, we do not need any back pointer from trx_t to the hash table element. The only extra thing we need is a trx_t::rw_trx_hash pointer to the next trx_t in the hash bucket chain.

Comment by Marko Mäkelä [ 2022-04-13 ]

I refined my prototype, implemented the cache line alignment of trx_t and while doing that, replaced pointless 8×256 bytes of padding with alignment constraints. I tried to determine the optimal compile-time constant size for the hash table. On my dual Intel Xeon E5-2630 v4 system (2×2×10 threads), the sweet spot of throughput (average number of transactions per second) according to my quick testing (30-second sysbench oltp_update_index with 8×100,000 rows and innodb_flush_log_at_trx_commit=0 on fast NVMe storage) appears to be 64 payload cells if we ignore the last column. For higher loads, 256 (and maybe even more) cells would seem to help.

n_cells 20 40 80 160 320 640
256 157663.74 181769.06 167924.72 184753.96 177438.89 173413.46
128 158605.59 182233.57 168804.65 185212.35 174068.23 167462.60
64 158671.71 182796.76 172557.28 184421.19 177325.83 166913.00
32 156305.56 181645.17 166668.97 182027.91 173413.46 165911.88

The performance dip at 80 concurrent connections was due to a checkpoint flush in the middle of my scripted prepare+benchmark run. In each case, the top throughput was reached at 160 concurrent connections (4 times the number of CPU hardware threads).
The compile-time parameter can be changed as follows:

diff --git a/storage/innobase/include/trx0sys.h b/storage/innobase/include/trx0sys.h
index 6caa02fcf6e..51f4bb8fe4f 100644
--- a/storage/innobase/include/trx0sys.h
+++ b/storage/innobase/include/trx0sys.h
@@ -330,7 +330,7 @@ constexpr uint32_t TRX_SYS_DOUBLEWRITE_SPACE_ID_STORED_N= 1783657386;
 class rw_trx_hash_t
 {
   /** number of payload elements in array[] */
-  static constexpr size_t n_cells= 128;
+  static constexpr size_t n_cells= 64;
   /** Hash cell chain */
   struct hash_chain
   {

Comment by Marko Mäkelä [ 2022-04-13 ]

I conducted some further tests on 10.9:

n_cells 20 40 80 160 320 640
512 168633.09 178296.24 133440.01 125156.74 121173.66 128847.09
256 169249.16 178831.63 92456.05 115434.76 121497.26 130257.14
128 168593.40 174614.98 137922.36 122947.22 120110.94 126650.67
64 166168.22 177690.33 137831.50 122340.87 122275.52 124477.58
10.9 167970.59 190672.82 142246.60 138316.21 136684.18 104582.04

Like with the 10.6 test, at 80 concurrent connections there was some concurrent checkpoint flushing activity.

Unfortunately, we can see that at higher concurrency, the lock-free hash table performs better. For 10.6 it did not seem to be the case. The table below duplicates the table of the previous comment, but adds a bottom row for the baseline performance:

n_cells 20 40 80 160 320 640
256 157663.74 181769.06 167924.72 184753.96 177438.89 173413.46
128 158605.59 182233.57 168804.65 185212.35 174068.23 167462.60
64 158671.71 182796.76 172557.28 184421.19 177325.83 166913.00
32 156305.56 181645.17 166668.97 182027.91 173413.46 165911.88
10.6 155038.21 181711.80 169520.11 185770.37 177964.56 165578.40

The different characteristics in 10.9 could be explained by MDEV-14425 and related changes in 10.8. My microbenchmarks are definitely too small for drawing any conclusion.

In any case, I think that the cache line optimized trx_t allocation could be adopted (after some performance testing).

Comment by Marko Mäkelä [ 2022-04-14 ]

I applied some cleanup in MDEV-28313 and rebased the lock-free hash table replacement on it. We can only observe a consistent performance regression:

n_cells 20 40 80 160 320 640
1024 157178.19 184725.49 173541.78 187346.91 177222.87 166366.98
512 158742.81 184503.73 168706.27 187812.76 177468.55 166178.17
256 158204.75 183234.77 171892.55 186343.56 176749.12 166453.43
128 160765.79 184147.99 168384.02 187145.88 176469.76 165900.91
64 160610.13 185161.94 169319.69 187618.77 178563.65 166107.32
10.6 158265.02 187125.55 172186.49 191218.58 182169.38 169287.10

The numbers at 80 concurrent connections are unreliable, because a checkpoint flush occurred during that in my microbenchmark.

Comment by Marko Mäkelä [ 2022-04-14 ]

I merged MDEV-28313 to 10.9 to create a new baseline, and then merged this on top. The results are just as bad as with 10.6. We can only observe an improvement at 20 concurrent connections; there is a regression in all other cases:

n_cells 20 40 80 160 320 640
1024 173175.26 175387.86 132482.90 119908.04 120288.41 128415.88
512 173790.21 175669.04 131650.50 120310.14 120169.48 129179.20
256 173540.89 178958.53 131378.92 122983.30 121310.14 125331.50
128 174969.21 173994.14 123826.27 117920.23 116900.31 145942.48
64 173819.50 132632.64 131318.75 124120.98 123778.56 133658.73
10.9+MDEV-28313 169572.38 191460.20 137424.75 137625.91 141308.08 151053.27
Comment by Marko Mäkelä [ 2022-04-26 ]

Apart from some recovery code, there are two sources of trx_sys.rw_trx_hash traversal, which will be affected by MDEV-20630:

  1. cloning a read view
    • in trx_sys.snapshot_ids() executed via ReadViewBase::snapshot() and ReadView::open()
    • in trx_sys.clone_oldest_view() executed via purge_sys.clone_oldest_view()
  2. determining the minimum transaction identifier in any read view

In MySQL 8.0.29, trx_rw_min_trx_id() was partly removed as redundant and partly replaced with a simpler operation.

I created something similar, replacing trx_sys.get_min_trx_id() with trx_sys.find_same_or_older(). I expect it to improve performance of operations that involve secondary indexes. The new hash table traversal trx_sys.find_same_or_older() employs short circuit evaluation, in a strictly read-only operation (not even acquiring any mutex).

Comment by Marko Mäkelä [ 2022-04-28 ]

I reran the 30-second 8×100,000-row Sysbench oltp_update_index with innodb_flush_log_at_trx_commit=0 to get some quick indication of the impact:

revision 20 40 80 160 320 640
10.6+patch 158434.28 185131.85 170423.99 190336.86 186661.45 179461.32
10.6 161207.93 185221.65 171324.52 189943.30 186307.93 177596.38
10.9+MDEV-28313+patch+MDEV-26603 174544.91 178149.89 110558.41 125144.57 127529.77 147725.99
10.9+MDEV-28313+patch 171584.13 182691.25 136949.96 130384.91 130686.76 144726.54
10.9+MDEV-28313 172770.79 182122.98 110902.51 127673.10 127307.35 143449.37
10.9+MDEV-28313 (previous run) 169572.38 191460.20 137424.75 137625.91 141308.08 151053.27

The last two rows indicate that there is quite a bit of variation in the throughput, in addition to the checkpoint glitch that occurs during the 80-connection test.

The combination with MDEV-26603 must also be tested against a baseline with innodb_flush_log_at_trx_commit=1:

revision 20 40 80 160 320 640
10.9+MDEV-28313+patch+MDEV-26603 38357.41 77825.55 148901.55 159469.58 128778.08 138870.00
10.9+MDEV-28313+patch 45049.02 85527.73 150008.44 160022.04 126585.47 142077.57

So, unfortunately even this fix does not cure the counterintuitive regression revealed by MDEV-26603.
Side note: At 160 concurrent connections, the durable configuration innodb_flush_log_at_trx_commit=1 resulted in better throughput than innodb_flush_log_at_trx_commit=0, possibly thanks to the group commit locks acting as a throttle that prevented more costly contention elsewhere.

axel, can you please run your standard benchmarks on 10.6+patch?

Comment by Marko Mäkelä [ 2022-04-29 ]

I filed a separate ticket MDEV-28445 for the clean-up, because it did not show any difference (for the better or the worse) in axel’s standard test battery. Therefore, we cannot claim that it would fix this performance regression.

I guess that our standard test batteries might not exercise locking conflicts at all, especially on secondary indexes. Something bigger like TPC-C might show a difference.

Comment by Marko Mäkelä [ 2023-03-14 ]

MDEV-28445 caused a performance regression MDEV-30357. As a part of the fix, I would implement a cache that avoids some repeated traversal of trx_sys.rw_trx_hash in repeated invocations of trx_sys_t::find_same_or_older() within the same transaction.

Comment by Marko Mäkelä [ 2023-10-01 ]

MDEV-20630 may be more rewarding to fix first. I see that the lock-free hash table is using std::memory_order_seq_cst, while a less intrusive memory order (or explicit memory barriers) might work. I have not studied that code in much detail. What I attempted so far was to make InnoDB invoke the expensive operations less often (MDEV-28445, MDEV-30357), and to replace the lock-free hash table trx_sys.rw_trx_hash with a locking one, which resulted in worse performance.

Comment by JiraAutomate [ 2023-12-05 ]

Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.

Generated at Thu Feb 08 09:07:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.