InnoDB adaptive indexes has performance problems when running queries in parallel that uses InnoDB hash indexes (at least on the same tables).
Here is an example of running the same SELECT query with 10-256 concurrent threads.
(The query normally takes about 0.36 seconds):
Total execution with 10 threads time: 1 seconds
Total execution with 32 threads time: 1 seconds
Total execution with 64 threads time: 3 seconds
Total execution with 128 threads time: 5 seconds
Total execution with 256 threads time: 9 seconds
When using --innodb-adaptive-hash-index=1 we get:
Total execution with 10 threads time: 1 seconds
Total execution with 32 threads time: 5 seconds
Total execution with 64 threads time: 9 seconds
Total execution with 128 threads time: 18 seconds
Total execution with 256 threads time: 50 seconds
MySQL 5.7 does not have this issue. It has roughly the same speed with and without the hash indexes.
In the above query, there was a driving table with a 400 rows doing lookups in a child table using 200 lookups on a secondary keys for every row from the driving table.
Using innodb-adaptive-hash-index=1 speeds up the query with 37% when running a query in one thread but it slows down drastically when running 32-256 concurrent threads.
InnoDB adaptive indexes has performance problems when running queries in parallel that uses InnoDB hash indexes (at least on the same tables).
Here is an example of running the same SELECT query with 10-256 concurrent threads.
(The query normally takes about 0.36 seconds):
Total execution with 10 threads time: 1 seconds
Total execution with 32 threads time: 1 seconds
Total execution with 64 threads time: 3 seconds
Total execution with 128 threads time: 5 seconds
Total execution with 256 threads time: 9 seconds
When using --innodb-adaptive-hash-index=1 we get:
Total execution with 10 threads time: 1 seconds
Total execution with 32 threads time: 5 seconds
Total execution with 64 threads time: 9 seconds
Total execution with 128 threads time: 18 seconds
Total execution with 256 threads time: 50 seconds
MySQL 5.7 does not have this issue. It has roughly the same speed with and without the hash indexes.
In the above query, there was a driving table with a 400 rows doing lookups in a child table using 200 lookups on a secondary keys for every row from the driving table.
Using innodb-adaptive-hash-index=1 speeds up the query with 37% when running a query in one thread but it slows down drastically when running 256 concurrent threads.
InnoDB adaptive indexes has performance problems when running queries in parallel that uses InnoDB hash indexes (at least on the same tables).
Here is an example of running the same SELECT query with 10-256 concurrent threads.
(The query normally takes about 0.36 seconds):
Total execution with 10 threads time: 1 seconds
Total execution with 32 threads time: 1 seconds
Total execution with 64 threads time: 3 seconds
Total execution with 128 threads time: 5 seconds
Total execution with 256 threads time: 9 seconds
When using --innodb-adaptive-hash-index=1 we get:
Total execution with 10 threads time: 1 seconds
Total execution with 32 threads time: 5 seconds
Total execution with 64 threads time: 9 seconds
Total execution with 128 threads time: 18 seconds
Total execution with 256 threads time: 50 seconds
MySQL 5.7 does not have this issue. It has roughly the same speed with and without the hash indexes.
In the above query, there was a driving table with a 400 rows doing lookups in a child table using 200 lookups on a secondary keys for every row from the driving table.
Using innodb-adaptive-hash-index=1 speeds up the query with 37% when running a query in one thread but it slows down drastically when running 32-256 concurrent threads.
The included program tries to emulate this issue. It does not show the same bad ratio as the above, but it shows something:
10.11.9 With adaptive hash index:
Total execution with 1 threads time: 1 seconds
Total execution with 10 threads time: 6 seconds
Total execution with 16 threads time: 11 seconds
Total execution with 32 threads time: 20 seconds
Total execution with 64 threads time: 46 seconds
Total execution with 128 threads time: 106 seconds
Total execution with 256 threads time: 405 seconds
10.11.9 without adaptive hash indexes:
Total execution with 1 threads time: 4 seconds
Total execution with 16 threads time: 26 seconds
Total execution with 32 threads time: 45 seconds
Total execution with 64 threads time: 88 seconds
Total execution with 128 threads time: 178 seconds
Total execution with 256 threads time: 363 seconds
MariaDB 10.4 with adaptive hash indexes:
Total execution with 1 threads time: 3 seconds
Total execution with 16 threads time: 15 seconds
Total execution with 32 threads time: 21 seconds
Total execution with 64 threads time: 43 seconds
Total execution with 128 threads time: 86 seconds
Total execution with 256 threads time: 172 seconds
10.4 shows speed directly proportional to number of threads.
10.11 goes worse at 128 - 256 threads.
The original issue shows the issue already at 32 threads (5x slowdown)
Michael Widenius
added a comment - The included program tries to emulate this issue. It does not show the same bad ratio as the above, but it shows something:
10.11.9 With adaptive hash index:
Total execution with 1 threads time: 1 seconds
Total execution with 10 threads time: 6 seconds
Total execution with 16 threads time: 11 seconds
Total execution with 32 threads time: 20 seconds
Total execution with 64 threads time: 46 seconds
Total execution with 128 threads time: 106 seconds
Total execution with 256 threads time: 405 seconds
10.11.9 without adaptive hash indexes:
Total execution with 1 threads time: 4 seconds
Total execution with 16 threads time: 26 seconds
Total execution with 32 threads time: 45 seconds
Total execution with 64 threads time: 88 seconds
Total execution with 128 threads time: 178 seconds
Total execution with 256 threads time: 363 seconds
MariaDB 10.4 with adaptive hash indexes:
Total execution with 1 threads time: 3 seconds
Total execution with 16 threads time: 15 seconds
Total execution with 32 threads time: 21 seconds
Total execution with 64 threads time: 43 seconds
Total execution with 128 threads time: 86 seconds
Total execution with 256 threads time: 172 seconds
10.4 shows speed directly proportional to number of threads.
10.11 goes worse at 128 - 256 threads.
The original issue shows the issue already at 32 threads (5x slowdown)
I confirmed a potential source of conflict: btr_search_check_free_space_in_heap() is acquiring an adaptive hash index partition latch in exclusive mode, even though it could start with a shared latch and try to upgrade it (MDEV-34791) if needed. Another such place is btr_search_drop_page_hash_index(), which is temporarily releasing the shared latch instead of trying to upgrade it to exclusive. Even after I changed these (https://github.com/MariaDB/server/pull/3562), I am seeing a lot of time being spent in the ssux_lock_impl<true>::rd_wait() spin loop. I see that also btr_search_build_page_hash_index() and btr_search_drop_page_hash_index() are being executed during the workload, but I don’t think that we can avoid the exclusive latch there.
I also observed that more than half the CPU time of btr_search_guess_on_hash() is spent in the ha_search_and_get_data() loop. This could suggest that the hash bucket chains could be too long. I was using innodb_buffer_pool_size=128m, and the size of the adaptive hash index bucket array is a fixed portion of the buffer pool size.
Marko Mäkelä
added a comment - perf top -g $(pgrep mariadbd) highlights the following call stack, which I guess could be a contributing factor:
10.11 21b20712a3fe7ac44291b398a3731e514c23c8a4
- 15,89% btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
- 13,04% btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, mtr_t*)
4,04% ssux_lock_impl<true>::rd_wait()
I confirmed a potential source of conflict: btr_search_check_free_space_in_heap() is acquiring an adaptive hash index partition latch in exclusive mode, even though it could start with a shared latch and try to upgrade it ( MDEV-34791 ) if needed. Another such place is btr_search_drop_page_hash_index() , which is temporarily releasing the shared latch instead of trying to upgrade it to exclusive. Even after I changed these ( https://github.com/MariaDB/server/pull/3562 ), I am seeing a lot of time being spent in the ssux_lock_impl<true>::rd_wait() spin loop. I see that also btr_search_build_page_hash_index() and btr_search_drop_page_hash_index() are being executed during the workload, but I don’t think that we can avoid the exclusive latch there.
I also observed that more than half the CPU time of btr_search_guess_on_hash() is spent in the ha_search_and_get_data() loop. This could suggest that the hash bucket chains could be too long. I was using innodb_buffer_pool_size=128m , and the size of the adaptive hash index bucket array is a fixed portion of the buffer pool size.
I realized that the latch upgrade in btr_search_check_free_space_in_heap() can cause deadlocks during DDL operations. What we definitely can do is to release the exclusive latch before invoking buf_block_free(block). That should avoid an obvious bottleneck, which could lead to multiple threads waiting on buf_pool.mutex while holding multiple AHI partition latches and therefore blocking many other threads that would be happy with a shared latch.
Marko Mäkelä
added a comment - I realized that the latch upgrade in btr_search_check_free_space_in_heap() can cause deadlocks during DDL operations. What we definitely can do is to release the exclusive latch before invoking buf_block_free(block) . That should avoid an obvious bottleneck, which could lead to multiple threads waiting on buf_pool.mutex while holding multiple AHI partition latches and therefore blocking many other threads that would be happy with a shared latch.
The latch upgrade logic seems to open many new cans of worms. I see that MySQL 5.7 avoids acquiring any AHI latch in btr_search_check_free_space_in_heap(). In MySQL 8.0 that seems to be done more correctly, using std::atomic. I revised the pull request to implement that idea. I think that it should very well explain the observed regression between MySQL 5.7 and MariaDB.
Marko Mäkelä
added a comment - The latch upgrade logic seems to open many new cans of worms. I see that MySQL 5.7 avoids acquiring any AHI latch in btr_search_check_free_space_in_heap() . In MySQL 8.0 that seems to be done more correctly, using std::atomic . I revised the pull request to implement that idea. I think that it should very well explain the observed regression between MySQL 5.7 and MariaDB.
I realized that using atomic memory operations for updating the pointer would reintroduce the race condition bug MDEV-22646. However, that change introduced an overkill in the locking, and the following would seem to fix it in 10.4:
/** Creates and initializes the adaptive search system at a database start.
A shared latch is sufficient for preventing a conflict with a concurrent allocation in ha_insert_for_fold(), which would be protected by an exclusive latch.
While working on this, I noticed that there is quite a bit of conditional code in mem0mem.* that is specific to the adaptive hash index. I think that it is better to move that code to btr0sea.cc, specifically, to ha_insert_for_fold(). In that way, the only additional overhead that users of mem_heap_t will have to pay is the storage for the heap->free_block, which would be nullptr anywhere else than in the adaptive hash index.
Marko Mäkelä
added a comment - I realized that using atomic memory operations for updating the pointer would reintroduce the race condition bug MDEV-22646 . However, that change introduced an overkill in the locking, and the following would seem to fix it in 10.4:
diff --git a/storage/innobase/btr/btr0sea.cc b/storage/innobase/btr/btr0sea.cc
index b6f8f46e3f5..a6263861beb 100644
--- a/storage/innobase/btr/btr0sea.cc
+++ b/storage/innobase/btr/btr0sea.cc
@@ -200,7 +200,7 @@ btr_search_check_free_space_in_heap(const dict_index_t* index)
hash_table_t* table;
mem_heap_t* heap;
- rw_lock_x_lock(latch);
+ rw_lock_s_lock(latch);
if (!btr_search_enabled) {
goto func_exit;
@@ -211,12 +211,13 @@ btr_search_check_free_space_in_heap(const dict_index_t* index)
if (heap->free_block == NULL) {
heap->free_block = block;
+ rw_lock_s_unlock(latch);
} else {
func_exit:
+ rw_lock_s_unlock(latch);
buf_block_free(block);
}
- rw_lock_x_unlock(latch);
}
/** Creates and initializes the adaptive search system at a database start.
A shared latch is sufficient for preventing a conflict with a concurrent allocation in ha_insert_for_fold() , which would be protected by an exclusive latch.
While working on this, I noticed that there is quite a bit of conditional code in mem0mem.* that is specific to the adaptive hash index. I think that it is better to move that code to btr0sea.cc , specifically, to ha_insert_for_fold() . In that way, the only additional overhead that users of mem_heap_t will have to pay is the storage for the heap->free_block , which would be nullptr anywhere else than in the adaptive hash index.
monty, can you please test this? For 10.6 and later, I updated https://github.com/MariaDB/server/pull/3562/ with a patch that replaces some srw_spin_lock with srw_lock. With the innodb-hashtest.sh workload, that could make a huge difference, because shared/exclusive conflicts will occur due to ha_delete_hash_node() in the workload.
Marko Mäkelä
added a comment - monty , can you please test this? For 10.6 and later, I updated https://github.com/MariaDB/server/pull/3562/ with a patch that replaces some srw_spin_lock with srw_lock . With the innodb-hashtest.sh workload, that could make a huge difference, because shared/exclusive conflicts will occur due to ha_delete_hash_node() in the workload.
I think that we can refactor this further and create a custom allocator for the adaptive hash index, instead of using mem_heap_t. In that way, the replacement of btr_search_check_free_space_in_heap() could avoid acquiring any rw-latch and simply use std::atomic::compare_exchange_strong().
Marko Mäkelä
added a comment - I think that we can refactor this further and create a custom allocator for the adaptive hash index, instead of using mem_heap_t . In that way, the replacement of btr_search_check_free_space_in_heap() could avoid acquiring any rw-latch and simply use std::atomic::compare_exchange_strong() .
I reverted the MDEV-22646 fix from 10.4 in this experimental change. It is barely good enough for a performance experiment. mariabackup would crash, and the tests innodb.innodb_buffer_pool_fail and innodb.temporary_table would lead to a server crash on shutdown. This should be good enough to assess whether the performance difference between 5.7 and 10.4 is only due to the MDEV-22646 fix, or whether there could be other explanations, such as some adjustments made for instant ALTER TABLE (MDEV-11369 and MDEV-15562).
Marko Mäkelä
added a comment - I reverted the MDEV-22646 fix from 10.4 in this experimental change . It is barely good enough for a performance experiment. mariabackup would crash, and the tests innodb.innodb_buffer_pool_fail and innodb.temporary_table would lead to a server crash on shutdown. This should be good enough to assess whether the performance difference between 5.7 and 10.4 is only due to the MDEV-22646 fix, or whether there could be other explanations, such as some adjustments made for instant ALTER TABLE ( MDEV-11369 and MDEV-15562 ).
Percona Server 5.7 differs from MySQL Server 5.7 by the introduction of two new parameters: btr_search_sys_constant_mem and btr_search_sys_variable_mem, which seems to affect accounting only.
There also is a simpler algorithm for choosing the AHI partition:
I think that we can employ this idea. For rec_fold() I would be wary of doing that.
Marko Mäkelä
added a comment - Percona Server 5.7 differs from MySQL Server 5.7 by the introduction of two new parameters: btr_search_sys_constant_mem and btr_search_sys_variable_mem , which seems to affect accounting only.
There also is a simpler algorithm for choosing the AHI partition:
const ulint ahi_slot
- = ut_fold_ulint_pair(static_cast<ulint>(index_id),
- static_cast<ulint>(block->page.id.space()))
- % btr_ahi_parts;
+ = static_cast<ulint>(index_id) % btr_ahi_parts;
I think that we can employ this idea. For rec_fold() I would be wary of doing that.
I filed MDEV-35189 and MDEV-35190 for some optimizations that I came up with while investigating this.
It turns out that in MySQL 8.0.30, something radical was done about the inefficient rec_fold() and dtuple_fold() hash functions. I think that we should investigate something simple and efficient, such as invoking my_crc32c(uint32_t(index_id),rec,size) on the necessary prefix (a few first fields of the record). I am a little worried that the exact key distribution is important. In fact, I was suspecting that MySQL Bug#111538, which mdcallag highlighted in his recent blog post would be related to that hash function change. However, when I went through the recent MySQL 8.0.40 changes in MDEV-35170, it seemed to be something else, possibly related to the MySQL 8.0 implementation of instant ADD/DROP column, which is quite different from MariaDB. That said, MariaDB could have some adaptive hash index performance regression due to MDEV-11369 and MDEV-15562.
Marko Mäkelä
added a comment - I filed MDEV-35189 and MDEV-35190 for some optimizations that I came up with while investigating this.
It turns out that in MySQL 8.0.30, something radical was done about the inefficient rec_fold() and dtuple_fold() hash functions . I think that we should investigate something simple and efficient, such as invoking my_crc32c(uint32_t(index_id),rec,size) on the necessary prefix (a few first fields of the record). I am a little worried that the exact key distribution is important. In fact, I was suspecting that MySQL Bug#111538 , which mdcallag highlighted in his recent blog post would be related to that hash function change. However, when I went through the recent MySQL 8.0.40 changes in MDEV-35170 , it seemed to be something else, possibly related to the MySQL 8.0 implementation of instant ADD/DROP column, which is quite different from MariaDB. That said, MariaDB could have some adaptive hash index performance regression due to MDEV-11369 and MDEV-15562 .
Another change in MySQL 8.0 is definitely of interest, to make things well defined:
Bug#33601434 InnoDB: AHI-related fields n_fields, n_bytes, left_side should be atomically set
Before the transition to C++11 in MariaDB Server 10.4, there were no good options to fix this.
It occurred to me that we can adapt an idea from the MDEV-20612lock_sys.rec_hash and basically partition the adaptive hash index partition latch further. We would basically only acquire a partition latch in exclusive mode when we are in the process of enabling or disabling the entire adaptive hash index. Anything else would use a combination of a shared latch and an rw-latch in the hash array cache line that covers the bucket that is of interest. In fact, if we rewrite the memory management to work without mem_heap_t, we can go further and remove the partition latch as well as the partititioning altogether. This would deprecate and ignore the parameter innodb_adaptive_hash_index_parts, basically going back to 1 partition, like it was up to MySQL 5.6 or MariaDB Server 10.1.
Enabling or disabling the adaptive hash index would have to acquire exclusive latches on all hash array cache lines. A disabled adaptive hash index might be indicated by having nullptr in an std::atomic<hash_chain*> array pointer; this would require some careful logic around disabling the adaptive hash index.
Marko Mäkelä
added a comment - It occurred to me that we can adapt an idea from the MDEV-20612 lock_sys.rec_hash and basically partition the adaptive hash index partition latch further. We would basically only acquire a partition latch in exclusive mode when we are in the process of enabling or disabling the entire adaptive hash index. Anything else would use a combination of a shared latch and an rw-latch in the hash array cache line that covers the bucket that is of interest. In fact, if we rewrite the memory management to work without mem_heap_t , we can go further and remove the partition latch as well as the partititioning altogether. This would deprecate and ignore the parameter innodb_adaptive_hash_index_parts , basically going back to 1 partition, like it was up to MySQL 5.6 or MariaDB Server 10.1.
Enabling or disabling the adaptive hash index would have to acquire exclusive latches on all hash array cache lines. A disabled adaptive hash index might be indicated by having nullptr in an std::atomic<hash_chain*> array pointer; this would require some careful logic around disabling the adaptive hash index.
Mark Callaghan
added a comment - Please confirm that the new hash functions in MySQL 8.0.30+ have latency (CPU overhead) similar to the previous versions.
I just spent more time than I wanted to tracking down odd perf regressions related to ut::random_from_interval_fast which is used with ut_delay: https://smalldatum.blogspot.com/2024/10/innodb-busy-wait-loops-gone-bad-in.html
MariaDB got rid of the random delays years ago in MDEV-14482, because the global state of the random number generator had been found to be a contention point.
Marko Mäkelä
added a comment - MariaDB got rid of the random delays years ago in MDEV-14482 , because the global state of the random number generator had been found to be a contention point.
There is a rather astonishing amount of overhead related to the adaptive hash index. Today I rewrote the memory allocation so that the adaptive hash index will allocate the hash table nodes directly from buf_page_t::frame without any additional alignment or 200-byte safety margin. There is no need to bloat mem_heap_t with anything that is related to the adaptive hash index. Probably something similar could be done about the lock_sys allocations.
One last thing that I am still missing is pushing down the rw-locks to the hash table array. Once that is done, it should be practical to test these changes. I might leave the hash function alone as part of this task, because the removal of some duplicated hash table traversal could already do wonders.
Marko Mäkelä
added a comment - There is a rather astonishing amount of overhead related to the adaptive hash index. Today I rewrote the memory allocation so that the adaptive hash index will allocate the hash table nodes directly from buf_page_t::frame without any additional alignment or 200-byte safety margin. There is no need to bloat mem_heap_t with anything that is related to the adaptive hash index. Probably something similar could be done about the lock_sys allocations.
One last thing that I am still missing is pushing down the rw-locks to the hash table array. Once that is done, it should be practical to test these changes. I might leave the hash function alone as part of this task, because the removal of some duplicated hash table traversal could already do wonders.
A reasonably efficient and radically different implementation of rec_fold() would be my_crc32c(index_id, rec, n_bytes). For ROW_FORMAT=REDUNDANT, any fixed-length fields that are NULL should be filled by NUL bytes (data_write_sql_null() in rec_convert_dtuple_to_rec_old()). For any other format, NULL values will occupy 0 bytes in the payload. For the clustered index, the prefix that the adaptive hash index is covering can never include NULL values, because all PRIMARY KEY columns must be NOT NULL. For the secondary index, the ROW_FORMAT may matter. For example, if we have the following:
CREATE t(a INTPRIMARYKEY, b INTUNIQUE) ENGINE=InnoDB ROW_FORMAT=REDUNDANT;
INSERTINTO t SET a=1;
then the secondary index record payload would be 8 octets 0x0000000080000001, corresponding to (b,a)=(NULL,1) with the sign bit inverted for memcmp() compatible comparisons. In any other ROW_FORMAT, the record payload would be the 4 octets 0x80000001. This can be taken care of by piecewise invocation of my_crc32c() in the replacement of dtuple_fold().
Marko Mäkelä
added a comment - A reasonably efficient and radically different implementation of rec_fold() would be my_crc32c(index_id, rec, n_bytes) . For ROW_FORMAT=REDUNDANT , any fixed-length fields that are NULL should be filled by NUL bytes ( data_write_sql_null() in rec_convert_dtuple_to_rec_old() ). For any other format, NULL values will occupy 0 bytes in the payload. For the clustered index, the prefix that the adaptive hash index is covering can never include NULL values, because all PRIMARY KEY columns must be NOT NULL . For the secondary index, the ROW_FORMAT may matter. For example, if we have the following:
CREATE t(a INT PRIMARY KEY , b INT UNIQUE ) ENGINE=InnoDB ROW_FORMAT=REDUNDANT;
INSERT INTO t SET a=1;
then the secondary index record payload would be 8 octets 0x0000000080000001 , corresponding to (b,a)=(NULL,1) with the sign bit inverted for memcmp() compatible comparisons. In any other ROW_FORMAT , the record payload would be the 4 octets 0x80000001 . This can be taken care of by piecewise invocation of my_crc32c() in the replacement of dtuple_fold() .
I changed the hash function to CRC-32C and am working on making rec_fold() determine the end of the record prefix on-the-fly, without requiring offsets to be allocated and computed separately. That should help quite a lot.
The functions btr_search_drop_page_hash_index() and btr_search_build_page_hash_index() are allocating some transient memory from the heap. If those could be changed to stack memory allocation, contention should be reduced further. Possibly, if we push down the rw-lock to the hash table cells (similar to buf_pool.page_hash and lock_sys.rec_hash), we could use a "buffer" of only 1 element, instead of first computing the hash values in the array and then acquiring a coarser adaptive hash index latch for the operation.
Marko Mäkelä
added a comment - I changed the hash function to CRC-32C and am working on making rec_fold() determine the end of the record prefix on-the-fly, without requiring offsets to be allocated and computed separately. That should help quite a lot.
The functions btr_search_drop_page_hash_index() and btr_search_build_page_hash_index() are allocating some transient memory from the heap. If those could be changed to stack memory allocation, contention should be reduced further. Possibly, if we push down the rw-lock to the hash table cells (similar to buf_pool.page_hash and lock_sys.rec_hash ), we could use a "buffer" of only 1 element, instead of first computing the hash values in the array and then acquiring a coarser adaptive hash index latch for the operation.
We need a "buffer" of two last records, because there can be multiple consecutive records with an identical prefix. Depending on the bool left_side parameter, the adaptive hash index will store the first or last such record. I think that the function btr_search_build_page_hash_index() can be rewritten to use just 1-record "buffer", if we retain an ability to update the last inserted record pointer. Similarly, btr_search_drop_page_hash_index() can remember the last fold value and skip the call to ha_remove_all_nodes_to_page() for "duplicate" values.
I also plan to rewrite cmp_dtuple_rec_with_match() to work without precomputed offsets. That should remove the last prominent source of heap memory allocation.
After implementing these changes, and after pushing down the rw-lock to the hash table array, I think that this should be ready for some performance testing. That could take 1 to 2 weeks.
Marko Mäkelä
added a comment - We need a "buffer" of two last records, because there can be multiple consecutive records with an identical prefix. Depending on the bool left_side parameter, the adaptive hash index will store the first or last such record. I think that the function btr_search_build_page_hash_index() can be rewritten to use just 1-record "buffer", if we retain an ability to update the last inserted record pointer. Similarly, btr_search_drop_page_hash_index() can remember the last fold value and skip the call to ha_remove_all_nodes_to_page() for "duplicate" values.
I also plan to rewrite cmp_dtuple_rec_with_match() to work without precomputed offsets . That should remove the last prominent source of heap memory allocation.
After implementing these changes, and after pushing down the rw-lock to the hash table array, I think that this should be ready for some performance testing. That could take 1 to 2 weeks.
I rewrote the comparison functions and enabled page_cur_search_with_match_bytes() independently of the adaptive hash index. That one could improve INSERT performance.
Edit: In btr_page_get_split_rec_to_right() there is a comment that refers to sequential inserts (that is, page_cur_search_with_match_bytes()) and the adaptive hash index. The comment was added in MySQL 4.0.19 (MariaDB commit). This change forgot to adjust the condition page_header_get_field(page, PAGE_N_DIRECTION) > 3 to one less in page_cur_search_with_match_bytes().
Marko Mäkelä
added a comment - - edited I rewrote the comparison functions and enabled page_cur_search_with_match_bytes() independently of the adaptive hash index. That one could improve INSERT performance.
Edit: In btr_page_get_split_rec_to_right() there is a comment that refers to sequential inserts (that is, page_cur_search_with_match_bytes() ) and the adaptive hash index. The comment was added in MySQL 4.0.19 ( MariaDB commit ). This change forgot to adjust the condition page_header_get_field(page, PAGE_N_DIRECTION) > 3 to one less in page_cur_search_with_match_bytes() .
I rewrote page_cur_search_with_match() so that it will not compute offsets. That is actually outside the scope of the adaptive hash index but very similar to page_cur_search_with_match_bytes().
A challenge that is related to pushing down the rw-locks to the hash array is buffer pool resizing. Next, I am going to reinstate innodb_adaptive_hash_index_parts. The performance issue related to btr_search_check_free_space_in_heap() should be mitigated in its replacement btr_sea::partition::prepare_insert(), which would be protected by a separate btr_sea::partition::blocks_mutex instead of an exclusive btr_sea::partition::latch. Some tests should soon indicate whether this would yield acceptable performance.
Marko Mäkelä
added a comment - I rewrote page_cur_search_with_match() so that it will not compute offsets . That is actually outside the scope of the adaptive hash index but very similar to page_cur_search_with_match_bytes() .
A challenge that is related to pushing down the rw-locks to the hash array is buffer pool resizing. Next, I am going to reinstate innodb_adaptive_hash_index_parts . The performance issue related to btr_search_check_free_space_in_heap() should be mitigated in its replacement btr_sea::partition::prepare_insert() , which would be protected by a separate btr_sea::partition::blocks_mutex instead of an exclusive btr_sea::partition::latch . Some tests should soon indicate whether this would yield acceptable performance.
I took a deeper look at the MySQL hash function change. The change is large because of some excessive renaming (such as replacing "fold" with "hash"). They are not eliminating any offsets computations, and the function rec_hash() (renamed from rec_fold()) is still invoking a hash function separately for each field. Only the low-level hash function was replaced; ut::hash_binary() instead of ut_fold_ulint_pair().
I had been wondering about the function ut_hash_ulint(), mainly about the usefulness of the exclusive-or operation. A simplified definition is as follows:
I see that Oracle simplified this to key % table_size, which is much more meaningful. The prime number sizes that we are using should ensure that the values will be spread evenly. Somewhat related to this is MDEV-35247.
As far as I can tell, the main drawback of using CRC-32C is that we would only have 32 bits of entropy before we choose the hash table cell. The size of each adaptive hash index array is innodb_buffer_pool_size/512/innodb_adaptive_hash_index_parts. With the maximum number of partitions (512), we would not exceed 2³² elements per array until the buffer pool size exceeds 2⁵⁰ bytes or 2 PiB. It is not yet a real limit, because the virtual address space on many contemporary 64-bit processor implementations is only 48 bits (256 TiB). So, I would simply go for the SIMD accelerated CRC-32C.
Marko Mäkelä
added a comment - - edited I took a deeper look at the MySQL hash function change . The change is large because of some excessive renaming (such as replacing "fold" with "hash"). They are not eliminating any offsets computations, and the function rec_hash() (renamed from rec_fold() ) is still invoking a hash function separately for each field. Only the low-level hash function was replaced; ut::hash_binary() instead of ut_fold_ulint_pair() .
I found https://stackoverflow.com/questions/10953958/can-crc32-be-used-as-a-hash-function which seems to agree with my thinking that CRC-32C could indeed make a suitable non-cryptographic hash function.
I had been wondering about the function ut_hash_ulint() , mainly about the usefulness of the exclusive-or operation. A simplified definition is as follows:
typedef size_t ulint;
inline ulint ut_hash_ulint(ulint key, ulint table_size) noexcept
{
return (key ^ 1653893711) % table_size;
}
I see that Oracle simplified this to key % table_size , which is much more meaningful. The prime number sizes that we are using should ensure that the values will be spread evenly. Somewhat related to this is MDEV-35247 .
As far as I can tell, the main drawback of using CRC-32C is that we would only have 32 bits of entropy before we choose the hash table cell. The size of each adaptive hash index array is innodb_buffer_pool_size /512/ innodb_adaptive_hash_index_parts . With the maximum number of partitions (512), we would not exceed 2³² elements per array until the buffer pool size exceeds 2⁵⁰ bytes or 2 PiB. It is not yet a real limit, because the virtual address space on many contemporary 64-bit processor implementations is only 48 bits (256 TiB). So, I would simply go for the SIMD accelerated CRC-32C.
I ran a reduced version of innodb-hashtest.sh (one million rows in the larger table instead of 10,000,000) on an 8-year-old laptop (2-core Skylake CPU with 2-way hyperthreading), storing the database in /dev/shm. This should be sufficient for finding CPU or memory bottlenecks. With the current version of the code, I see the following:
Total execution with 1 threads time: 3 seconds
Total execution with 2 threads time: 4 seconds
Total execution with 4 threads time: 9 seconds
Total execution with 8 threads time: 12 seconds
Total execution with 16 threads time: 26 seconds
Total execution with 32 threads time: 68 seconds
Anything above 2 threads is really overdoing it. During the 32-thread run, I collected two perf record, first without -g:
I included cmp_data_data() because that is what the rewritten adaptive hash index code would invoke. I see that btr_search_guess_on_hash() is spending the bulk of its CPU time in a spin loop on either the adaptive hash index partition latch (there can be at most 2 distinct AHI partitions in this test case) or a buffer page latch. The ssux_lock_impl<true>::rd_wait() could be the slow path for either one. I tried to disable the spin loop again:
/** map of CRC-32C of rec prefix to rec_t* in buf_page_t::frame */
hash_table_t table;
/** latch protecting blocks, spare */
It would not be the first time we find out that spin loops are bad at some thread counts:
Total execution with 1 threads time: 3 seconds
Total execution with 2 threads time: 3 seconds
Total execution with 4 threads time: 6 seconds
Total execution with 8 threads time: 12 seconds
Total execution with 16 threads time: 25 seconds
Total execution with 32 threads time: 61 seconds
But as we can see, at higher thread counts it does not help much.
My effort to rewrite the memory allocation seems to have paid off: btr_sea::partition::prepare_insert() (which replaces btr_search_check_free_space_in_heap()) never allocated any blocks from the buffer pool, and it would not acquire an exclusive AHI partition latch either. An exclusive AHI partition latch would be acquired as part of the following, during the workload:
This involves an inline invocation of btr_search_update_hash_ref(). I will have to figure out why this is being invoked; neither the data nor the access pattern should not be changing here.
I initially thought that the following would acquire an exclusive latch, but in fact it is only acquiring a shared latch during this workload:
This call definitely needs to be moved up in the call stack, to all code paths that may actually access B-tree pages. Those callers only need to access the AHI if the buf_block_t::index is a non-null pointer that is different from the expected index object.
I must also find the reason why btr_search_update_hash_ref() is invoked. That is the main problem here.
Marko Mäkelä
added a comment - I ran a reduced version of innodb-hashtest.sh (one million rows in the larger table instead of 10,000,000) on an 8-year-old laptop (2-core Skylake CPU with 2-way hyperthreading), storing the database in /dev/shm . This should be sufficient for finding CPU or memory bottlenecks. With the current version of the code, I see the following:
Total execution with 1 threads time: 3 seconds
Total execution with 2 threads time: 4 seconds
Total execution with 4 threads time: 9 seconds
Total execution with 8 threads time: 12 seconds
Total execution with 16 threads time: 26 seconds
Total execution with 32 threads time: 68 seconds
Anything above 2 threads is really overdoing it. During the 32-thread run, I collected two perf record , first without -g :
17,42% mariadbd mariadbd [.] btr_search_guess_on_hash(dict_index_t*, dtuple_t const*, page_cur_mode_t, btr_latch_mode, btr_cur_t*, mtr_t*)
13,45% mariadbd mariadbd [.] ssux_lock_impl<true>::rd_wait()
6,83% mariadbd mariadbd [.] row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)
4,21% mariadbd mariadbd [.] void rec_init_offsets_comp_ordinary<false, false>(unsigned char const*, dict_index_t const*, unsigned short*, unsigned long, dict_col_t::def_t const*, rec_leaf_format)
2,97% mariadbd mariadbd [.] evaluate_join_record(JOIN*, st_join_table*, int)
2,74% mariadbd mariadbd [.] int cmp_dtuple_rec<true>(dtuple_t const&, unsigned char const*, dict_index_t const&, unsigned short*, unsigned long)
2,69% mariadbd mariadbd [.] Row_sel_get_clust_rec_for_mysql::operator()(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned short**, mem_block_info_t**, dtuple_t**, mtr_t*)
2,66% mariadbd [vdso] [.] __vdso_gettimeofday
2,47% mariadbd mariadbd [.] btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
2,20% mariadbd mariadbd [.] sub_select(JOIN*, st_join_table*, bool)
1,99% mariadbd mariadbd [.] row_sel_store_mysql_field(unsigned char*, row_prebuilt_t*, unsigned char const*, dict_index_t const*, unsigned short const*, unsigned long, mysql_row_templ_t const*)
1,85% mariadbd libc.so.6 [.] __memcmp_avx2_movbe
1,67% mariadbd mariadbd [.] row_sel_field_store_in_mysql_format_func(unsigned char*, mysql_row_templ_t const*, unsigned char const*, unsigned long)
1,63% mariadbd mariadbd [.] mtr_memo_slot_t::release() const
1,57% mariadbd mariadbd [.] cmp_data_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long)
I included cmp_data_data() because that is what the rewritten adaptive hash index code would invoke. I see that btr_search_guess_on_hash() is spending the bulk of its CPU time in a spin loop on either the adaptive hash index partition latch (there can be at most 2 distinct AHI partitions in this test case) or a buffer page latch. The ssux_lock_impl<true>::rd_wait() could be the slow path for either one. I tried to disable the spin loop again:
diff --git a/storage/innobase/include/btr0sea.h b/storage/innobase/include/btr0sea.h
index 9f910cb13c7..efdf916b994 100644
--- a/storage/innobase/include/btr0sea.h
+++ b/storage/innobase/include/btr0sea.h
@@ -123,7 +123,7 @@ struct btr_sea
struct partition
{
/** latch protecting the hash table */
- alignas(CPU_LEVEL1_DCACHE_LINESIZE) srw_spin_lock latch;
+ alignas(CPU_LEVEL1_DCACHE_LINESIZE) srw_lock latch;
/** map of CRC-32C of rec prefix to rec_t* in buf_page_t::frame */
hash_table_t table;
/** latch protecting blocks, spare */
It would not be the first time we find out that spin loops are bad at some thread counts:
Total execution with 1 threads time: 3 seconds
Total execution with 2 threads time: 3 seconds
Total execution with 4 threads time: 6 seconds
Total execution with 8 threads time: 12 seconds
Total execution with 16 threads time: 25 seconds
Total execution with 32 threads time: 61 seconds
But as we can see, at higher thread counts it does not help much.
My effort to rewrite the memory allocation seems to have paid off: btr_sea::partition::prepare_insert() (which replaces btr_search_check_free_space_in_heap() ) never allocated any blocks from the buffer pool, and it would not acquire an exclusive AHI partition latch either. An exclusive AHI partition latch would be acquired as part of the following, during the workload:
btr_sea::partition::insert(unsigned int, unsigned char const*)
btr_cur_t::search_info_update() const
btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
Row_sel_get_clust_rec_for_mysql::operator()(…)
row_search_mvcc(…)
This involves an inline invocation of btr_search_update_hash_ref() . I will have to figure out why this is being invoked; neither the data nor the access pattern should not be changing here.
I initially thought that the following would acquire an exclusive latch, but in fact it is only acquiring a shared latch during this workload:
btr_search_drop_page_hash_index(buf_block_t*, bool)
buf_page_get_low(page_id_t, unsigned long, rw_lock_type_t, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)
btr_cur_t::search_leaf(dtuple_t const*, page_cur_mode_t, btr_latch_mode, mtr_t*)
This is something that was originally added in MDEV-22456 for a significant customer. I tried disabling it:
diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc
index 50b503eef51..a716f6ab7a7 100644
--- a/storage/innobase/buf/buf0buf.cc
+++ b/storage/innobase/buf/buf0buf.cc
@@ -3051,7 +3051,7 @@ buf_page_get_low(
mtr->lock_register(mtr->get_savepoint() - 1, MTR_MEMO_BUF_FIX);
goto corrupted;
}
-#ifdef BTR_CUR_HASH_ADAPT
+#if 0
btr_search_drop_page_hash_index(block, true);
#endif /* BTR_CUR_HASH_ADAPT */
Removing that call slightly improves performance:
Total execution with 1 threads time: 2 seconds
Total execution with 2 threads time: 4 seconds
Total execution with 4 threads time: 4 seconds
Total execution with 8 threads time: 9 seconds
Total execution with 16 threads time: 20 seconds
Total execution with 32 threads time: 63 seconds
This call definitely needs to be moved up in the call stack, to all code paths that may actually access B-tree pages. Those callers only need to access the AHI if the buf_block_t::index is a non-null pointer that is different from the expected index object.
I must also find the reason why btr_search_update_hash_ref() is invoked. That is the main problem here.
I used rr record and rr replay to determine why the adaptive hash index is being rebuilt during a yet reduced version workload (with 100k instead of 10M rows in the bigger table). Specifically, I ran the following commands in rr replay:
continue
break btr_search_update_hash_ref
reverse-continue
disable 1
watch -l cursor->flag
reverse-continue
to find out why the last occurrence took place.
It turns out that btr_search_guess_on_hash() would have set cursor->flag = BTR_CUR_HASH_FAIL even though a guess was deemed invalid. I reviewed my rewrite of btr_search_check_guess(), and the logic looks equivalent. The problem is this part (quoting the code in MariaDB Server 10.6 before my work-in-progress changes):
if (page_rec_is_supremum(next_rec)) {
if (!page_has_next(page_align(next_rec))) {
cursor->up_match = 0;
success = true;
}
goto exit_func;
}
During handler::ha_index_next_same(), we had a match (cmp==0) for the PRIMARY KEY(a) in table t2. That record was the last one on a page, but not the last one in the table. It would seem that in this case, the correct course of action would have been to set the up_match equal to low_match and declare victory. This sub-optimal logic was there already in MySQL 3.23.53.
It seems that we are missing special treatment for the case cmp == 0 && dict_index_get_n_unique_in_tree(index) == tuple->n_fields_cmp. In that case, we should know that the next record will be larger than the search tuple.
Marko Mäkelä
added a comment - I used rr record and rr replay to determine why the adaptive hash index is being rebuilt during a yet reduced version workload (with 100k instead of 10M rows in the bigger table). Specifically, I ran the following commands in rr replay :
continue
break btr_search_update_hash_ref
reverse-continue
disable 1
watch -l cursor->flag
reverse-continue
to find out why the last occurrence took place.
It turns out that btr_search_guess_on_hash() would have set cursor->flag = BTR_CUR_HASH_FAIL even though a guess was deemed invalid. I reviewed my rewrite of btr_search_check_guess() , and the logic looks equivalent. The problem is this part (quoting the code in MariaDB Server 10.6 before my work-in-progress changes):
if (page_rec_is_supremum(next_rec)) {
if (!page_has_next(page_align(next_rec))) {
cursor->up_match = 0;
success = true ;
}
goto exit_func;
}
During handler::ha_index_next_same() , we had a match ( cmp==0 ) for the PRIMARY KEY(a) in table t2 . That record was the last one on a page, but not the last one in the table. It would seem that in this case, the correct course of action would have been to set the up_match equal to low_match and declare victory. This sub-optimal logic was there already in MySQL 3.23.53.
It seems that we are missing special treatment for the case cmp == 0 && dict_index_get_n_unique_in_tree(index) == tuple->n_fields_cmp . In that case, we should know that the next record will be larger than the search tuple.
All InnoDB versions so far are setting cursor->flag = BTR_CUR_HASH_FAIL in some further cases that definitely do not warrant a rebuild of the adaptive hash index:
if we would have to wait for a buffer page latch; well, this is an "inconclusive" result, not a failure; the same operation executed a bit earlier or later could have succeeded
if the mode is PAGE_CUR_G or PAGE_CUR_L and we do get a match of the key; well, shouldn’t we avoid the AHI in these modes to begin with?
Marko Mäkelä
added a comment - All InnoDB versions so far are setting cursor->flag = BTR_CUR_HASH_FAIL in some further cases that definitely do not warrant a rebuild of the adaptive hash index:
if we would have to wait for a buffer page latch; well, this is an "inconclusive" result, not a failure; the same operation executed a bit earlier or later could have succeeded
if the mode is PAGE_CUR_G or PAGE_CUR_L and we do get a match of the key; well, shouldn’t we avoid the AHI in these modes to begin with?
There is some fluctuation in the numbers on my system, but removing some "baseless" BTR_CUR_HASH_FAIL seems to have done the trick:
Total execution with 1 threads time: 3 seconds
Total execution with 2 threads time: 3 seconds
Total execution with 4 threads time: 5 seconds
Total execution with 8 threads time: 11 seconds
Total execution with 16 threads time: 20 seconds
Total execution with 32 threads time: 46 seconds
I checked a 10-second perf record -g sample taken during one 32-thread run, and btr_sea::partition::insert disappeared, along with any rd_wait(). It looks like most if not all context switches are due to preemption (asm_sysvec_reschedule_ipi and irqentry_exit_to_user_mode in the Linux kernel). (Executing 32 CPU-bound threads on only 2 hardware threads of the 8-year-old laptop is asking for exactly this kind of trouble.)
I did not push this change yet, because I need to address some regression test suite failures for the debug instrumented build. Once ready, I think that this latest development should be worth testing on a larger system. I should still figure out a replacement for the btr_search_drop_page_hash_index() call in buf_page_get_low().
Marko Mäkelä
added a comment - There is some fluctuation in the numbers on my system, but removing some "baseless" BTR_CUR_HASH_FAIL seems to have done the trick:
Total execution with 1 threads time: 3 seconds
Total execution with 2 threads time: 3 seconds
Total execution with 4 threads time: 5 seconds
Total execution with 8 threads time: 11 seconds
Total execution with 16 threads time: 20 seconds
Total execution with 32 threads time: 46 seconds
I checked a 10-second perf record -g sample taken during one 32-thread run, and btr_sea::partition::insert disappeared, along with any rd_wait() . It looks like most if not all context switches are due to preemption ( asm_sysvec_reschedule_ipi and irqentry_exit_to_user_mode in the Linux kernel). (Executing 32 CPU-bound threads on only 2 hardware threads of the 8-year-old laptop is asking for exactly this kind of trouble.)
I did not push this change yet, because I need to address some regression test suite failures for the debug instrumented build. Once ready, I think that this latest development should be worth testing on a larger system. I should still figure out a replacement for the btr_search_drop_page_hash_index() call in buf_page_get_low() .
I did not yet figure out a way to avoid BTR_CUR_HASH_FAIL when the buffer page latch can’t be acquired without waiting. But I fixed the problem that caused bogus failures when we find an exact match at the end of a page, using PAGE_CUR_LE. I also merged the 3 parameters left_side,n_bytes,n_fields into a single 32-bit field left_bytes_fields, so that the parameters can be compared and copied more efficiently. There are two such bunch of parameters in the block descriptor: for the "next AHI" and the currently built adaptive hash index. I hope that the "next" parameter can be removed, but I did not try it yet.
Marko Mäkelä
added a comment - I did not yet figure out a way to avoid BTR_CUR_HASH_FAIL when the buffer page latch can’t be acquired without waiting. But I fixed the problem that caused bogus failures when we find an exact match at the end of a page, using PAGE_CUR_LE . I also merged the 3 parameters left_side,n_bytes,n_fields into a single 32-bit field left_bytes_fields , so that the parameters can be compared and copied more efficiently. There are two such bunch of parameters in the block descriptor: for the "next AHI" and the currently built adaptive hash index. I hope that the "next" parameter can be removed, but I did not try it yet.
I tested the current development using Sysbench oltp_update_non_index, including the removal of BTR_CUR_HASH_FAIL when the page latch cannot be acquired.
connections
throughput/tps
1
10619.75
6
13095.61
12
14040.93
24
23538.36
48
59772.14
72
98002.21
This may not be as good as the innodb-hashtest.sh that monty created, but the adaptive hash index does get some use during this workload. Below is a sample from the 72-connection run:
I’m still going to see if I can remove one set of AHI parameters from buf_block_t.
Marko Mäkelä
added a comment - I tested the current development using Sysbench oltp_update_non_index , including the removal of BTR_CUR_HASH_FAIL when the page latch cannot be acquired.
connections
throughput/tps
1
10619.75
6
13095.61
12
14040.93
24
23538.36
48
59772.14
72
98002.21
This may not be as good as the innodb-hashtest.sh that monty created, but the adaptive hash index does get some use during this workload. Below is a sample from the 72-connection run:
6.51% mariadbd [.] btr_search_guess_on_hash(dict_index_t*, dtuple_t const*,
0.01% mariadbd [.] btr_search_drop_page_hash_index(buf_block_t*, bool)
0.00% mariadbd [.] btr_sea::partition::insert(unsigned int, unsigned char c
0.00% mariadbd [.] btr_search_build_page_hash_index(dict_index_t*, buf_bloc
I’m still going to see if I can remove one set of AHI parameters from buf_block_t .
Removing one set of AHI parameters is indeed possible, and sizeof(buf_block_t) would shrink by about 64 bits to 152 bytes. One last thing that I want to do is to move the btr_search_drop_page_hash_index(block, true) call from buf_page_get_low() to the callers that really need it, to reduce a performance regression due to MDEV-22456.
Marko Mäkelä
added a comment - Removing one set of AHI parameters is indeed possible, and sizeof(buf_block_t) would shrink by about 64 bits to 152 bytes. One last thing that I want to do is to move the btr_search_drop_page_hash_index(block, true) call from buf_page_get_low() to the callers that really need it, to reduce a performance regression due to MDEV-22456 .
Here is the new numbers from the latest MDEV-35049 tree:
Server started with:
sql/mariadbd --innodb-buffer-pool-size=32G --innodb-buffer-pool-load-at-startup=0 --innodb-log-file-size=1G --innodb-fast-shutdown=0 --innodb_adaptive_hash_index=1 --max-connections=300
Data is 2.8G
Total execution with 16 threads time: 1 seconds
Total execution with 32 threads time: 3 seconds
Total execution with 64 threads time: 5 seconds
Total execution with 128 threads time: 10 seconds
Total execution with 256 threads time: 20 second
Michael Widenius
added a comment - Here is the new numbers from the latest MDEV-35049 tree:
Server started with:
sql/mariadbd --innodb-buffer-pool-size=32G --innodb-buffer-pool-load-at-startup=0 --innodb-log-file-size=1G --innodb-fast-shutdown=0 --innodb_adaptive_hash_index=1 --max-connections=300
Data is 2.8G
Total execution with 16 threads time: 1 seconds
Total execution with 32 threads time: 3 seconds
Total execution with 64 threads time: 5 seconds
Total execution with 128 threads time: 10 seconds
Total execution with 256 threads time: 20 second
I was testing this on my dual-core Skylake laptop with a smaller amount of data (10,000+100,000 rows), which I suppose will lead to ENGINE=MEMORY being used for any temporary tables, instead of ENGINE=Aria:
Total execution with 1 threads time: 1 seconds
Total execution with 2 threads time: 0 seconds
Total execution with 4 threads time: 2 seconds
Total execution with 8 threads time: 2 seconds
Total execution with 16 threads time: 5 seconds
Total execution with 32 threads time: 10 seconds
During the execution at 16 or 32 threads, I collected perf record data for 5 seconds. I do see some writes, which I did not expect to see there:
I had some trouble reproducing these calls in rr replay, but in the end I did reproduce it under GDB. It turns out that the AHI parameters for the index are alternating between left_side=true,n_fields=1 and left_side=false,n_bytes=1,n_fields=0. Maybe there should be some way to prevent this.
Marko Mäkelä
added a comment - I was testing this on my dual-core Skylake laptop with a smaller amount of data (10,000+100,000 rows), which I suppose will lead to ENGINE=MEMORY being used for any temporary tables, instead of ENGINE=Aria :
Total execution with 1 threads time: 1 seconds
Total execution with 2 threads time: 0 seconds
Total execution with 4 threads time: 2 seconds
Total execution with 8 threads time: 2 seconds
Total execution with 16 threads time: 5 seconds
Total execution with 32 threads time: 10 seconds
During the execution at 16 or 32 threads, I collected perf record data for 5 seconds. I do see some writes, which I did not expect to see there:
+ 29.78% 21.89% mariadbd mariadbd [.] btr_search_guess_on_hash(dict_index_t*, dtuple_t const*, bool, btr_latch_mode, btr_cur_t*, mtr_t*)
0.16% 0.02% mariadbd mariadbd [.] btr_search_update_hash_ref(btr_cur_t const&, buf_block_t*, unsigned int)
0.04% 0.01% mariadbd mariadbd [.] btr_search_build_page_hash_index(dict_index_t*, buf_block_t*, btr_sea::partition&, unsigned int)
0.02% 0.01% mariadbd mariadbd [.] btr_search_drop_page_hash_index(buf_block_t*, dict_index_t const*)
0.00% 0.00% mariadbd mariadbd [.] btr_sea::partition::prepare_insert()
Related to this, there also is a tiny portion of futex waits inside InnoDB:
0.14% 0.14% mariadbd mariadbd [.] srw_mutex_impl<true>::wait_and_lock()
0.06% 0.06% mariadbd mariadbd [.] ssux_lock_impl<true>::rd_wait()
0.04% 0.04% mariadbd mariadbd [.] ssux_lock_impl<true>::wr_wait(unsigned int)
I had some trouble reproducing these calls in rr replay , but in the end I did reproduce it under GDB. It turns out that the AHI parameters for the index are alternating between left_side=true,n_fields=1 and left_side=false,n_bytes=1,n_fields=0. Maybe there should be some way to prevent this.
The reason why btr_search_info_update_hash() was changing the adaptive hash index parameters is that it was being invoked on a supremum pseudo-record, which never can be part of any adaptive hash index. As far as I can tell, this deficiency was not introduced in my refactoring. It was always there, maybe harder to hit than the 2 other cases where the AHI is being rebuilt for no reason. After I added an early return for that case, ahi.sh on my 2-core laptop would complete a tiny bit faster:
Total execution with 1 threads time: 1 seconds
Total execution with 2 threads time: 0 seconds
Total execution with 4 threads time: 1 seconds
Total execution with 8 threads time: 3 seconds
Total execution with 16 threads time: 4 seconds
Total execution with 32 threads time: 10 seconds
I did not observe any waits in InnoDB. I found only the following while the test was executing at 16 or 32 concurrent connections:
I did not find any call stacks for these; it might be that I’d need a libc.so that is compiled with -fno-omit-frame-pointer. Next, I’ll test this on a system with lots of hardware threads to see if any context switches would take place. On the poor 2-core laptop, there is plenty of preemption going on (asm_sysvec_apic_timer_interrupt, irqentry_exit_to_user_mode, schedule being executed in the kernel).
Marko Mäkelä
added a comment - The reason why btr_search_info_update_hash() was changing the adaptive hash index parameters is that it was being invoked on a supremum pseudo-record, which never can be part of any adaptive hash index. As far as I can tell, this deficiency was not introduced in my refactoring. It was always there, maybe harder to hit than the 2 other cases where the AHI is being rebuilt for no reason. After I added an early return for that case, ahi.sh on my 2-core laptop would complete a tiny bit faster:
Total execution with 1 threads time: 1 seconds
Total execution with 2 threads time: 0 seconds
Total execution with 4 threads time: 1 seconds
Total execution with 8 threads time: 3 seconds
Total execution with 16 threads time: 4 seconds
Total execution with 32 threads time: 10 seconds
I did not observe any waits in InnoDB. I found only the following while the test was executing at 16 or 32 concurrent connections:
0.00% 0.00% mariadbd libc.so.6 [.] __futex_abstimed_wait_common
0.00% 0.00% mariadbd [kernel.kallsyms] [k] futex_wait
0.00% 0.00% mariadbd [kernel.kallsyms] [k] __futex_wait
0.00% 0.00% mariadbd [kernel.kallsyms] [k] futex_wait_queue
I did not find any call stacks for these; it might be that I’d need a libc.so that is compiled with -fno-omit-frame-pointer . Next, I’ll test this on a system with lots of hardware threads to see if any context switches would take place. On the poor 2-core laptop, there is plenty of preemption going on ( asm_sysvec_apic_timer_interrupt , irqentry_exit_to_user_mode , schedule being executed in the kernel).
After the last push today, combined with some fixes in Aria (to be added to 10.5 and up), we have in my tests:
sql/mariadbd --innodb-buffer-pool-size=32G --innodb-buffer-pool-load-at-startup=0 --innodb-log-file-size=1G --innodb-fast-shutdown=0 --innodb_adaptive_hash_index=1 --max-connections=300 --aria_pagecache_buffer_size=512M --max_heap_table_size=256M --tmp_memory_table_size=256M
Total execution with 32 threads time: 2 seconds
Total execution with 64 threads time: 2 seconds
Total execution with 128 threads time: 5 seconds
Total execution with 256 threads time: 11 seconds
Michael Widenius
added a comment - After the last push today, combined with some fixes in Aria (to be added to 10.5 and up), we have in my tests:
sql/mariadbd --innodb-buffer-pool-size=32G --innodb-buffer-pool-load-at-startup=0 --innodb-log-file-size=1G --innodb-fast-shutdown=0 --innodb_adaptive_hash_index=1 --max-connections=300 --aria_pagecache_buffer_size=512M --max_heap_table_size=256M --tmp_memory_table_size=256M
Total execution with 32 threads time: 2 seconds
Total execution with 64 threads time: 2 seconds
Total execution with 128 threads time: 5 seconds
Total execution with 256 threads time: 11 seconds
I assumed that these are minor page faults (such as TLB miss). I checked again with perf record -e faults, but it is not highlighting these, but some writes instead. It looks like there was some STATS_AUTO_RECALC action going on. After I amended the CREATE TABLE statements with STATS_PERSISTENT=0, those writes went away and perf record -e faults captured no samples. The perf record -e context-switches remained unaffected by this. The scalability would be rather good:
Creating tables (will take a while)
Total execution with 1 threads time: 1 seconds
Total execution with 2 threads time: 1 seconds
Total execution with 4 threads time: 1 seconds
Total execution with 8 threads time: 2 seconds
Total execution with 16 threads time: 3 seconds
Total execution with 32 threads time: 7 seconds
Once we exceed the number of CPU cores (16), the execution time will roughly double (there seems to be almost 1 second of jitter in the reported times). Until then it seems to scale rather well.
Marko Mäkelä
added a comment - ahi.sh on a 16-core, 32-thread Intel® Xeon® Gold 6208U:
Creating tables (will take a while)
Total execution with 1 threads time: 0 seconds
Total execution with 2 threads time: 1 seconds
Total execution with 4 threads time: 1 seconds
Total execution with 8 threads time: 2 seconds
Total execution with 16 threads time: 4 seconds
Total execution with 32 threads time: 6 seconds
I see no signs of preemption and hardly any waits:
0.00% 0.00% mariadbd mariadbd [.] vio_socket_io_wait
0.00% 0.00% mariadbd libpthread-2.31.so [.] pthread_cond_timedwait@@
0.00% 0.00% mariadbd mariadbd [.] thd_wait_begin
I checked also
perf record -e context-switches -g -p $(pgrep mariadbd) -- sleep 4
near the end of the workload (mostly when running 16 concurrent queries):
- 76.48% ha_innobase::general_fetch
- 74.64% row_search_mvcc
- 65.44% Row_sel_get_clust_rec_for_mysql::operator()
- 60.05% btr_cur_t::search_leaf
- 59.53% btr_search_guess_on_hash
- 57.95% 0xffffffff8be00a34
0xffffffff8b203e97
0xffffffff8b203bff
0xffffffff8bce20c2
0xffffffff8bce20c2
- 1.45% btr_cur_t::check_mismatch
- 1.05% 0xffffffff8be00a34
…
I assumed that these are minor page faults (such as TLB miss). I checked again with perf record -e faults , but it is not highlighting these, but some writes instead. It looks like there was some STATS_AUTO_RECALC action going on. After I amended the CREATE TABLE statements with STATS_PERSISTENT=0 , those writes went away and perf record -e faults captured no samples. The perf record -e context-switches remained unaffected by this. The scalability would be rather good:
Creating tables (will take a while)
Total execution with 1 threads time: 1 seconds
Total execution with 2 threads time: 1 seconds
Total execution with 4 threads time: 1 seconds
Total execution with 8 threads time: 2 seconds
Total execution with 16 threads time: 3 seconds
Total execution with 32 threads time: 7 seconds
Once we exceed the number of CPU cores (16), the execution time will roughly double (there seems to be almost 1 second of jitter in the reported times). Until then it seems to scale rather well.
We could make the AHI less eager to rebuild itself for unique indexes. The AHI for CREATE UNIQUE INDEX ID_IND ON SYS_TABLES(ID) had initially been built for 1 field (the unique SYS_TABLES.ID), which makes sense. On a DROP TABLE, we are switching to build the adaptive hash index on the entire secondary index (ID, NAME). That is going to ruin most searches:
Thread 18 hit Hardware watchpoint 2: -location $i.search_info.left_bytes_fields.m._M_i
#3 0x0000578be5e35afb in btr_cur_t::search_info_update (this=this@entry=0x794b3c16e300) at /data/Server/10.6-MDEV-35049B/storage/innobase/btr/btr0sea.cc:1557
#4 0x0000578be5e29ce8 in btr_cur_t::search_leaf (this=this@entry=0x794b3c16e300, tuple=tuple@entry=0x794b200ccfc8, mode=PAGE_CUR_LE, latch_mode=<optimized out>, mtr=mtr@entry=0x794b401650a0)
at /data/Server/10.6-MDEV-35049B/storage/innobase/btr/btr0cur.cc:1547
at /data/Server/10.6-MDEV-35049B/storage/innobase/btr/btr0pcur.cc:439
#7 0x0000578be5d94575 in row_sel_restore_pcur_pos (plan=plan@entry=0x794b3c16e2f0, mtr=mtr@entry=0x794b401650a0) at /data/Server/10.6-MDEV-35049B/storage/innobase/row/row0sel.cc:1468
#8 0x0000578be5d9d550 in row_sel (node=node@entry=0x794b3c16e200, thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/row/row0sel.cc:1775
#9 0x0000578be5d9e35a in row_sel_step (thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/row/row0sel.cc:2435
#10 0x0000578be5d2d0f2 in que_thr_step (thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:530
#11 0x0000578be5d2d3f3 in que_run_threads_low (thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:610
#12 0x0000578be5d2d4b4 in que_run_threads (thr=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:630
#13 0x0000578be5d2d8df in que_eval_sql (info=info@entry=0x794b080484d8,
sql=sql@entry=0x578be625c248 "PROCEDURE DROP_TABLE() IS\niid CHAR;\nDECLARE CURSOR idx IS\nSELECT ID FROM SYS_INDEXES\nWHERE TABLE_ID=:id FOR UPDATE;\nBEGIN\nDELETE FROM SYS_TABLES WHERE ID=:id;\nDELETE FROM SYS_COLUMNS WHERE TABLE_ID=:i"..., trx=trx@entry=0x794b41407380) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:669
#14 0x0000578be5e9deef in trx_t::drop_table (this=this@entry=0x794b41407380,
Marko Mäkelä
added a comment - We could make the AHI less eager to rebuild itself for unique indexes. The AHI for CREATE UNIQUE INDEX ID_IND ON SYS_TABLES(ID) had initially been built for 1 field (the unique SYS_TABLES.ID ), which makes sense. On a DROP TABLE , we are switching to build the adaptive hash index on the entire secondary index (ID, NAME) . That is going to ruin most searches:
Thread 18 hit Hardware watchpoint 2: -location $i.search_info.left_bytes_fields.m._M_i
Old value = 0x80000001
New value = 0x2
…
(rr) backtrace
…
#2 btr_search_info_update_hash (
cursor=@0x794b3c16e300: {page_cur = {index = 0x578be8059d68, rec = 0x794b3b0e85af "", offsets = 0x0, block = 0x794b3a80d200}, purge_node = 0x0, thr = 0x0, flag = BTR_CUR_BINARY, tree_height = 0x1, up_match = 0x0, up_bytes = 0x7, low_match = 0x2, low_bytes = 0x0, n_bytes_fields = 0x3c16e360, fold = 0x794b, path_arr = 0x0, rtr_info = 0x0}) at /data/Server/10.6-MDEV-35049B/storage/innobase/btr/btr0sea.cc:588
#3 0x0000578be5e35afb in btr_cur_t::search_info_update (this=this@entry=0x794b3c16e300) at /data/Server/10.6-MDEV-35049B/storage/innobase/btr/btr0sea.cc:1557
#4 0x0000578be5e29ce8 in btr_cur_t::search_leaf (this=this@entry=0x794b3c16e300, tuple=tuple@entry=0x794b200ccfc8, mode=PAGE_CUR_LE, latch_mode=<optimized out>, mtr=mtr@entry=0x794b401650a0)
at /data/Server/10.6-MDEV-35049B/storage/innobase/btr/btr0cur.cc:1547
#5 0x0000578be5e2f987 in btr_pcur_open_with_no_init (mtr=0x794b401650a0, cursor=0x794b3c16e300, latch_mode=<optimized out>, mode=<optimized out>, tuple=0x794b200ccfc8)
at /data/Server/10.6-MDEV-35049B/storage/innobase/include/btr0pcur.inl:322
#6 btr_pcur_t::restore_position (this=this@entry=0x794b3c16e300, restore_latch_mode=<optimized out>, restore_latch_mode@entry=BTR_SEARCH_LEAF, mtr=mtr@entry=0x794b401650a0)
at /data/Server/10.6-MDEV-35049B/storage/innobase/btr/btr0pcur.cc:439
#7 0x0000578be5d94575 in row_sel_restore_pcur_pos (plan=plan@entry=0x794b3c16e2f0, mtr=mtr@entry=0x794b401650a0) at /data/Server/10.6-MDEV-35049B/storage/innobase/row/row0sel.cc:1468
#8 0x0000578be5d9d550 in row_sel (node=node@entry=0x794b3c16e200, thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/row/row0sel.cc:1775
#9 0x0000578be5d9e35a in row_sel_step (thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/row/row0sel.cc:2435
#10 0x0000578be5d2d0f2 in que_thr_step (thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:530
#11 0x0000578be5d2d3f3 in que_run_threads_low (thr=thr@entry=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:610
#12 0x0000578be5d2d4b4 in que_run_threads (thr=0x794b3c1701d0) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:630
#13 0x0000578be5d2d8df in que_eval_sql (info=info@entry=0x794b080484d8,
sql=sql@entry=0x578be625c248 "PROCEDURE DROP_TABLE() IS\niid CHAR;\nDECLARE CURSOR idx IS\nSELECT ID FROM SYS_INDEXES\nWHERE TABLE_ID=:id FOR UPDATE;\nBEGIN\nDELETE FROM SYS_TABLES WHERE ID=:id;\nDELETE FROM SYS_COLUMNS WHERE TABLE_ID=:i"..., trx=trx@entry=0x794b41407380) at /data/Server/10.6-MDEV-35049B/storage/innobase/que/que0que.cc:669
#14 0x0000578be5e9deef in trx_t::drop_table (this=this@entry=0x794b41407380,
My attempt to optimize the AHI latch release in btr_search_guess_on_hash() caused various race conditions. I was wrongly releasing the latch before checking block->index and block->page.state(). I also implemented a temporary workaround for MDEV-35485, so that I was able to test this. I hope that this will fix the bulk of failures that were observed in stress tests so far.
Marko Mäkelä
added a comment - My attempt to optimize the AHI latch release in btr_search_guess_on_hash() caused various race conditions. I was wrongly releasing the latch before checking block->index and block->page.state() . I also implemented a temporary workaround for MDEV-35485 , so that I was able to test this. I hope that this will fix the bulk of failures that were observed in stress tests so far.
I did not fшnd obvious errors during code review. Fixed-size allocator, crc32-c as hash function for hash index, hashing contiguous array of bytes instead of separate fields, `part.blocks_mutex` along with `part.spare` and other changes should improve hash index performance.
Vladislav Lesin
added a comment - - edited I did not fшnd obvious errors during code review. Fixed-size allocator, crc32-c as hash function for hash index, hashing contiguous array of bytes instead of separate fields, `part.blocks_mutex` along with `part.spare` and other changes should improve hash index performance.
I analyzed rr replay traces from mleich that displayed 2 problems when innodb_adaptive_hash_index=OFF:
(analysis completed) A race condition MDEV-35508 that was caused by MDEV-34515, unrelated to these changes.
(analysis in progress) 2 traces where the first sign of trouble is that page_validate() reports Records in wrong order for the change buffer ibuf.index. I will need to adjust MDEV-35312, which is part of this branch. This was a pre-existing bug MDEV-35525.
I’m still waiting for stress test results with innodb_change_buffering=none and innodb_adaptive_hash_index=ON. The tests so far were conducted before I had fixed the race condition in btr_search_guess_on_hash().
Marko Mäkelä
added a comment - - edited I analyzed rr replay traces from mleich that displayed 2 problems when innodb_adaptive_hash_index=OFF :
(analysis completed) A race condition MDEV-35508 that was caused by MDEV-34515 , unrelated to these changes.
(analysis in progress) 2 traces where the first sign of trouble is that page_validate() reports Records in wrong order for the change buffer ibuf.index . I will need to adjust MDEV-35312 , which is part of this branch. This was a pre-existing bug MDEV-35525 .
I’m still waiting for stress test results with innodb_change_buffering=none and innodb_adaptive_hash_index=ON . The tests so far were conducted before I had fixed the race condition in btr_search_guess_on_hash() .
I analyzed the 2 rr replay traces where the change buffer got corrupted. Both seem to be caused by something that may have been broken in MDEV-30400 already. That bug would only affect the modes BTR_SEARCH_PREV (backward index scans) and BTR_MODIFY_PREV (only used by the change buffer) when an index tree is at least 2 levels high. I will file a separate bug for that, once the tentative fix has been tested.
Marko Mäkelä
added a comment - I analyzed the 2 rr replay traces where the change buffer got corrupted. Both seem to be caused by something that may have been broken in MDEV-30400 already. That bug would only affect the modes BTR_SEARCH_PREV (backward index scans) and BTR_MODIFY_PREV (only used by the change buffer) when an index tree is at least 2 levels high. I will file a separate bug for that, once the tentative fix has been tested.
For the record, monty mentioned that running the larger innodb-hashtest.sh revealed the performance bugs MDEV-35469 and MDEV-35470 outside InnoDB. I don’t think I noticed any ENGINE=Aria writes while running the smaller ahi.sh.
Marko Mäkelä
added a comment - For the record, monty mentioned that running the larger innodb-hashtest.sh revealed the performance bugs MDEV-35469 and MDEV-35470 outside InnoDB. I don’t think I noticed any ENGINE=Aria writes while running the smaller ahi.sh .
Hopefully one of the last bugs was that on ROW_FORMAT=REDUNDANT records, rec_fold() determined the record prefix length incorrectly when the last included column was NULL. I wrote storage/innobase/unittest/innodb_ahi-t.cc to cover both dtuple_fold() and rec_fold().
Marko Mäkelä
added a comment - Hopefully one of the last bugs was that on ROW_FORMAT=REDUNDANT records, rec_fold() determined the record prefix length incorrectly when the last included column was NULL . I wrote storage/innobase/unittest/innodb_ahi-t.cc to cover both dtuple_fold() and rec_fold() .
I merged the 10.6 based development branch with the main branch and created the branch bb-11.8-innodb-ahi-cursor for further testing.
The most recent rounds of tests produced some core dumps for some debug assertion failures !block->n_pointers, which indicate corruption of the adaptive hash index. Hopefully we can get a simplified grammar for reproducing that, or an rr replay trace.
Marko Mäkelä
added a comment - I merged the 10.6 based development branch with the main branch and created the branch bb-11.8-innodb-ahi-cursor for further testing.
The most recent rounds of tests produced some core dumps for some debug assertion failures !block->n_pointers , which indicate corruption of the adaptive hash index. Hopefully we can get a simplified grammar for reproducing that, or an rr replay trace.
CREATETABLE t (pk INTPRIMARYKEY, f VARCHAR(128), FULLTEXT (f)) ENGINE=InnoDB;
INSERTINTO t VALUES (1,'curriculum'),(2,'educator'),(3,'two-thirds'),(4,'needle'),(5,'swimming'),(6,'receiver'),(7,'sole'),(8,'publicly'),(9,'honest'),(10,'bitter'),(11,'staff'),(12,'stroke'),(13,'front'),(14,'primary'),(15,'journal'),(16,'requirement'),(17,'science'),(18,'essay'),(19,'application'),(20,'oak'),(21,'provide'),(22,'citizen'),(23,'bathroom'),(24,'sport'),(25,'successful'),(26,'supportive'),(27,'schedule'),(28,'proceed'),(29,'partial'),(30,'sovereignty'),(31,'promising'),(32,'storage'),(33,'twist'),(34,'capable'),(35,'organize'),(36,'basic'),(37,'statue'),(38,'replace'),(39,'anyway'),(40,'smooth'),(41,'bother'),(42,'orientation'),(43,'halfway'),(44,'them'),(45,'sauce'),(46,'trouble'),(47,'occasional'),(48,'belt'),(49,'craft'),(50,'conscience'),(51,'large'),(52,'recommendation'),(53,'slam'),(54,'around'),(55,'projection'),(56,'hurricane'),(57,'remove'),(58,'year'),(59,'bishop'),(60,'measure'),(61,'roman'),(62,'science'),(63,'grade'),(64,'steam'),(65,'dough'),(66,'genius'),(67,'certainly'),(68,'encounter'),(69,'legislature'),(70,'dangerous'),(71,'pride'),(72,'nobody'),(73,'toilet'),(74,'chapter'),(75,'inform'),(76,'its'),(77,'regional'),(78,'nearly'),(79,'borrow'),(80,'widely'),(81,'overwhelming'),(82,'barely'),(83,'lead'),(84,'personality'),(85,'preparation'),(86,'question'),(87,'broken'),(88,'attend'),(89,'day'),(90,'recently'),(91,'recipient'),(92,'tear'),(93,'lady'),(94,'distribute'),(95,'terrible'),(96,'thank'),(97,'consist'),(98,'distinct'),(99,'amendment'),(100,'discipline'),(101,'billion'),(102,'oxygen'),(103,'lightly'),(104,'resume'),(105,'vast'),(106,'stir'),(107,'life'),(108,'joke'),(109,'faculty'),(110,'connection'),(111,'pleasure'),(112,'hopefully'),(113,'favorite'),(114,'heaven'),(115,'ratio'),(116,'deem'),(117,'kiss'),(118,'medicine'),(119,'wrist'),(120,'quest'),(121,'mixture'),(122,'fatal'),(123,'conference'),(124,'news'),(125,'undergo'),(126,'sunlight'),(127,'queen'),(128,'supplier'),(129,'allow'),(130,'hunt'),(131,'birthday'),(132,'socially'),(133,'sidewalk'),(134,'journey'),(135,'become'),(136,'historian'),(137,'contain'),(138,'establish'),(139,'fleet'),(140,'self'),(141,'publicity'),(142,'treat'),(143,'tool'),(144,'condemn'),(145,'iron'),(146,'luck'),(147,'sea'),(148,'defensive'),(149,'foreigner'),(150,'advance');
In an rr replay trace that elenst provided for a ut_ad(!block->n_pointers) debug assertion failure, the LEFT_SIDE flag of the adaptive hash index for a single-column PRIMARY KEY index is changing: the n_bytes_fields is changing between 1 and LEFT_SIDE|1. This would seem to be yet another performance deficiency. It seems to me that the LEFT_SIDE flag needs to be ignored in case the adaptive hash index covers all unique index fields.
Marko Mäkelä
added a comment - In an rr replay trace that elenst provided for a ut_ad(!block->n_pointers) debug assertion failure, the LEFT_SIDE flag of the adaptive hash index for a single-column PRIMARY KEY index is changing: the n_bytes_fields is changing between 1 and LEFT_SIDE|1 . This would seem to be yet another performance deficiency. It seems to me that the LEFT_SIDE flag needs to be ignored in case the adaptive hash index covers all unique index fields.
There is a bug in btr_search_drop_page_hash_index(). In the rr replay trace from elenst that I analyzed, we have PAGE_N_RECS 410, and btr_search_build_page_hash_index() is incrementing the debug counter block->n_pointers to that much, in batches of up to 64, when a while (n_cached) loop is invoking ha_insert_for_fold().
btr_search_build_page_hash_index() will invoke ha_remove_all_nodes_to_page() less than 410 times. That function includes a loop similar to the one in btr_search_build_page_hash_index() to fill and process the stack-allocated buffer several times, but something is wrong about that. I am yet to analyze how many 128-entry batches it is processing.
Marko Mäkelä
added a comment - - edited There is a bug in btr_search_drop_page_hash_index() . In the rr replay trace from elenst that I analyzed, we have PAGE_N_RECS 410, and btr_search_build_page_hash_index() is incrementing the debug counter block->n_pointers to that much, in batches of up to 64, when a while (n_cached) loop is invoking ha_insert_for_fold() .
btr_search_build_page_hash_index() will invoke ha_remove_all_nodes_to_page() less than 410 times. That function includes a loop similar to the one in btr_search_build_page_hash_index() to fill and process the stack-allocated buffer several times, but something is wrong about that. I am yet to analyze how many 128-entry batches it is processing.
I think I finally figured it out. I wanted to remove any heap memory allocation from the adaptive hash index code. That is why I introduced fixed-size stack-allocated buffers for inserting or deleting multiple hash entries pointing to the records in a page. In the trace that I analyzed, we have two threads that are concurrently trying to remove a LEFT_SIDE attribute of an adaptive hash index that was built on a leaf page of a 1-column PRIMARY KEY. Because the fixed-size buffer is being exhausted in both threads, both of them will release the part.latch and compute the next batch of rec_fold() in the buffer. We have two interleaved executions of btr_search_build_page_hash_index() that invoke btr_search_drop_page_hash_index(). One of them finished, the other one hit the assertion failure.
Before the memory allocation optimization, we would allocate a large enough buffer from the heap, fill in the rec_fold() values for the entire page, then acquire the part.latch and check if the parameters are as expected, and then drop the entries. With this multi-batch operation, this obviously gets trickier. It would be unacceptable to hold part.latch across rec_fold() calls when we are only holding a shared buf_page_t::lock. We do check the AHI parameters between batches, but differences of the buf_block_t::LEFT_SIDE flag are deliberately ignored, because the original code did not check that flag. Not masking that flag should fix this race condition:
- if ((block->ahi_left_bytes_fields ^ n_bytes_fields) &
- ~buf_block_t::LEFT_SIDE)
+ if (block->ahi_left_bytes_fields != left_bytes_fields)
{
/* Someone else has meanwhile built a new hash index on the page,
with different parameters */
This fix should lead to backing off and trying harder to drop all entries pointing to the page. The performance anomaly of unnecessarily flipping the LEFT_SIDE flag will have to be fixed separately.
Marko Mäkelä
added a comment - I think I finally figured it out. I wanted to remove any heap memory allocation from the adaptive hash index code. That is why I introduced fixed-size stack-allocated buffers for inserting or deleting multiple hash entries pointing to the records in a page. In the trace that I analyzed, we have two threads that are concurrently trying to remove a LEFT_SIDE attribute of an adaptive hash index that was built on a leaf page of a 1-column PRIMARY KEY . Because the fixed-size buffer is being exhausted in both threads, both of them will release the part.latch and compute the next batch of rec_fold() in the buffer. We have two interleaved executions of btr_search_build_page_hash_index() that invoke btr_search_drop_page_hash_index() . One of them finished, the other one hit the assertion failure.
Before the memory allocation optimization, we would allocate a large enough buffer from the heap, fill in the rec_fold() values for the entire page, then acquire the part.latch and check if the parameters are as expected, and then drop the entries. With this multi-batch operation, this obviously gets trickier. It would be unacceptable to hold part.latch across rec_fold() calls when we are only holding a shared buf_page_t::lock . We do check the AHI parameters between batches, but differences of the buf_block_t::LEFT_SIDE flag are deliberately ignored, because the original code did not check that flag. Not masking that flag should fix this race condition:
diff --git a/storage/innobase/btr/btr0sea.cc b/storage/innobase/btr/btr0sea.cc
index f77d70eccb2..20e459fce92 100644
--- a/storage/innobase/btr/btr0sea.cc
+++ b/storage/innobase/btr/btr0sea.cc
@@ -1336,8 +1336,7 @@ void btr_search_drop_page_hash_index(buf_block_t *block,
ut_a(block->index == index);
}
- if ((block->ahi_left_bytes_fields ^ n_bytes_fields) &
- ~buf_block_t::LEFT_SIDE)
+ if (block->ahi_left_bytes_fields != left_bytes_fields)
{
/* Someone else has meanwhile built a new hash index on the page,
with different parameters */
This fix should lead to backing off and trying harder to drop all entries pointing to the page. The performance anomaly of unnecessarily flipping the LEFT_SIDE flag will have to be fixed separately.
The memory leak on shutdown was interesting. With ./mtr --rr I finally figured out a simple fix that works for both the test cases that elenst provided.
Marko Mäkelä
added a comment - The memory leak on shutdown was interesting. With ./mtr --rr I finally figured out a simple fix that works for both the test cases that elenst provided.
I checked why btr_search_info_update_hash() is trying to remove the LEFT_SIDE attribute of the adaptive hash index for the single-column PRIMARY KEY(pk) of the table simple_db.DD. We have the following condition:
/* The search would have succeeded using the recommended prefix */
goto increment_potential;
Only the first 3 bytes of the 4-byte value of pk had matched the successor record in row_purge_reset_trx_id(). The search for the current record was a full match (cursor.low_match=1). Because this is a primary key index, by design the above condition on cursor.up_match can never be fulfilled in this case. If we have PRIMARY KEY(pk), the next key must differ from the current one by at least the last byte, which was the case here (pk=4972 and pk=4973). The btr_search_guess_on_hash() lookup for the key pk=4972 had been inconclusive in btr_cur_t::search_leaf(), because a concurrent INSERT operation prevented the acquisition of the page latch. btr_cur_t::search_leaf() had better avoid invoking btr_cur_t::search_info_update() in such cases.
I also double-checked that the logic in btr_search_info_update_hash() is equivalent to how it worked before my simplification of some data structures. We do set the LEFT_SIDE flag if and only if the up_match not smaller than low_match. It seems that this would lead to the most common search mode PAGE_CUR_LE} choosing not to set the LEFT_SIDE flag. But, the default parameters are LEFT_SIDE|1. It’s unclear to me if that logic makes sense when we the adaptive hash index is being built on all the unique fields of the index.
Marko Mäkelä
added a comment - I checked why btr_search_info_update_hash() is trying to remove the LEFT_SIDE attribute of the adaptive hash index for the single-column PRIMARY KEY(pk) of the table simple_db.DD . We have the following condition:
else if (uint16_t(left_bytes_fields) >= n_uniq && cursor.up_match >= n_uniq)
/* The search would have succeeded using the recommended prefix */
goto increment_potential;
Only the first 3 bytes of the 4-byte value of pk had matched the successor record in row_purge_reset_trx_id() . The search for the current record was a full match ( cursor.low_match=1 ). Because this is a primary key index, by design the above condition on cursor.up_match can never be fulfilled in this case. If we have PRIMARY KEY(pk) , the next key must differ from the current one by at least the last byte, which was the case here ( pk=4972 and pk=4973 ). The btr_search_guess_on_hash() lookup for the key pk=4972 had been inconclusive in btr_cur_t::search_leaf() , because a concurrent INSERT operation prevented the acquisition of the page latch. btr_cur_t::search_leaf() had better avoid invoking btr_cur_t::search_info_update() in such cases .
I also double-checked that the logic in btr_search_info_update_hash() is equivalent to how it worked before my simplification of some data structures. We do set the LEFT_SIDE flag if and only if the up_match not smaller than low_match . It seems that this would lead to the most common search mode PAGE_CUR_LE } choosing not to set the LEFT_SIDE flag. But, the default parameters are LEFT_SIDE|1 . It’s unclear to me if that logic makes sense when we the adaptive hash index is being built on all the unique fields of the index.
There was one more observed ut_ad(block->n_pointers) assertion failure. It is the result of a race condition between btr_search_drop_page_hash_index() and btr_search_update_hash_ref() between two SELECT operations. The workload is also frequently rebuilding the adaptive hash index for this index, between 1 field (left side) and 2 fields (right side). It demonstrates that ideally, it should be possible to enable or disable the adaptive hash index not only on a per-index basis, but also on a per-query basis.
I do not have a fix for this yet. Possibly, btr_search_drop_page_hash_index() needs to be made to use a single batch, by allocating a large enough buffer. Multiple batches in btr_search_build_page_hash_index() should not be a problem, because the AHI always can be a proper subset of the contents of the index.
Marko Mäkelä
added a comment - There was one more observed ut_ad(block->n_pointers) assertion failure. It is the result of a race condition between btr_search_drop_page_hash_index() and btr_search_update_hash_ref() between two SELECT operations. The workload is also frequently rebuilding the adaptive hash index for this index, between 1 field (left side) and 2 fields (right side). It demonstrates that ideally, it should be possible to enable or disable the adaptive hash index not only on a per-index basis, but also on a per-query basis.
I do not have a fix for this yet. Possibly, btr_search_drop_page_hash_index() needs to be made to use a single batch, by allocating a large enough buffer. Multiple batches in btr_search_build_page_hash_index() should not be a problem, because the AHI always can be a proper subset of the contents of the index.
For performance testing, I also created the branch bb-10.11- MDEV-35049 , with two variants: something that aims to be a correct merge and a deliberately incorrect merge to assess the impact of MDEV-13756 .
I ended up fixing the btr_search_drop_page_hash_index() problem by allocating enough stack to accommodate the maximum number of records per page. I empirically determined the maximum number of records with the following test case:
od -Ax -t x1 -j 0x40000 -N 0x10000 var/mysqld.1/data/test/t.ibd
There is a hard limit of 8191 records per page. InnoDB reserves 13 bits for a heap number of in the record header, and the predefined infimum and supremum records consume one heap number each.
So, we would have to allocate 8189*4 = 32756 bytes of stack for the rec_fold() value buffer in btr_search_drop_page_hash_index() in order to be able to compute all rec_fold() before acquiring an exclusive latch on an adaptive hash index partition. I did consider an alternative of allocating a buffer from the heap; I am afraid that if we failed to allocate it, we’d have no other option than to crash the server.
I just realized that there is one more option: keep computing rec_fold() for the "overflow" while holding an exclusive adaptive hash index latch. In that way, for typical pages where we have at most a few hundred records, we’d invoke rec_fold() while not blocking others. In the non-typical case, there would be a bit more blocking going on. This could be the best compromise. We know that Alpine Linux (MDEV-18462) is configuring a small stack size, which is already leading to some problems elsewhere (MDEV-35705, MDEV-35022).
Marko Mäkelä
added a comment - I ended up fixing the btr_search_drop_page_hash_index() problem by allocating enough stack to accommodate the maximum number of records per page. I empirically determined the maximum number of records with the following test case:
--source include/have_innodb.inc
--source include/have_innodb_64k.inc
--source include/have_sequence.inc
CREATE TABLE t(a SMALLINT PRIMARY KEY , INDEX (a)) ENGINE=InnoDB;
INSERT INTO t SELECT * FROM seq_1_to_8189;
od -Ax -t x1 -j 0x40000 -N 0x10000 var/mysqld.1/data/test/t.ibd
There is a hard limit of 8191 records per page. InnoDB reserves 13 bits for a heap number of in the record header, and the predefined infimum and supremum records consume one heap number each.
So, we would have to allocate 8189*4 = 32756 bytes of stack for the rec_fold() value buffer in btr_search_drop_page_hash_index() in order to be able to compute all rec_fold() before acquiring an exclusive latch on an adaptive hash index partition. I did consider an alternative of allocating a buffer from the heap; I am afraid that if we failed to allocate it, we’d have no other option than to crash the server.
I just realized that there is one more option: keep computing rec_fold() for the "overflow" while holding an exclusive adaptive hash index latch. In that way, for typical pages where we have at most a few hundred records, we’d invoke rec_fold() while not blocking others. In the non-typical case, there would be a bit more blocking going on. This could be the best compromise. We know that Alpine Linux ( MDEV-18462 ) is configuring a small stack size, which is already leading to some problems elsewhere ( MDEV-35705 , MDEV-35022 ).
I checked some performance test results. As expected, I see a slight improvement in INSERT for the innodb_adaptive_hash_index=OFF case, due to MDEV-35312 and the fact that we now enable that code even when the adaptive hash index is disabled.
Curiously, for the t_oltp_pointselect_innodb test in the innodb_adaptive_hash_index=ON case, there is hardly any improvement observed. This seems to suggest that we should include something like ahi.sh in our performance regression test suite.
Marko Mäkelä
added a comment - I checked some performance test results. As expected, I see a slight improvement in INSERT for the innodb_adaptive_hash_index=OFF case, due to MDEV-35312 and the fact that we now enable that code even when the adaptive hash index is disabled.
Curiously, for the t_oltp_pointselect_innodb test in the innodb_adaptive_hash_index=ON case, there is hardly any improvement observed. This seems to suggest that we should include something like ahi.sh in our performance regression test suite.
Marko Mäkelä
added a comment - I created the branch bb-10.11-MDEV-35049-rebase (and tested all 10 commits in it separately) as well as pushed the 10 commits to the main (currently 11.8) branch.
The included program tries to emulate this issue. It does not show the same bad ratio as the above, but it shows something:
10.11.9 With adaptive hash index:
Total execution with 1 threads time: 1 seconds
Total execution with 10 threads time: 6 seconds
Total execution with 16 threads time: 11 seconds
Total execution with 32 threads time: 20 seconds
Total execution with 64 threads time: 46 seconds
Total execution with 128 threads time: 106 seconds
Total execution with 256 threads time: 405 seconds
10.11.9 without adaptive hash indexes:
Total execution with 1 threads time: 4 seconds
Total execution with 16 threads time: 26 seconds
Total execution with 32 threads time: 45 seconds
Total execution with 64 threads time: 88 seconds
Total execution with 128 threads time: 178 seconds
Total execution with 256 threads time: 363 seconds
MariaDB 10.4 with adaptive hash indexes:
Total execution with 1 threads time: 3 seconds
Total execution with 16 threads time: 15 seconds
Total execution with 32 threads time: 21 seconds
Total execution with 64 threads time: 43 seconds
Total execution with 128 threads time: 86 seconds
Total execution with 256 threads time: 172 seconds
10.4 shows speed directly proportional to number of threads.
10.11 goes worse at 128 - 256 threads.
The original issue shows the issue already at 32 threads (5x slowdown)