[MDEV-30182] Optimize open_tables to take O(N) time - Jira

Details

Type: Task
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: 12.1
Component/s: None
Labels:
- contribution

Description

See also ~~MDEV-22180~~ and MDEV-30103 (comment).

find_fk_prelocked_table traverses all opened tables in connection, MDL_context::find_ticket traverses all mdl tickets which is also not less amount of steps. Both functions are called for each table, taking O(tables^2) time total, during opening tables.

Opening a table for an operation like UPDATE or DELETE for a table with 12k child tables takes about 3.5 seconds on my laptop. Making the two mentioned fuinctions noop takes 0.15 seconds.

What should be done

A list traversal should be changed to a hash lookup when a number of tables to open is big.
Hashing is an expensive operation, so for a smaller number of tables (an exact value is to be determined) it should be left as a list traversal.

We can't predict an exact number of tables to open preliminarily: it grows dynamically, once the tables are opened (and triggers/FK relations are checked), so we need a data structure which will also dynamically adapt.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

200_tables_hash_vs_list.png
38 kB
2024-03-14 23:18
hash_perf.test
1 kB
2024-03-14 23:26
hash_vs_list_perf.png
338 kB
2024-03-14 23:18
list_vs_hash_2tables_running_total.png
31 kB
2024-09-25 01:22
mysys_vs_openaddress.png
85 kB
2024-09-25 00:22
old_vs_new_vs_new_opt_1_2.png
65 kB
2024-09-25 01:59
realistic_table_names.dat
3 kB
2024-03-14 23:19

Issue Links

blocks

MDEV-29181 Potential corruption on Foreign key update on a table with vcol index

Stalled

MDEV-31654 Support STORED generated columns with FK cascade changes

Open

MDEV-31931 FK fields cannot be used anymore in generated columns

Closed

MDEV-31942 Online alter: support cascade foreign keys

Open

relates to

MDEV-22180 Planner opens unnecessary tables when updated table is referenced by foreign keys

Closed

MDEV-29181 Potential corruption on Foreign key update on a table with vcol index

Stalled

(1 relates to)

Activity

Ascending order - Click to sort in descending order

Nikita Malyavin created issue - 2022-12-09 15:39

Nikita Malyavin made changes - 2022-12-09 15:39

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-22180~~ [ ~~MDEV-22180~~ ]

Nikita Malyavin made changes - 2022-12-09 15:40

Fix Version/s

10.13 [ 28501 ]

Julien Fritsch made changes - 2022-12-28 13:05

Fix Version/s

11.1 [ 28549 ]

Julien Fritsch made changes - 2022-12-28 13:05

Fix Version/s

10.13 [ 28501 ]

Nikita Malyavin added a comment - 2023-03-06 16:53 - edited

It's now in progress by Sergey Vanislavskiy, a FEFU student. So I don't expect it to be done before July

Nikita Malyavin added a comment - 2023-03-06 16:53 - edited It's now in progress by Sergey Vanislavskiy, a FEFU student. So I don't expect it to be done before July

Nikita Malyavin made changes - 2023-03-06 16:53

Due Date

2023-07-31

Julien Fritsch made changes - 2023-08-17 06:10

Due Date

2023-07-31

Nikita Malyavin made changes - 2023-08-17 13:23

Link

This issue blocks MDEV-31942 [ MDEV-31942 ]

Nikita Malyavin added a comment - 2023-10-19 09:35

As it was found, another use case is DROP DATABASE, which is very slow for tables over 10K table databases

Nikita Malyavin added a comment - 2023-10-19 09:35 As it was found, another use case is DROP DATABASE, which is very slow for tables over 10K table databases

Sergei Golubchik made changes - 2023-11-05 18:22

Link

This issue blocks MDEV-31942 [ MDEV-31942 ]

Nikita Malyavin made changes - 2024-03-13 21:29

Status

Open [ 1 ]

In Progress [ 3 ]

Nikita Malyavin made changes - 2024-03-14 14:48

Fix Version/s		11.5 [ 29506 ]
Fix Version/s	11.1 [ 28549 ]

Nikita Malyavin made changes - 2024-03-14 23:18

Attachment		200_tables_hash_vs_list.png [ 73267 ]
Attachment		hash_vs_list_perf.png [ 73268 ]

Nikita Malyavin made changes - 2024-03-14 23:19

Attachment

realistic_table_names.dat [ 73269 ]

Nikita Malyavin made changes - 2024-03-14 23:26

Attachment

hash_perf.test [ 73270 ]

Nikita Malyavin added a comment - 2024-03-14 23:50 - edited

I have finished the wide adoption of the patch provided by Sergei Vanislavskiy in PR.

The work has been rebased to 11.5. Many things have been changed to the original work: bugs fixed, api straighted up, more tests added, fixed quirks/readability.

Performance measurements

TLRD The performance tests show that a hash lookup outperforms list traversal in all tested cases.

The config is:

OS Linux 6.7.8
gcc 13.2.1
performance schema compiled out
-O3 flags and nothing else

To repeat the test cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test with $tables sed to desired value.

The patch compares the performance of the changed part of prepare_fk_prelocking_list. Similar outcome is expected for find_ticket.

The tests have been done for 200, 20, 10, 5 and 2 tables with the table names generated by chatgpt.

While the hash lookup obviously outperforms list traversal on 200 tables with a huge gap,

it also shows better results on the smaller number of child tables.

Here the X axis is # of run, and by Y - a total amount of clocks for the run.

I've been observing some jumps in time spent for a few runs - they've been cut out for a smoother display. Probably, that's related to process scheduling.

This is explained by less memory dispatches, better caching (more than one table is typically stored in a single cache line) and hash value reuse, so the number of hash calculation stays the same on both old and new approaches.

In case of more than one foreign key per parent table, some overhead can be expected, since the hash value will not be reused. The effects are expected to be insignificant, and can be further neglected with adopting xxh, as prototyped here

Nikita Malyavin added a comment - 2024-03-14 23:50 - edited I have finished the wide adoption of the patch provided by Sergei Vanislavskiy in PR. The work has been rebased to 11.5. Many things have been changed to the original work: bugs fixed, api straighted up, more tests added, fixed quirks/readability. Performance measurements TLRD The performance tests show that a hash lookup outperforms list traversal in all tested cases. The config is: OS Linux 6.7.8 gcc 13.2.1 performance schema compiled out -O3 flags and nothing else To repeat the test cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test with $tables sed to desired value. The patch compares the performance of the changed part of prepare_fk_prelocking_list . Similar outcome is expected for find_ticket. The tests have been done for 200, 20, 10, 5 and 2 tables with the table names generated by chatgpt. While the hash lookup obviously outperforms list traversal on 200 tables with a huge gap, it also shows better results on the smaller number of child tables. Here the X axis is # of run, and by Y - a total amount of clocks for the run. I've been observing some jumps in time spent for a few runs - they've been cut out for a smoother display. Probably, that's related to process scheduling. This is explained by less memory dispatches, better caching (more than one table is typically stored in a single cache line) and hash value reuse, so the number of hash calculation stays the same on both old and new approaches. In case of more than one foreign key per parent table, some overhead can be expected, since the hash value will not be reused. The effects are expected to be insignificant, and can be further neglected with adopting xxh, as prototyped here

Nikita Malyavin made changes - 2024-03-14 23:55

Status

In Progress [ 3 ]

In Testing [ 10301 ]

Nikita Malyavin made changes - 2024-03-14 23:55

Status

In Testing [ 10301 ]

Stalled [ 10000 ]

Nikita Malyavin added a comment - 2024-03-14 23:57 - edited

serg, please kindly review this work. I propose to push it to the 11.5 branch, if it's not frozen yet.

The reworked version is available for review at bb-11.5-opentables, currently as commit 4a4fe078d5.

Nikita Malyavin added a comment - 2024-03-14 23:57 - edited serg , please kindly review this work. I propose to push it to the 11.5 branch, if it's not frozen yet. The reworked version is available for review at bb-11.5-opentables , currently as commit 4a4fe078d5 .

Nikita Malyavin made changes - 2024-03-14 23:57

Assignee	Nikita Malyavin [ nikitamalyavin ]	Sergei Golubchik [ serg ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Sergei Golubchik made changes - 2024-03-19 18:32

Fix Version/s		11.6 [ 29515 ]
Fix Version/s	11.5 [ 29506 ]

Ralf Gebhardt made changes - 2024-06-21 10:14

Fix Version/s		11.7 [ 29815 ]
Fix Version/s	11.6 [ 29515 ]

Nikita Malyavin made changes - 2024-08-04 23:52

Link

This issue blocks MDEV-12302 [ MDEV-12302 ]

Nikita Malyavin made changes - 2024-08-04 23:52

Link

This issue blocks MDEV-31654 [ MDEV-31654 ]

Nikita Malyavin made changes - 2024-08-04 23:52

Link

This issue blocks ~~MDEV-31931~~ [ ~~MDEV-31931~~ ]

Nikita Malyavin made changes - 2024-08-04 23:52

Link

This issue blocks MDEV-31942 [ MDEV-31942 ]

Nikita Malyavin made changes - 2024-08-04 23:52

Link

This issue blocks MDEV-34151 [ MDEV-34151 ]

Nikita Malyavin made changes - 2024-08-04 23:53

Link

This issue relates to MDEV-29181 [ MDEV-29181 ]

Nikita Malyavin made changes - 2024-08-04 23:53

Priority

Major [ 3 ]

Critical [ 2 ]

Sergei Golubchik made changes - 2024-08-22 11:22

Link

This issue blocks MDEV-31942 [ MDEV-31942 ]

Sergei Golubchik made changes - 2024-08-22 11:22

Link

This issue blocks MDEV-34151 [ MDEV-34151 ]

Sergei Golubchik made changes - 2024-08-22 11:22

Link

This issue blocks MDEV-12302 [ MDEV-12302 ]

Sergei Golubchik made changes - 2024-08-22 11:22

Link

This issue blocks MDEV-31654 [ MDEV-31654 ]

Sergei Golubchik made changes - 2024-08-22 11:23

Priority

Critical [ 2 ]

Major [ 3 ]

Nikita Malyavin made changes - 2024-08-22 15:37

Link

This issue blocks MDEV-31654 [ MDEV-31654 ]

Nikita Malyavin made changes - 2024-08-22 15:40

Link

This issue blocks MDEV-31942 [ MDEV-31942 ]

Sergei Golubchik made changes - 2024-09-24 13:53

Fix Version/s		11.8 [ 29921 ]
Fix Version/s	11.7 [ 29815 ]

Nikita Malyavin made changes - 2024-09-25 00:22

Attachment

mysys_vs_openaddress.png [ 74062 ]

Nikita Malyavin added a comment - 2024-09-25 00:58 - edited

Mysys hash comparison

I've received a question why not to use the hash table that we already have (the mysys hash).

Hash overview

Apparently, to the contrary of what I thought, it doesn't resolve collisions by separate chaining.
It does store the hash bucket, and the link to next, but the buckets are embedded into array, and the link is a 32-bit array index.
The collisions are resolved in a way reminding robinhood algorithm, but I didn't recognize the particular one.

The important part is that it doesn't make holes during the consecutive inserts and keeps the array consecutively filled. It puts a new record by the hash number index, and moves the older conflicting record to the end of the dynamic array.

It also tries to maintain some distribution of the items.

The setup

An i9-13900HX laptop with fan coling.
gcc 14.2
-O3 -msse4.2
Release build type
Perfschema compiled in.
Linux 6.11
The measurement approach is same as before:
cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test and set $tables.

Performance

To conclude the performance difference, I've prototyped the adoption of mysys hash for my use-case, the approach can be found here.

The results for 100 and 200 tables are not encouraging for mysys hash: it performs around 30% worse that the new (and simple) open address implementation.
As before, the comparison is done on the function extending the prelocked list for referenced tables.

Less is better.

Explanation

Here is a few reasons for a worse result:

Putting a colliding item in the end of the array results in a poor cache locality. The "balancing" algorithm probably didn't help much in this.
The "balancing" algorithm apparently doesn't bring a good multiplier neither.
I didn't use lto, so the code could not be inlined. Mysys hash implementation requires to call 2-3 callbacks per insert or find, which could also bring a bit of difference.

Memory consumption

mysys hash stores a hash buctet per item, which is 2 pointers size on x86_64. Given that it uses a dynamic array with a power of 2 growth, total memory usage is 2-4 pointers per item.

open address hash stores a single pointer per item. It also expands twice, but as soon as a hash fill factor reaches 0.5, resulting in 2-4 pointers per item as well.

The memory usage boundaries are the same.

However, the right part of the mysys hash array may be untouched long enough in many cases, bringing a little bit better real memory use on systems with lazy page allocation.

Conclusion

Worse results do not mean that mysys hash is completely supersede.
Here are the use cases:

C code
Components that require a polymorphic behavior of hash table (that is, substituting search or comparison criterias on the fly).
Use-cases that require updates or iteration (both are not yet implemented in the new hash)
Poor hash functions, resulting in the "crowded" distributions (the "balancing" code can supposedly maintain it well)

Nikita Malyavin added a comment - 2024-09-25 00:58 - edited Mysys hash comparison I've received a question why not to use the hash table that we already have (the mysys hash). Hash overview Apparently, to the contrary of what I thought, it doesn't resolve collisions by separate chaining . It does store the hash bucket, and the link to next, but the buckets are embedded into array, and the link is a 32-bit array index. The collisions are resolved in a way reminding robinhood algorithm, but I didn't recognize the particular one. The important part is that it doesn't make holes during the consecutive inserts and keeps the array consecutively filled. It puts a new record by the hash number index, and moves the older conflicting record to the end of the dynamic array. It also tries to maintain some distribution of the items. The setup An i9-13900HX laptop with fan coling. gcc 14.2 -O3 -msse4.2 Release build type Perfschema compiled in. Linux 6.11 The measurement approach is same as before: cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test and set $tables. Performance To conclude the performance difference, I've prototyped the adoption of mysys hash for my use-case, the approach can be found here . The results for 100 and 200 tables are not encouraging for mysys hash: it performs around 30% worse that the new (and simple) open address implementation. As before, the comparison is done on the function extending the prelocked list for referenced tables. Less is better. Explanation Here is a few reasons for a worse result: Putting a colliding item in the end of the array results in a poor cache locality. The "balancing" algorithm probably didn't help much in this. The "balancing" algorithm apparently doesn't bring a good multiplier neither. I didn't use lto, so the code could not be inlined. Mysys hash implementation requires to call 2-3 callbacks per insert or find, which could also bring a bit of difference. Memory consumption mysys hash stores a hash buctet per item, which is 2 pointers size on x86_64. Given that it uses a dynamic array with a power of 2 growth, total memory usage is 2-4 pointers per item. open address hash stores a single pointer per item. It also expands twice, but as soon as a hash fill factor reaches 0.5, resulting in 2-4 pointers per item as well. The memory usage boundaries are the same. However, the right part of the mysys hash array may be untouched long enough in many cases, bringing a little bit better real memory use on systems with lazy page allocation. Conclusion Worse results do not mean that mysys hash is completely supersede. Here are the use cases: C code Components that require a polymorphic behavior of hash table (that is, substituting search or comparison criterias on the fly). Use-cases that require updates or iteration (both are not yet implemented in the new hash) Poor hash functions, resulting in the "crowded" distributions (the "balancing" code can supposedly maintain it well)

Nikita Malyavin made changes - 2024-09-25 01:22

Attachment

list_vs_hash_2tables_running_total.png [ 74063 ]

Nikita Malyavin made changes - 2024-09-25 01:59

Attachment

old_vs_new_vs_new_opt_1_2.png [ 74064 ]

Nikita Malyavin added a comment - 2024-09-25 02:05 - edited

1-2 tables case

My previous results were quite marky, showing demonstrating the uncertainty for 2 referenced tables case.
The results were mostly in between 0 and 2 clocks, , so the picture contained a lot of noise. The results were different from run to run.

So I decided to improve the approach to see the more clear picture.

New measurement setup

A have collected the average of 100 probes on each test run, and made 100 runs, collecting the results in a table.
Then I have visualized the "running total" of the probes.

Now we can see that with time the hash table (where the hash table itself is actually not involved) is behaving worse.

Optimization 1

The first observation was that for the empty data structure, find() checks both `first` and `second` slots. This is because erase() didn't move `second`, when it deleted the item in `first`.

I changed that, and also added `likely` for a 1-2 case, and it resulted in a slightly better performance, but still worse that the old code.

Optimization 2

The second observation is that the data structure was accessed twice in a row: first, find() is invoked, than insert().
The third one was that I have an extra MDL key copy, however the MDL key, and a hash number, are not really required for 1-2 case.
I've also noticed that gcc -O3 makes a pretty good inlining, resulting in both find() and insert() inlined, and even mdl_key_init().

This resulted in an idea to combine find and insert in a single query to a data structure. I've created the new insert() method that inserts a new item if a match is not found during the collisions traversal. The creation callback is invoked immediately before insertion.

The assumption was that, once find and insert are combined in a single traversal, the optimizer, once inlined, will be able to understand, that MDL key is only used in insert_into_bucket and move its initialization there. This would be otherwise hard to make it manually without breaking the api into peaces. Another approach would be to make a lazy MDL_key initialisation.

The code is here: link.

Final results

The optimization 2 made the results indistinguishable from the old behavior for 2 tables.
New opt means optimization 1, and new optinsert means optimization 2.

We can see that optimization 1 also brought some improvement, but still was a little bit worse.

Nikita Malyavin added a comment - 2024-09-25 02:05 - edited 1-2 tables case My previous results were quite marky, showing demonstrating the uncertainty for 2 referenced tables case. The results were mostly in between 0 and 2 clocks, , so the picture contained a lot of noise. The results were different from run to run. So I decided to improve the approach to see the more clear picture. New measurement setup A have collected the average of 100 probes on each test run, and made 100 runs, collecting the results in a table. Then I have visualized the "running total" of the probes. Now we can see that with time the hash table (where the hash table itself is actually not involved) is behaving worse. Optimization 1 The first observation was that for the empty data structure, find() checks both `first` and `second` slots. This is because erase() didn't move `second`, when it deleted the item in `first`. I changed that, and also added `likely` for a 1-2 case, and it resulted in a slightly better performance, but still worse that the old code. Optimization 2 The second observation is that the data structure was accessed twice in a row: first, find() is invoked, than insert(). The third one was that I have an extra MDL key copy, however the MDL key, and a hash number, are not really required for 1-2 case. I've also noticed that gcc -O3 makes a pretty good inlining, resulting in both find() and insert() inlined, and even mdl_key_init(). This resulted in an idea to combine find and insert in a single query to a data structure. I've created the new insert() method that inserts a new item if a match is not found during the collisions traversal. The creation callback is invoked immediately before insertion. The assumption was that, once find and insert are combined in a single traversal, the optimizer, once inlined, will be able to understand, that MDL key is only used in insert_into_bucket and move its initialization there. This would be otherwise hard to make it manually without breaking the api into peaces. Another approach would be to make a lazy MDL_key initialisation. The code is here: link . Final results The optimization 2 made the results indistinguishable from the old behavior for 2 tables. New opt means optimization 1, and new optinsert means optimization 2. We can see that optimization 1 also brought some improvement, but still was a little bit worse.

Nikita Malyavin made changes - 2024-09-26 21:29

Assignee

Sergei Golubchik [ serg ]

Vladislav Vaintroub [ wlad ]

Nikita Malyavin made changes - 2024-11-14 17:05

Link

This issue blocks MDEV-29181 [ MDEV-29181 ]

Sergei Golubchik made changes - 2024-12-10 15:11

Fix Version/s		11.9 [ 29945 ]
Fix Version/s	11.8 [ 29921 ]

Vladislav Vaintroub made changes - 2024-12-18 12:18

Status

In Review [ 10002 ]

Stalled [ 10000 ]

Vladislav Vaintroub made changes - 2024-12-18 12:18

Assignee

Vladislav Vaintroub [ wlad ]

Nikita Malyavin [ nikitamalyavin ]

Sergey Vojtovich made changes - 2025-01-29 18:56

Labels

contribution

Sergei Golubchik made changes - 2025-02-25 19:53

Fix Version/s		12.1 [ 29992 ]
Fix Version/s	12.0 [ 29945 ]

People

Assignee:: Nikita Malyavin

Reporter:: Nikita Malyavin

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2022-12-09 15:39

Updated:: 2025-02-25 19:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

What should be done

Attachments

Attachments

Issue Links

Activity

Performance measurements

Mysys hash comparison

Hash overview

The setup

Performance

Explanation

Memory consumption

Conclusion

1-2 tables case

New measurement setup

Optimization 1

Optimization 2

Final results

People

Dates

Git Integration