Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30182

Optimize open_tables to take O(N) time

Details

    Description

      See also MDEV-22180 and MDEV-30103 (comment).

      find_fk_prelocked_table traverses all opened tables in connection, MDL_context::find_ticket traverses all mdl tickets which is also not less amount of steps. Both functions are called for each table, taking O(tables^2) time total, during opening tables.

      Opening a table for an operation like UPDATE or DELETE for a table with 12k child tables takes about 3.5 seconds on my laptop. Making the two mentioned fuinctions noop takes 0.15 seconds.

      What should be done

      A list traversal should be changed to a hash lookup when a number of tables to open is big.
      Hashing is an expensive operation, so for a smaller number of tables (an exact value is to be determined) it should be left as a list traversal.

      We can't predict an exact number of tables to open preliminarily: it grows dynamically, once the tables are opened (and triggers/FK relations are checked), so we need a data structure which will also dynamically adapt.

      Attachments

        Issue Links

          Activity

            nikitamalyavin Nikita Malyavin added a comment - - edited

            It's now in progress by Sergey Vanislavskiy, a FEFU student. So I don't expect it to be done before July

            nikitamalyavin Nikita Malyavin added a comment - - edited It's now in progress by Sergey Vanislavskiy, a FEFU student. So I don't expect it to be done before July

            As it was found, another use case is DROP DATABASE, which is very slow for tables over 10K table databases

            nikitamalyavin Nikita Malyavin added a comment - As it was found, another use case is DROP DATABASE, which is very slow for tables over 10K table databases
            nikitamalyavin Nikita Malyavin added a comment - - edited

            I have finished the wide adoption of the patch provided by Sergei Vanislavskiy in PR.

            The work has been rebased to 11.5. Many things have been changed to the original work: bugs fixed, api straighted up, more tests added, fixed quirks/readability.

            Performance measurements

            TLRD The performance tests show that a hash lookup outperforms list traversal in all tested cases.

            The config is:

            • OS Linux 6.7.8
            • gcc 13.2.1
            • performance schema compiled out
            • -O3 flags and nothing else

            To repeat the test cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test with $tables sed to desired value.

            The patch compares the performance of the changed part of prepare_fk_prelocking_list. Similar outcome is expected for find_ticket.

            The tests have been done for 200, 20, 10, 5 and 2 tables with the table names generated by chatgpt.

            While the hash lookup obviously outperforms list traversal on 200 tables with a huge gap,

            it also shows better results on the smaller number of child tables.

            Here the X axis is # of run, and by Y - a total amount of clocks for the run.

            I've been observing some jumps in time spent for a few runs - they've been cut out for a smoother display. Probably, that's related to process scheduling.

            This is explained by less memory dispatches, better caching (more than one table is typically stored in a single cache line) and hash value reuse, so the number of hash calculation stays the same on both old and new approaches.

            In case of more than one foreign key per parent table, some overhead can be expected, since the hash value will not be reused. The effects are expected to be insignificant, and can be further neglected with adopting xxh, as prototyped here

            nikitamalyavin Nikita Malyavin added a comment - - edited I have finished the wide adoption of the patch provided by Sergei Vanislavskiy in PR. The work has been rebased to 11.5. Many things have been changed to the original work: bugs fixed, api straighted up, more tests added, fixed quirks/readability. Performance measurements TLRD The performance tests show that a hash lookup outperforms list traversal in all tested cases. The config is: OS Linux 6.7.8 gcc 13.2.1 performance schema compiled out -O3 flags and nothing else To repeat the test cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test with $tables sed to desired value. The patch compares the performance of the changed part of prepare_fk_prelocking_list . Similar outcome is expected for find_ticket. The tests have been done for 200, 20, 10, 5 and 2 tables with the table names generated by chatgpt. While the hash lookup obviously outperforms list traversal on 200 tables with a huge gap, it also shows better results on the smaller number of child tables. Here the X axis is # of run, and by Y - a total amount of clocks for the run. I've been observing some jumps in time spent for a few runs - they've been cut out for a smoother display. Probably, that's related to process scheduling. This is explained by less memory dispatches, better caching (more than one table is typically stored in a single cache line) and hash value reuse, so the number of hash calculation stays the same on both old and new approaches. In case of more than one foreign key per parent table, some overhead can be expected, since the hash value will not be reused. The effects are expected to be insignificant, and can be further neglected with adopting xxh, as prototyped here
            nikitamalyavin Nikita Malyavin added a comment - - edited

            serg, please kindly review this work. I propose to push it to the 11.5 branch, if it's not frozen yet.

            The reworked version is available for review at bb-11.5-opentables, currently as commit 4a4fe078d5.

            nikitamalyavin Nikita Malyavin added a comment - - edited serg , please kindly review this work. I propose to push it to the 11.5 branch, if it's not frozen yet. The reworked version is available for review at bb-11.5-opentables , currently as commit 4a4fe078d5 .
            nikitamalyavin Nikita Malyavin added a comment - - edited

            Mysys hash comparison

            I've received a question why not to use the hash table that we already have (the mysys hash).

            Hash overview

            Apparently, to the contrary of what I thought, it doesn't resolve collisions by separate chaining.
            It does store the hash bucket, and the link to next, but the buckets are embedded into array, and the link is a 32-bit array index.
            The collisions are resolved in a way reminding robinhood algorithm, but I didn't recognize the particular one.

            The important part is that it doesn't make holes during the consecutive inserts and keeps the array consecutively filled. It puts a new record by the hash number index, and moves the older conflicting record to the end of the dynamic array.

            It also tries to maintain some distribution of the items.

            The setup

            An i9-13900HX laptop with fan coling.
            gcc 14.2
            -O3 -msse4.2
            Release build type
            Perfschema compiled in.
            Linux 6.11
            The measurement approach is same as before:
            cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test and set $tables.

            Performance

            To conclude the performance difference, I've prototyped the adoption of mysys hash for my use-case, the approach can be found here.

            The results for 100 and 200 tables are not encouraging for mysys hash: it performs around 30% worse that the new (and simple) open address implementation.
            As before, the comparison is done on the function extending the prelocked list for referenced tables.

            Less is better.

            Explanation

            Here is a few reasons for a worse result:

            • Putting a colliding item in the end of the array results in a poor cache locality. The "balancing" algorithm probably didn't help much in this.
            • The "balancing" algorithm apparently doesn't bring a good multiplier neither.
            • I didn't use lto, so the code could not be inlined. Mysys hash implementation requires to call 2-3 callbacks per insert or find, which could also bring a bit of difference.

            Memory consumption

            mysys hash stores a hash buctet per item, which is 2 pointers size on x86_64. Given that it uses a dynamic array with a power of 2 growth, total memory usage is 2-4 pointers per item.

            open address hash stores a single pointer per item. It also expands twice, but as soon as a hash fill factor reaches 0.5, resulting in 2-4 pointers per item as well.

            The memory usage boundaries are the same.

            However, the right part of the mysys hash array may be untouched long enough in many cases, bringing a little bit better real memory use on systems with lazy page allocation.

            Conclusion

            Worse results do not mean that mysys hash is completely supersede.
            Here are the use cases:

            • C code
            • Components that require a polymorphic behavior of hash table (that is, substituting search or comparison criterias on the fly).
            • Use-cases that require updates or iteration (both are not yet implemented in the new hash)
            • Poor hash functions, resulting in the "crowded" distributions (the "balancing" code can supposedly maintain it well)
            nikitamalyavin Nikita Malyavin added a comment - - edited Mysys hash comparison I've received a question why not to use the hash table that we already have (the mysys hash). Hash overview Apparently, to the contrary of what I thought, it doesn't resolve collisions by separate chaining . It does store the hash bucket, and the link to next, but the buckets are embedded into array, and the link is a 32-bit array index. The collisions are resolved in a way reminding robinhood algorithm, but I didn't recognize the particular one. The important part is that it doesn't make holes during the consecutive inserts and keeps the array consecutively filled. It puts a new record by the hash number index, and moves the older conflicting record to the end of the dynamic array. It also tries to maintain some distribution of the items. The setup An i9-13900HX laptop with fan coling. gcc 14.2 -O3 -msse4.2 Release build type Perfschema compiled in. Linux 6.11 The measurement approach is same as before: cherry-pick this patch put realistic_table_names.dat in std_data root and run hash_perf.test and set $tables. Performance To conclude the performance difference, I've prototyped the adoption of mysys hash for my use-case, the approach can be found here . The results for 100 and 200 tables are not encouraging for mysys hash: it performs around 30% worse that the new (and simple) open address implementation. As before, the comparison is done on the function extending the prelocked list for referenced tables. Less is better. Explanation Here is a few reasons for a worse result: Putting a colliding item in the end of the array results in a poor cache locality. The "balancing" algorithm probably didn't help much in this. The "balancing" algorithm apparently doesn't bring a good multiplier neither. I didn't use lto, so the code could not be inlined. Mysys hash implementation requires to call 2-3 callbacks per insert or find, which could also bring a bit of difference. Memory consumption mysys hash stores a hash buctet per item, which is 2 pointers size on x86_64. Given that it uses a dynamic array with a power of 2 growth, total memory usage is 2-4 pointers per item. open address hash stores a single pointer per item. It also expands twice, but as soon as a hash fill factor reaches 0.5, resulting in 2-4 pointers per item as well. The memory usage boundaries are the same. However, the right part of the mysys hash array may be untouched long enough in many cases, bringing a little bit better real memory use on systems with lazy page allocation. Conclusion Worse results do not mean that mysys hash is completely supersede. Here are the use cases: C code Components that require a polymorphic behavior of hash table (that is, substituting search or comparison criterias on the fly). Use-cases that require updates or iteration (both are not yet implemented in the new hash) Poor hash functions, resulting in the "crowded" distributions (the "balancing" code can supposedly maintain it well)
            nikitamalyavin Nikita Malyavin added a comment - - edited

            1-2 tables case

            My previous results were quite marky, showing demonstrating the uncertainty for 2 referenced tables case.
            The results were mostly in between 0 and 2 clocks, , so the picture contained a lot of noise. The results were different from run to run.

            So I decided to improve the approach to see the more clear picture.

            New measurement setup

            A have collected the average of 100 probes on each test run, and made 100 runs, collecting the results in a table.
            Then I have visualized the "running total" of the probes.

            Now we can see that with time the hash table (where the hash table itself is actually not involved) is behaving worse.

            Optimization 1

            The first observation was that for the empty data structure, find() checks both `first` and `second` slots. This is because erase() didn't move `second`, when it deleted the item in `first`.

            I changed that, and also added `likely` for a 1-2 case, and it resulted in a slightly better performance, but still worse that the old code.

            Optimization 2

            The second observation is that the data structure was accessed twice in a row: first, find() is invoked, than insert().
            The third one was that I have an extra MDL key copy, however the MDL key, and a hash number, are not really required for 1-2 case.
            I've also noticed that gcc -O3 makes a pretty good inlining, resulting in both find() and insert() inlined, and even mdl_key_init().

            This resulted in an idea to combine find and insert in a single query to a data structure. I've created the new insert() method that inserts a new item if a match is not found during the collisions traversal. The creation callback is invoked immediately before insertion.

            The assumption was that, once find and insert are combined in a single traversal, the optimizer, once inlined, will be able to understand, that MDL key is only used in insert_into_bucket and move its initialization there. This would be otherwise hard to make it manually without breaking the api into peaces. Another approach would be to make a lazy MDL_key initialisation.

            The code is here: link.

            Final results

            The optimization 2 made the results indistinguishable from the old behavior for 2 tables.
            New opt means optimization 1, and new optinsert means optimization 2.

            We can see that optimization 1 also brought some improvement, but still was a little bit worse.

            nikitamalyavin Nikita Malyavin added a comment - - edited 1-2 tables case My previous results were quite marky, showing demonstrating the uncertainty for 2 referenced tables case. The results were mostly in between 0 and 2 clocks, , so the picture contained a lot of noise. The results were different from run to run. So I decided to improve the approach to see the more clear picture. New measurement setup A have collected the average of 100 probes on each test run, and made 100 runs, collecting the results in a table. Then I have visualized the "running total" of the probes. Now we can see that with time the hash table (where the hash table itself is actually not involved) is behaving worse. Optimization 1 The first observation was that for the empty data structure, find() checks both `first` and `second` slots. This is because erase() didn't move `second`, when it deleted the item in `first`. I changed that, and also added `likely` for a 1-2 case, and it resulted in a slightly better performance, but still worse that the old code. Optimization 2 The second observation is that the data structure was accessed twice in a row: first, find() is invoked, than insert(). The third one was that I have an extra MDL key copy, however the MDL key, and a hash number, are not really required for 1-2 case. I've also noticed that gcc -O3 makes a pretty good inlining, resulting in both find() and insert() inlined, and even mdl_key_init(). This resulted in an idea to combine find and insert in a single query to a data structure. I've created the new insert() method that inserts a new item if a match is not found during the collisions traversal. The creation callback is invoked immediately before insertion. The assumption was that, once find and insert are combined in a single traversal, the optimizer, once inlined, will be able to understand, that MDL key is only used in insert_into_bucket and move its initialization there. This would be otherwise hard to make it manually without breaking the api into peaces. Another approach would be to make a lazy MDL_key initialisation. The code is here: link . Final results The optimization 2 made the results indistinguishable from the old behavior for 2 tables. New opt means optimization 1, and new optinsert means optimization 2. We can see that optimization 1 also brought some improvement, but still was a little bit worse.

            People

              nikitamalyavin Nikita Malyavin
              nikitamalyavin Nikita Malyavin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.