1-2 tables case
My previous results were quite marky, showing demonstrating the uncertainty for 2 referenced tables case.
The results were mostly in between 0 and 2 clocks, , so the picture contained a lot of noise. The results were different from run to run.
So I decided to improve the approach to see the more clear picture.
New measurement setup
A have collected the average of 100 probes on each test run, and made 100 runs, collecting the results in a table.
Then I have visualized the "running total" of the probes.

Now we can see that with time the hash table (where the hash table itself is actually not involved) is behaving worse.
Optimization 1
The first observation was that for the empty data structure, find() checks both `first` and `second` slots. This is because erase() didn't move `second`, when it deleted the item in `first`.
I changed that, and also added `likely` for a 1-2 case, and it resulted in a slightly better performance, but still worse that the old code.
Optimization 2
The second observation is that the data structure was accessed twice in a row: first, find() is invoked, than insert().
The third one was that I have an extra MDL key copy, however the MDL key, and a hash number, are not really required for 1-2 case.
I've also noticed that gcc -O3 makes a pretty good inlining, resulting in both find() and insert() inlined, and even mdl_key_init().
This resulted in an idea to combine find and insert in a single query to a data structure. I've created the new insert() method that inserts a new item if a match is not found during the collisions traversal. The creation callback is invoked immediately before insertion.
The assumption was that, once find and insert are combined in a single traversal, the optimizer, once inlined, will be able to understand, that MDL key is only used in insert_into_bucket and move its initialization there. This would be otherwise hard to make it manually without breaking the api into peaces. Another approach would be to make a lazy MDL_key initialisation.
The code is here: link.
Final results
The optimization 2 made the results indistinguishable from the old behavior for 2 tables.
New opt means optimization 1, and new optinsert means optimization 2.
We can see that optimization 1 also brought some improvement, but still was a little bit worse.
It's now in progress by Sergey Vanislavskiy, a FEFU student. So I don't expect it to be done before July