Details
-
Task
-
Status: In Review (View Workflow)
-
Critical
-
Resolution: Unresolved
Description
MDEV-16329 adds a little per-row overhead related to the handler::binlog_log_row function.
1. A new function call.
2. what's more important, another double dispatch+branch table->s->online_alter_binlog != 0
A patch binlog_log_row (1).diff has been provided by monty however the impact has been put in doubt.
A few tests have been made to measure the performance impact of the original version and the supposed improvement.
Build options: gcc 13.2, DBUG=ON, WSREP=OFF, ASAN=0, -O3, -flto, -fuse-ld=mold
Note: rows_changed++ is included into the scope of measurement to avoid the conflicts during patching.
1. time measurement with std::chrono::high_resolution_clock.
commit is compared against the patched version.
Results:
Monty: 15ns avg
Nikita: 20ns avg
Also an approach was made to tune up likely/unlikely, which didn't change the situation.
An assumption is made that 5ns appears because of serialization that could be caused by high_resolution_clock, especially if it was implemented with a system call.
2. clock_gettime
clock_gettime should be implemented through vdso on linux, that is, avoiding any extra synchronization.
Results:
Monty: 15ns avg
Nikita: 20ns avg
The difference is still suspiciuosly big hinting that a syncronization is been made, or a misprediction constantly happens.
Later turned out that clock_gettime is also not helpful when a non-synchronized measurement is had to be made:
https://bert-hubert.blogspot.com/2017/03/on-linux-vdso-and-clockgettime.html
https://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/
3. rdtsc
In comparison with the rest rdtsc does not make any synchronization.
https://bert-hubert.blogspot.com/2017/03/on-linux-vdso-and-clockgettime.html
https://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/
Monty: 45 cycles
Nikita: 45 cycles
Now we see that a cost of roughly having an extra if + function call is negligible, unless a synchronization follows.
Note: rdtscp is supposed to be a wrong indicator, exactly since it involves some synchronization. We don't need to know how long the operation takes synchronized, but an actual impact rather, which requires the measurement to be unintrusive.
Does the compiler generate an efficient code?
Now let's see what assembly is actually generated by the lto-powered build. We'll look at
objdump -d ../sql/mariadbd | grep -A 300 '_ZN7handler13ha_update_rowEPKhS1_>:'
|
to see if the code is actually effective.
1. Is the function inlined?
The expected answer is yes, why would a compiler need a hint to make an efficient code to offload the cpu's prefetcher. The actual anwer is:
fa7e03: 0f 31 rdtsc
|
fa7e05: 48 83 83 28 02 00 00 addq $0x1,0x228(%rbx) # rows_changed++ is placed interestingly.
|
fa7e0c: 01
|
fa7e0d: 89 c0 mov %eax,%eax
|
fa7e0f: 4c 89 ee mov %r13,%rsi
|
fa7e12: 48 89 df mov %rbx,%rdi
|
fa7e15: 48 c1 e2 20 shl $0x20,%rdx
|
# end of rdtsc
|
|
fa7e19: 48 8d 0d 80 6b fe ff lea -0x19480(%rip),%rcx
|
fa7e20: 48 09 c2 or %rax,%rdx
|
fa7e23: 49 89 d6 mov %rdx,%r14
|
fa7e26: 4c 89 e2 mov %r12,%rdx
|
fa7e29: e8 42 e3 ff ff call fa6170 <_ZN7handler14binlog_log_rowEPKhS1_PFbP3THDP5TABLEP9Event_logP17binlog_cache_databmS1_S1_E>
|
fa7e2e: 89 c5 mov %eax,%ebp
|
no.
Hinting a function with inline keyword works perfect though.
2. Does likely/unlikely work?
After inlining the function we get the following assembly:
fa7b67: 0f 31 rdtsc
|
fa7b69: 48 c1 e2 20 shl $0x20,%rdx
|
fa7b6d: 89 c0 mov %eax,%eax
|
fa7b6f: 48 83 83 28 02 00 00 addq $0x1,0x228(%rbx)
|
fa7b76: 01
|
fa7b77: 48 09 c2 or %rax,%rdx
|
# end rdtsc
|
|
fa7b7a: 80 bb 20 05 00 00 00 cmpb $0x0,0x520(%rbx) # if (row_logging)
|
fa7b81: 48 89 55 80 mov %rdx,-0x80(%rbp) #
|
fa7b85: 0f 85 65 01 00 00 jne fa7cf0 # jump if != 0, good
|
|
# looks like `error` comparison is skipped and moved
|
# to the pessimistic branch. Smart!
|
|
fa7b8b: 48 8b 7b 10 mov 0x10(%rbx),%rdi # this->table
|
fa7b8f: 48 8b 07 mov (%rdi),%rax # table->s
|
fa7b92: 48 83 b8 d8 05 00 00 cmpq $0x0,0x5d8(%rax) # s->online_alter_binlog
|
fa7b99: 00
|
fa7b9a: 74 12 je fa7bae # skip second double dispatch
|
fa7b9c: 48 8b 47 08 mov 0x8(%rdi),%rax # this->table
|
fa7ba0: 48 8b 40 68 mov 0x68(%rax),%rax # table->file
|
fa7ba4: 48 39 43 68 cmp %rax,0x68(%rbx) # ht != file->ht
|
fa7ba8: 0f 84 d2 02 00 00 je fa7e80 #
|
fa7bae: 45 84 ff test %r15b,%r15b # if (likely(count_time))
|
fa7bb1: 0f 84 80 fe ff ff je fa7a37 # ...
|
fa7bb7: 0f 31 rdtsc
|
if (row_logging) is guessed to be unlikely. Considering the common (featureless) case to be prioritized, it's good.
However, we can see that a second double (triple?) dispatch can begin evaluating before table->s->online_alter_binlog.
Interestingly, this is fetched from %rbx sometimes, not %rdi. As systemV abi dictates:
2. If the class is INTEGER, the next available register of the sequence %rdi,
%rsi, %rdx, %rcx, %r8 and %r9 is used
Good to see that error comparison is not made for nothing in the optimistic path!
So, this should be found in %rdi anyway.
So it shows up that a code generated is not ideal. We may try to tune up likely/unlikely:
- if (unlikely(!error && table->s->online_alter_binlog && is_root_handler()))
|
+ if (likely(!error) && unlikely(table->s->online_alter_binlog) && is_root_handler())
|
error= online_alter_log_row(table, before_record, after_record,
|
log_func); |
Now the assembly is perfect:
fa7c57: 0f 31 rdtsc
|
fa7c59: 48 c1 e2 20 shl $0x20,%rdx
|
fa7c5d: 89 c0 mov %eax,%eax
|
fa7c5f: 48 83 83 28 02 00 00 addq $0x1,0x228(%rbx)
|
fa7c66: 01
|
fa7c67: 48 09 c2 or %rax,%rdx
|
# end rdtsc
|
|
fa7c6a: 80 bb 20 05 00 00 00 cmpb $0x0,0x520(%rbx) # if (row_logging)
|
fa7c71: 48 89 55 80 mov %rdx,-0x80(%rbp) #
|
fa7c75: 0f 85 55 01 00 00 jne fa7dd0 # jump if != 0, good
|
|
fa7c7b: 48 8b 7b 10 mov 0x10(%rbx),%rdi # table
|
fa7c7f: 48 8b 07 mov (%rdi),%rax # table->s
|
fa7c82: 48 83 b8 d8 05 00 00 cmpq $0x0,0x5d8(%rax) # s->online_alter_binlog
|
fa7c89: 00 #
|
fa7c8a: 0f 85 e8 03 00 00 jne fa8078 # jump away if != 0
|
|
fa7c90: 45 84 ff test %r15b,%r15b # if (likely(count_time))
|
fa7c93: 0f 84 8e fe ff ff je fa7b27 # continue to exit
|
fa7c99: 0f 31 rdtsc
|