[MDEV-33219] Optimize handler::binlog_log_row usage Created: 2024-01-10  Updated: 2024-01-31

Status: Stalled
Project: MariaDB Server
Component/s: Data Definition - Alter Table
Fix Version/s: 11.2

Type: Task Priority: Critical
Reporter: Nikita Malyavin Assignee: Nikita Malyavin
Resolution: Unresolved Votes: 0
Labels: online-ddl, performance

Attachments: File binlog_log_row (1).diff    

 Description   

MDEV-16329 adds a little per-row overhead related to the handler::binlog_log_row function.

1. A new function call.
2. what's more important, another double dispatch+branch table->s->online_alter_binlog != 0

A patch binlog_log_row (1).diff has been provided by monty however the impact has been put in doubt.
A few tests have been made to measure the performance impact of the original version and the supposed improvement.

Build options: gcc 13.2, DBUG=ON, WSREP=OFF, ASAN=0, -O3, -flto, -fuse-ld=mold

Note: rows_changed++ is included into the scope of measurement to avoid the conflicts during patching.

1. time measurement with std::chrono::high_resolution_clock.

commit is compared against the patched version.

Results:
Monty: 15ns avg
Nikita: 20ns avg

Also an approach was made to tune up likely/unlikely, which didn't change the situation.

An assumption is made that 5ns appears because of serialization that could be caused by high_resolution_clock, especially if it was implemented with a system call.

2. clock_gettime

clock_gettime should be implemented through vdso on linux, that is, avoiding any extra synchronization.

Results:
Monty: 15ns avg
Nikita: 20ns avg

The difference is still suspiciuosly big hinting that a syncronization is been made, or a misprediction constantly happens.
Later turned out that clock_gettime is also not helpful when a non-synchronized measurement is had to be made:
https://bert-hubert.blogspot.com/2017/03/on-linux-vdso-and-clockgettime.html
https://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/

3. rdtsc

In comparison with the rest rdtsc does not make any synchronization.
https://bert-hubert.blogspot.com/2017/03/on-linux-vdso-and-clockgettime.html
https://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/

commit

Monty: 45 cycles
Nikita: 45 cycles

Now we see that a cost of roughly having an extra if + function call is negligible, unless a synchronization follows.

Note: rdtscp is supposed to be a wrong indicator, exactly since it involves some synchronization. We don't need to know how long the operation takes synchronized, but an actual impact rather, which requires the measurement to be unintrusive.

Does the compiler generate an efficient code?

Now let's see what assembly is actually generated by the lto-powered build. We'll look at

objdump -d ../sql/mariadbd | grep -A 300 '_ZN7handler13ha_update_rowEPKhS1_>:'

to see if the code is actually effective.

1. Is the function inlined?

The expected answer is yes, why would a compiler need a hint to make an efficient code to offload the cpu's prefetcher. The actual anwer is:

  fa7e03:	0f 31                	rdtsc
  fa7e05:	48 83 83 28 02 00 00 	addq   $0x1,0x228(%rbx) # rows_changed++ is placed interestingly.
  fa7e0c:	01 
  fa7e0d:	89 c0                	mov    %eax,%eax
  fa7e0f:	4c 89 ee             	mov    %r13,%rsi
  fa7e12:	48 89 df             	mov    %rbx,%rdi
  fa7e15:	48 c1 e2 20          	shl    $0x20,%rdx
# end of rdtsc
 
  fa7e19:	48 8d 0d 80 6b fe ff 	lea    -0x19480(%rip),%rcx
  fa7e20:	48 09 c2             	or     %rax,%rdx
  fa7e23:	49 89 d6             	mov    %rdx,%r14
  fa7e26:	4c 89 e2             	mov    %r12,%rdx
  fa7e29:	e8 42 e3 ff ff       	call   fa6170 <_ZN7handler14binlog_log_rowEPKhS1_PFbP3THDP5TABLEP9Event_logP17binlog_cache_databmS1_S1_E>
  fa7e2e:	89 c5                	mov    %eax,%ebp

no.
Hinting a function with inline keyword works perfect though.

2. Does likely/unlikely work?

After inlining the function we get the following assembly:

  fa7b67:	0f 31                	rdtsc
  fa7b69:	48 c1 e2 20          	shl    $0x20,%rdx
  fa7b6d:	89 c0                	mov    %eax,%eax
  fa7b6f:	48 83 83 28 02 00 00 	addq   $0x1,0x228(%rbx)
  fa7b76:	01 
  fa7b77:	48 09 c2             	or     %rax,%rdx
# end rdtsc
 
  fa7b7a:	80 bb 20 05 00 00 00 	cmpb   $0x0,0x520(%rbx) # if (row_logging)
  fa7b81:	48 89 55 80          	mov    %rdx,-0x80(%rbp) # 
  fa7b85:	0f 85 65 01 00 00    	jne    fa7cf0           # jump if != 0, good
 
# looks like `error` comparison is skipped and moved 
# to the pessimistic branch. Smart!
 
  fa7b8b:	48 8b 7b 10          	mov    0x10(%rbx),%rdi  # this->table
  fa7b8f:	48 8b 07             	mov    (%rdi),%rax      # table->s
  fa7b92:	48 83 b8 d8 05 00 00 	cmpq   $0x0,0x5d8(%rax) # s->online_alter_binlog
  fa7b99:	00 
  fa7b9a:	74 12                	je     fa7bae # skip second double dispatch
  fa7b9c:	48 8b 47 08          	mov    0x8(%rdi),%rax  # this->table
  fa7ba0:	48 8b 40 68          	mov    0x68(%rax),%rax # table->file
  fa7ba4:	48 39 43 68          	cmp    %rax,0x68(%rbx) # ht != file->ht
  fa7ba8:	0f 84 d2 02 00 00    	je     fa7e80 # 
  fa7bae:	45 84 ff             	test   %r15b,%r15b # if (likely(count_time))
  fa7bb1:	0f 84 80 fe ff ff    	je     fa7a37      # ...
  fa7bb7:	0f 31                	rdtsc

if (row_logging) is guessed to be unlikely. Considering the common (featureless) case to be prioritized, it's good.
However, we can see that a second double (triple?) dispatch can begin evaluating before table->s->online_alter_binlog.
Interestingly, this is fetched from %rbx sometimes, not %rdi. As systemV abi dictates:

2. If the class is INTEGER, the next available register of the sequence %rdi,
%rsi, %rdx, %rcx, %r8 and %r9 is used

Good to see that error comparison is not made for nothing in the optimistic path!

So, this should be found in %rdi anyway.

So it shows up that a code generated is not ideal. We may try to tune up likely/unlikely:

-  if (unlikely(!error && table->s->online_alter_binlog && is_root_handler()))
+  if (likely(!error) && unlikely(table->s->online_alter_binlog) && is_root_handler())
     error= online_alter_log_row(table, before_record, after_record,
                                 log_func);

Now the assembly is perfect:

  fa7c57:	0f 31                	rdtsc
  fa7c59:	48 c1 e2 20          	shl    $0x20,%rdx
  fa7c5d:	89 c0                	mov    %eax,%eax
  fa7c5f:	48 83 83 28 02 00 00 	addq   $0x1,0x228(%rbx)
  fa7c66:	01 
  fa7c67:	48 09 c2             	or     %rax,%rdx
# end rdtsc
 
  fa7c6a:	80 bb 20 05 00 00 00 	cmpb   $0x0,0x520(%rbx) # if (row_logging)
  fa7c71:	48 89 55 80          	mov    %rdx,-0x80(%rbp) #
  fa7c75:	0f 85 55 01 00 00    	jne    fa7dd0           # jump if != 0, good
 
  fa7c7b:	48 8b 7b 10          	mov    0x10(%rbx),%rdi  # table
  fa7c7f:	48 8b 07             	mov    (%rdi),%rax      # table->s
  fa7c82:	48 83 b8 d8 05 00 00 	cmpq   $0x0,0x5d8(%rax) # s->online_alter_binlog
  fa7c89:	00                                              #
  fa7c8a:	0f 85 e8 03 00 00    	jne    fa8078           # jump away if != 0
 
  fa7c90:	45 84 ff             	test   %r15b,%r15b # if (likely(count_time))
  fa7c93:	0f 84 8e fe ff ff    	je     fa7b27      # continue to exit
  fa7c99:	0f 31                	rdtsc



 Comments   
Comment by Nikita Malyavin [ 2024-01-10 ]

To address the assembly issue, the patch ccf80f7dd8b (link) is made.

Comment by Michael Widenius [ 2024-01-24 ]

The above does not address the full code in my patch.
I cannot see the effect of the following:

For the common insert, my suggested codes uses 1 jump if and 1 memory access to replace a call+return, 3 ifs and 7 memory accesses (edited)

Also, making the function inline is not good as it causes code blot which will an affect if many users are calling insert/update/delete at the same time.

Also, if we need to add more code later to binlog_row(), the current solution with inline will be even worse.

I rather go with my suggested solution, at least until you show the difference with my patch and yours, without inline.

Comment by Michael Widenius [ 2024-01-24 ]

Not convinced this is a good idea.
I rather go back to the original code before the patch that that added checking if binlog_row_log() needs to be called at all.

Generated at Thu Feb 08 10:37:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.