[MDEV-31273] optimize write to binlog Created: 2023-05-15  Updated: 2024-02-07

Status: In Testing
Project: MariaDB Server
Component/s: Replication
Fix Version/s: 11.5

Type: Task Priority: Critical
Reporter: Andrei Elkin Assignee: Roel Van de Paar
Resolution: Unresolved Votes: 0
Labels: Preview_11.4

Attachments: PDF File OLTP-MDEV-31273-cheetah02.pdf     PDF File OLTP-MDEV-31273-g5.2.pdf     PDF File OLTP-MDEV-31273-g5.pdf     File test_binlog_checksum_precompute_stall.pl    
Issue Links:
PartOf

 Description   

The current "legacy" method of writing transactions to binlog involves per-event actions, dealing with changes in the event header, which can be optimized away.

In particular the event checksum can be computed as early as at ctor and the binlog write be done with one cache-to-cache copy.

Expected performance gain may vary, and may need to estimated before taking on implementation, but intuitively it looks to be greater for small event size transactions.



 Comments   
Comment by Andrei Elkin [ 2023-05-17 ]

knielsen, let me introduce you to the filed task. In case you're going to contribute indeed, please stay its assignee.

Comment by Kristian Nielsen [ 2023-05-30 ]

Some notes on implementing this:

  • Compute the checksum when writing to cache, allocating space for it.
  • Read the value of binlog_checksum when the binlog cache is initialized for the transaction. This way we can check at write-to-binlog time if global binlog_checksum changes during the transaction, and recompute/remove checksums as needed (binlog_checksum is a dynamic variable).
  • Most events will have the position field set to 0 to avoid recomputing checksums, and the slave will need to cope with these zero position values.
  • Occasionally the position field will need to be filled in by the binlog send thread on the master (this happens eg. for @@skip_replication). This may also be needed to support replication to old MariaDB slave or MySQL slave if they cannot tolerate zero position in the events (to be investigated).
  • May want to have some option to revert back to old behaviour with having correct position field in each event (and recompute checksums under LOCK_log)? For some backwards compatibility, but not on by default.
  • Check if the slave needs some way to see that the binlog is the new format with zero position field? Could be stored as a flag in FORMAT_DESCRIPTION_EVENT perhaps. But better if this can be avoided completely.
  • Optional idea is to optimize the multiple calls to write_data() in log_event_server.cc to use a single call with varargs for example.
  • Optional idea to also pre-encrypt the events in the cache.
Comment by Kristian Nielsen [ 2023-08-23 ]

Pre-computing checksums means that events written through a trans/stmt cache will have the end_log_pos set to zero. But such events are always bracketed between GTID and XID/COMMIT events, which will have valid non-zero end_log_pos as they are written directly to the binlog.

So it turns out there's no need to fixup the end_log_pos in the binlog dump thread or elsewhere, not for @@skip_replication or other reasons, as the end_log_pos for GTID and XID/COMMIT will be valid and is what the slave will use. So this simplifies the implementation a bit.

Comment by Kristian Nielsen [ 2023-09-11 ]

Mailing list thread with patch series for review:

https://lists.mariadb.org/hyperkitty/list/commits@lists.mariadb.org/thread/ZTBRQ733MOY5WJSHGYUXR7RX7YCAXY2R/

Comment by Kristian Nielsen [ 2023-10-27 ]

This has now been pushed to 11.4.

Comment by Kristian Nielsen [ 2023-10-27 ]

Here is text for documentation (for once the 11.4 release is out with this feature):

--binlog-legacy-event-pos
 
New option in 11.4
 
Since MariaDB 11.4, writing to the binlog during transaction commit is
optimized to omit (set to 0) the field end_log_pos from binlog events. This
improves the scalability of the binlog. The field is redundant, as the value
can be derived from the position of the event in the file and the event's
`length` field.
 
Enabling the --binlog-legacy-event-pos option (which is off by default)
reverts the server to the pre-11.4 behavior and writes the actual
end_log_pos value to every binlog event. This should normally not be needed,
and can hurt binlog performance. The option is provided for backwards
compatibility in case any 3rd-party tools or applications are expecting the
end_log_pos field to be set.

Comment by Kristian Nielsen [ 2023-10-27 ]

Roel : The feature is pushed to github in branch knielsen_mdev31273_11.4 , if you want to test it.

Comment by Roel Van de Paar [ 2023-11-10 ]

knielsen_mdev31273_11.4 was previously deleted, testing as 11.4 trunk (which currently contains ~only this patch)

Comment by Roel Van de Paar [ 2023-11-11 ]

axel Can you please verify the performance if not done already, or is binlog performance testing included in your standard release tests? Thank you

Comment by Axel Schwenke [ 2023-11-14 ]

Roel, there is a test in the regression suite with the binlog enabled. However 11.4 is not tested untile we release 11.4.0. I can however run the same test on 11.4 outside the normal regression tests.

Comment by Andrei Elkin [ 2023-11-15 ]

axel, I suggest to vary the trx sizes. Maximum performance increase is expected for workload with trx of many short statements (each is to be checksummed by a leader of Binlog-Group-Commit (BGC) in the patch's BASE).

Comment by Kristian Nielsen [ 2023-11-15 ]

Also, we shouldn't necessarily expect any significant throughput improvements from this change alone. The binlog commit throughput is likely to be mostly bottlenecked by the performance of the underlying storage. Most storage systems probably cannot write data faster than the CPU can checksum it. Other motivations for this change is as general cleanup reducing work done under LOCK_log; as a cleaner implementation of custom patches in use by some users; and as an enabler for future improvements.

Still, very good idea to benchmark it, of course.

Comment by Roel Van de Paar [ 2023-12-05 ]

axel Thank you. Yes please. We're looking for increases as well as unforeseen decreases in performance. See Elkin's input for added setup ideas.

Comment by Axel Schwenke [ 2024-01-14 ]

I have run the suite of available OLTP tests on two different hosts. There is no advantage visible from the new binlog format. I tested also with lazy redo logging (innodb_flush_log_at_trx_commit = 0) to remove an InnoDB bottleneck and reduce IO pressure, but also not much effect.

Cheetah02 (32 threads, 128G RAM, SATA SSD) results: OLTP-MDEV-31273-cheetah02.pdf
g5 (12 threads, 64G RAM, NVMe SSD) results: OLTP-MDEV-31273-g5.pdf

Comment by Axel Schwenke [ 2024-01-14 ]

run additional test on g5

Comment by Roel Van de Paar [ 2024-01-14 ]

As no performance benefits were found from the implementation, returning this to Elkin.
serg Elkin Please confirm if we will still proceed with implementing this patch given the findings.

Comment by Kristian Nielsen [ 2024-01-15 ]

Erm, no.

As was already explained, we don't necessarily expect any measurable throughput improvement from this patch.

I have attached a simple test script test_binlog_checksum_precompute_stall.pl that measures the time to commit a large transaction. On fast disk, I measure the stall from commit of large transaction with --binlog-legacy-event-pos being 30% longer than without it:

legacy:
 
Table rows: 200002
Time for all commits: 0.171386
Time for big commits: 0.171134
 
precomputed:
 
Table rows: 200002
Time for all commits: 0.131941
Time for big commits: 0.13176

Comment by Axel Schwenke [ 2024-01-15 ]

I agree with knielsen. If it reduces bloat or removes redundant information from the binlog, then we should do this change in the binlog format. Regardless of performance improvements. Only counter argument would be a performance regression (but I see no chance for that)

I just now discover that the tests on g5 were run with the default InnoDB redo log size of 4M. I meant to run them with a huge (16G) log, but somehow the trailing 'G' was lost from my.cnf. I will repeat the benchmark on g5.

Comment by Axel Schwenke [ 2024-01-22 ]

I've rerun the OLTP tests on g5 with 16G InnoDB redo log and now I see no differences between new and legacy binlog: OLTP-MDEV-31273-g5.2.pdf

The differences are no bigger than the usual fluctuations.

Generated at Thu Feb 08 10:22:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.