[MDEV-31273] optimize write to binlog Created: 2023-05-15 Updated: 2024-02-07 |
|
| Status: | In Testing |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Fix Version/s: | 11.5 |
| Type: | Task | Priority: | Critical |
| Reporter: | Andrei Elkin | Assignee: | Roel Van de Paar |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | Preview_11.4 | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Description |
|
The current "legacy" method of writing transactions to binlog involves per-event actions, dealing with changes in the event header, which can be optimized away. In particular the event checksum can be computed as early as at ctor and the binlog write be done with one cache-to-cache copy. Expected performance gain may vary, and may need to estimated before taking on implementation, but intuitively it looks to be greater for small event size transactions. |
| Comments |
| Comment by Andrei Elkin [ 2023-05-17 ] | ||||||||||||||||
|
knielsen, let me introduce you to the filed task. In case you're going to contribute indeed, please stay its assignee. | ||||||||||||||||
| Comment by Kristian Nielsen [ 2023-05-30 ] | ||||||||||||||||
|
Some notes on implementing this:
| ||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-23 ] | ||||||||||||||||
|
Pre-computing checksums means that events written through a trans/stmt cache will have the end_log_pos set to zero. But such events are always bracketed between GTID and XID/COMMIT events, which will have valid non-zero end_log_pos as they are written directly to the binlog. So it turns out there's no need to fixup the end_log_pos in the binlog dump thread or elsewhere, not for @@skip_replication or other reasons, as the end_log_pos for GTID and XID/COMMIT will be valid and is what the slave will use. So this simplifies the implementation a bit. | ||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-11 ] | ||||||||||||||||
|
Mailing list thread with patch series for review: | ||||||||||||||||
| Comment by Kristian Nielsen [ 2023-10-27 ] | ||||||||||||||||
|
This has now been pushed to 11.4. | ||||||||||||||||
| Comment by Kristian Nielsen [ 2023-10-27 ] | ||||||||||||||||
|
Here is text for documentation (for once the 11.4 release is out with this feature):
| ||||||||||||||||
| Comment by Kristian Nielsen [ 2023-10-27 ] | ||||||||||||||||
|
Roel : The feature is pushed to github in branch knielsen_mdev31273_11.4 , if you want to test it. | ||||||||||||||||
| Comment by Roel Van de Paar [ 2023-11-10 ] | ||||||||||||||||
|
knielsen_mdev31273_11.4 was previously deleted, testing as 11.4 trunk (which currently contains ~only this patch) | ||||||||||||||||
| Comment by Roel Van de Paar [ 2023-11-11 ] | ||||||||||||||||
|
axel Can you please verify the performance if not done already, or is binlog performance testing included in your standard release tests? Thank you | ||||||||||||||||
| Comment by Axel Schwenke [ 2023-11-14 ] | ||||||||||||||||
|
Roel, there is a test in the regression suite with the binlog enabled. However 11.4 is not tested untile we release 11.4.0. I can however run the same test on 11.4 outside the normal regression tests. | ||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-15 ] | ||||||||||||||||
|
axel, I suggest to vary the trx sizes. Maximum performance increase is expected for workload with trx of many short statements (each is to be checksummed by a leader of Binlog-Group-Commit (BGC) in the patch's BASE). | ||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-15 ] | ||||||||||||||||
|
Also, we shouldn't necessarily expect any significant throughput improvements from this change alone. The binlog commit throughput is likely to be mostly bottlenecked by the performance of the underlying storage. Most storage systems probably cannot write data faster than the CPU can checksum it. Other motivations for this change is as general cleanup reducing work done under LOCK_log; as a cleaner implementation of custom patches in use by some users; and as an enabler for future improvements. Still, very good idea to benchmark it, of course. | ||||||||||||||||
| Comment by Roel Van de Paar [ 2023-12-05 ] | ||||||||||||||||
|
axel Thank you. Yes please. We're looking for increases as well as unforeseen decreases in performance. See Elkin's input for added setup ideas. | ||||||||||||||||
| Comment by Axel Schwenke [ 2024-01-14 ] | ||||||||||||||||
|
I have run the suite of available OLTP tests on two different hosts. There is no advantage visible from the new binlog format. I tested also with lazy redo logging (innodb_flush_log_at_trx_commit = 0) to remove an InnoDB bottleneck and reduce IO pressure, but also not much effect. Cheetah02 (32 threads, 128G RAM, SATA SSD) results: OLTP-MDEV-31273-cheetah02.pdf | ||||||||||||||||
| Comment by Axel Schwenke [ 2024-01-14 ] | ||||||||||||||||
|
run additional test on g5 | ||||||||||||||||
| Comment by Roel Van de Paar [ 2024-01-14 ] | ||||||||||||||||
|
As no performance benefits were found from the implementation, returning this to Elkin. | ||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-15 ] | ||||||||||||||||
|
Erm, no. As was already explained, we don't necessarily expect any measurable throughput improvement from this patch. I have attached a simple test script test_binlog_checksum_precompute_stall.pl that measures the time to commit a large transaction. On fast disk, I measure the stall from commit of large transaction with --binlog-legacy-event-pos being 30% longer than without it:
| ||||||||||||||||
| Comment by Axel Schwenke [ 2024-01-15 ] | ||||||||||||||||
|
I agree with knielsen. If it reduces bloat or removes redundant information from the binlog, then we should do this change in the binlog format. Regardless of performance improvements. Only counter argument would be a performance regression (but I see no chance for that) I just now discover that the tests on g5 were run with the default InnoDB redo log size of 4M. I meant to run them with a huge (16G) log, but somehow the trailing 'G' was lost from my.cnf. I will repeat the benchmark on g5. | ||||||||||||||||
| Comment by Axel Schwenke [ 2024-01-22 ] | ||||||||||||||||
|
I've rerun the OLTP tests on g5 with 16G InnoDB redo log and now I see no differences between new and legacy binlog: OLTP-MDEV-31273-g5.2.pdf The differences are no bigger than the usual fluctuations. |