[MDEV-4991] GTID binlog indexing Created: 2013-09-03 Updated: 2024-02-08 |
|
| Status: | In Testing |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Fix Version/s: | 11.4.1 |
| Type: | New Feature | Priority: | Critical |
| Reporter: | Kristian Nielsen | Assignee: | Roel Van de Paar |
| Resolution: | Unresolved | Votes: | 5 |
| Labels: | Preview_11.4, gtid | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Current GTID code needs to scan one binlog file from the beginning when a To fix this problem, indexing should be done on the binlog files, allowing to The index could be an extra file master-bin.000001.idx written in parallel The index would be page-based, allowing a connecting slave to do binary search
There is no need to include every position in the index. We can write say one To further reduce the index size, it could be "compressed" by omitting from entries those (domain_id, server_id) combinations that do not change. Typically, there can be many distinct such values in a binlog file, but only a few of them are likely to change within one given file. A work-in-progress high-level design description:
|
| Comments |
| Comment by Kristian Nielsen [ 2015-01-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This might also be a nice project for GSoC. It would be very valuable to have, and while not trivial, it is reasonably well defined and reasonably limited to a smaller part of the server code. I have some more detailed notes on a possible design, I will add them later. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2015-02-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Jonas Oreland said he had already implemented this: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonas Oreland [ 2015-04-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
fyi...now trying to produce a usable patch... | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bnestere, to the design, we can also make use of the slave knowledge of the gtid's binlog coordinates which it could contribute to master | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-06-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Please review bb-10.11-midenok-MDEV-4991 (all last commits authored by me) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
From a quick look at mostly the commit comments, it sounds like the existing work doesn't implement this task, but instead a different task of keeping an in-memory index of parts of the binlog. I think there are a couple of problems with this approach (if I understood correctly). The most serious problem is if this does not solve the serious problem of slow worst-case performance of slave connect. In the worst case, the slave connect needs on the master to fully scan a whole binlog file, even worse if that file is encrypted. IIUC, an in-memory index can improve the average time for this (in case of many slave connects during master server uptime), but it will not help the worst-case latency resulting from this. Another problem is that this incurs a memory usage in the master server that's unused for most of the time, only used when a slave connects, which is rare in many setups. So if we want to implement the in-memory index first, it's very important to design the user-visible part of the feature (eg. configuration options, status variables, etc.) so that it can be seamlessly upgraded to when the real solution with disk-based binlog indexes is implemented. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-08-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Please share your insights, if any, on that. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'll need to look closer at the design then, I'll try to do that at some point. Is there any design description or discussions around this I could refer to? Or is it just the code in branch bb-11.2-midenok-MDEV-4991 ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-08-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen, I'd go to implement the in-memory first, after all but some worst cases access to the slave start gtid is reduced drastically. The disk-based can be considered separately, and unlike rob.schwyzer@mariadb.com's proposal I'd consider how to turn it on automatically after a memory limit is reached. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-08-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen As far is user interface is concerned you may not go far into design. Commit message explanation should be enough. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, looks like the two configuration variables can just be made no-op and the instrumentation show zeros, once the proper implementation gets done. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
FTR, I have now started to work on this. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-08-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen, so you're adding the index persistency layer to bb-11.2-midenok-MDEV-4991? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Elkin, no, persistent index is the proper implementation of this, as described in the task description. It will replace any in-memory index that's in the code when the implementation is done. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-08-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen, you say 'replace', but could your ultimate goal materialize with enriching the Alekey's design with a layer to write memory index records to a file? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I don't believe an in-memory index serves any useful purpose. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2023-08-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Why not using an extra aria system table partition by range of binlog name ( fix record size) , add partition on binlog file create, drop partition when binlog is purged | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-09-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen, master-bin.000001 offset_1 offset_2 ... offset_200 (and still "minimally" impact on the current business logics of that index maintenance) I don't insist on this particular approach. After all it's ad hoc at this very moment. My point is obviously try to exploit Aleksey (well his work efforts | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Elkin I think the index should be written incrementally, in parallel with the binlog file. But I'll share my thoughts as soon as I have something concrete,. For now I'm still working on ideas. I'm not sure how having the offsets in the .index would help? You need the complete binlog state associated with each index to be able to locate correctly a GTID position, the same info that's stored in GTID_LIST_EVENT. And this can be potentially fairly large on setups where many domain_ids or server_ids were configured in the past. But agree, I also just want this fixed properly. Looking now, this bug is 10 years since I created it. Maybe it was a mistake that I pushed GTID without an implementation of this included. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-09-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> I'm not sure how having the offsets in the .index would help? You need the complete binlog state associated with each index to be able to locate correctly a GTID position I had in mind a simple one domain case. And offsets are offsets in the binlog files for each say 1000th gtid logged into it. To | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, I took a quick look at the bb-11.2-midenok-MDEV-4991 branch. So this only addresses the lookup from offset to GTID position. It doesn't try to speed up slave connect, where it needs to lookup from GTID position to offset. That's why it never needs the complete binlog state. It also still processes every GTID in the binlog file up to offset (IIUC), just doing so from an in-memory GTID list rather than reading full events from a file. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-09-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Excuse me, but binlog_gtid_pos() call is what used in SHOW MASTER STATUS upon slave connection. Isn't it? https://mariadbcorp.atlassian.net/browse/SAMU-142 is the original task for the above branch to speed up SHOW MASTER STATUS. I guess, rob.schwyzer@mariadb.com had this requirement based on production cases. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
midenok, it depends on whether the connection slave is using GTID position to connect or not. If using GTID position, the slave needs to look up the file/offset of where to start in the binlog. If not using GTID position, the slave calls binlog_gtid_pos(), but this is just to record a GTID position that can be used for a later connect after CHANGE MASTER TO master_use_gtid=slave_pos. Both of these currently requires scanning a binlog file from the start, which can be slow. I'm not sure about how SHOW MASTER STATUS would be involved? Hope this helps, - Kristian. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen You right, SHOW MASTER STATUS is not involved here. In any case, the use case of binlog_gtid_pos() suffers from significant slowdowns as SAMU-142 testifies. I'm not sure on how much important the opposite speedup though. Btw, the patches for review are in bb-10.11-midenok-MDEV-4991 (there are more patches) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The speedup for a slave connecting with Master_use_gtid=slave_pos is the most critical one, as for GTID position, the search for the place to start is required. The speedup for binlog_gtid_pos() is simpler to solve. This is called by a slave connecting with Master_use_gtid=no, and it is only done so that the slave can compute a correct @@gtid_slave_pos in case the user later wants to switch to GTID automatically. An option could be implemented for the user to disable that to avoid the slowdown (if they don't plan to switch to GTID at the moment), which would be even simpler than your patch and could probably even be backported to stable releases. Both cases of slowdown are of similar magnitude and both very important of course. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Branch knielsen_mdev4991 is still very much work in progress. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Now the code in branch knielsen_mdev4991 is more mature. It speeds up both GTID and non-GTID slave connect, as well as BINLOG_GTID_POS(). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-09-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Added a link to the mailing list thread discussing the design | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The GTID index feature is now complete and ready for testing. The code, based on 11.3, is pushed to knielsen_mdev4991_11.3: https://github.com/MariaDB/server/commits/knielsen_mdev4991_11.3 (Taking this back, I feel the original MDEV was kind of hi-jacked here for something entirely different.) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is some documentation for the testing and eventually for the KB:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jukka Pihl [ 2023-11-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If I understand main purpose of this feature is make BINLOG_GTID_POS() faster. (arggh... I think this could enable some other new functionality too:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bluebike, Yes, the functionality for GTID=>FILE:POS is already in the code, used when a slave connects with Master_use_gtid=slave_pos. It's just not exposed through a user function currently. Interesting idea about backtracking. Indeed this should be possible to implement using the indexes, by following the previous key in the GTID index. (The main purpose of GTID indexes is to speed up when a slave connects. This can otherwise use up a lot of CPU, disk I/O, and/or wallclock time.) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-11-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Roel, at the Unconference it was discussed that the test team might want to test the gtid indexes before it's pushed to 11.4. Elena suggested to assign tentatively to you and "request testing". The code is available in the branch knielsen_mdev4991_11.3 . - Kristian. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-11-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen Yes, thank you. And from the discussion I understand you will also co-test, which will be helpful as there are a fair number of things to test in the pipeline for the coming weeks and months. Thank you | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2023-12-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Roel Functional testing looks good. Did not see any related/specific issues. Also tested in combination with binlog encryption. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have updated the patch:
Main visibility is that the options --binlog-gtid-index-sparse and --binlog-gtid-index-span-max have been removed to simplify the user configuration, as these options were mostly redundant. The current code is pushed to the branch: bb-11.4-knielsen-mdev4991 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Binlog GTID indexes has now been merged to 11.4.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2024-01-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Not sure why this was merged; no OK to push was provided. As there are still errors, observed during testing, to be evaluated, as well as a backlog of issues and tasks, this will not make it into 11.4. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2024-01-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen Please revert https://github.com/MariaDB/server/commit/d039346a7acac7c72f264377a8cd6b0273c548df | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Roel, the patch will not be reverted, unless I learn of a technical reason to do so. I have not been informed of any errors observed during test, what are you referring to? In fact, I have not seen hardly any discussions anywhere, IRC, Zulip, mailing list, of the 11.4 release. Does such discussion take place on internal communication channels not accessible to all developers? Or is there just no coordination whatsoever? This task has been completed and testing has been done and merge has been agreed with the reviewer. All is as it should be. Roel, it's great if you want to contribute testing of new features, that's much appreciated. But you can't come months after a patch is complete, with no communication at all except "Functional testing looks good", and suddenly say "no OK to push was provided". That's simply not workable. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2024-01-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
> I have not been informed of any errors observed during test, what are you referring to? > In fact, I have not seen hardly any discussions anywhere, IRC, Zulip, mailing list, of the 11.4 release. > testing has been done > All is as it should be. > it's great if you want to contribute testing of new features | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
"The current regular procedure is..." "Management can override ..." - there are MariaDB corporation procedures. But this is not a MariaDB corporation work. Maybe that's the misunderstanding here. Since this is not MariaDB corporation work, you are not under any obligation to test it. But of course if you want to contribute testing, that's much appreciated. That is why I wrote in November: "if you want to test, please be sure to let me know of any problems that I need to fix in time for me to do it before the 11.4 release." Already back then, you informed me that it was uncertain if you would have the time. Therefore I planned to do the required testing otherwise. And this is what happened, the required testing has been completed outside of MariaDB corporation, that is what is meant by "testing has been done". | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2024-02-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you. I understand where the confusion comes from now. However, please note that wherever a patch or feature comes from, testing and signoff by MariaDB testers is standard procedure, alike to review signoff. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2024-02-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen Hi! An issue that I regularly see in the runs is this assertion upon shutdown:
The only bug I could locate with a similar assert is an old closed Galera bug ( The issue has proven to be not reducible into a testcase, and always happens after shutdown. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2024-02-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Roel , do you have a stack trace showing where this happens? This is an assertion about memory usage accounting, the stack trace should show which memory allocation is involved. Can you get a core file and show (with gdb) the values of the variables is_thread_specific, mysql_server_initialized, and thd? Apparently this is a memory allocation that should be accounted on a specific thread, but either the server is not yet initialized, or it's called from a place where the thread is unknown. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2024-02-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
knielsen Thank you for the input. All good; I have found that the issue is present in a non-patched 11.3 as well. I will continue to debug it on the side and log it later as a new ticket (including the information mentioned), but it is confirmed not related to MDEV-4991. (NTS: data/000000) |