[MDEV-31404] Implement binlog_space_limit - Jira

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 11.4.1
Component/s: Replication
Labels:
- Preview_11.4

Description

Original request:
We have a 130GB database, replicated to a backup cluster, and the binlog files consume 400GB of storage for a single day, without any option to reduce that to something more reasonable!

Implementation guidelines:

Add variable to limit max binlog space (max-binlog-total-space and/or binlog_space_limit)
Binlog size should be checked at during server start, binlog rotation, FLUSH LOGS, when writing to binary log or when max-binlog-total-size changes value.
Add option --slave-connections-needed-for-purge with 1 as default. This is the minimum number of slaves that needs to be connected for binary logs to be deleted.
The number should normally be the number of minimum expected slaves. The binary logs will not be deleted until we have at least that many slaves attached and no one is using the to-be-deleted binary log.
For example assuming one is supposed to have 3 slaves connected to the sever. MariaDB will not delete binary logs based on size until all of them are connected at the same time and all of the slaves are requesting data from binary logs after the to-be-deleted one.
Add status variable 'Binlog_disk_use' that shows current binary log space usage.

Attachments

Issue Links

causes

MDEV-33116 int MYSQL_BIN_LOG::real_purge_logs_by_size(ulonglong): Assertion `reclaimed_space == found_space' failed

Closed

MDEV-33282 Assertion `(longlong) binlog_space_total >= 0' failed in MYSQL_BIN_LOG::real_purge_logs_by_size

Closed

MDEV-33319 max_binlog_total_size and slave_connections_needed_for_purge require wrong permissions

Closed

MDEV-33320 Assertion `(longlong) binlog_space_total >= 0' failed in MYSQL_BIN_LOG::real_purge_logs_by_size #2

Closed

MDEV-34961 BINLOG_DISK_USE status variable is not documented

Closed

is duplicated by

MDEV-33029 Add a mechanism to limit total binlog size on disk

Closed

relates to

MDEV-11076 Make binlog and relay sizes dynamically configurable by "maximum size"

Open

MDEV-29195 relay_log_purge variable not very useful

Open

MDEV-29196 Request new system variable: binlog_keep_max

Closed

MDEV-34897 Assertion `info->type == READ_CACHE || info->type == WRITE_CACHE' failed in reinit_io_cache on SET ... slave_connections_needed_for_purge

Open

MDEV-35336 PURGE BINARY LOGS not working anymore again

Closed

(1 is duplicated by, 5 relates to)

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Elena Stepanova added a comment - 2023-12-06 00:47

ccounotte,
Could you please clarify your side note about DROP TABLE with foreign keys:

I did a DROP TABLE xxx on primary cluster, for a table being child of foreign keys and obviously it failed, but for some reason the statement is being replicated and makes the slave constantly exit !?

In this context, what is the primary cluster and what is the slave? Do you mean that the statement was replicated within a Galera cluster, between nodes? Or is it about traditional replication? Do you have an example of a binary log?

Elena Stepanova added a comment - 2023-12-06 00:47 ccounotte , Could you please clarify your side note about DROP TABLE with foreign keys: I did a DROP TABLE xxx on primary cluster, for a table being child of foreign keys and obviously it failed, but for some reason the statement is being replicated and makes the slave constantly exit !? In this context, what is the primary cluster and what is the slave? Do you mean that the statement was replicated within a Galera cluster, between nodes? Or is it about traditional replication? Do you have an example of a binary log?

Kristian Nielsen added a comment - 2023-12-11 09:16 - edited

Review done: https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/thread/EXIBNZ6PY4FQNBIUWJXS7SHKBMJWFPQU/

Kristian Nielsen added a comment - 2023-12-11 09:16 - edited Review done: https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/thread/EXIBNZ6PY4FQNBIUWJXS7SHKBMJWFPQU/

Elena Stepanova added a comment - 2024-02-12 16:08 - edited

In my opinion the feature as of bb-11.4-timestamp 39fa2e267cef7952e can be pushed into 11.4 and released with 11.4 RC.

I have some concerns about hypothetical corner cases which testing could have missed, especially involving replication, but I don't think at this stage the feature will benefit crucially from further internal (only) testing.

At the same time, it is important to get feedback from the community. One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables.

This was discussed during code review and admitted to be "by design", which is why it wasn't raised during testing. In practice, by default, without extra care, these options become ignored after upgrade from an old server to the new one on instances which don't serve as a primary to at least one replica.

It can probably be assumed that the amount of affected instances shouldn't be huge, as writing binary logs without using them in replication doesn't make much sense; if it's done for possible recovery, it is an expensive and very unreliable way to achieve it. Still, even if it's about a small fraction of installations, the change is quite dangerous – if they have been relying on log rotation for years and then suddenly it stops working, it can easily happen that the problem is only noticed when the disk space is completely exhausted, in which case various problems can happen.

I don't see it as a reason to postpone the release though. If we or users come up with an idea for better default configuration, we can still change the defaults before GA.

Elena Stepanova added a comment - 2024-02-12 16:08 - edited In my opinion the feature as of bb-11.4-timestamp 39fa2e267cef7952e can be pushed into 11.4 and released with 11.4 RC. I have some concerns about hypothetical corner cases which testing could have missed, especially involving replication, but I don't think at this stage the feature will benefit crucially from further internal (only) testing. At the same time, it is important to get feedback from the community. One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables. This was discussed during code review and admitted to be "by design", which is why it wasn't raised during testing. In practice, by default, without extra care, these options become ignored after upgrade from an old server to the new one on instances which don't serve as a primary to at least one replica. It can probably be assumed that the amount of affected instances shouldn't be huge, as writing binary logs without using them in replication doesn't make much sense; if it's done for possible recovery, it is an expensive and very unreliable way to achieve it. Still, even if it's about a small fraction of installations, the change is quite dangerous – if they have been relying on log rotation for years and then suddenly it stops working, it can easily happen that the problem is only noticed when the disk space is completely exhausted, in which case various problems can happen. I don't see it as a reason to postpone the release though. If we or users come up with an idea for better default configuration, we can still change the defaults before GA.

Elena Stepanova added a comment - 2024-02-12 21:25

Unfortunately it turns out that the branch/commit is unusable for the release on reasons unrelated to this feature (as it also contains ~~MDEV-32188~~ which turned out to be problematic).
Also due to that, last pushes into the branch didn't go through all buildbot tests on Windows and non-x86_64 platforms. So, while the feature approval above remains valid, releasing the feature requires either fixing remaining issues or splitting the branch, and in either case at the very least additional tests in the buildbot, preferably on all platforms.

Elena Stepanova added a comment - 2024-02-12 21:25 Unfortunately it turns out that the branch/commit is unusable for the release on reasons unrelated to this feature (as it also contains MDEV-32188 which turned out to be problematic). Also due to that, last pushes into the branch didn't go through all buildbot tests on Windows and non-x86_64 platforms. So, while the feature approval above remains valid, releasing the feature requires either fixing remaining issues or splitting the branch, and in either case at the very least additional tests in the buildbot, preferably on all platforms.

Michael Widenius added a comment - 2024-02-13 09:32

Just a clarification to Elena's comments (I just talked with her).
"One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables."
The issue is that --slave-connections-needed-for-purge=1 also affects the above two variables and fixes a bug where the binary logs could be removed
even if an existing but not connected slave could still need them. That could be fixed by having --slave-connections-needed-for-purge to be 0 by default.
However I think the default value is better as it avoids unexpected loss of data.

Regarding buildbot, both Elena and I missed to check buildbot for the last few pushes (as we where concentrated on fixing issues that Elena has found).
I am now working on fixing the buildbot issues

Michael Widenius added a comment - 2024-02-13 09:32 Just a clarification to Elena's comments (I just talked with her). "One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables." The issue is that --slave-connections-needed-for-purge=1 also affects the above two variables and fixes a bug where the binary logs could be removed even if an existing but not connected slave could still need them. That could be fixed by having --slave-connections-needed-for-purge to be 0 by default. However I think the default value is better as it avoids unexpected loss of data. Regarding buildbot, both Elena and I missed to check buildbot for the last few pushes (as we where concentrated on fixing issues that Elena has found). I am now working on fixing the buildbot issues

MariaDB Server

Implement binlog_space_limit

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration