Details

    Description

      Original request:
      We have a 130GB database, replicated to a backup cluster, and the binlog files consume 400GB of storage for a single day, without any option to reduce that to something more reasonable!

      Implementation guidelines:

      • Add variable to limit max binlog space (max-binlog-total-space and/or binlog_space_limit)
      • Binlog size should be checked at during server start, binlog rotation, FLUSH LOGS, when writing to binary log or when max-binlog-total-size changes value.
      • Add option --slave-connections-needed-for-purge with 1 as default. This is the minimum number of slaves that needs to be connected for binary logs to be deleted.
        The number should normally be the number of minimum expected slaves. The binary logs will not be deleted until we have at least that many slaves attached and no one is using the to-be-deleted binary log.
        For example assuming one is supposed to have 3 slaves connected to the sever. MariaDB will not delete binary logs based on size until all of them are connected at the same time and all of the slaves are requesting data from binary logs after the to-be-deleted one.
      • Add status variable 'Binlog_disk_use' that shows current binary log space usage.

      Attachments

        Issue Links

          Activity

            ccounotte,
            Could you please clarify your side note about DROP TABLE with foreign keys:

            I did a DROP TABLE xxx on primary cluster, for a table being child of foreign keys and obviously it failed, but for some reason the statement is being replicated and makes the slave constantly exit !?

            In this context, what is the primary cluster and what is the slave? Do you mean that the statement was replicated within a Galera cluster, between nodes? Or is it about traditional replication? Do you have an example of a binary log?

            elenst Elena Stepanova added a comment - ccounotte , Could you please clarify your side note about DROP TABLE with foreign keys: I did a DROP TABLE xxx on primary cluster, for a table being child of foreign keys and obviously it failed, but for some reason the statement is being replicated and makes the slave constantly exit !? In this context, what is the primary cluster and what is the slave? Do you mean that the statement was replicated within a Galera cluster, between nodes? Or is it about traditional replication? Do you have an example of a binary log?
            knielsen Kristian Nielsen added a comment - - edited Review done: https://lists.mariadb.org/hyperkitty/list/developers@lists.mariadb.org/thread/EXIBNZ6PY4FQNBIUWJXS7SHKBMJWFPQU/
            elenst Elena Stepanova added a comment - - edited

            In my opinion the feature as of bb-11.4-timestamp 39fa2e267cef7952e can be pushed into 11.4 and released with 11.4 RC.

            I have some concerns about hypothetical corner cases which testing could have missed, especially involving replication, but I don't think at this stage the feature will benefit crucially from further internal (only) testing.

            At the same time, it is important to get feedback from the community. One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables.

            This was discussed during code review and admitted to be "by design", which is why it wasn't raised during testing. In practice, by default, without extra care, these options become ignored after upgrade from an old server to the new one on instances which don't serve as a primary to at least one replica.

            It can probably be assumed that the amount of affected instances shouldn't be huge, as writing binary logs without using them in replication doesn't make much sense; if it's done for possible recovery, it is an expensive and very unreliable way to achieve it. Still, even if it's about a small fraction of installations, the change is quite dangerous – if they have been relying on log rotation for years and then suddenly it stops working, it can easily happen that the problem is only noticed when the disk space is completely exhausted, in which case various problems can happen.

            I don't see it as a reason to postpone the release though. If we or users come up with an idea for better default configuration, we can still change the defaults before GA.

            elenst Elena Stepanova added a comment - - edited In my opinion the feature as of bb-11.4-timestamp 39fa2e267cef7952e can be pushed into 11.4 and released with 11.4 RC. I have some concerns about hypothetical corner cases which testing could have missed, especially involving replication, but I don't think at this stage the feature will benefit crucially from further internal (only) testing. At the same time, it is important to get feedback from the community. One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables. This was discussed during code review and admitted to be "by design", which is why it wasn't raised during testing. In practice, by default, without extra care, these options become ignored after upgrade from an old server to the new one on instances which don't serve as a primary to at least one replica. It can probably be assumed that the amount of affected instances shouldn't be huge, as writing binary logs without using them in replication doesn't make much sense; if it's done for possible recovery, it is an expensive and very unreliable way to achieve it. Still, even if it's about a small fraction of installations, the change is quite dangerous – if they have been relying on log rotation for years and then suddenly it stops working, it can easily happen that the problem is only noticed when the disk space is completely exhausted, in which case various problems can happen. I don't see it as a reason to postpone the release though. If we or users come up with an idea for better default configuration, we can still change the defaults before GA.

            Unfortunately it turns out that the branch/commit is unusable for the release on reasons unrelated to this feature (as it also contains MDEV-32188 which turned out to be problematic).
            Also due to that, last pushes into the branch didn't go through all buildbot tests on Windows and non-x86_64 platforms. So, while the feature approval above remains valid, releasing the feature requires either fixing remaining issues or splitting the branch, and in either case at the very least additional tests in the buildbot, preferably on all platforms.

            elenst Elena Stepanova added a comment - Unfortunately it turns out that the branch/commit is unusable for the release on reasons unrelated to this feature (as it also contains MDEV-32188 which turned out to be problematic). Also due to that, last pushes into the branch didn't go through all buildbot tests on Windows and non-x86_64 platforms. So, while the feature approval above remains valid, releasing the feature requires either fixing remaining issues or splitting the branch, and in either case at the very least additional tests in the buildbot, preferably on all platforms.

            Just a clarification to Elena's comments (I just talked with her).
            "One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables."
            The issue is that --slave-connections-needed-for-purge=1 also affects the above two variables and fixes a bug where the binary logs could be removed
            even if an existing but not connected slave could still need them. That could be fixed by having --slave-connections-needed-for-purge to be 0 by default.
            However I think the default value is better as it avoids unexpected loss of data.

            Regarding buildbot, both Elena and I missed to check buildbot for the last few pushes (as we where concentrated on fixing issues that Elena has found).
            I am now working on fixing the buildbot issues

            monty Michael Widenius added a comment - Just a clarification to Elena's comments (I just talked with her). "One controversial detail is the change of behavior for old expire_logs_days / binlog_expire_logs_seconds variables." The issue is that --slave-connections-needed-for-purge=1 also affects the above two variables and fixes a bug where the binary logs could be removed even if an existing but not connected slave could still need them. That could be fixed by having --slave-connections-needed-for-purge to be 0 by default. However I think the default value is better as it avoids unexpected loss of data. Regarding buildbot, both Elena and I missed to check buildbot for the last few pushes (as we where concentrated on fixing issues that Elena has found). I am now working on fixing the buildbot issues

            People

              monty Michael Widenius
              ccounotte COUNOTTE CEDRIC
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.