Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-4761

Diff - Upgrade Risk Assessment Tool

    XMLWordPrintable

Details

    • New Feature
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • None
    • 24.08.1
    • diff
    • None
    • MXS-SPRINT-192, MXS-SPRINT-193, MXS-SPRINT-194, MXS-SPRINT-195, MXS-SPRINT-196, MXS-SPRINT-197, MXS-SPRINT-198, MXS-SPRINT-199, MXS-SPRINT-200, MXS-SPRINT-201, MXS-SPRINT-202, MXS-SPRINT-203, MXS-SPRINT-204, MXS-SPRINT-205, MXS-SPRINT-206, MXS-SPRINT-207, MXS-SPRINT-208

    Description

      The documentation for the feature that has been implemented as a MaxScale router can be found here: https://github.com/mariadb-corporation/MaxScale/blob/24.02/Documentation/Routers/Diff.md

      A significant observation during the implementation of the feature was that you cannot focus on individual queries, because, due to various reasons, there can be a significant differences in the response time between main (current master) and other (the other master) for any particular query. As a result of that, a configurable number (default 1000) of response time samples are collected for each canonical query (aka digest). Thereafter, if the response time of a particular query from other goes outside the sample-range, it is considered an anomaly.

      Original description:
      The purpose of this task is to allow a customer to verify if it safe
      for them to upgrade to another (newer, custom build etc) MariaDB
      versions and what the expected impact would be.

      The idea is for each connection send all queries to the active master
      and to the other-master. The results and times for each server should
      be compared and stored. MaxScale should create a daily/weekly report
      of the difference between the current and new master.

      Things in more detail:

      Startup to be made by user:
      ===========================

      • Make backup of the current server (master) to 'other-master'.
      • Start replication from current to 'other'

      MaxScale startup
      ================

      • Wait until slave 'catches up'
      • Stop replication from 'other' and start sending all queries to current
        and other.

      This can also be done by starting current' and 'other' at the same
      time, but that would require customer to stop current which is not
      always possible.

      MaxScale todo
      =============

      • For each client connection, ensure that there is associated
        connections to current and other server.
      • Send all queries to current and other (asynchronously).
      • Stream the answer from 'current' directly to the client without any delays.
      • Calculate checksum of each row event.
      • Store query, checksum and timing for all old events until 'other' has
        replied.
      • For all queries, sum up the execution time for current and other.
      • Compare the checksum from 'current' and other and if they don't match,
        store the query and and timing for the report
      • If query time differs more than 10% (user defined) then:
      • Create a digest of the query
      • If the digest is not yet found from the same day:
      • Store the query in the 'good query log' if it is faster on 'other'
        and in the 'bad query log' if it is slower.
      • In case it is much slower in other,
        execute an explain on 'other' and store it.
      • If enabled by the user, also do an explain on the current server.
      • If 'other' starts to lag a lot (because of it is slower), MaxScale
        can skip read-only queries to allow 'other' to catch up.

      MaxScale reports
      ================

      Reports can be stored in a text file in specific directory with the
      date as part of the file name.

      The report should include:

      • Total number of queries, total time for current, total time for other.
      • How many queries where faster on current
      • How many queries where faster on other
      • How many queries that provided different results and the digest
        for these queries.
        (In this particular case we should probably remember the exact query)
      • A list of queries (digest) that where significantly faster on other
      • A list of queries (digest) that where slower on other
      • The total query time per digest
      • A few 'explain' of slower queries.
      • Any stored data for the queries for which explain was provided.
        (Preferably similar information as we have in the slow query log).
      • Maybe MaxScale can get these information from the slow query log
        or from the server itself by having the server remember the slow
        query log information from the last query.
        (Not hard to add to the server)

      MaxScale could also configured to store problematic queries
      in a log with information about the problem.

      In effect, the report should be able to answer the questions:

      • Is it safe for me to upgrade to 'other' server
        (no different results, no crashes, no performance issues)
      • What is the expected gain/loss in performance if we would just switch
        to use the 'other' server.

      The report should also contain information that MariaDB developers can use to find out what is wrong in the 'other' server that would stop a smooth upgrade.

      Attachments

        Issue Links

          Activity

            People

              johan.wikman Johan Wikman
              monty Michael Widenius
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.