[MXS-4761] Upgrade Risk Assessment Tool Created: 2023-09-14  Updated: 2024-01-29

Status: In Progress
Project: MariaDB MaxScale
Component/s: None
Affects Version/s: None
Fix Version/s: 24.02

Type: New Feature Priority: Critical
Reporter: Michael Widenius Assignee: Johan Wikman
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MXS-4838 Ensure that /restart sessions/ does t... Closed
is blocked by MXS-4839 Provide functionality for suspending ... Closed
PartOf
includes MXS-4944 Logically identical runtime configura... Open
Relates
relates to MXS-4850 Enable generic call commands. Closed
relates to MXS-4837 Provide more convenient ALTER SERVICE... Closed
Sprint: MXS-SPRINT-192, MXS-SPRINT-193, MXS-SPRINT-194, MXS-SPRINT-195, MXS-SPRINT-196, MXS-SPRINT-197, MXS-SPRINT-198, MXS-SPRINT-199, MXS-SPRINT-200, MXS-SPRINT-201

 Description   

The purpose of this task is to allow a customer to verify if it safe
for them to upgrade to another (newer, custom build etc) MariaDB
versions and what the expected impact would be.

The idea is for each connection send all queries to the active master
and to the other-master. The results and times for each server should
be compared and stored. MaxScale should create a daily/weekly report
of the difference between the current and new master.

Things in more detail:

Startup to be made by user:
===========================

  • Make backup of the current server (master) to 'other-master'.
  • Start replication from current to 'other'

MaxScale startup
================

  • Wait until slave 'catches up'
  • Stop replication from 'other' and start sending all queries to current
    and other.

This can also be done by starting current' and 'other' at the same
time, but that would require customer to stop current which is not
always possible.

MaxScale todo
=============

  • For each client connection, ensure that there is associated
    connections to current and other server.
  • Send all queries to current and other (asynchronously).
  • Stream the answer from 'current' directly to the client without any delays.
  • Calculate checksum of each row event.
  • Store query, checksum and timing for all old events until 'other' has
    replied.
  • For all queries, sum up the execution time for current and other.
  • Compare the checksum from 'current' and other and if they don't match,
    store the query and and timing for the report
  • If query time differs more than 10% (user defined) then:
  • Create a digest of the query
  • If the digest is not yet found from the same day:
  • Store the query in the 'good query log' if it is faster on 'other'
    and in the 'bad query log' if it is slower.
  • In case it is much slower in other,
    execute an explain on 'other' and store it.
  • If enabled by the user, also do an explain on the current server.
  • If 'other' starts to lag a lot (because of it is slower), MaxScale
    can skip read-only queries to allow 'other' to catch up.

MaxScale reports
================

Reports can be stored in a text file in specific directory with the
date as part of the file name.

The report should include:

  • Total number of queries, total time for current, total time for other.
  • How many queries where faster on current
  • How many queries where faster on other
  • How many queries that provided different results and the digest
    for these queries.
    (In this particular case we should probably remember the exact query)
  • A list of queries (digest) that where significantly faster on other
  • A list of queries (digest) that where slower on other
  • The total query time per digest
  • A few 'explain' of slower queries.
  • Any stored data for the queries for which explain was provided.
    (Preferably similar information as we have in the slow query log).
  • Maybe MaxScale can get these information from the slow query log
    or from the server itself by having the server remember the slow
    query log information from the last query.
    (Not hard to add to the server)

MaxScale could also configured to store problematic queries
in a log with information about the problem.

In effect, the report should be able to answer the questions:

  • Is it safe for me to upgrade to 'other' server
    (no different results, no crashes, no performance issues)
  • What is the expected gain/loss in performance if we would just switch
    to use the 'other' server.

The report should also contain information that MariaDB developers can use to find out what is wrong in the 'other' server that would stop a smooth upgrade.


Generated at Thu Feb 08 04:30:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.