[MXS-4761] Diff - Upgrade Risk Assessment Tool - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 24.08.1
Component/s: diff
Labels:
None

Sprint:
MXS-SPRINT-192, MXS-SPRINT-193, MXS-SPRINT-194, MXS-SPRINT-195, MXS-SPRINT-196, MXS-SPRINT-197, MXS-SPRINT-198, MXS-SPRINT-199, MXS-SPRINT-200, MXS-SPRINT-201, MXS-SPRINT-202, MXS-SPRINT-203, MXS-SPRINT-204, MXS-SPRINT-205, MXS-SPRINT-206, MXS-SPRINT-207, MXS-SPRINT-208

Description

The documentation for the feature that has been implemented as a MaxScale router can be found here: https://github.com/mariadb-corporation/MaxScale/blob/24.02/Documentation/Routers/Diff.md

A significant observation during the implementation of the feature was that you cannot focus on individual queries, because, due to various reasons, there can be a significant differences in the response time between main (current master) and other (the other master) for any particular query. As a result of that, a configurable number (default 1000) of response time samples are collected for each canonical query (aka digest). Thereafter, if the response time of a particular query from other goes outside the sample-range, it is considered an anomaly.

Original description:
The purpose of this task is to allow a customer to verify if it safe
for them to upgrade to another (newer, custom build etc) MariaDB
versions and what the expected impact would be.

The idea is for each connection send all queries to the active master
and to the other-master. The results and times for each server should
be compared and stored. MaxScale should create a daily/weekly report
of the difference between the current and new master.

Things in more detail:

Startup to be made by user:
===========================

Make backup of the current server (master) to 'other-master'.
Start replication from current to 'other'

MaxScale startup
================

Wait until slave 'catches up'
Stop replication from 'other' and start sending all queries to current
and other.

This can also be done by starting current' and 'other' at the same
time, but that would require customer to stop current which is not
always possible.

MaxScale todo
=============

For each client connection, ensure that there is associated
connections to current and other server.
Send all queries to current and other (asynchronously).
Stream the answer from 'current' directly to the client without any delays.
Calculate checksum of each row event.
Store query, checksum and timing for all old events until 'other' has
replied.
For all queries, sum up the execution time for current and other.
Compare the checksum from 'current' and other and if they don't match,
store the query and and timing for the report
If query time differs more than 10% (user defined) then:
Create a digest of the query
If the digest is not yet found from the same day:
Store the query in the 'good query log' if it is faster on 'other'
and in the 'bad query log' if it is slower.
In case it is much slower in other,
execute an explain on 'other' and store it.
If enabled by the user, also do an explain on the current server.
If 'other' starts to lag a lot (because of it is slower), MaxScale
can skip read-only queries to allow 'other' to catch up.

MaxScale reports
================

Reports can be stored in a text file in specific directory with the
date as part of the file name.

The report should include:

Total number of queries, total time for current, total time for other.
How many queries where faster on current
How many queries where faster on other
How many queries that provided different results and the digest
for these queries.
(In this particular case we should probably remember the exact query)
A list of queries (digest) that where significantly faster on other
A list of queries (digest) that where slower on other
The total query time per digest
A few 'explain' of slower queries.
Any stored data for the queries for which explain was provided.
(Preferably similar information as we have in the slow query log).
Maybe MaxScale can get these information from the slow query log
or from the server itself by having the server remember the slow
query log information from the last query.
(Not hard to add to the server)

MaxScale could also configured to store problematic queries
in a log with information about the problem.

In effect, the report should be able to answer the questions:

Is it safe for me to upgrade to 'other' server
(no different results, no crashes, no performance issues)
What is the expected gain/loss in performance if we would just switch
to use the 'other' server.

The report should also contain information that MariaDB developers can use to find out what is wrong in the 'other' server that would stop a smooth upgrade.

Attachments

Issue Links

includes

MXS-4944 Logically identical runtime configurations are not erased

Closed

is blocked by

MXS-4838 Ensure that /restart sessions/ does the right thing with respect to transaction boundaries etc.

Closed

MXS-4839 Provide functionality for suspending sessions.

Closed

relates to

MXS-4850 Enable generic call commands.

Closed

MXS-4837 Provide more convenient ALTER SERVICE as an alternative to LINK/UNLINK

Closed

Activity

People

Assignee:: Johan Wikman

Reporter:: Michael Widenius

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2023-09-14 15:30

Updated:: 2024-10-14 10:44

Resolved:: 2024-05-17 07:38

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.