[MXS-4549] Replay queries with partially returned results Created: 2023-03-07  Updated: 2023-07-18  Resolved: 2023-06-05

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: None
Fix Version/s: 23.08.0

Type: New Feature Priority: Major
Reporter: Michael Malgeri (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

SkySQL2 at https://cloud.stage.mariadb.com/dashboard


Attachments: Text File MXS-4549-apr05-logs.txt     Text File MXS-4549-log-info.txt     Text File MXS-4549-mar17-logs.txt     Text File MXS-4549-no-log-info.txt     File addPrimaryKey.sql     File alterTest.py     File t3a.sql     File t3arun.sql     File t4.sql     File travel.sql     File xactionsTest.py    
Issue Links:
Relates
relates to MXS-4555 Dynamic filter capabilities do not work Closed
Sprint: MXS-SPRINT-183

 Description   

If a result has been partially sent to the client and the connection is either lost in the middle of it or it ends in a deadlock, the replay is not possible. In certain cases a replay would be possible if the failing result was re-executed and spliced into the existing one.

A simple byte offset into the latest resultset would be a simple but effective method of detecting which parts have been sent to the client and which parts need to match in order for the replay to be safe.


Original description:
SkySQL2 cluster is called customer777-xpand2
Domain Name is dbswf33330813.syss0001.stage1.skysql.net
Database is travel

1. Created database with travel.sql
2. Added primary key with addPrimaryKey.sql
3. In session 1 ran sql script t3a.sql to create a stored procedure
4. In session 1 ran sql script t3arun.sql to run t3a.sql stored procedure
5. In session 2 ran t4.sql to create a stored procedure and run it

The following error was displayed in session 1 after a few minutes

ERROR 1205 (40000) at line 1 in file: 't3arun.sql': [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (66773); try restarting transaction



 Comments   
Comment by Michael Malgeri (Inactive) [ 2023-03-08 ]

I will work on his today and report results

Comment by Michael Malgeri (Inactive) [ 2023-03-08 ]

The log files have been uploaded for the configuration BEFORE "log_info=true" was set.

Then log_info=true was set.

The tests have been run 3 times with no error.

Should the post re-configure logs be uploaded?

Should the tests be run a few more times?

Can I have the customer test?

Comment by Michael Malgeri (Inactive) [ 2023-03-08 ]

gdorman markus makela johan.wikman Sorry, the test I ran 3 times had a mistake.

After correcting for the mistake, the test gave the same error
ERROR 1205 (40000) at line 1 in file: 't3arun.sql': [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (69535); try restarting transaction

The log files will be collected and uploaded to this issue

Comment by Michael Malgeri (Inactive) [ 2023-03-08 ]

gdorman markus makela johan.wikman

j_kyle added logs which contain information AFTER the most recent tests were performed

He added a line saying at which point we added log_info = true

Comment by markus makela [ 2023-03-09 ]

The error Current query unexpectedly empty when trying to retry query on master definitely indicates a bug of some sorts in MaxScale.

Comment by markus makela [ 2023-03-09 ]

The logs don't seem to actually have log_info enabled. If the configuration file was modified, MaxScale needs to be restarted for it to take effect. If this is to be done at runtime, you must use maxctrl alter maxscale log_info=true.

Comment by Michael Malgeri (Inactive) [ 2023-03-09 ]

markus makela gdorman johan.wikman, j_kyle was able to configure Maxscale with log_info enabled. I reran the test and got the same error

ERROR 1205 (40000) at line 1 in file: 't3arun.sql': [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (69535); try restarting transaction

j_kyle will be uploading the log files

Comment by Michael Malgeri (Inactive) [ 2023-03-10 ]

markus makelagdormanjohan.wikman , j_kyle has uploaded the log files.

Please check if they provide the information you need to debug the error and what the next steps will be.

Comment by markus makela [ 2023-03-10 ]

OK, I was just reading the errors wrong and it is indeed a transaction rollback SQLSTATE:

Comment by markus makela [ 2023-03-10 ]

Problem Analysis

Seems like the reason why the replay doesn't get triggered and why the error warns about an unexpectedly empty query (this warning is still a bug) is because a partial result was already delivered to the client. Readwritesplit never retries a query that has been partially delivered to the client as there's no way to un-send the data.

Here's an excerpt from the logs that shows what's happening:

2023-03-09 18:04:52   info   : (1275) (Experimental-Service); > Autocommit: [enabled], trx is [not open], cmd: (0x03) COM_QUERY, plen: 32, type: QUERY_TYPE_WRITE, stmt: CALL run_sql_statements3a() 
2023-03-09 18:04:52   info   : (1275) [readwritesplit] (Experimental-Service); Route query to master: @@Xpand-Monitor:node-1 <
2023-03-09 18:04:52   info   : (1275) [readwritesplit] (Experimental-Service); Reply not yet complete. Waiting for 1 replies, got one from @@Xpand-Monitor:node-1
2023-03-09 18:09:52   info   : (1275) [readwritesplit] (Experimental-Service); Reply not yet complete. Waiting for 1 replies, got one from @@Xpand-Monitor:node-1
2023-03-09 18:09:52   info   : (1275) [readwritesplit] (Experimental-Service); Server triggered transaction rollback, replaying transaction: [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (69535); try restarting transaction
2023-03-09 18:09:52   warning: (1275) [readwritesplit] (Experimental-Service); Expected a WSREP error but got a transaction rollback error: 1205, [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (69535); try restarting transaction

Here's the relevant SQL from the stored procedure:

SELECT dep_delay from flights WHERE carrier = 'UA' and flight_id like '773%';
SELECT SLEEP(300);
UPDATE flights SET dep_delay = dep_delay + 1 WHERE carrier = 'UA' and flight_id like '%773%';

The logs show that there's exactly 300 seconds between the SELECT that returns a resultset and the UPDATE that ends up conflicting.

This is a limitaton of the current replay capability and there's no way to avoid a partial result that ends in a deadlock from being delivered to a client.

Possible New Feature

This could be prevented in one of two ways: either by buffering the result in MaxScale or by trying to splice two results into one.

The first approach requires more memory but it is a relatively safe and straightforward method of guaranteeing that the client gets a complete result or no result at all. Something like the existing transaction_replay_max_size parameter could be used to determine the maximum size of any single result that MaxScale would buffer before sending it to a client. This would guarantee that even if a partial result is returned, the client wouldn't see it if MaxScale ends up replaying it.

The second approach of splicing two results into one is more complex. The hard part is in tracking the point at which the results must be spliced together. A simple but relatively effective approach could be to use a raw byte offset as two identical resultsets would always end up being the same length and the packet boundaries would always match. A checksum on the already returned result would make it possible to detect results that have changed from the original one and the one that was re-executed by MaxScale.

Comment by Michael Malgeri (Inactive) [ 2023-03-10 ]

gdormanI've sent a briefing email to toddstoffel and will follow up with customer

Comment by Michael Malgeri (Inactive) [ 2023-03-12 ]

I will contact Jeff Smelser at Rechargeapps and get the answers asap. If it turns out we have to use Python to drive the client sequence, I should be able to handle it but will reach out if I need assistance

Comment by Johan Wikman [ 2023-03-13 ]

Better keep this as a bug that is closed as Not a Bug, with an addition to the documentation that transaction replay does not support stored procedures. Then create a feature request "Make Replay work with Stored procedures", referring to this bug.

Comment by Gregory Dorman (Inactive) [ 2023-03-13 ]

Fine.

Comment by Michael Malgeri (Inactive) [ 2023-03-13 ]

This is the response from Jeff Smelser of Reachargeapps:

_"We keep track of all alter in git. We use liquibase to apply changes, generally speaking. https://www.liquibase.org/

however, on busier systems and busy tables, we will just run the alter right on the command line using mysql cli.

We don't use stored procedures or anything like that."_

Are there any other questions to ask him?

Comment by Michael Malgeri (Inactive) [ 2023-03-14 ]

gdormanI will perform those tests and get back with results asap.

Comment by Michael Malgeri (Inactive) [ 2023-03-15 ]

gdorman markus makela johan.wikmanI ran the following test from a python client multiple times and did not receive any errors

Here is a description of the test:

  1. Started a jupyter notebook and ran the code in the attached file named xactionsTest.py
  2. Started another jupyter notebook and ran the code in the attached file named alterTest.py

Please review these files and advise if the test wasn't performed properly.
Note, in xactionsTest.py, the connection had to be re-acquired after the 5 minute sleep

Are we ready to respond to the client for a retest?

Comment by Michael Malgeri (Inactive) [ 2023-03-16 ]

gdorman markus makela johan.wikman it looks like the problem still exists.

I shortened the sleep to 100 seconds to stay within the 120 seconds set by Xpand's idle_trx_timeout_s variable.

The following error was received when the transaction work up and tried to update

Sleeping for 100 seconds...
Updating flights
*Error occurred: 1205 (40000): [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (69535); try restarting transaction
Transaction rolled back.*
MySQL connection closed.

The ALTER was performed shortly after the transaction went to sleep, but before it woke up and tried the update

Comment by markus makela [ 2023-03-16 ]

Please provide logs from MaxScale with log_info enabled.

Comment by Michael Malgeri (Inactive) [ 2023-03-17 ]

Will upload those asap

Comment by Michael Malgeri (Inactive) [ 2023-03-20 ]

gdormanmarkus makelajohan.wikmanLog files have been uploaded. Please advise on next steps

Comment by Michael Malgeri (Inactive) [ 2023-04-04 ]

Hi Todd,

I received the notice about MXS-4549.

Does this mean there will be a fix for the rplay feature in Maxscale, and
it's scheduled for release in August of 2023?

Is that confirmed?

Michael Malgeri
Principal Solutions Architect
MariaDB Corporation
310-704-6403

On Mon, Apr 3, 2023 at 11:01 PM Todd Stoffel (Jira) <jira@mariadb.org>

Comment by Michael Malgeri (Inactive) [ 2023-04-04 ]

Actually, the real question I have is, can I tell the prospect our current
release schedule for a fix is 8/2023?

Michael Malgeri
Principal Solutions Architect
MariaDB Corporation
310-704-6403

On Tue, Apr 4, 2023 at 1:50 PM Michael Malgeri <michael.malgeri@mariadb.com>

Comment by Johan Wikman [ 2023-04-05 ]

It means that the plan is that we will address this issue and the intent is to provide a fix in the 8/2023 release. However, as plans may be altered due to priorities being changed or external reasons not in our control, it is not yet a hard promise that the fix indeed will be in 8/2023.

Johan

Comment by Michael Malgeri (Inactive) [ 2023-04-05 ]

gdorman markus makela johan.wikman
I reran the tests, and this time on the configuration that supports readwritesplit

I did not receive any errors.

Does this mean replay is working?

Are there any other tests to run?

The log files have been uploaded to MXS-4549-apr05-logs.txt

Comment by Michael Malgeri (Inactive) [ 2023-04-06 ]

Also, here are the steps of the test

  1. 1. Created the travel database and loaded approx. 70K records of flight data with travel.sql
  2. 2. Added a primary key to the data addPrimaryKey.sql
  3. 3. Ran # python3 xactionsTest.py, which performed a select, slept for for 5 minutes, and performed another select
  4. 4. In a separate session, immediately started # python3 alterTest.py , which performed alters for for over 5 minutes, after the first select and during the time the xactionsTest.py was asleep
Comment by markus makela [ 2023-04-06 ]

If you got no errors, I'd say that it's working.

Comment by Michael Malgeri (Inactive) [ 2023-04-14 ]

gdorman markus makela johan.wikman

Jeff Smelser of Recharge performed his ALTER Table tests during a transaction.

He said,

"I did my test this morning, It looks like it behaved much better except I got a 3 errors from the write:

pymysql.err.OperationalError: (1927, 'Transaction checksum mismatch encountered when replaying transaction.')

That looks like maxscale trying to replay it but can't for some reason"

Please advise

Comment by Michael Malgeri (Inactive) [ 2023-04-24 ]

gdormanmarkus makelajohan.wikman

Do any of you have comments on this error that was reported by Rechargeapps?

pymysql.err.OperationalError: (1927, 'Transaction checksum mismatch encountered when replaying transaction.')

Do we understand what is causing it, or do we need to perform additional tests?

Comment by Michael Malgeri (Inactive) [ 2023-06-02 ]

gdorman markus makela johan.wikman , I've seen a few updates to this jira and I don't quite understand them, but that's besides the point.

The question I have is, our most recent position on this feature was that the application should not try to ALTER a table in the middle of a transaction, and not expect to receive an error message from MariaDB Xpand.

Is this still the case, or have we made changes in Maxscale to handle this type of behavior from the application?

Comment by Michael Malgeri (Inactive) [ 2023-06-28 ]

Team, gdorman markus makelatoddstoffel clieu

Just providing some feedback from the field regarding the Rechargeapps issue.

Apparently, TiDB does in their database what MaxScale will do in version 23.08.0 
We asked the customer (Rechargeapps) if they'd be willing to test again, but they believe that adding another layer between the application and the database, for this feature, will insect too much latency.

I don't know if having the "replay queries with partially returned results" improvement in MariaDB Enterprise, or Xpand in this case, will become "table stakes" for customers other than Rechargeapps, but just thought awareness should be raised.

Comment by Michael Malgeri (Inactive) [ 2023-06-29 ]

We're trying to get them to rerun their tests against MaxScale 23.08.0, but looks like that ship has sailed.

We may have better luck down the road if TiDB doesn't pan out.

Generated at Thu Feb 08 04:29:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.