[MXS-4549] Replay queries with partially returned results Created: 2023-03-07 Updated: 2023-07-18 Resolved: 2023-06-05 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | readwritesplit |
| Affects Version/s: | None |
| Fix Version/s: | 23.08.0 |
| Type: | New Feature | Priority: | Major |
| Reporter: | Michael Malgeri (Inactive) | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
SkySQL2 at https://cloud.stage.mariadb.com/dashboard |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Sprint: | MXS-SPRINT-183 | ||||||||
| Description |
|
If a result has been partially sent to the client and the connection is either lost in the middle of it or it ends in a deadlock, the replay is not possible. In certain cases a replay would be possible if the failing result was re-executed and spliced into the existing one. A simple byte offset into the latest resultset would be a simple but effective method of detecting which parts have been sent to the client and which parts need to match in order for the replay to be safe. Original description: 1. Created database with travel.sql The following error was displayed in session 1 after a few minutes ERROR 1205 (40000) at line 1 in file: 't3arun.sql': [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (66773); try restarting transaction |
| Comments |
| Comment by Michael Malgeri (Inactive) [ 2023-03-08 ] | |||||||||
|
I will work on his today and report results | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-08 ] | |||||||||
|
The log files have been uploaded for the configuration BEFORE "log_info=true" was set. Then log_info=true was set. The tests have been run 3 times with no error. Should the post re-configure logs be uploaded? Should the tests be run a few more times? Can I have the customer test? | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-08 ] | |||||||||
|
gdorman markus makela johan.wikman Sorry, the test I ran 3 times had a mistake. After correcting for the mistake, the test gave the same error The log files will be collected and uploaded to this issue | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-08 ] | |||||||||
|
gdorman markus makela johan.wikman j_kyle added logs which contain information AFTER the most recent tests were performed He added a line saying at which point we added log_info = true | |||||||||
| Comment by markus makela [ 2023-03-09 ] | |||||||||
|
The error Current query unexpectedly empty when trying to retry query on master definitely indicates a bug of some sorts in MaxScale. | |||||||||
| Comment by markus makela [ 2023-03-09 ] | |||||||||
|
The logs don't seem to actually have log_info enabled. If the configuration file was modified, MaxScale needs to be restarted for it to take effect. If this is to be done at runtime, you must use maxctrl alter maxscale log_info=true. | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-09 ] | |||||||||
|
markus makela gdorman johan.wikman, j_kyle was able to configure Maxscale with log_info enabled. I reran the test and got the same error ERROR 1205 (40000) at line 1 in file: 't3arun.sql': [30720] MVCC serializable conflict: This transaction conflicted with a serializable transaction: `travel`.`flights` Primary key: (69535); try restarting transaction j_kyle will be uploading the log files | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-10 ] | |||||||||
|
markus makelagdormanjohan.wikman , j_kyle has uploaded the log files. Please check if they provide the information you need to debug the error and what the next steps will be. | |||||||||
| Comment by markus makela [ 2023-03-10 ] | |||||||||
|
OK, I was just reading the errors wrong and it is indeed a transaction rollback SQLSTATE: | |||||||||
| Comment by markus makela [ 2023-03-10 ] | |||||||||
Problem AnalysisSeems like the reason why the replay doesn't get triggered and why the error warns about an unexpectedly empty query (this warning is still a bug) is because a partial result was already delivered to the client. Readwritesplit never retries a query that has been partially delivered to the client as there's no way to un-send the data. Here's an excerpt from the logs that shows what's happening:
Here's the relevant SQL from the stored procedure:
The logs show that there's exactly 300 seconds between the SELECT that returns a resultset and the UPDATE that ends up conflicting. This is a limitaton of the current replay capability and there's no way to avoid a partial result that ends in a deadlock from being delivered to a client. Possible New FeatureThis could be prevented in one of two ways: either by buffering the result in MaxScale or by trying to splice two results into one. The first approach requires more memory but it is a relatively safe and straightforward method of guaranteeing that the client gets a complete result or no result at all. Something like the existing transaction_replay_max_size parameter could be used to determine the maximum size of any single result that MaxScale would buffer before sending it to a client. This would guarantee that even if a partial result is returned, the client wouldn't see it if MaxScale ends up replaying it. The second approach of splicing two results into one is more complex. The hard part is in tracking the point at which the results must be spliced together. A simple but relatively effective approach could be to use a raw byte offset as two identical resultsets would always end up being the same length and the packet boundaries would always match. A checksum on the already returned result would make it possible to detect results that have changed from the original one and the one that was re-executed by MaxScale. | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-10 ] | |||||||||
|
gdormanI've sent a briefing email to toddstoffel and will follow up with customer | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-12 ] | |||||||||
|
I will contact Jeff Smelser at Rechargeapps and get the answers asap. If it turns out we have to use Python to drive the client sequence, I should be able to handle it but will reach out if I need assistance | |||||||||
| Comment by Johan Wikman [ 2023-03-13 ] | |||||||||
|
Better keep this as a bug that is closed as Not a Bug, with an addition to the documentation that transaction replay does not support stored procedures. Then create a feature request "Make Replay work with Stored procedures", referring to this bug. | |||||||||
| Comment by Gregory Dorman (Inactive) [ 2023-03-13 ] | |||||||||
|
Fine. | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-13 ] | |||||||||
|
This is the response from Jeff Smelser of Reachargeapps: _"We keep track of all alter in git. We use liquibase to apply changes, generally speaking. https://www.liquibase.org/ however, on busier systems and busy tables, we will just run the alter right on the command line using mysql cli. We don't use stored procedures or anything like that."_ Are there any other questions to ask him? | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-14 ] | |||||||||
|
gdormanI will perform those tests and get back with results asap. | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-15 ] | |||||||||
|
gdorman markus makela johan.wikmanI ran the following test from a python client multiple times and did not receive any errors Here is a description of the test:
Please review these files and advise if the test wasn't performed properly. Are we ready to respond to the client for a retest? | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-16 ] | |||||||||
|
gdorman markus makela johan.wikman it looks like the problem still exists. I shortened the sleep to 100 seconds to stay within the 120 seconds set by Xpand's idle_trx_timeout_s variable. The following error was received when the transaction work up and tried to update Sleeping for 100 seconds... The ALTER was performed shortly after the transaction went to sleep, but before it woke up and tried the update | |||||||||
| Comment by markus makela [ 2023-03-16 ] | |||||||||
|
Please provide logs from MaxScale with log_info enabled. | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-17 ] | |||||||||
|
Will upload those asap | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-03-20 ] | |||||||||
|
gdormanmarkus makelajohan.wikmanLog files have been uploaded. Please advise on next steps | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-04-04 ] | |||||||||
|
Hi Todd, I received the notice about Does this mean there will be a fix for the rplay feature in Maxscale, and Is that confirmed? Michael Malgeri On Mon, Apr 3, 2023 at 11:01 PM Todd Stoffel (Jira) <jira@mariadb.org> | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-04-04 ] | |||||||||
|
Actually, the real question I have is, can I tell the prospect our current Michael Malgeri On Tue, Apr 4, 2023 at 1:50 PM Michael Malgeri <michael.malgeri@mariadb.com> | |||||||||
| Comment by Johan Wikman [ 2023-04-05 ] | |||||||||
|
It means that the plan is that we will address this issue and the intent is to provide a fix in the 8/2023 release. However, as plans may be altered due to priorities being changed or external reasons not in our control, it is not yet a hard promise that the fix indeed will be in 8/2023. Johan | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-04-05 ] | |||||||||
|
gdorman markus makela johan.wikman I did not receive any errors. Does this mean replay is working? Are there any other tests to run? The log files have been uploaded to | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-04-06 ] | |||||||||
|
Also, here are the steps of the test
| |||||||||
| Comment by markus makela [ 2023-04-06 ] | |||||||||
|
If you got no errors, I'd say that it's working. | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-04-14 ] | |||||||||
|
gdorman markus makela johan.wikman Jeff Smelser of Recharge performed his ALTER Table tests during a transaction. He said, "I did my test this morning, It looks like it behaved much better except I got a 3 errors from the write: pymysql.err.OperationalError: (1927, 'Transaction checksum mismatch encountered when replaying transaction.') That looks like maxscale trying to replay it but can't for some reason" Please advise | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-04-24 ] | |||||||||
|
gdormanmarkus makelajohan.wikman Do any of you have comments on this error that was reported by Rechargeapps? pymysql.err.OperationalError: (1927, 'Transaction checksum mismatch encountered when replaying transaction.') Do we understand what is causing it, or do we need to perform additional tests? | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-06-02 ] | |||||||||
|
gdorman markus makela johan.wikman , I've seen a few updates to this jira and I don't quite understand them, but that's besides the point. The question I have is, our most recent position on this feature was that the application should not try to ALTER a table in the middle of a transaction, and not expect to receive an error message from MariaDB Xpand. Is this still the case, or have we made changes in Maxscale to handle this type of behavior from the application? | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-06-28 ] | |||||||||
|
Team, gdorman markus makelatoddstoffel clieu Just providing some feedback from the field regarding the Rechargeapps issue. Apparently, TiDB does in their database what MaxScale will do in version 23.08.0 I don't know if having the "replay queries with partially returned results" improvement in MariaDB Enterprise, or Xpand in this case, will become "table stakes" for customers other than Rechargeapps, but just thought awareness should be raised. | |||||||||
| Comment by Michael Malgeri (Inactive) [ 2023-06-29 ] | |||||||||
|
We're trying to get them to rerun their tests against MaxScale 23.08.0, but looks like that ship has sailed. We may have better luck down the road if TiDB doesn't pan out. |