[MXS-3695] Causal Consistency with MaxScale's Read/Write Split Router issue Created: 2021-07-28 Updated: 2023-04-09 Resolved: 2021-08-02 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | readwritesplit |
| Affects Version/s: | 2.5.13 |
| Fix Version/s: | 2.5.15 |
| Type: | Bug | Priority: | Major |
| Reporter: | Domas | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Ubuntu 20 |
||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Problem statement Stale selects occur with maxscale casual read configuration in place (global parameter in use). Basically, we will be seeing stale records if the following condition over the maxscale will be met - number of client connections is greater than 1. In this report we are describing techniques used for testing and current maxscale configuration. Testing Technique 1. We will be running a simple PHP script that is going to do INSERT and SELECT using one connection / thread. The content of the script is shown below.
2. Running the script, provides following results with no or very low number of stale reads.
3. Once we generate additional traffic going over the maxscale, problem increases drastically with a lot of missing IDs.
--threads=1 - problem occurs sometimes Maxscale versions and config we use
Config:
MariaDB instances have this setting on: session_track_system_variables="autocommit,character_set_client,character_set_connection,character_set_results,last_gtid,time_zone Demonstration: https://www.youtube.com/watch?v=RMNQMgQBisw |
| Comments |
| Comment by markus makela [ 2021-07-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think this might be the same problem that has been described in It's possible that the global GTID synchronization for some reason isn't picking up the latest change. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-07-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think I've managed to reproduce the problem and I also think I know why it happens. If two connections both execute a transaction, they both receive a GTID and they appear in the correct order in the database. The problem is that the responses that are delivered to MaxScale aren't guaranteed to arrive in order which means the global latest GTID could end up being set to the lower of the two values. The reason why this won't happen with causal_reads=local is due to the lack of parallellism in the traffic. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marijus Planciunas [ 2021-08-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Many thanks for fixing our issue Markus. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-08-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I used the test to reproduce the problem and with the fix in place it no longer reports any errors. Unfortunately I don't know when the next release will be. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Domas [ 2021-08-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have completed several tests and can confirm that the issue still exists with the latest 2.5.15 built directly from repository. root@maxscale-2:/usr/bin# ./maxscale -V I have tested with causal_reads=global.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-08-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hmm, that is strange. The GTID handling should work correctly now. For a sanity check, you could try disabling causal_reads=global and seeing if that makes the test fail a lot faster. I'll try to reproduce your results locally and see if I can figure out what's going on. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marijus Planciunas [ 2021-08-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Markus, disabling causal_reads or changing it to "local" value won't solve the issue. In our particular case, we have a microservice infrastructure, where modules are communicating through RabbitMQ queues - therefore, are using different DB connections for same data. As far as we understand, only "global" setting ensures all connections see data changes: https://mariadb.com/docs/reference/mxs/module-parameters/causal_reads/. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2021-08-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tested again with this modified test script:
Would it be possible for you to test if you still see the problem with this modified script? Note that it uses two separate connections for reading, doesn't truncate the table at the start and uses an auto-increment field to allow multiple instances of it to be run in parallel. With 50 parallel executions of this script and sysbench with 100 threads, I can't reproduce this with the latest code. Reverting the fix makes the test fail pretty much immediately. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Zierhut [ 2023-04-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This issue is marked as fixed, but @Domas wrote that they were still able to reproduce this issue even after the fix. It this actually confirmed as resolved @markus makela ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2023-04-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alexander-zierhut yes, the original problem that was reproduced was fixed. So far we haven't heard of any causality violations and, as stated in the comments, the results they saw weren't reproducible. If you've observed any problems, please let us know and we can investigate the source of them. |