[MXS-3353] Tee filter loses statements if branch target is slower Created: 2020-12-24  Updated: 2022-07-21  Resolved: 2021-09-28

Status: Closed
Project: MariaDB MaxScale
Component/s: tee
Affects Version/s: 2.5.6, 2.5.10
Fix Version/s: 6.2.0

Type: Bug Priority: Major
Reporter: Oli Sennhauser Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

Linux, Debian 10, n.a.


Attachments: File mixed_test.php    
Issue Links:
Relates
relates to MXS-3363 Make tee filter also syncronously Closed
relates to MXS-3370 Asynchronicity of tee filter isn't do... Closed
Sprint: MXS-SPRINT-122, MXS-SPRINT-140

 Description   

When we are running our simple insert test or our a bit more complex mixed test (SIUD) at a very high pace (10 us delay between statements) maxscale is not sending all statements to the tee'd instance after a short time.
When we lower the pace to 10 ms delay this does not happen.
Unfortunately maxscale is not throwing any warning or error (in maxscale error log) to make us aware of this.
This means tee filter cannot be used reliably in production.
Possibly a statement cache or something like this is overflowing?
Would be good to have something like a "synchronous" mode to guarantee that statements have been received my both database backends with possibly some timeout (similar to semi-synchronous replication).
We want to use this for some kind of quality assurance between 2 different backends and/or for doing a push replication from the secure to the insecure network zone in a security sensitive environment.



 Comments   
Comment by Oli Sennhauser [ 2020-12-24 ]

Example:
Primary:
8023 rows in set
Tee'd:
6324 rows in set

Comment by markus makela [ 2020-12-27 ]

The tee filter will execute the statements on the branch target asynchronously. If I'm understanding this correctly, you are doing inserts and then immediately checking the results which is not currently supported by the filter.

This currently doesn't appear to be mentioned in the filter documentation which we should correct first. If you believe that this would be a useful feature, please submit a new feature request. I'll convert this into a bug about the filter documentation.

Comment by Oli Sennhauser [ 2021-01-04 ]

@Markus: I was waiting for a while after the inserts to give MaxScale a chance to catch up. But the queries did not arrive and the connection aborted in an unclean way.
And I did this test several times. It was reproducible.

Comment by markus makela [ 2021-01-05 ]

When you are seeing these lost statements, how soon after the write is the read being done? If you do the writes and then read the data several seconds later do you see all the rows or are still some rows lost? I'm trying to understand whether this is a case of the transactions not being committed on the tee'd backend or if the statements are never sent there.

Comment by Oli Sennhauser [ 2021-01-06 ]

Hi Markus

I repeated the test. Please find all the information attached. You can also provoke the situation when you set on the tee'd instance innodb_flush_log_at_trx_commit = 1 and sync_binlog = 1 while you have these values set to 2 and 0 on the primary instance. Even with the slow pace:

Test with slow pace (10000 us sleep, everything is fine):

shell> date ; ./mixed_test.php
Wed 6 Jan 09:17:34 CET 2021
IIII.........IIII^CCaught signal: 2
shell> date
Wed 6 Jan 09:19:29 CET 2021

Primary instance: mariadb-105

SQL> SELECT COUNT, CURRENT_TIMESTAMP() FROM test;
-----------------------------+

count current_timestamp()

-----------------------------+

6163 2021-01-06 09:20:49

-----------------------------+

End of General Log:

330 Query SELECT * FROM test WHERE id = 383966
330 Query INSERT INTO test (id, data, ts) values (383966, "Test data insert", CURRENT_TIMESTAMP(6))
210106 9:19:27 330 Quit

Tee'd instance: mysql-57

SQL> SELECT COUNT, CURRENT_TIMESTAMP() FROM test;
-----------------------------+

count current_timestamp()

-----------------------------+

6163 2021-01-06 09:21:02

-----------------------------+

End of General Log:

2021-01-06T08:19:26.942277Z 360 Query SELECT * FROM test WHERE id = 383966
2021-01-06T08:19:26.943038Z 360 Query INSERT INTO test (id, data, ts) values (383966, "Test data insert", CURRENT_TIMESTAMP(6))
2021-01-06T08:19:27.522731Z 360 Quit


Test with fast pace (10 us sleep, rows are missing):

shell> date ; ./mixed_test.php ; date
Wed 6 Jan 09:24:26 CET 2021
IIIIIIII...DUDI^CCaught signal: 2
Wed 6 Jan 09:27:30 CET 2021

Primary instance: mariadb-105

350 Query SELECT * FROM test WHERE id = 934
350 Query INSERT INTO test (id, data, ts) values (934, "Test data insert", CURRENT_TIMESTAMP(6))
210106 9:27:30 350 Quit

SQL> TRUNCATE TABLE test.test;
SQL> SELECT COUNT, CURRENT_TIMESTAMP() FROM test;
-----------------------------+

count current_timestamp()

-----------------------------+

0 2021-01-06 09:24:12
35817 2021-01-06 09:25:37
55553 2021-01-06 09:26:28
59872 2021-01-06 09:27:01
61970 2021-01-06 09:27:43
61970 2021-01-06 09:28:02
61970 2021-01-06 09:29:29
61970 2021-01-06 09:30:14

-----------------------------+

Secondary instance: mysql-57

2021-01-06T08:27:30.470174Z 383 Query DELETE FROM test WHERE id = 61171
2021-01-06T08:27:30.470592Z 383 Query SELECT * FROM test WHERE id = 24971
--> Quit is missing! And last statement was an INSERT NOT a DELETE!

SQL> TRUNCATE TABLE test.test;
SQL> SELECT COUNT, CURRENT_TIMESTAMP() FROM test;
-----------------------------+

count current_timestamp()

-----------------------------+

0 2021-01-06 09:24:07
35228 2021-01-06 09:25:36
50725 2021-01-06 09:26:30
59681 2021-01-06 09:27:00
61401 2021-01-06 09:27:42
61401 2021-01-06 09:28:00
61401 2021-01-06 09:29:28
61401 2021-01-06 09:30:13

-----------------------------+

570 rows missing. Now warning in the MaxScale error log.

The test table:

CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(128) DEFAULT NULL,
`ts` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=100000 DEFAULT CHARSET=latin1

And the test script:

Comment by Oli Sennhauser [ 2021-01-06 ]

mixed_test.php

Comment by Oli Sennhauser [ 2021-01-06 ]

All the other details here: https://fromdual.com/traffic-mirroring-with-mariadb-maxscale

Comment by markus makela [ 2021-01-08 ]

I've managed to reproduce this with a local setup. This seems to happen much more easily when the branched target is slower than the main target. I've created MXS-3370 for the documentation problem and changed this back to a bug.

Comment by markus makela [ 2021-05-03 ]

The fix to this was inadequate and it caused a session reference leak. This leak made it appear as if it worked as expected.

Generated at Thu Feb 08 04:20:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.