[MDEV-16610] Torn reads from spider, with XA and isolation level REPEATABLE READ Created: 2018-06-28  Updated: 2023-12-18

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - Spider, XA
Affects Version/s: 10.3
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Mattias Jonsson Assignee: Yuchen Pei
Resolution: Unresolved Votes: 2
Labels: None
Environment:

Linux


Issue Links:
Relates
relates to MDEV-5004 Support parallel read transactions on... Open
relates to MDEV-25097 spider xa savepoint not supported/inc... Open
relates to MDEV-27717 Parallel execution on partitions in s... Open

 Description   

Writes through spider on external datanodes are guaranteed to be consistent through XA, however reads are no consistent, but can be torn. Meaning spider cannot be considered to be an ACID engine, even with SERIALIZABLE isolation level and XA. Where a simple transaction such as inserting or deleting two rows atomically, can still be seen during read as only 1 row exists.

On the spider head node:

create database spider_replication_test;
create table t2 (
  id int unsigned PRIMARY KEY,
  filler varchar(64) DEFAULT 'Default Filler non-sense!'
) ENGINE=SPIDER COMMENT='wrapper = "mysql", table = "t2", database = "spider_replication_test"'
PARTITION BY HASH (`id`)
(PARTITION `p0` COMMENT = 'srv "datanode1"' ENGINE = SPIDER,
 PARTITION `p1` COMMENT = 'srv "datanode2"' ENGINE = SPIDER,
 PARTITION `p2` COMMENT = 'srv "datanode3"' ENGINE = SPIDER,
 PARTITION `p3` COMMENT = 'srv "datanode4"' ENGINE = SPIDER)
 
 
DELIMITER //
 
DROP PROCEDURE IF EXISTS insert_and_delete//
CREATE PROCEDURE insert_and_delete(IN loops INT)
  BEGIN
    SET @x = 1;
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    # To make it easier to find in the query.log:
    SELECT * FROM spider_replication_test.t2 WHERE filler NOT LIKE 'START OF insert_and_delete';
    REPEAT
        SET @x = @x + 1;
        start transaction;
        INSERT INTO spider_replication_test.t2 (id, filler) VALUES (0, 'p0 consistent view check'), (1, 'p1 consistent view check'), (2, 'p2 consistent view check'), (3, 'p3 consistent view check');
        commit;
        start transaction;
        DELETE FROM spider_replication_test.t2 WHERE id IN (0,1,2,3);
        commit;
    UNTIL @x > loops END REPEAT;
    # To make it easier to find in the query.log:
    SELECT * FROM spider_replication_test.t2 WHERE filler NOT LIKE 'END OF insert_and_delete';
  END
//
DROP PROCEDURE IF EXISTS check_select_consistency//
CREATE PROCEDURE check_select_consistency(IN loops INT, OUT outparam INT)
  BEGIN
    DECLARE var_sum INT;
    SET @x = 1;
    SET outparam = 0;
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    # To make it easier to find in the query.log:
    SELECT * FROM spider_replication_test.t2 WHERE filler NOT LIKE 'START OF check_select_consistency';
    REPEAT
        SET @x = @x + 1;
        SELECT sum(id) INTO var_sum FROM spider_replication_test.t2 WHERE id in (0,1,2,3);
        IF (var_sum IS NOT NULL AND var_sum != 6 ) THEN
            SET outparam = outparam + 1;
        END IF;
 
    UNTIL @x > loops END REPEAT;
    # To make it easier to find in the query.log:
    SELECT * FROM spider_replication_test.t2 WHERE filler NOT LIKE 'END OF check_select_consistency';
  END
//
DELIMITER ;

Insert some minimal data to avoid 0/1 row optimisations:

INSERT INTO t2 (id, filler) VALUES (4, 'p0 dummy data'), (5, 'p1 dummy data'), (6, 'p2 dummy data'), (7, 'p3 dummy data'), (8, 'P0 dummy data 2'), (9, 'P1 dummy data 2'), (10, 'P2 dummy data 2'), (11, 'P3 dummy data 2');

On the data nodes:

use spider_replication_test;
create table t2 (
  id int unsigned PRIMARY KEY,
  filler varchar(64) DEFAULT 'Default Filler non-sense!'
) ENGINE=INNODB;

Spider head variables:

spider-head [spider_replication_test]> show variables WHERE Variable_name like '%spider%' AND value not like '-1';
+---------------------------------------+--------+
| Variable_name                         | Value  |
+---------------------------------------+--------+
| spider_bka_engine                     |        |
| spider_block_size                     | 16384  |
| spider_conn_recycle_mode              | 0      |
| spider_conn_recycle_strict            | 0      |
| spider_conn_wait_timeout              | 10     |
| spider_connect_error_interval         | 1      |
| spider_connect_mutex                  | OFF    |
| spider_connect_retry_count            | 1000   |
| spider_connect_retry_interval         | 1000   |
| spider_direct_dup_insert              | 1      |
| spider_dry_access                     | OFF    |
| spider_force_commit                   | 1      |
| spider_general_log                    | ON     |
| spider_index_hint_pushdown            | OFF    |
| spider_internal_unlock                | OFF    |
| spider_internal_xa                    | ON     |
| spider_internal_xa_id_type            | 0      |
| spider_internal_xa_snapshot           | 0      |
| spider_local_lock_table               | OFF    |
| spider_lock_exchange                  | OFF    |
| spider_log_result_error_with_sql      | 3      |
| spider_log_result_errors              | 3      |
| spider_max_connections                | 0      |
| spider_net_read_timeout               | 86400  |
| spider_net_write_timeout              | 86400  |
| spider_ping_interval_at_trx_start     | 3600   |
| spider_quick_mode                     | 3      |
| spider_remote_access_charset          |        |
| spider_remote_default_database        |        |
| spider_remote_time_zone               | SYSTEM |
| spider_same_server_link               | OFF    |
| spider_semi_table_lock                | 1      |
| spider_semi_trx                       | ON     |
| spider_support_xa                     | ON     |
| spider_sync_autocommit                | ON     |
| spider_sync_time_zone                 | ON     |
| spider_sync_trx_isolation             | ON     |
| spider_table_crd_thread_count         | 10     |
| spider_table_init_error_interval      | 1      |
| spider_table_sts_thread_count         | 10     |
| spider_udf_table_lock_mutex_count     | 20     |
| spider_udf_table_mon_mutex_count      | 20     |
| spider_use_all_conns_snapshot         | OFF    |
| spider_use_consistent_snapshot        | OFF    |
| spider_use_default_database           | ON     |
| spider_use_flash_logs                 | OFF    |
| spider_use_snapshot_with_flush_tables | 0      |
| spider_version                        | 3.3.13 |
| spider_xa_register_mode               | 1      |
+---------------------------------------+--------+
49 rows in set (0.00 sec)

Spider head session 1

spider-head [spider_replication_test]> call insert_and_delete(1000);
Empty set (0.01 sec)
 
Empty set (20.42 sec)
 
Query OK, 8000 rows affected (20.42 sec)

Spider head session 2 Concurrently as session 1, showing >15% torn reads!

spider-head [spider_replication_test]> call check_select_consistency(2000, @o); select @o;
Empty set (0.01 sec)
 
Empty set (15.89 sec)
 
Query OK, 2000 rows affected (15.89 sec)
 
+------+
| @o   |
+------+
|  385 |
+------+
1 row in set (0.00 sec)

From query.log on datanode showing concurrent reads and writes, see '5785670 Query select a.id,sum(b.`id`)...' showing reads in 'in-doubt' XA state between prepare and commit state:

                5785669 Connect spider_test@spider-head.example.com as anonymous on 
                5785669 Query   set session transaction isolation level read committed;set session autocommit = 1;xa start 0x3339613139,0x63656634376134,1
                5785669 Query   SET NAMES utf8mb4
                5785669 Init DB spider_replication_test
                5785669 Query   drop temporary table if exists spider_replication_test.tmp_spider_bka_0x7f79d40a9870;create temporary table spider_replication_test.tmp_spider_bka_0x7f79d40a9870(id bigint,c0 int(10) unsigned)engine=memory default charset=utf8mb4 collate utf8mb4_general_ci;insert into spider_replication_test.tmp_spider_bka_0x7f79d40a9870(id,c0)values(0,0),(1,1),(2,2),(3,3)
                5785669 Query   select a.id,b.`id` from spider_replication_test.tmp_spider_bka_0x7f79d40a9870 a,`spider_replication_test`.`t2` b where a.c0 <=> b.`id` and (b.`id` in( 0 , 1 , 2 , 3)) for update
                5785669 Query   delete from `spider_replication_test`.`t2` where `id` = 0 limit 1
                5785669 Query   drop temporary table if exists spider_replication_test.tmp_spider_bka_0x7f79d40a9870
                5785669 Query   xa end 0x3339613139,0x63656634376134,1
                5785669 Query   xa prepare 0x3339613139,0x63656634376134,1
                5785670 Connect spider_test@spider-head.example.com as anonymous on 
                5785670 Query   set session transaction isolation level read committed;set session autocommit = 1;xa start 0x3339633438,0x63656634376134,1
                5785670 Query   SET NAMES utf8mb4
                5785670 Init DB spider_replication_test
                5785670 Query   drop temporary table if exists spider_replication_test.tmp_spider_bka_0x7f79d40a9870;create temporary table spider_replication_test.tmp_spider_bka_0x7f79d40a9870(id bigint,c0 int(10) unsigned)engine=memory default charset=utf8mb4 collate utf8mb4_general_ci;insert into spider_replication_test.tmp_spider_bka_0x7f79d40a9870(id,c0)values(0,0),(1,1),(2,2),(3,3)
                5785670 Query   select a.id,sum(b.`id`),b.`id` from spider_replication_test.tmp_spider_bka_0x7f79d40a9870 a,`spider_replication_test`.`t2` b where a.c0 <=> b.`id` and (b.`id` in( 0 , 1 , 2 , 3))
                5785669 Query   xa commit 0x3339613139,0x63656634376134,1
                5785669 Quit    
                5785671 Connect spider_test@spider-head.example.com as anonymous on 
                5785671 Query   set session transaction isolation level read committed;set session autocommit = 1;xa start 0x3339613139,0x63656634376134,1
                5785671 Query   SET NAMES utf8mb4
                5785671 Init DB spider_replication_test
                5785671 Query   insert high_priority into `spider_replication_test`.`t2`(`id`,`filler`)values(0,'p0 consistent view check')
                5785670 Query   drop temporary table if exists spider_replication_test.tmp_spider_bka_0x7f79d40a9870
                5785670 Query   xa end 0x3339633438,0x63656634376134,1
                5785670 Query   xa prepare 0x3339633438,0x63656634376134,1
                5785670 Query   xa commit 0x3339633438,0x63656634376134,1
                5785671 Query   xa end 0x3339613139,0x63656634376134,1
                5785670 Quit    

I think this can be fixed by not allowing queries in spider to cross XA commit, i.e. XA commit needs to wait for currently running queries before sent to any node and new queries needs to be blocked before XA commit is completed on all nodes. But this will be limiting the parallelism and performance.



 Comments   
Comment by Marko Mäkelä [ 2018-06-28 ]

We had a brainstorming session about this with Kentoku, mattiasjonsson, Elkin, markus makela, Eric_Herman and monty.
monty came up with the following:

  1. The spider head node would generate global identifiers for transactions are generated from one global sequence, at the start of a transaction.
  2. Writes are only sent to the affected nodes.
  3. Read transactions where a consistent distributed read view is requested are started on each data node, passing a set of global identifiers of currently running transactions (if any).
  4. The data nodes will map the global identifiers to local transaction identifiers. Some of the "currently running transactions" may have been recently committed, but we would consider them as "not committed" based on the set that we got at the read view creation statement.

My tentative proposal for syntax would like the following, when the global identifiers for transactions are being passed as XA XIDs:

XA START 'xid' [ BEFORE 'xid' (, 'xid' )* ];

The first transaction identifier would be the start identifier of the transaction. Optionally, it is followed by a set of (necessarily smaller) transaction identifiers that the Spider head node knows to be still running at that point of time. Some of the transactions may have been committed at the data node when the statement is finally executed, but the data node would consider these transactions as not committed when creating the read view. In this way, we can create exactly the same read view on all data nodes at the same time.

If the processing of the above XA START transaction was delayed on some data node, it could happen that some of the transactions in the BEFORE list were not only committed, but also purged. In this case, the statement would fail (we would not be able to create the requested read view on this data node). Spider could either handle that error (by propagating the error to the client, or by trying to create a new distributed read view), or we could try to prevent it by adding a mechanism that would delay purging until an explicit acknowledgement is sent from the Spider head to the data node. This acknowlegement could be an optional part of the XA statement, something like

XA START 'xid' [ BEFORE 'xid' (, 'xid' )* ] [ PURGE 'xid' ];

This would guarantee that no subsequent XA START statement may specify a BEFORE identifier that is smaller than the PURGE identifier, because history would be preserved for any transaction that was started with an identifier that is newer than the one identified by the PURGE clause.

Comment by Marko Mäkelä [ 2018-06-29 ]

Elkin made a valid point that the Spider head could uniquely identify a logical point of time in terms of a global transaction identifier that is assigned at commit time from a global sequence, like NEXTVAL(id). When starting a read view, we would simply specify the time as LASTVAL(id). This would be directly compatible with the way how GTID is being assigned.

The challenge for InnoDB is: How to map the global identifier to local transaction identifiers and read views. Conceptually, it would map to the InnoDB transaction end identifier trx_t::no. That is the starting point of ReadView::prepare() (10.2) or trx_sys.snapshot_ids() (10.3). The other aspect is that the read view is copying the local identifiers of currently active (not yet committed) transactions.

The potential race condition would occur when a read view is created soon after Spider has submitted a transaction commit. I believe that a solution exists, but it would involve extra waiting: The read view creation would have to wait for the transaction corresponding to the LASTVAL(id) to be committed. The challenge is that the data node will not know the LASTVAL(id) or which transaction it maps to, until the COMMIT statement is actually processed by InnoDB. Furthermore, if the Spider head submitted a LASTVAL(id) that would never be sent to that particular data node in a COMMIT statement, then the read view creation would end in an infinite wait (which could only be terminated if the data node received a COMMIT with a larger NEXTVAL(id)).

So, if we used this approach, the Spider head should keep track of the latest NEXTVAL(id) that was sent to each data node. The read view creation could then send a different global id to different data nodes, to get the state as of the global time that corresponds to the maximum of the id sent to the data nodes (which would be the LASTVAL(id) of the global sequence when the read view creation was initiated by the Spider head).

This is doable. It would cause some extra waiting inside the data nodes for the read view creation. Because the read view creation does not need to be acknowledged to the Spider head node, it should not cause too big extra delays. I think that we should consider this approach. The syntax could be something like this:

XA START 'xid' AFTER 'xid';
-- alternative form
START TRANSACTION WITH CONSISTENT SNAPSHOT 'xid';

This would wait for 'xid' to be committed and then create the read view as it is at that transaction commit.

There would be no need to assign a global transaction identifier at the start of the transaction, but there would still be a need to explicitly release purge, or otherwise the read view creation could fail, in case the data node is congested or the processing of the request for creating the read view is delayed. There could be an optional clause for this:

XA START 'xid' AFTER 'xid' [ PURGE 'xid' ];

Spider head would keep track of the OK packets for previously submitted read view creation statements, and specify the PURGE clause so that there are no pending read view creations which were submitted with an AFTER 'xid' earlier than the PURGE 'xid'.

When the Spider head is started, there would be no pending commits on any data nodes, and data nodes would not necessarily have any knowledge of the Spider head’s LASTVAL(id). In this case, the read view would be created with a plain

XA START 'xid'; -- this should be changed to immediately create a read view on all engines
-- alternatively
START TRANSACTION WITH CONSISTENT SNAPSHOT;

Comment by Daniele Sciascia [ 2018-06-29 ]

I suspect that in the test, check_select_consistency() is exposed to torn reads because Spider submits "individual" SELECTs to the each of the involved data nodes. If that's the case, then it means that those SELECTs release their read locks right after the statement is done in the data node.
One way to achieve serializability, would be to wrap those individual SELECTs in a XA transaction, effectively extending the read locks until all of them are taken in all data nodes involved, and releasing them only after XA COMMIT. This would guarantee 2 phase locking and thus serializability, and no torn reads.
Just my two cents.

Comment by Seppo Jaakola [ 2018-06-29 ]

Thanks to sciascid for analyzing this problem in depth: Spider could provide correctly scheduled reads by processing all selects inside XA transactions and using SERIALIZABLE isolation, i.e. implementing distributed two phase locking for reads and writes. In the test Spider is configured to use XA and SERIALIZABLE isolation, but assumption is that Spider does not use XA for plain reads. With XA, select would acquire shared locks and would not release these locks until all XA selects, in all data nodes have been successfully prepared and committed.

However, that would lead to elevated distributed deadlocks, and maybe is not a practical solution.

Comment by Marko Mäkelä [ 2018-06-29 ]

sciascid, perhaps my comments would better apply to MDEV-14090, which is about non-locking reads and read view creation. When it comes to that, it is only a part of the solution that the read view creation has to be submitted to every data node at the same time, instead of submitting it implicitly when the first operation of the transaction accesses a particular data node. (Related to that, XA START does not currently start a read view; it would be started when the first row is accessed inside InnoDB.)

There would still be a race condition between commit and read view creation, if commits are arriving at roughly the same time with the read view creation. My comments are about resolving that race condition, so that each data node will actually start the same read view.

In the SERIALIZABLE isolation level, reads would be locking, so read views should not matter. Essentially, once a transaction has acquired a transactional lock on an index record, there cannot be any pending changes to the record, and the latest version in the page is what should be viewed. So, essentially SERIALIZABLE is a combination of locking and READ UNCOMMITTED. It might be that my comment in MDEV-14589 is relevant for addressing the bug. If we implemented a stricter level of SERIALIZABLE, a read could fail with "snapshot too old" error, instead of returning too new data.

Normally, a conflict between commits and locking reads becomes a normal locking conflict, which may result in a lock wait or deadlock. If some transaction was committed after a SERIALIZABLE transaction was started, but before the SERIALIZABLE transaction attempted to access (and read-lock) a record written by the committed transaction, the SERIALIZABLE transaction would see the newly committed version of the data. (This could be the very problem that mattiasjonsson reported for the SERIALIZABLE isolation level.) If that earlier transaction was not yet committed, then we would merely have a locking conflict between the two transactions, resulting in a lock wait that might result in a timeout or deadlock.

Comment by Mattias Jonsson [ 2018-07-25 ]

I was wrong, the test is NOT running in SERIALIZABLE isolation level, due to

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

is missing SESSION, meaning it did not change the isolation level for the current session.

When fixing that I cannot see any torn reads in SERIALIZABLE isolation level.

It can still be seen in REPEATABLE READ, but that I think can be discussed if it is a bug that should be fixed or a something to just document?

Comment by Sergei Golubchik [ 2019-05-14 ]

Kentoku, how would MDEV-14090 help here?

Comment by Kentoku Shiba (Inactive) [ 2019-05-15 ]

serg, A read consistency without lock on a sharding environment requires the following things.

  1. start (XA) transaction from same time stamp on each shards.
  2. (XA) commit at same time stamp on each shards.
  3. start each query from same time stamp on each shards. (only for READ COMMITTED)

I think MDEV-14090 is for first one.

Comment by Sergei Golubchik [ 2019-05-15 ]

I don't think it is. "start with consistent snapshot" will simply ask all engines to start a transaction now. It'll be still sequential, so race conditions are possible. And from the spider point of view it'll be one call, something like spider_hton->start_consistent_snapshot() — where spider will need to iterate all nodes and start a consistent snapshot there, which, again, is not atomic and prone to race conditions.

Generally, I think, DTP XA standard only provides proper isolation on the SERIALIZABLE isolation level.

On the other hand, if you serialize "starting a transaction on all nodes" code, for example, with a mutex. Then you'll have consistent reads, but at the cost of reduced concurrency.

Anyway, as far as I can see in no case MDEV-14090 will be of any use to you here, it looks quite unrelated.

Comment by Kentoku Shiba (Inactive) [ 2019-05-15 ]

serg
What I would like to tell is like "start with consistent snapshot at '2019-05-16 01:21:00.000000000'". As same as, commit is like "commit at '2019-05-16 01:21:00.000000000'".
I assume that use same time to all nodes make it possible to access datas with consistent views without any locks.
Should I create new jira task instead of MDEV-14090?

Comment by Kentoku Shiba (Inactive) [ 2019-11-17 ]

Spider can be considered to be an ACID engine with SERIALIZABLE isolation level and XA. Because all SELECTs take shared locks and wait commit/rollback if read rows are inserted/updated/deleted.

Comment by Marko Mäkelä [ 2019-11-21 ]

monty suggested a possible solution to this problem:

  • Spider head node(s) will send information about transaction commits to a monitor server, which is also assigning InnoDB trx_t::id to all Spider data nodes.
  • In a special mode of InnoDB, trx_t::id for write transactions will be assigned by the monitor server.
  • The spider head node and the monitor server will control read view creation and delay the purge, by extending the BEGIN or START TRANSACTION syntax with the InnoDB local transaction identifiers, something like this:

BEGIN 100, (103,104);

This tells InnoDB two things:

  • The purge view is not allowed to advance beyond transaction identifier 99 until another such extended BEGIN statement arrives.
  • No matter what transactions were already committed, InnoDB will pretend that transactions 103, 104 (which must have started earlier) were not committed.

It is an error to request a BEGIN with a smaller number for the purge limit, such as:

BEGIN 99, (100,103,104); -- error: previously we got BEGIN 100, …
BEGIN 100, (101,102,105); -- OK: the previous limit was BEGIN 100, …
BEGIN 101, (101,102,105); -- OK, returning the same read view (allowing purge view to advance)
BEGIN 100, (101,102,105); -- error: previously we got BEGIN 101, …

Note: Not all listed transaction identifiers are necessarily known by all Spider data nodes, because not all transactions will always modify tables on all data nodes.

Comment by Marko Mäkelä [ 2019-11-22 ]

I feel that some clarifications are necessary:

  • This extended functionality should be enabled only if a special parameter is set. Only so we can avoid a scenario where a data node is restarted and then suddenly starts purging history before the first extended BEGIN statement is issued.
  • All Spider data nodes that are serving a Spider head node would use the same 48-bit InnoDB transaction identifiers.
  • InnoDB would allocate those transaction identifiers from a new monitor server, which does not have any persistent state.
  • On startup, the monitor server would query all data nodes from the maximum transaction identifier, and initialize the sequence from that.
  • We seem to need XA 2-phase-commit syntax for distributed transaction commit between Spider data nodes.
  • But, XIDs will be lost after XA COMMIT, and we must be able to refer to committed not-yet-purged transactions.
  • Perhaps we could introduce an identity mapping, such as sscanf(xid, "xid" TRX_ID_FMT, &trx_id)?
  • If a Spider system is being replicated by Galera, all nodes should use the same transaction identifiers, assigned by the monitor server.

I have not thought out how to handle distributed transactions between multiple Spider clusters. Above, we would be internally using the XA mechanism for transactions that are not distributed from the point of view of a client that connects to a Spider head node.

Generated at Thu Feb 08 08:30:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.