[MDEV-16610] Torn reads from spider, with XA and isolation level REPEATABLE READ Created: 2018-06-28 Updated: 2023-12-18 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - Spider, XA |
| Affects Version/s: | 10.3 |
| Fix Version/s: | 10.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Mattias Jonsson | Assignee: | Yuchen Pei |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | None | ||
| Environment: |
Linux |
||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Writes through spider on external datanodes are guaranteed to be consistent through XA, however reads are no consistent, but can be torn. Meaning spider cannot be considered to be an ACID engine, even with SERIALIZABLE isolation level and XA. Where a simple transaction such as inserting or deleting two rows atomically, can still be seen during read as only 1 row exists. On the spider head node:
Insert some minimal data to avoid 0/1 row optimisations:
On the data nodes:
Spider head variables:
Spider head session 1
Spider head session 2 Concurrently as session 1, showing >15% torn reads!
From query.log on datanode showing concurrent reads and writes, see '5785670 Query select a.id,sum(b.`id`)...' showing reads in 'in-doubt' XA state between prepare and commit state:
I think this can be fixed by not allowing queries in spider to cross XA commit, i.e. XA commit needs to wait for currently running queries before sent to any node and new queries needs to be blocked before XA commit is completed on all nodes. But this will be limiting the parallelism and performance. |
| Comments |
| Comment by Marko Mäkelä [ 2018-06-28 ] | |||||||
|
We had a brainstorming session about this with Kentoku, mattiasjonsson, Elkin, markus makela, Eric_Herman and monty.
My tentative proposal for syntax would like the following, when the global identifiers for transactions are being passed as XA XIDs:
The first transaction identifier would be the start identifier of the transaction. Optionally, it is followed by a set of (necessarily smaller) transaction identifiers that the Spider head node knows to be still running at that point of time. Some of the transactions may have been committed at the data node when the statement is finally executed, but the data node would consider these transactions as not committed when creating the read view. In this way, we can create exactly the same read view on all data nodes at the same time. If the processing of the above XA START transaction was delayed on some data node, it could happen that some of the transactions in the BEFORE list were not only committed, but also purged. In this case, the statement would fail (we would not be able to create the requested read view on this data node). Spider could either handle that error (by propagating the error to the client, or by trying to create a new distributed read view), or we could try to prevent it by adding a mechanism that would delay purging until an explicit acknowledgement is sent from the Spider head to the data node. This acknowlegement could be an optional part of the XA statement, something like
This would guarantee that no subsequent XA START statement may specify a BEFORE identifier that is smaller than the PURGE identifier, because history would be preserved for any transaction that was started with an identifier that is newer than the one identified by the PURGE clause. | |||||||
| Comment by Marko Mäkelä [ 2018-06-29 ] | |||||||
|
Elkin made a valid point that the Spider head could uniquely identify a logical point of time in terms of a global transaction identifier that is assigned at commit time from a global sequence, like NEXTVAL(id). When starting a read view, we would simply specify the time as LASTVAL(id). This would be directly compatible with the way how GTID is being assigned. The challenge for InnoDB is: How to map the global identifier to local transaction identifiers and read views. Conceptually, it would map to the InnoDB transaction end identifier trx_t::no. That is the starting point of ReadView::prepare() (10.2) or trx_sys.snapshot_ids() (10.3). The other aspect is that the read view is copying the local identifiers of currently active (not yet committed) transactions. The potential race condition would occur when a read view is created soon after Spider has submitted a transaction commit. I believe that a solution exists, but it would involve extra waiting: The read view creation would have to wait for the transaction corresponding to the LASTVAL(id) to be committed. The challenge is that the data node will not know the LASTVAL(id) or which transaction it maps to, until the COMMIT statement is actually processed by InnoDB. Furthermore, if the Spider head submitted a LASTVAL(id) that would never be sent to that particular data node in a COMMIT statement, then the read view creation would end in an infinite wait (which could only be terminated if the data node received a COMMIT with a larger NEXTVAL(id)). So, if we used this approach, the Spider head should keep track of the latest NEXTVAL(id) that was sent to each data node. The read view creation could then send a different global id to different data nodes, to get the state as of the global time that corresponds to the maximum of the id sent to the data nodes (which would be the LASTVAL(id) of the global sequence when the read view creation was initiated by the Spider head). This is doable. It would cause some extra waiting inside the data nodes for the read view creation. Because the read view creation does not need to be acknowledged to the Spider head node, it should not cause too big extra delays. I think that we should consider this approach. The syntax could be something like this:
This would wait for 'xid' to be committed and then create the read view as it is at that transaction commit. There would be no need to assign a global transaction identifier at the start of the transaction, but there would still be a need to explicitly release purge, or otherwise the read view creation could fail, in case the data node is congested or the processing of the request for creating the read view is delayed. There could be an optional clause for this:
Spider head would keep track of the OK packets for previously submitted read view creation statements, and specify the PURGE clause so that there are no pending read view creations which were submitted with an AFTER 'xid' earlier than the PURGE 'xid'. When the Spider head is started, there would be no pending commits on any data nodes, and data nodes would not necessarily have any knowledge of the Spider head’s LASTVAL(id). In this case, the read view would be created with a plain
| |||||||
| Comment by Daniele Sciascia [ 2018-06-29 ] | |||||||
|
I suspect that in the test, check_select_consistency() is exposed to torn reads because Spider submits "individual" SELECTs to the each of the involved data nodes. If that's the case, then it means that those SELECTs release their read locks right after the statement is done in the data node. | |||||||
| Comment by Seppo Jaakola [ 2018-06-29 ] | |||||||
|
Thanks to sciascid for analyzing this problem in depth: Spider could provide correctly scheduled reads by processing all selects inside XA transactions and using SERIALIZABLE isolation, i.e. implementing distributed two phase locking for reads and writes. In the test Spider is configured to use XA and SERIALIZABLE isolation, but assumption is that Spider does not use XA for plain reads. With XA, select would acquire shared locks and would not release these locks until all XA selects, in all data nodes have been successfully prepared and committed. However, that would lead to elevated distributed deadlocks, and maybe is not a practical solution. | |||||||
| Comment by Marko Mäkelä [ 2018-06-29 ] | |||||||
|
sciascid, perhaps my comments would better apply to MDEV-14090, which is about non-locking reads and read view creation. When it comes to that, it is only a part of the solution that the read view creation has to be submitted to every data node at the same time, instead of submitting it implicitly when the first operation of the transaction accesses a particular data node. (Related to that, XA START does not currently start a read view; it would be started when the first row is accessed inside InnoDB.) There would still be a race condition between commit and read view creation, if commits are arriving at roughly the same time with the read view creation. My comments are about resolving that race condition, so that each data node will actually start the same read view. In the SERIALIZABLE isolation level, reads would be locking, so read views should not matter. Essentially, once a transaction has acquired a transactional lock on an index record, there cannot be any pending changes to the record, and the latest version in the page is what should be viewed. So, essentially SERIALIZABLE is a combination of locking and READ UNCOMMITTED. It might be that my comment in Normally, a conflict between commits and locking reads becomes a normal locking conflict, which may result in a lock wait or deadlock. If some transaction was committed after a SERIALIZABLE transaction was started, but before the SERIALIZABLE transaction attempted to access (and read-lock) a record written by the committed transaction, the SERIALIZABLE transaction would see the newly committed version of the data. (This could be the very problem that mattiasjonsson reported for the SERIALIZABLE isolation level.) If that earlier transaction was not yet committed, then we would merely have a locking conflict between the two transactions, resulting in a lock wait that might result in a timeout or deadlock. | |||||||
| Comment by Mattias Jonsson [ 2018-07-25 ] | |||||||
|
I was wrong, the test is NOT running in SERIALIZABLE isolation level, due to
is missing SESSION, meaning it did not change the isolation level for the current session. When fixing that I cannot see any torn reads in SERIALIZABLE isolation level. It can still be seen in REPEATABLE READ, but that I think can be discussed if it is a bug that should be fixed or a something to just document? | |||||||
| Comment by Sergei Golubchik [ 2019-05-14 ] | |||||||
|
Kentoku, how would MDEV-14090 help here? | |||||||
| Comment by Kentoku Shiba (Inactive) [ 2019-05-15 ] | |||||||
|
serg, A read consistency without lock on a sharding environment requires the following things.
I think MDEV-14090 is for first one. | |||||||
| Comment by Sergei Golubchik [ 2019-05-15 ] | |||||||
|
I don't think it is. "start with consistent snapshot" will simply ask all engines to start a transaction now. It'll be still sequential, so race conditions are possible. And from the spider point of view it'll be one call, something like spider_hton->start_consistent_snapshot() — where spider will need to iterate all nodes and start a consistent snapshot there, which, again, is not atomic and prone to race conditions. Generally, I think, DTP XA standard only provides proper isolation on the SERIALIZABLE isolation level. On the other hand, if you serialize "starting a transaction on all nodes" code, for example, with a mutex. Then you'll have consistent reads, but at the cost of reduced concurrency. Anyway, as far as I can see in no case MDEV-14090 will be of any use to you here, it looks quite unrelated. | |||||||
| Comment by Kentoku Shiba (Inactive) [ 2019-05-15 ] | |||||||
|
serg | |||||||
| Comment by Kentoku Shiba (Inactive) [ 2019-11-17 ] | |||||||
|
Spider can be considered to be an ACID engine with SERIALIZABLE isolation level and XA. Because all SELECTs take shared locks and wait commit/rollback if read rows are inserted/updated/deleted. | |||||||
| Comment by Marko Mäkelä [ 2019-11-21 ] | |||||||
|
monty suggested a possible solution to this problem:
This tells InnoDB two things:
It is an error to request a BEGIN with a smaller number for the purge limit, such as:
Note: Not all listed transaction identifiers are necessarily known by all Spider data nodes, because not all transactions will always modify tables on all data nodes. | |||||||
| Comment by Marko Mäkelä [ 2019-11-22 ] | |||||||
|
I feel that some clarifications are necessary:
I have not thought out how to handle distributed transactions between multiple Spider clusters. Above, we would be internally using the XA mechanism for transactions that are not distributed from the point of view of a client that connects to a Spider head node. |