[MDEV-28430] lf_alloc isn't safe on aarch64 (or ppc64le) Created: 2022-04-28 Updated: 2024-01-23 |
|
| Status: | In Review |
| Project: | MariaDB Server |
| Component/s: | Server |
| Affects Version/s: | 10.1.48, 10.9.0 |
| Fix Version/s: | 10.4, 10.5, 10.6, 10.11, 11.0, 11.1, 11.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Daniel Black | Assignee: | Sergei Golubchik |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ARMv8, aarch64, hash, lock-free, powerpc, synchronization | ||
| Environment: |
aarch64 mostly, but small amount of ppc64le test failures too. Never other architectures. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Since 2020-08-24 unit.lf test frequently fails in buildbot on aarch64, and a few times on ppc64le. This is occurring after the attempted fix in An example of a stalled test:
mbeck, svoj, if you have a moment/interest, can you please check the implementation again. |
| Comments |
| Comment by Daniel Black [ 2022-04-28 ] | |||||||||||||||||||||||||||||||||||||
|
The lf_hash unit tests appear to be fixed by | |||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2022-05-01 ] | |||||||||||||||||||||||||||||||||||||
|
danblack, did lf-t crash or just fail? A few lines from the log or a link to the failure would be useful. | |||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-05-02 ] | |||||||||||||||||||||||||||||||||||||
|
Fairly common across all build environments.
| |||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2022-05-05 ] | |||||||||||||||||||||||||||||||||||||
|
This part of coredump (lf_pinbox_real_free() frame) looks suspicious:
Note that all addresses are equal. last is address of an element processed in previous iteration, cur - in current iteration, list - for the next iteration. In other words dead-loop, that is element->next == element. Feels like per-thread purgatory list is broken, otoh I can't tell if this coredump can be trusted (e.g. if it comes from heavily optimized binary). It is hard to dig it further without having access to similar machine. | |||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-02-02 ] | |||||||||||||||||||||||||||||||||||||
|
rseg (Restartable Sequence) might be a viable alternate implementation https://lwn.net/Articles/883104/ | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-04-28 ] | |||||||||||||||||||||||||||||||||||||
|
vlad.lesin reproduced an assertion failure on AMD64 that suggests that there is a race condition between lf_hash_search() and some other operations. We have an assertion failure:
In the record, the DB_TRX_ID is 0x6b684 (439940). In the trx_sys.rw_trx_hash there are 3 transactions, none with that identifier. One of them was found by the crashing thread:
This transaction is being committed in another thread:
This thread is waiting for the element->mutex that the crashing thread is holding. Note: The transaction ID is 439951, just like the element->id. It looks like the transaction 439940 had been committed and the memory reused for a new transaction 439951 while lf_hash_search() was in progress. Is the element returned by lf_hash_search() supposed to be validated by the caller again? If yes, what is the purpose of element->id? I believe that the impact of this bug could be ACID violation or weird locking anomalies. If we assume that the debug assertion does not exist, the returned transaction object would have been passed to lock_rec_convert_impl_to_expl_for_trx(). It only checks that the transaction has not been committed. If the unrelated transaction has not been committed, it would grant an exclusive lock to that transaction. This would unnecessarily block the lock-waiting transaction until the unrelated transaction is committed or rolled back. | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-04-28 ] | |||||||||||||||||||||||||||||||||||||
|
For an Oracle-internal 8.0.0 release of MySQL, the implementation was rewritten in C++11 std::atomic (using implicit std::memory_order_seq_cst in 2015. There have been some changes to the code since then, including replacing some more volatile with more std::atomic, and adding a cmp_func parameter. I admit that I do not understand the purpose of the pins. I wonder if in this function there is a narrow opportunity for race conditions:
res would only hold if r==0 in l_find():
Without understanding the logic, I wonder if the lf_pin(pins, 2, …) should be executed in l_find() before comparing anything. Possibly, adding a sleep right after the l_find() call in l_search() would allow the anomaly to be reproduced more easily. | |||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-05-03 ] | |||||||||||||||||||||||||||||||||||||
|
> I admit that I do not understand the purpose of the pins. As I understood, the purpose is to prevent memory deallocation if it's used. I.e. the node in lf_hash linked list can be deleted, but its content including to the pointer to the next element will still be valid until it's unpinned. marko, when the comparison is happening, cursor->curr is already pinned to pin 1 with lf_pin(pins, 1, cursor->curr) call either at the beginning of "retry" label, or at the end of "for (;;)" loop in l_find(). The LF_SLIST::key points to the memory object, allocated with pinbox allocator, see rw_trx_hash_t::init()->lf_hash_init()->lf_alloc_init()->lf_pinbox_init() call stack and LF_PINBOX::free_func_arg variable, i.e. lf_alloc_new() in lf_hash_insert() allocates the memory for storing LF_SLIST+rw_trx_hash_element_t in one chunk, see also lf_alloc_new() call from rw_trx_hash_t::insert()->lf_hash_insert(), and LF_SLIST::key points to the following after that LF_SLIST object rw_trx_hash_element_t::id. In other words, the memory, which LF_SLIST::key points to, is pinned to pin 1, and it's safe to compare it with local trx_id (at least if pinbox allocator works as supposed). | |||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-05-03 ] | |||||||||||||||||||||||||||||||||||||
|
One more addition, l_search() can return deleted element(the element can be "deleted" after !DELETED(link) condition was checked in l_find(), but, as the current and the next element of the list are pinned, their content stays valid), rw_trx_hash_t::find() checks element->trx, which is zeroed out by rw_trx_hash_t::erase() before deleting the element from the hash. If the element was deleteв with rw_trx_hash_t::erase(), element->trx would be zeroed out, and "if ((trx= element->trx))" with the assertion would not be executed. This is the answer to Marko's question: "Is the element returned by lf_hash_search() supposed to be validated by the caller again?" It's validated with element->trx examining in rw_trx_hash_t::find(). > The transaction ID is 439951, just like the element->id. It looks like the transaction 439940 had been committed and the memory reused for a new transaction 439951 while lf_hash_search() was in progress. If the transaction was committed, rw_trx_hash_t::erase() would be invoked for committed transaction, and element->trx would be zeroed out. One more note, element->trx is protected with element->mutex, which is acquired from both rw_trx_hash_t::erase() and rw_trx_hash_t::find(). | |||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-05-03 ] | |||||||||||||||||||||||||||||||||||||
|
Take a look rw_trx_hash_t::find():
It acquires element->mutex, then unpins transaction pins. After that the "element" can be deallocated and reused by some other thread. If we take a look rw_trx_hash_t::insert()->lf_hash_insert()->lf_alloc_new() calls, we will not find any element->mutex acquisition, as it was not initialized yet before it's allocation. My assumption is that rw_trx_hash_t::insert() can easily reuse the chunk, unpinned in rw_trx_hash_t::find(). The scenario is the following: 1. Thread 1 have just executed lf_hash_search() in rw_trx_hash_t::find(), but have not acquired element->mutex yet. 2. Thread 2 have removed the element from hash table with rw_trx_hash_t::erase() call. 3. Thread 1 acquired element->mutex and unpinned pin 2 pin with lf_hash_search_unpin(pins) call. 4. Some thread purged memory of the element. 5. Thread 3 reused the memory for the element, filled element->id, element->trx. 6. Thread 1 crashes with failed "DBUG_ASSERT(trx_id == trx->id)" assertion. If so, then the bug does not relate to the allocator case, described in this ticket. The fix is to invoke "lf_hash_search_unpin(pins);" after "mutex_exit(&element->mutex);" call in rw_trx_hash_t::find(). | |||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-05-03 ] | |||||||||||||||||||||||||||||||||||||
|
The above scenario is indirectly confirmed with the following trick. If we set one my_sleep(1) before mutex_enter(&element->mutex) call in rw_trx_hash_t::find(), another my_sleep(1) after lf_hash_search_unpin(pins) call in rw_trx_hash_t::find(), then the assertion failure is reproduced much more faster with the initial test case caused it. | |||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-05-04 ] | |||||||||||||||||||||||||||||||||||||
|
I filed a separate bug for it: | |||||||||||||||||||||||||||||||||||||
| Comment by Nikita Malyavin [ 2023-06-26 ] | |||||||||||||||||||||||||||||||||||||
|
Hi all, I'll note that during the earlier discussions on the latter, my finding was the following:
I assume that on x86 it has less chances to make a problem, but can be vital for arm. | |||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-09 ] | |||||||||||||||||||||||||||||||||||||
|
Still occuring regularly in BB | |||||||||||||||||||||||||||||||||||||
| Comment by Xiaotong Niu [ 2023-10-18 ] | |||||||||||||||||||||||||||||||||||||
|
Hi Daniel, Could you please provide the arm platform information where the error occurred? This may help us reproduce this error, thank you. | |||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-10-18 ] | |||||||||||||||||||||||||||||||||||||
|
Thanks xiaoniu is this sufficient?
| |||||||||||||||||||||||||||||||||||||
| Comment by Xiaotong Niu [ 2023-10-19 ] | |||||||||||||||||||||||||||||||||||||
|
Thanks Daniel, we have reproduce it, and we are investigating this issue. | |||||||||||||||||||||||||||||||||||||
| Comment by Xiaotong Niu [ 2023-10-19 ] | |||||||||||||||||||||||||||||||||||||
|
During the debugging process, we simulate time delays in the lf_alloc_new function, then an new error was detected, this error can be detected very quickly on arm, and there is no such problem on x86. Posted here for discussion. Code to simulate delays:
Then an error occurred, and the detail gdb information is in report_gdb.txt
other information:
In addition, the #L85 and #L89 in lf-t.c set the next of the node returned by alloc_new() to 0xffff00000000, related code link: | |||||||||||||||||||||||||||||||||||||
| Comment by Xiaotong Niu [ 2023-10-27 ] | |||||||||||||||||||||||||||||||||||||
|
A PR was submitted for this issue, link: | |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-27 ] | |||||||||||||||||||||||||||||||||||||
|
xiaoniu, thank you, great work. I think that in we should also consider refactoring that code further, converting it to C++11 std::atomic, and replacing some excessive use of std::memory_order_seq_cst when possible. |