Note: run the test case with --mem --repeat=N. For me N=10 has always been enough so far, but it can vary on different machines. --mem is important, at least on my machines, apparently the test is not concurrent enough when it's run on disk.
Could not reproduce on 10.2
This part contains some data from early stages, before we had any test case at all, which was collected initially or upon Marko's request, as well as Marko's notes on it. Feel free to remove it if it's no longer needed.
Not easily reproducible.
Some data collected upon Marko's request:
ok, delete-marked record, no PK. `p/x *rec@6+6+7+4+4+4+4`
I hope I got the correct amount (curiously, all fields look fixed-length)
strange: `rec` is `0x7f`. it should be the MSB of `DB_TRX_ID` and rather small
there is nothing close to 0x4146 (big-endian format) in the record. it looks totally corrupted to me
what about the other thread? or were there 2 threads asserting?
also, `dump binary memory /tmp/page.bin $page $page+srv_page_size` where `$page` is `rec&(srv_page_size-1)`
so that I can check the page dump. also, I'd like to know the `rec` in that case
adaptive hash index corruption? just guessing.
it is total garbage. no "infimum" or "supremum", and large sections filled with 0x8f (TRASH). stray write into the InnoDB buffer pool from somewhere
use ASAN and ASAN_OPTIONS=abort_on_error=1,disable_coredump=0
no, it could be a legitimate page; there is something at the end of it
otoh, the page contents look like some SYS_ table page, and this thread is trying to read a normal user table
there even is some InnoDB SQL code within the page dump
the dump just looks like random unrelated garbage put together
oh yes, I forgot to check that it matches the `rec`
yes, it does match it
The more that I think of this, I think that the memory may never have been part of the buffer pool, but heap memory. Too bad I did not ask you to dump the buf_pool->chunks.
Yes, it must trivially be so: `rec_copy_prefix_to_dtuple()` is being executed on a buffer that btr_pcur_store_position() must have allocated. This could be related to the
MDEV-14837fix from yesterday. Or there could have been some corruption on the index page, causing btr_pcur_store_position() to copy garbage, and to somehow miss this corruption when copying. Or there was some rogue write that corrupted the copied record prefix in the malloc() heap.
One more contributing factor could be adaptive hash index corruption, which is very unlikely, since the query did not use any index lookup.
Luckily, the transaction ID was small (0x4146); I think it starts from around 0x500 and is incremented by 2 for each transaction.
sorry, `buf_pool_ptr.chunks`. there can be multiple; see `srv_buf_pool_instances`. so maybe, `p *buf_pool_ptr@srv_buf_pool_instances` would do the trick.
That said, in this particular case it is now obvious that the pointer was in a `malloc()` heap, not inside the InnoDB buffer pool. The only question is: how did we manage to get the record corrupted in the heap? I think that the most plausible explanation is that it was corrupted by something doing `malloc()` and then writing outside the bounds the allocated area, corrupting the record prefix in `pcur`. (The next plausible one is that there really was a page corruption, and we somehow magically managed to copy a corrupted record prefix to the `pcur` buffer.)