[MDEV-32174] ROW_FORMAT=COMPRESSED table corruption due to ROLLBACK - Jira

Bart-Jan Hilbrands created issue - 2023-09-15 06:29

Bart-Jan Hilbrands made changes - 2023-09-15 06:31

Field	Original Value	New Value
Description	After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely). The database portion of the restores specifically are done using mydumper with the default amount of 4 threads. What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes). This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem. We have also tried upgrading to various versions of mariadb/mydumper, most notably: \| mariadb \| mydumper \| \|10.6.12 \| 0.15 \| \|11.1 \| 0.10 \| \|11.1 \| 0.15\| But with all of the above version combinations the problem still occurred Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause. Provided files: The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table. Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now. But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log. In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.	After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely). The database portion of the restores specifically are done using mydumper with the default amount of 4 threads. What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes). This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem. We have also tried upgrading to various versions of mariadb/mydumper, most notably: \| mariadb \| mydumper \| \|10.6.12 \| 0.15 \| \|10.11.3 \| 0.10 \| \|10.11.3 \| 0.15\| But with all of the above version combinations the problem still occurred Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause. Provided files: The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table. Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now. But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log. In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.

Daniel Black added a comment - 2023-09-15 06:39 - edited

Hi bjhilbrands,

To give us a little more detail:

sudo apt-get install mariadb-server-core-dbg

And using abrt-retrace and starting a gdb session run thread apply all bt -frame-arguments all full.

If you are able to test with 10.6.15 that would be much appreciated. repo config.

or even the 10.11 though an earlier version will tell us where to fix it more reliably.

Daniel Black added a comment - 2023-09-15 06:39 - edited Hi bjhilbrands , To give us a little more detail: sudo apt-get install mariadb-server-core-dbg And using abrt-retrace and starting a gdb session run thread apply all bt -frame-arguments all full . If you are able to test with 10.6.15 that would be much appreciated. repo config . or even the 10.11 though an earlier version will tell us where to fix it more reliably.

Marko Mäkelä added a comment - 2023-09-15 06:43

If the innodb_buffer_pool_size is much smaller than the database, this report could be a duplicate of ~~MDEV-31767~~.

Marko Mäkelä added a comment - 2023-09-15 06:43 If the innodb_buffer_pool_size is much smaller than the database, this report could be a duplicate of MDEV-31767 .

Bart-Jan Hilbrands added a comment - 2023-09-15 08:31

Hi, thank you for the quick feedback. I am testing it on 10.6.15 right now, i'll get back to you with my findings and a full stacktrace if the problem indeed still occurs.

Bart-Jan Hilbrands added a comment - 2023-09-15 08:31 Hi, thank you for the quick feedback. I am testing it on 10.6.15 right now, i'll get back to you with my findings and a full stacktrace if the problem indeed still occurs.

Bart-Jan Hilbrands made changes - 2023-09-15 09:28

Description

After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
We have also tried upgrading to various versions of mariadb/mydumper, most notably:
| mariadb | mydumper |
|10.6.12 | 0.15 |
|10.11.3 | 0.10 |
|10.11.3 | 0.15|
But with all of the above version combinations the problem still occurred
Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

Provided files:
The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.

After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
We have also tried upgrading to various versions of mariadb/mydumper, most notably:
| mariadb | mydumper |
|10.6.12 | 0.15 |
|10.11.3 | 0.10 |
|10.11.3 | 0.15|
But with all of the above version combinations the problem still occurred
Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

Provided files:
The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.

Update:
After running on 10.6.15 I was again able to reproduce it. I generated a stacktrace form the core dump (

Bart-Jan Hilbrands made changes - 2023-09-15 09:28

Description

After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
We have also tried upgrading to various versions of mariadb/mydumper, most notably:
| mariadb | mydumper |
|10.6.12 | 0.15 |
|10.11.3 | 0.10 |
|10.11.3 | 0.15|
But with all of the above version combinations the problem still occurred
Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

Provided files:
The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.

Update:
After running on 10.6.15 I was again able to reproduce it. I generated a stacktrace form the core dump (

After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
We have also tried upgrading to various versions of mariadb/mydumper, most notably:
| mariadb | mydumper |
|10.6.12 | 0.15 |
|10.11.3 | 0.10 |
|10.11.3 | 0.15|
But with all of the above version combinations the problem still occurred
Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

Provided files:
The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.

Update:
After running on 10.6.15 I was again able to reproduce it. I generated a stacktrace form the core dump (mariadbd_full_bt_all_threads.txt)

Bart-Jan Hilbrands made changes - 2023-09-15 09:29

Attachment

mariadbd_full_bt_all_threads.txt [ 72008 ]

Bart-Jan Hilbrands made changes - 2023-09-15 10:39

Attachment

syslog-restore-10.6.15.txt [ 72011 ]

Bart-Jan Hilbrands added a comment - 2023-09-15 10:39

I was able to reproduce it again with 10.6.15. One thing I noticed that was different is, that mariadb is stil running properly (Before it was in a restart loop). Now it is running fine with status "To roll back: 1 transactions, 5668 rows". Stacktrace I generated from the core dump is attached as mariadbd_full_bt_all_threads.txt , also added output from the syslog if that is useful syslog-restore-10.6.15.txt
.

Bart-Jan Hilbrands added a comment - 2023-09-15 10:39 I was able to reproduce it again with 10.6.15. One thing I noticed that was different is, that mariadb is stil running properly (Before it was in a restart loop). Now it is running fine with status "To roll back: 1 transactions, 5668 rows". Stacktrace I generated from the core dump is attached as mariadbd_full_bt_all_threads.txt , also added output from the syslog if that is useful syslog-restore-10.6.15.txt .

Daniel Black added a comment - 2023-09-15 13:00

I think the backtrace is still being processed with the 10.6.12 symbols rather than the 10.6.15 ones. Can you install mariadb-server-core-10.6-dbgsym which should come from the mariadb repo. After that re-resolve the previous core dump. Thanks for the confirmation and quick response.

Daniel Black added a comment - 2023-09-15 13:00 I think the backtrace is still being processed with the 10.6.12 symbols rather than the 10.6.15 ones. Can you install mariadb-server-core-10.6-dbgsym which should come from the mariadb repo. After that re-resolve the previous core dump. Thanks for the confirmation and quick response.

Bart-Jan Hilbrands made changes - 2023-09-15 13:08

Attachment

mariadbd_full_bt_all_threads.txt [ 72008 ]

Bart-Jan Hilbrands made changes - 2023-09-15 13:08

Attachment

mariadbd_full_bt_all_threads.txt [ 72012 ]

Bart-Jan Hilbrands added a comment - 2023-09-15 13:09

I see, updated the attachment with the fixed version

Bart-Jan Hilbrands added a comment - 2023-09-15 13:09 I see, updated the attachment with the fixed version

Bart-Jan Hilbrands made changes - 2023-09-15 13:09

Attachment

mariadbd_full_bt_all_threads.txt [ 72012 ]

Bart-Jan Hilbrands made changes - 2023-09-15 13:28

Attachment

mariadbd_full_bt_all_threads.txt [ 72013 ]

Alice Sherepa made changes - 2023-09-15 13:48

Link

This issue relates to ~~MDEV-31817~~ [ ~~MDEV-31817~~ ]

Daniel Black made changes - 2023-09-18 22:42

Fix Version/s

10.6 [ 24028 ]

Daniel Black made changes - 2023-09-18 22:42

Component/s

Storage Engine - InnoDB [ 10129 ]

Daniel Black made changes - 2023-09-18 22:42

Assignee

Marko Mäkelä [ marko ]

Daniel Black added a comment - 2023-09-18 22:50

Thanks bjhilbrands.

page_zip_dir_find returning a null pointer

Aborting thread
#8 0x00007f0eb02b7476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
ret = <optimized out>
#9 0x00007f0eb029d7f3 in __GI_abort () at ./stdlib/abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x7f0e983b9d00, sa_sigaction = 0x7f0e983b9d00}, sa_mask = {__val = {398, 139701063841440, 3432, 94106376343816, 139701062207568, 398, 94106364354817, 9, 139701063841440, 398, 398, 1, 94106376343816, 139701063824896, 139701062160471, 1103806595077}}, sa_flags = -1337395328, sa_restorer = 0x7f0eb048b600 <_IO_file_jumps>}
sigs = {__val = {32, 139700660311452, 139701063841056, 15840675240469249536, 9, 15840675240469249536, 139701063841888, 139700660311648, 94106376288112, 139701062769284, 4297, 398, 94106376343816, 398, 139700660311424, 139701062205293}}
#10 0x00005596d86b1d3e in ut_dbg_assertion_failed (expr=expr@entry=0x5596d9213770 "slot_rec", file=file@entry=0x5596d9213230 "./storage/innobase/page/page0zip.cc", line=line@entry=4297) at ./storage/innobase/ut/ut0dbg.cc:60
No locals.
#11 0x00005596d8ddb888 in page_zip_dir_delete (block=block@entry=0x7f0e8c0f46a0, rec=<optimized out>, index=index@entry=0x7f0e5c0535f0, offsets=offsets@entry=0x7f0e983b97d0, free=0x0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/page/page0zip.cc:4297
page_zip = 0x7f0e8c0f46d8
page_free = <optimized out>
garbage = <optimized out>
slot_rec = 0x0
n_recs = <optimized out>
page_n_recs = <optimized out>
slot_free = <optimized out>
slot_len = <optimized out>
#12 0x00005596d8dcb06c in page_cur_delete_rec (cursor=<optimized out>, offsets=0x7f0e983b97d0, mtr=0x7f0e983b9d00) at ./storage/innobase/page/page0cur.cc:2362
cur_dir_slot = <optimized out>
current_rec = 0x7f0e92306b62 ""
prev_rec = <optimized out>
next_rec = <optimized out>
cur_n_owned = 8
rec = <optimized out>
index = 0x7f0e5c0535f0
block = 0x7f0e8c0f46a0
cur_slot_no = 152
slot_rec = <optimized out>
data_size = <optimized out>
extra_size = <optimized out>
#13 0x00005596d8e8807e in btr_cur_optimistic_delete (cursor=cursor@entry=0x7f0e983b9aa0, flags=flags@entry=0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/btr/btr0cur.cc:4619
page = <optimized out>
page_zip = <optimized out>
block = 0x7f0e8c0f46a0
rec = 0x7f0e92306b62 ""
heap = 0x0
offsets_ = {300, 3, 32774, 4, 8, 12, 0, 0, 11824, 23557, 32526, 0, 39664, 38971, 0, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 11456, 35854, 32526, 0, 11824, 23557, 32526, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 38992, 38971, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 39664, 38971, 32526, 0, 40608, 38971, 32526, 0, 42204, 55522, 21910, 0, 11456, 35854, 32526, 0, 4088, 0, 0, 0, 1, 0, 0, 0, 46830, 37311, 32526, 0, 39136, 38971, 32526, 0, 0, 0, 0, 0, 300, 15, 32780, 4, 10, 17, 21, 25, 176, 176, 185, 227, 231, 240, 244, 248, 252, 498, 0, 0, 40192, 38971, 32526, 0, 4089, 0, 0, 0, 11456, 35854, 32526, 0, 47973, 37311, 32526, 0, 40192, 38971, 32526, 0, 39168, 38971, 32526, 0, 36003, 55517, 21910, 0, 4, 0, 0, 0, 47973, 37311, 32526, 0, 39312, 38971, 32526, 0, 46156, 55516, 21910, 0, 41, 0, 0, 0, 47973, 37311, 32526, 0, 47340, 37311, 32526, 0, 49140, 37311, 32526, 0, 47340, 37311, 32526, 0, 2, 0, 0, 0, 16384, 0, 0, 0, 39392, 38971, 32526, 0, 33970, 37311, 32526, 0, 49140, 37311, 32526, 0...}
offsets = <optimized out>
err = DB_SUCCESS
#14 0x00005596d8f1eef5 in row_undo_ins_remove_sec_low (mode=<optimized out>, index=<optimized out>, entry=<optimized out>, thr=<optimized out>) at ./storage/innobase/row/row0uins.cc:318
btr_cur = 0x7f0e983b9aa0
pcur = {btr_cur = {page_cur = {index = 0x7f0e5c0535f0, rec = 0x7f0e92304070 "", offsets = 0x0, block = 0x7f0e8c0f46a0}, purge_node = 0x0, thr = 0x0, flag = BTR_CUR_BINARY, tree_height = 3, up_match = 1, up_bytes = 0, low_match = 3, low_bytes = 0, n_fields = 0, n_bytes = 0, fold = 0, path_arr = 0x0, rtr_info = 0x0}, latch_mode = BTR_MODIFY_LEAF, old_rec = 0x0, old_n_core_fields = 0, old_n_fields = 0, rel_pos = 0, block_when_stored = {m_block = 0x0, m_page_id = {m_id = 0}}, modify_clock = 0, pos_state = BTR_PCUR_IS_POSITIONED, search_mode = PAGE_CUR_LE, trx_if_known = 0x0, old_rec_buf = 0x0, buf_size = 0}
err = DB_SUCCESS
mtr = {m_last = 0x7f0e8c0f46a0, m_last_offset = 48, m_log_mode = 0, m_modifications = 1, m_made_dirty = 0, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7f0e983b9d20, Size = 2, Capacity = 16}, small = {{object = 0x7f0e5c053768, type = MTR_MEMO_S_LOCK}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c097900, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c0468c0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c002080, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c050c80, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c089080, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e5c04e4f8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f0e8c08c780, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f0e8c050c80, type = 2956040257}, {object = 0x7f0e68000090, type = 0}, {object = 0x7, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x428c0e2cc0, type = 1152}, {object = 0x7f0e8c139800, type = 605081585}, {object = 0x57, type = 1200}}}, m_log = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x7f0e983b9e48, prev = 0x7f0e983b9e48}}, size_ = 1}, m_size = 19, m_first_block = {<ilist_node<void>> = {next = 0x7f0e983b9e28, prev = 0x7f0e983b9e28}, m_data = "<omitted customer data>"..., m_used = 19}}, m_user_space = 0x7f0e5c04e4f8, m_commit_lsn = 0, m_freed_space = 0x0, m_freed_pages = 0x0}
modify_leaf = true
#15 0x00005596d8f1f4f4 in row_undo_ins_remove_sec (thr=0x7f0e68095bf8, entry=0x7f0e6bfc6430, index=0x7f0e5c0535f0) at ./storage/innobase/row/row0uins.cc:353
err = <optimized out>
n_tries = 0
retry = <optimized out>
err = <optimized out>
n_tries = <optimized out>
#16 row_undo_ins_remove_sec_rec (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0uins.cc:549
entry = 0x7f0e6bfc6430
err = DB_SUCCESS
index = 0x7f0e5c0535f0
heap = 0x7f0e6bfc63c8
func_exit = <optimized out>
err = <optimized out>
index = <optimized out>
heap = <optimized out>
entry = <optimized out>
#17 row_undo_ins (node=node@entry=0x7f0e6bf6fa20, thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0uins.cc:601
err = <optimized out>
dict_locked = false
#18 0x00005596d8e2ab7b in row_undo (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0undo.cc:401
err = <optimized out>
err = <optimized out>
#19 row_undo_step (thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0undo.cc:452
err = <optimized out>
node = 0x7f0e6bf6fa20
trx = 0x7f0ea62aa680
#20 0x00005596d8de36f2 in que_thr_step (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:586
node = 0x7f0e6bf6fa20
old_thr = 0x7f0e68095bf8
trx = <optimized out>
type = <optimized out>
node = <optimized out>
old_thr = <optimized out>
trx = <optimized out>
type = <optimized out>
#21 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:644
next_thr = <optimized out>
trx = <optimized out>
#22 que_run_threads (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:664
trx = 0x7f0ea62aa680
loop = <optimized out>
#23 0x00005596d8e4b414 in trx_t::rollback_low (this=this@entry=0x7f0ea62aa680, savept=savept@entry=0x7f0ea62aafc0) at ./storage/innobase/trx/trx0roll.cc:125
thr = 0x7f0e680e8ad8
heap = 0x7f0e680e8978
roll_node = 0x7f0e680e89e0
#24 0x00005596d8e4a300 in trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:176
No locals.
#25 trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:163
No locals.
#26 trx_rollback_last_sql_stat_for_mysql (trx=trx@entry=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:307
err = <optimized out>
#27 0x00005596d8d4d2e0 in innobase_rollback (hton=<optimized out>, thd=0x7f0e68000c68, rollback_trx=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:4705
trx = 0x7f0ea62aa680
error = <optimized out>
#28 0x00005596d8a4eed0 in ha_rollback_trans (thd=thd@entry=0x7f0e68000c68, all=all@entry=false) at ./sql/handler.cc:2224
err = <optimized out>
ht = <optimized out>
error = 0
trans = <optimized out>
ha_info = 0x7f0e68003628
ha_info_next = <optimized out>
is_real_trans = <optimized out>
#29 0x00005596d892f9d9 in trans_rollback_stmt (thd=thd@entry=0x7f0e68000c68) at ./sql/transaction.cc:535
No locals.
#30 0x00005596d8802e50 in mysql_execute_command (thd=0x7f0e68000c68, is_called_from_prepared_stmt=<optimized out>) at ./sql/sql_parse.cc:6077
multi_delete_error = <optimized out>
wsrep_error_label = <optimized out>
error = <optimized out>
res = 1
up_result = <optimized out>
lex = 0x7f0e68004d88
select_lex = <optimized out>
first_table = <optimized out>
all_tables = 0x7f0e68013130
unit = 0x7f0e68004e50
have_table_map_for_update = <optimized out>
rpl_filter = <optimized out>
ots = {ctx = 0x7f0e68004a58, traceable = false}
orig_binlog_format = <optimized out>
orig_current_stmt_binlog_format = <optimized out>
#31 0x00005596d88085f7 in mysql_parse (thd=0x7f0e68000c68, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>) at ./sql/sql_parse.cc:8053
found_semicolon = <optimized out>
error = <optimized out>
lex = 0x7f0e68004d88
err = <optimized out>
#32 0x00005596d880addd in dispatch_command (command=COM_QUERY, thd=0x7f0e68000c68, packet=<optimized out>, packet_length=<optimized out>, blocking=<optimized out>) at ./sql/sql_parse.cc:1993
beginning_of_next_stmt = <optimized out>
length = <optimized out>
packet_end = 0x7f0e3b861df4 ""
parser_state = {m_lip = {lookahead_token = -1, lookahead_yylval = 0x0, m_thd = 0x7f0e68000c68, m_ptr = 0x7f0e3b861df5 "\b", m_tok_start = 0x7f0e3b861df5 "\b", m_tok_end = 0x7f0e3b861df5 "\b", m_end_of_query = 0x7f0e3b861df4 "", m_tok_start_prev = 0x7f0e3b861df4 "", m_buf = 0x7f0e3afb4e00 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_buf_length = 9097204, m_echo = true, m_echo_saved = true, m_cpp_buf = 0x7f0e38000070 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_cpp_ptr = 0x7f0e388ad064 "", m_cpp_tok_start = 0x7f0e388ad064 "", m_cpp_tok_start_prev = 0x7f0e388ad064 "", m_cpp_tok_end = 0x7f0e388ad064 "", m_body_utf8 = 0x0, m_body_utf8_ptr = 0x0, m_cpp_utf8_processed_ptr = 0x0, next_state = MY_LEX_END, found_semicolon = 0x0, ignore_space = false, stmt_prepare_mode = false, multi_statements = true, yylineno = 26245, m_digest = 0x0, in_comment = NO_COMMENT, in_comment_saved = NO_COMMENT, m_cpp_text_start = 0x7f0e388acf97 "<omitted customer data>"..., m_cpp_text_end = 0x7f0e388ad062 "\")", m_underscore_cs = 0x0}, m_yacc = {yacc_yyss = 0x0, yacc_yyvs = 0x0, m_set_signal_info = {m_item = {0x0 <repeats 12 times>}}, m_lock_type = TL_READ_DEFAULT, m_mdl_type = MDL_SHARED_READ}, m_digest_psi = 0x0}
net = 0x7f0e68000f20
error = false
do_end_of_statement = true
drop_more_results = false
__FUNCTION__ = "dispatch_command"
res = <optimized out>
#33 0x00005596d880cbe8 in do_command (thd=0x7f0e68000c68, blocking=blocking@entry=true) at ./sql/sql_parse.cc:1409
return_value = <optimized out>
packet = 0x7f0e3919adb8 "\003INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"...
packet_length = <optimized out>
net = 0x7f0e68000f20
command = COM_QUERY
__FUNCTION__ = "do_command"
#34 0x00005596d891efef in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at ./sql/sql_connect.cc:1416
create_user = true
thr_create_utime = <optimized out>
thd = 0x7f0e68000c68
#35 0x00005596d891f33d in handle_one_connection (arg=arg@entry=0x5596db163948) at ./sql/sql_connect.cc:1318
connect = 0x5596db163948
#36 0x00005596d8ca4ce6 in pfs_spawn_thread (arg=0x5596db1180f8) at ./storage/perfschema/pfs.cc:2201
typed_arg = 0x5596db1180f8
user_arg = 0x5596db163948
user_start_routine = 0x5596d891f2e0 <handle_one_connection(void*)>
pfs = <optimized out>
klass = <optimized out>
#37 0x00007f0eb0309b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
ret = <optimized out>
pd = <optimized out>
out = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140720411833360, -1557864560374536914, 139700660323904, 1, 139701062244432, 140720411833712, 1476171432797185326, 1476118612295460142}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
#38 0x00007f0eb039ba00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
No locals.

Daniel Black added a comment - 2023-09-18 22:50 Thanks bjhilbrands . page_zip_dir_find returning a null pointer Aborting thread #8 0x00007f0eb02b7476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 ret = <optimized out> #9 0x00007f0eb029d7f3 in __GI_abort () at ./stdlib/abort.c:79 save_stage = 1 act = {__sigaction_handler = {sa_handler = 0x7f0e983b9d00, sa_sigaction = 0x7f0e983b9d00}, sa_mask = {__val = {398, 139701063841440, 3432, 94106376343816, 139701062207568, 398, 94106364354817, 9, 139701063841440, 398, 398, 1, 94106376343816, 139701063824896, 139701062160471, 1103806595077}}, sa_flags = -1337395328, sa_restorer = 0x7f0eb048b600 <_IO_file_jumps>} sigs = {__val = {32, 139700660311452, 139701063841056, 15840675240469249536, 9, 15840675240469249536, 139701063841888, 139700660311648, 94106376288112, 139701062769284, 4297, 398, 94106376343816, 398, 139700660311424, 139701062205293}} #10 0x00005596d86b1d3e in ut_dbg_assertion_failed (expr=expr@entry=0x5596d9213770 "slot_rec", file=file@entry=0x5596d9213230 "./storage/innobase/page/page0zip.cc", line=line@entry=4297) at ./storage/innobase/ut/ut0dbg.cc:60 No locals. #11 0x00005596d8ddb888 in page_zip_dir_delete (block=block@entry=0x7f0e8c0f46a0, rec=<optimized out>, index=index@entry=0x7f0e5c0535f0, offsets=offsets@entry=0x7f0e983b97d0, free=0x0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/page/page0zip.cc:4297 page_zip = 0x7f0e8c0f46d8 page_free = <optimized out> garbage = <optimized out> slot_rec = 0x0 n_recs = <optimized out> page_n_recs = <optimized out> slot_free = <optimized out> slot_len = <optimized out> #12 0x00005596d8dcb06c in page_cur_delete_rec (cursor=<optimized out>, offsets=0x7f0e983b97d0, mtr=0x7f0e983b9d00) at ./storage/innobase/page/page0cur.cc:2362 cur_dir_slot = <optimized out> current_rec = 0x7f0e92306b62 "" prev_rec = <optimized out> next_rec = <optimized out> cur_n_owned = 8 rec = <optimized out> index = 0x7f0e5c0535f0 block = 0x7f0e8c0f46a0 cur_slot_no = 152 slot_rec = <optimized out> data_size = <optimized out> extra_size = <optimized out> #13 0x00005596d8e8807e in btr_cur_optimistic_delete (cursor=cursor@entry=0x7f0e983b9aa0, flags=flags@entry=0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/btr/btr0cur.cc:4619 page = <optimized out> page_zip = <optimized out> block = 0x7f0e8c0f46a0 rec = 0x7f0e92306b62 "" heap = 0x0 offsets_ = {300, 3, 32774, 4, 8, 12, 0, 0, 11824, 23557, 32526, 0, 39664, 38971, 0, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 11456, 35854, 32526, 0, 11824, 23557, 32526, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 38992, 38971, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 39664, 38971, 32526, 0, 40608, 38971, 32526, 0, 42204, 55522, 21910, 0, 11456, 35854, 32526, 0, 4088, 0, 0, 0, 1, 0, 0, 0, 46830, 37311, 32526, 0, 39136, 38971, 32526, 0, 0, 0, 0, 0, 300, 15, 32780, 4, 10, 17, 21, 25, 176, 176, 185, 227, 231, 240, 244, 248, 252, 498, 0, 0, 40192, 38971, 32526, 0, 4089, 0, 0, 0, 11456, 35854, 32526, 0, 47973, 37311, 32526, 0, 40192, 38971, 32526, 0, 39168, 38971, 32526, 0, 36003, 55517, 21910, 0, 4, 0, 0, 0, 47973, 37311, 32526, 0, 39312, 38971, 32526, 0, 46156, 55516, 21910, 0, 41, 0, 0, 0, 47973, 37311, 32526, 0, 47340, 37311, 32526, 0, 49140, 37311, 32526, 0, 47340, 37311, 32526, 0, 2, 0, 0, 0, 16384, 0, 0, 0, 39392, 38971, 32526, 0, 33970, 37311, 32526, 0, 49140, 37311, 32526, 0...} offsets = <optimized out> err = DB_SUCCESS #14 0x00005596d8f1eef5 in row_undo_ins_remove_sec_low (mode=<optimized out>, index=<optimized out>, entry=<optimized out>, thr=<optimized out>) at ./storage/innobase/row/row0uins.cc:318 btr_cur = 0x7f0e983b9aa0 pcur = {btr_cur = {page_cur = {index = 0x7f0e5c0535f0, rec = 0x7f0e92304070 "", offsets = 0x0, block = 0x7f0e8c0f46a0}, purge_node = 0x0, thr = 0x0, flag = BTR_CUR_BINARY, tree_height = 3, up_match = 1, up_bytes = 0, low_match = 3, low_bytes = 0, n_fields = 0, n_bytes = 0, fold = 0, path_arr = 0x0, rtr_info = 0x0}, latch_mode = BTR_MODIFY_LEAF, old_rec = 0x0, old_n_core_fields = 0, old_n_fields = 0, rel_pos = 0, block_when_stored = {m_block = 0x0, m_page_id = {m_id = 0}}, modify_clock = 0, pos_state = BTR_PCUR_IS_POSITIONED, search_mode = PAGE_CUR_LE, trx_if_known = 0x0, old_rec_buf = 0x0, buf_size = 0} err = DB_SUCCESS mtr = {m_last = 0x7f0e8c0f46a0, m_last_offset = 48, m_log_mode = 0, m_modifications = 1, m_made_dirty = 0, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7f0e983b9d20, Size = 2, Capacity = 16}, small = {{object = 0x7f0e5c053768, type = MTR_MEMO_S_LOCK}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c097900, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c0468c0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c002080, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c050c80, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c089080, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e5c04e4f8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f0e8c08c780, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f0e8c050c80, type = 2956040257}, {object = 0x7f0e68000090, type = 0}, {object = 0x7, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x428c0e2cc0, type = 1152}, {object = 0x7f0e8c139800, type = 605081585}, {object = 0x57, type = 1200}}}, m_log = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x7f0e983b9e48, prev = 0x7f0e983b9e48}}, size_ = 1}, m_size = 19, m_first_block = {<ilist_node<void>> = {next = 0x7f0e983b9e28, prev = 0x7f0e983b9e28}, m_data = "<omitted customer data>"..., m_used = 19}}, m_user_space = 0x7f0e5c04e4f8, m_commit_lsn = 0, m_freed_space = 0x0, m_freed_pages = 0x0} modify_leaf = true #15 0x00005596d8f1f4f4 in row_undo_ins_remove_sec (thr=0x7f0e68095bf8, entry=0x7f0e6bfc6430, index=0x7f0e5c0535f0) at ./storage/innobase/row/row0uins.cc:353 err = <optimized out> n_tries = 0 retry = <optimized out> err = <optimized out> n_tries = <optimized out> #16 row_undo_ins_remove_sec_rec (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0uins.cc:549 entry = 0x7f0e6bfc6430 err = DB_SUCCESS index = 0x7f0e5c0535f0 heap = 0x7f0e6bfc63c8 func_exit = <optimized out> err = <optimized out> index = <optimized out> heap = <optimized out> entry = <optimized out> #17 row_undo_ins (node=node@entry=0x7f0e6bf6fa20, thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0uins.cc:601 err = <optimized out> dict_locked = false #18 0x00005596d8e2ab7b in row_undo (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0undo.cc:401 err = <optimized out> err = <optimized out> #19 row_undo_step (thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0undo.cc:452 err = <optimized out> node = 0x7f0e6bf6fa20 trx = 0x7f0ea62aa680 #20 0x00005596d8de36f2 in que_thr_step (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:586 node = 0x7f0e6bf6fa20 old_thr = 0x7f0e68095bf8 trx = <optimized out> type = <optimized out> node = <optimized out> old_thr = <optimized out> trx = <optimized out> type = <optimized out> #21 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:644 next_thr = <optimized out> trx = <optimized out> #22 que_run_threads (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:664 trx = 0x7f0ea62aa680 loop = <optimized out> #23 0x00005596d8e4b414 in trx_t::rollback_low (this=this@entry=0x7f0ea62aa680, savept=savept@entry=0x7f0ea62aafc0) at ./storage/innobase/trx/trx0roll.cc:125 thr = 0x7f0e680e8ad8 heap = 0x7f0e680e8978 roll_node = 0x7f0e680e89e0 #24 0x00005596d8e4a300 in trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:176 No locals. #25 trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:163 No locals. #26 trx_rollback_last_sql_stat_for_mysql (trx=trx@entry=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:307 err = <optimized out> #27 0x00005596d8d4d2e0 in innobase_rollback (hton=<optimized out>, thd=0x7f0e68000c68, rollback_trx=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:4705 trx = 0x7f0ea62aa680 error = <optimized out> #28 0x00005596d8a4eed0 in ha_rollback_trans (thd=thd@entry=0x7f0e68000c68, all=all@entry=false) at ./sql/handler.cc:2224 err = <optimized out> ht = <optimized out> error = 0 trans = <optimized out> ha_info = 0x7f0e68003628 ha_info_next = <optimized out> is_real_trans = <optimized out> #29 0x00005596d892f9d9 in trans_rollback_stmt (thd=thd@entry=0x7f0e68000c68) at ./sql/transaction.cc:535 No locals. #30 0x00005596d8802e50 in mysql_execute_command (thd=0x7f0e68000c68, is_called_from_prepared_stmt=<optimized out>) at ./sql/sql_parse.cc:6077 multi_delete_error = <optimized out> wsrep_error_label = <optimized out> error = <optimized out> res = 1 up_result = <optimized out> lex = 0x7f0e68004d88 select_lex = <optimized out> first_table = <optimized out> all_tables = 0x7f0e68013130 unit = 0x7f0e68004e50 have_table_map_for_update = <optimized out> rpl_filter = <optimized out> ots = {ctx = 0x7f0e68004a58, traceable = false} orig_binlog_format = <optimized out> orig_current_stmt_binlog_format = <optimized out> #31 0x00005596d88085f7 in mysql_parse (thd=0x7f0e68000c68, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>) at ./sql/sql_parse.cc:8053 found_semicolon = <optimized out> error = <optimized out> lex = 0x7f0e68004d88 err = <optimized out> #32 0x00005596d880addd in dispatch_command (command=COM_QUERY, thd=0x7f0e68000c68, packet=<optimized out>, packet_length=<optimized out>, blocking=<optimized out>) at ./sql/sql_parse.cc:1993 beginning_of_next_stmt = <optimized out> length = <optimized out> packet_end = 0x7f0e3b861df4 "" parser_state = {m_lip = {lookahead_token = -1, lookahead_yylval = 0x0, m_thd = 0x7f0e68000c68, m_ptr = 0x7f0e3b861df5 "\b", m_tok_start = 0x7f0e3b861df5 "\b", m_tok_end = 0x7f0e3b861df5 "\b", m_end_of_query = 0x7f0e3b861df4 "", m_tok_start_prev = 0x7f0e3b861df4 "", m_buf = 0x7f0e3afb4e00 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_buf_length = 9097204, m_echo = true, m_echo_saved = true, m_cpp_buf = 0x7f0e38000070 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_cpp_ptr = 0x7f0e388ad064 "", m_cpp_tok_start = 0x7f0e388ad064 "", m_cpp_tok_start_prev = 0x7f0e388ad064 "", m_cpp_tok_end = 0x7f0e388ad064 "", m_body_utf8 = 0x0, m_body_utf8_ptr = 0x0, m_cpp_utf8_processed_ptr = 0x0, next_state = MY_LEX_END, found_semicolon = 0x0, ignore_space = false, stmt_prepare_mode = false, multi_statements = true, yylineno = 26245, m_digest = 0x0, in_comment = NO_COMMENT, in_comment_saved = NO_COMMENT, m_cpp_text_start = 0x7f0e388acf97 "<omitted customer data>"..., m_cpp_text_end = 0x7f0e388ad062 "\")", m_underscore_cs = 0x0}, m_yacc = {yacc_yyss = 0x0, yacc_yyvs = 0x0, m_set_signal_info = {m_item = {0x0 <repeats 12 times>}}, m_lock_type = TL_READ_DEFAULT, m_mdl_type = MDL_SHARED_READ}, m_digest_psi = 0x0} net = 0x7f0e68000f20 error = false do_end_of_statement = true drop_more_results = false __FUNCTION__ = "dispatch_command" res = <optimized out> #33 0x00005596d880cbe8 in do_command (thd=0x7f0e68000c68, blocking=blocking@entry=true) at ./sql/sql_parse.cc:1409 return_value = <optimized out> packet = 0x7f0e3919adb8 "\003INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"... packet_length = <optimized out> net = 0x7f0e68000f20 command = COM_QUERY __FUNCTION__ = "do_command" #34 0x00005596d891efef in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at ./sql/sql_connect.cc:1416 create_user = true thr_create_utime = <optimized out> thd = 0x7f0e68000c68 #35 0x00005596d891f33d in handle_one_connection (arg=arg@entry=0x5596db163948) at ./sql/sql_connect.cc:1318 connect = 0x5596db163948 #36 0x00005596d8ca4ce6 in pfs_spawn_thread (arg=0x5596db1180f8) at ./storage/perfschema/pfs.cc:2201 typed_arg = 0x5596db1180f8 user_arg = 0x5596db163948 user_start_routine = 0x5596d891f2e0 <handle_one_connection(void*)> pfs = <optimized out> klass = <optimized out> #37 0x00007f0eb0309b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 ret = <optimized out> pd = <optimized out> out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140720411833360, -1557864560374536914, 139700660323904, 1, 139701062244432, 140720411833712, 1476171432797185326, 1476118612295460142}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> #38 0x00007f0eb039ba00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 No locals.

Marko Mäkelä made changes - 2023-11-21 15:28

Link

This issue relates to MDEV-16281 [ MDEV-16281 ]

Marko Mäkelä added a comment - 2023-11-21 15:28

The posted stack trace is specific to ROW_FORMAT=COMPRESSED. There is an open bug ~~MDEV-31441~~ with a test case that reproduces corruption on 10.4. This may be related to this failure.

Starting with 10.6, ~~MDEV-515~~ allows a bulk load mechanism that will use table-level locking and undo logging when inserting into an empty table or partition. This mechanism is enabled by SET unique_checks=0,foreign_key_checks=0.

My understanding is that Myloader would by default attempt to use multiple threads for loading a single table. That will obviously not work with the table-level locking. There exists MDEV-16281 for improving the data loading.

Marko Mäkelä added a comment - 2023-11-21 15:28 The posted stack trace is specific to ROW_FORMAT=COMPRESSED . There is an open bug MDEV-31441 with a test case that reproduces corruption on 10.4. This may be related to this failure. Starting with 10.6, MDEV-515 allows a bulk load mechanism that will use table-level locking and undo logging when inserting into an empty table or partition. This mechanism is enabled by SET unique_checks=0,foreign_key_checks=0 . My understanding is that Myloader would by default attempt to use multiple threads for loading a single table. That will obviously not work with the table-level locking. There exists MDEV-16281 for improving the data loading.

Marko Mäkelä made changes - 2023-11-21 15:28

Link

This issue relates to ~~MDEV-31441~~ [ ~~MDEV-31441~~ ]

Bart-Jan Hilbrands added a comment - 2023-11-23 07:41

As for the relation to ~~MDEV-31441~~, if they are indeed the same problem that would mean it was fixed in 10.5 (At least as far as we can tell; we have been running 10.5 for five months now and have not seen the problem reoccur) .

Regarding the table-level locking, we did run myloader with the

--threads 1

option before downgrading to 10.5. But while the problem seemed much less likely to occur, it did eventually still happen.

Bart-Jan Hilbrands added a comment - 2023-11-23 07:41 As for the relation to MDEV-31441 , if they are indeed the same problem that would mean it was fixed in 10.5 (At least as far as we can tell; we have been running 10.5 for five months now and have not seen the problem reoccur) . Regarding the table-level locking, we did run myloader with the --threads 1 option before downgrading to 10.5. But while the problem seemed much less likely to occur, it did eventually still happen.

Marko Mäkelä added a comment - 2023-11-23 09:53

bjhilbrands, sorry, actually I meant ~~MDEV-31574~~ and not ~~MDEV-31441~~.

The famous quote of E. W. Dijkstra is applicable here: Program testing can be used to show the presence of bugs, but never to show their absence!

Depending on when transactions are rolled back or rollback or the purge of committed history is executed, the physical layout of index pages or the entire index tree could be very different for the same logical contents. Also in ~~MDEV-31817~~ we are able to reproduce some corruption on a ROW_FORMAT=COMPRESSED table.

Marko Mäkelä added a comment - 2023-11-23 09:53 bjhilbrands , sorry, actually I meant MDEV-31574 and not MDEV-31441 . The famous quote of E. W. Dijkstra is applicable here: Program testing can be used to show the presence of bugs, but never to show their absence! Depending on when transactions are rolled back or rollback or the purge of committed history is executed, the physical layout of index pages or the entire index tree could be very different for the same logical contents. Also in MDEV-31817 we are able to reproduce some corruption on a ROW_FORMAT=COMPRESSED table.

Marko Mäkelä added a comment - 2023-11-23 14:47

It turns out that ~~MDEV-31574~~ had already been fixed in 10.5 by ~~MDEV-12353~~. But I am rather certain that there are other bugs that cause corruption of ROW_FORMAT=COMPRESSED tables. We’d only need some data for reproducing them.

Marko Mäkelä added a comment - 2023-11-23 14:47 It turns out that MDEV-31574 had already been fixed in 10.5 by MDEV-12353 . But I am rather certain that there are other bugs that cause corruption of ROW_FORMAT=COMPRESSED tables. We’d only need some data for reproducing them.

Bart-Jan Hilbrands added a comment - 2023-11-24 07:16

Working on generating some data right now! Its going to take a while mostly because I'll have to make sure it is still reproducible with the generated data (Don't see a reason why not), and reproducing it might take several hours.

Bart-Jan Hilbrands added a comment - 2023-11-24 07:16 Working on generating some data right now! Its going to take a while mostly because I'll have to make sure it is still reproducible with the generated data (Don't see a reason why not), and reproducing it might take several hours.

Bart-Jan Hilbrands added a comment - 2023-11-24 13:21

I have some data ready but the file is too big; https://mariadb.com/kb/en/reporting-bugs/ mentions uploading to a public ftp server, but we are reluctant to use that. Would it be okay to invite you to a private sharepoint folder that contains the archived dump via email?

Bart-Jan Hilbrands added a comment - 2023-11-24 13:21 I have some data ready but the file is too big; https://mariadb.com/kb/en/reporting-bugs/ mentions uploading to a public ftp server, but we are reluctant to use that. Would it be okay to invite you to a private sharepoint folder that contains the archived dump via email?

Marko Mäkelä added a comment - 2023-11-24 13:45

bjhilbrands, I sent you an email.

Marko Mäkelä added a comment - 2023-11-24 13:45 bjhilbrands , I sent you an email.

Bart-Jan Hilbrands added a comment - 2023-11-24 14:06

Shared the drive. Perhaps some more detailed information: it always seems be a steps table index that gets corrupted, but that might also be a statistical thing as the vast majority of the data are steps.

Bart-Jan Hilbrands added a comment - 2023-11-24 14:06 Shared the drive. Perhaps some more detailed information: it always seems be a steps table index that gets corrupted, but that might also be a statistical thing as the vast majority of the data are steps.

Marko Mäkelä added a comment - 2023-11-24 15:10

Thank you. I will get to this next week, after fixing ~~MDEV-31441~~.

Marko Mäkelä added a comment - 2023-11-24 15:10 Thank you. I will get to this next week, after fixing MDEV-31441 .

Marko Mäkelä added a comment - 2023-11-27 10:39

bjhilbrands, can you please reduce the dump so that the error would reproduce much faster? For example:

remove any other tables than steps or any tables that it might be connected via FOREIGN KEY constraints (unless myloader does set foreign_key_checks=0)
remove any non-unique indexes on the table steps other than the one where it crashes; in GDB, execute frame 11 to get to the stack frame of page_zip_dir_delete() and then print index->name to see the name of the index

I have had a single-threaded load of the dump running against a debug instrumented server for over 102 CPU minutes (and some 75 minutes real time) now, and it still has not crashed. The data directory has grown to 4.3 GiB during this time. I think that the main cause of slowdown could be the cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON a.k.a. page_zip_validate(), which I think needs to be enabled to get to the root cause as early as possible.

I am attempting to load the dump as follows:

myloader --directory /dev/shm/data-dump2 --socket /dev/shm/mariadb.sock -u root --threads=1

Marko Mäkelä added a comment - 2023-11-27 10:39 bjhilbrands , can you please reduce the dump so that the error would reproduce much faster? For example: remove any other tables than steps or any tables that it might be connected via FOREIGN KEY constraints (unless myloader does set foreign_key_checks=0 ) remove any non-unique indexes on the table steps other than the one where it crashes; in GDB, execute frame 11 to get to the stack frame of page_zip_dir_delete() and then print index->name to see the name of the index I have had a single-threaded load of the dump running against a debug instrumented server for over 102 CPU minutes (and some 75 minutes real time) now, and it still has not crashed. The data directory has grown to 4.3 GiB during this time. I think that the main cause of slowdown could be the cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON a.k.a. page_zip_validate() , which I think needs to be enabled to get to the root cause as early as possible. I am attempting to load the dump as follows: myloader --directory /dev/shm/data-dump2 --socket /dev/shm/mariadb.sock -u root --threads=1

Marko Mäkelä made changes - 2023-11-27 10:39

Status

Open [ 1 ]

Needs Feedback [ 10501 ]

Bart-Jan Hilbrands added a comment - 2023-11-27 13:06

Uploaded a new package in the same drive containing only the steps table. Reproducing it with one thread is indeed quite painful. In fact, I had like 10 successful restores in a row using --threads=1 while looking into this, which the resulted in us rolling out a fix to only run myloader single threaded everywhere. It then ran fine for about a week, before the same problem occurred (but this time on a server only). Just as a sanity check I'll try to actually reproduce it locally using just 1 thread.

Bart-Jan Hilbrands added a comment - 2023-11-27 13:06 Uploaded a new package in the same drive containing only the steps table. Reproducing it with one thread is indeed quite painful. In fact, I had like 10 successful restores in a row using --threads=1 while looking into this, which the resulted in us rolling out a fix to only run myloader single threaded everywhere. It then ran fine for about a week, before the same problem occurred (but this time on a server only). Just as a sanity check I'll try to actually reproduce it locally using just 1 thread.

Bart-Jan Hilbrands added a comment - 2023-11-27 13:09

It does make me wonder if there were other factors involved that caused the failure during the single threaded restore.

Bart-Jan Hilbrands added a comment - 2023-11-27 13:09 It does make me wonder if there were other factors involved that caused the failure during the single threaded restore.

Bart-Jan Hilbrands added a comment - 2023-11-27 13:25

If it proves too time consuming to reproduce it single threaded, it might also be the most productive to wait for MDEV-16281 and see if it is still reproducible.

Bart-Jan Hilbrands added a comment - 2023-11-27 13:25 If it proves too time consuming to reproduce it single threaded, it might also be the most productive to wait for MDEV-16281 and see if it is still reproducible.

Marko Mäkelä added a comment - 2023-11-29 08:44

bjhilbrands, the stack trace of your crash is during the rollback of an INSERT statement. I suppose that the reason is a deadlock between concurrent INSERT transactions. These can occur when there are multiple indexes in a table. I am not familiar with mydumper or myloader, but it would make sense to create the indexes only after loading the data. This should not only prevent the deadlocks and rollbacks from occurring, but also speed up the operation, because only one index tree would have to be updated during the data load, which would allow more efficient use of the buffer pool. Can you try if this would make the crash go away?

In ~~MDEV-30882~~ I already noted that there are hard-to-fix problems with rollback in ROW_FORMAT=COMPRESSED tables. I was not able to fully fix the bug. A page overflow during a rollback is possible. I have a data load running without --threads=1, but my build option cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON is slowing it down extremely. Only 2 CPU threads are occupied.

Marko Mäkelä added a comment - 2023-11-29 08:44 bjhilbrands , the stack trace of your crash is during the rollback of an INSERT statement. I suppose that the reason is a deadlock between concurrent INSERT transactions. These can occur when there are multiple indexes in a table. I am not familiar with mydumper or myloader , but it would make sense to create the indexes only after loading the data. This should not only prevent the deadlocks and rollbacks from occurring, but also speed up the operation, because only one index tree would have to be updated during the data load, which would allow more efficient use of the buffer pool. Can you try if this would make the crash go away? In MDEV-30882 I already noted that there are hard-to-fix problems with rollback in ROW_FORMAT=COMPRESSED tables. I was not able to fully fix the bug. A page overflow during a rollback is possible. I have a data load running without --threads=1 , but my build option cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON is slowing it down extremely. Only 2 CPU threads are occupied.

Marko Mäkelä added a comment - 2023-11-29 11:36

I have had the data loading running for 6.5 CPU hours (more than 3 wall-clock hours), and there has been no crash so far. The data directory has grown to only 5.1 GiB during this time. The build option that I am using will cause very frequent calls to page_zip_decompress(), which could easily slow down things by an order of magnitude. In any case, this does not look like a feasible option for reproducing the bug.

Marko Mäkelä added a comment - 2023-11-29 11:36 I have had the data loading running for 6.5 CPU hours (more than 3 wall-clock hours), and there has been no crash so far. The data directory has grown to only 5.1 GiB during this time. The build option that I am using will cause very frequent calls to page_zip_decompress() , which could easily slow down things by an order of magnitude. In any case, this does not look like a feasible option for reproducing the bug.

Marko Mäkelä added a comment - 2023-11-29 14:39

bjhilbrands, the data load had finally completed on my debug build, without any crash:

10:24:18 /dev/shm/10.6/mysql-test$myloader myloader --directory /dev/shm/data-dump-steps-only --socket /dev/shm/mariadb.sock -u root

15:50:30 /dev/shm/10.6/mysql-test$

The data directory size grew to 5.8 GiB, of which 4.0 GiB is the ib_logfile0. The server was shut down successfully, after consuming 706 minutes and 33 seconds of CPU time:

10.6 20b0ec9aae38faaa4d249f67315cc5394eaea22e
2023-11-29 16:36:48 0 [Note] InnoDB: Shutdown completed; log sequence number 13600918514; transaction id 1814

According to the LSN output, it had written some 13 GiB to the circular ib_logfile0.

Marko Mäkelä added a comment - 2023-11-29 14:39 bjhilbrands , the data load had finally completed on my debug build, without any crash: 10:24:18 /dev/shm/10.6/mysql-test$myloader myloader --directory /dev/shm/data-dump-steps-only --socket /dev/shm/mariadb.sock -u root 15:50:30 /dev/shm/10.6/mysql-test$ The data directory size grew to 5.8 GiB, of which 4.0 GiB is the ib_logfile0 . The server was shut down successfully, after consuming 706 minutes and 33 seconds of CPU time: 10.6 20b0ec9aae38faaa4d249f67315cc5394eaea22e 2023-11-29 16:36:48 0 [Note] InnoDB: Shutdown completed; log sequence number 13600918514; transaction id 1814 According to the LSN output, it had written some 13 GiB to the circular ib_logfile0 .

Roel Van de Paar added a comment - 2023-11-30 07:13

marko mentioned the potential ROLLBACK issues in this ticket during re-testing of ~~MDEV-31817~~ (that bug did not reproduce anymore in the patched branch 10.6-MDEV-32371 at commit 963a88be2e89100260e09e46e90e2b3271444899), so I added ROLLBACK and START TRANSACTION to the test using a random selection, as follows:

#!/bin/bash

# Start server with --max_connections=10000 --log-bin

# Set variables and ensure ramloc is a ramdisk or tmpfs (i.e. /dev/shm)

user="root"

socket="./socket.sock"

db="test"

client="./bin/mariadb"

errorlog="./log/master.err"

ramloc="/dev/shm"

threads=2000   # Number of concurrent threads

queries=100    # Number of t1/t2 INSERTs per thread/per test round

rounds=999999  # Number of max test rounds

# Setup

${client} -u ${user} -S ${socket} -D ${db} -e "

DROP TABLE IF EXISTS t1;

DROP TABLE IF EXISTS t2;

CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT, c2 INT NOT NULL, PRIMARY KEY (c1), UNIQUE KEY u1 (c1,c2)) ENGINE=InnoDB AUTO_INCREMENT=1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;

CREATE TABLE t2 (c1 DATETIME NOT NULL, c2 DOUBLE NOT NULL, t1_c1 INT NOT NULL, PRIMARY KEY (t1_c1,c1)) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;

insert_rows(){

  SQL=

  for ((i=0;i<${queries};i++)); do

    rand=$((RANDOM % 4 + 1))

    if [ ${rand} -eq 1 ]; then

      SQL="${SQL}INSERT INTO t1 (c2) VALUES (0);INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()));"

    elif [ ${rand} -eq 2 ]; then

      SQL="${SQL}START TRANSACTION;"

    else

      SQL="${SQL}ROLLBACK;"

fi

  done

  ${client} -u ${user} -S ${socket} -D ${db} -e "${SQL}"

  rm -f ${ramloc}/${prefix}_md_proc_${1}  # Thread done

abort(){ jobs -p | xargs -P100 kill >/dev/null 2>&1; rm -Rf ${ramloc}/${prefix}_md_proc_*; exit 1; }

trap abort SIGINT

count=0

prefix="$(echo "${RANDOM}${RANDOM}${RANDOM}" | cut -b1-5)"

rm -f ${ramloc}/${prefix}_md_proc_*

for ((i=0;i<${rounds};i++)); do

  for ((i=0;i<${threads};i++)); do

    if [ ! -r ${ramloc}/${prefix}_md_proc_${i} ]; then  # Thread idle

      touch ${ramloc}/${prefix}_md_proc_${i}  # Thread busy

      insert_rows ${i} &

      if [ $[ ${count} % 1000 ] -eq 0 ]; then

        ${client} -u ${user} -S ${socket} -D ${db} -e "TRUNCATE TABLE t1;"

fi

      count=$[ ${count} + 1 ]

      if [ $[ ${count} % 100 ] -eq 0 ]; then  # Limit disk I/O, check once every new 100 threads

        echo "Count: ${count}" | tee lastcount.log

        TAIL="$(tail -n10 ${errorlog} | tr -d '\n')"

        if [[ "${TAIL}" == *"ERROR"* ]]; then

          echo '*** Error found:'

          grep -i 'ERROR' log/master.err

          abort

        elif [[ "${TAIL}" == *"down complete"* ]]; then

          echo '*** Server shutdown'

          abort

        elif ! ${client}-admin ping -u ${user} -S ${socket} > /dev/null 2>&1; then

          echo '*** Server gone (killed/crashed)'

          abort

fi

fi

fi

  done

done

And within 111800 iterations observed:

10.6-MDEV-32371 963a88be2e89100260e09e46e90e2b3271444899 (Optimized)
Count: 111500
Count: 111600
Count: 111700
Count: 111800
ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it
ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it
ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it
Count: 111900
*** Error found:
2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it

Reflected in the error log as:

10.6.17 963a88be2e89100260e09e46e90e2b3271444899 (Optimized)
2023-11-30 13:51:15 0 [Note] /test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/bin/mariadbd: ready for connections.
Version: '10.6.17-MariaDB-log' socket: '/test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/socket.sock' port: 11471 MariaDB Server
2023-11-30 13:51:33 2392 [Warning] Aborted connection 2392 to db: 'unconnected' user: 'unauthenticated' host: 'localhost' (This connection closed normally without authentication)
2023-11-30 13:52:37 5935 [Note] InnoDB: Number of pools: 2
2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-11-30 14:17:25 117470 [Note] InnoDB: btr_page_get_parent 2
2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-11-30 14:17:25 117468 [Note] InnoDB: btr_page_get_parent 2
2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it

The extra [Note] debug messages are the result of this commit.
I will try to see if I can get an rr trace, however ~~MDEV-31817~~ has proven to be rr-elusive which likely means this issue will be too.

Roel Van de Paar added a comment - 2023-11-30 07:13 marko mentioned the potential ROLLBACK issues in this ticket during re-testing of MDEV-31817 (that bug did not reproduce anymore in the patched branch 10.6- MDEV-32371 at commit 963a88be2e89100260e09e46e90e2b3271444899 ), so I added ROLLBACK and START TRANSACTION to the test using a random selection, as follows: #!/bin/bash # Start server with --max_connections=10000 --log-bin # Set variables and ensure ramloc is a ramdisk or tmpfs (i.e. /dev/shm) user= "root" socket= "./socket.sock" db= "test" client= "./bin/mariadb" errorlog= "./log/master.err" ramloc= "/dev/shm" threads=2000 # Number of concurrent threads queries=100 # Number of t1/t2 INSERTs per thread/per test round rounds=999999 # Number of max test rounds # Setup ${client} -u ${user} -S ${socket} -D ${db} -e " DROP TABLE IF EXISTS t1; DROP TABLE IF EXISTS t2; CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT, c2 INT NOT NULL, PRIMARY KEY (c1), UNIQUE KEY u1 (c1,c2)) ENGINE=InnoDB AUTO_INCREMENT=1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; CREATE TABLE t2 (c1 DATETIME NOT NULL, c2 DOUBLE NOT NULL, t1_c1 INT NOT NULL, PRIMARY KEY (t1_c1,c1)) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; " insert_rows(){ SQL= for ((i=0;i<${queries};i++)); do rand=$((RANDOM % 4 + 1)) if [ ${rand} - eq 1 ]; then SQL= "${SQL}INSERT INTO t1 (c2) VALUES (0);INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()));" elif [ ${rand} - eq 2 ]; then SQL= "${SQL}START TRANSACTION;" else SQL= "${SQL}ROLLBACK;" fi done ${client} -u ${user} -S ${socket} -D ${db} -e "${SQL}" rm -f ${ramloc}/${prefix}_md_proc_${1} # Thread done } abort(){ jobs -p | xargs -P100 kill > /dev/null 2>&1; rm -Rf ${ramloc}/${prefix}_md_proc_*; exit 1; } trap abort SIGINT count=0 prefix= "$(echo " ${RANDOM}${RANDOM}${RANDOM} " | cut -b1-5)" rm -f ${ramloc}/${prefix}_md_proc_* for ((i=0;i<${rounds};i++)); do for ((i=0;i<${threads};i++)); do if [ ! -r ${ramloc}/${prefix}_md_proc_${i} ]; then # Thread idle touch ${ramloc}/${prefix}_md_proc_${i} # Thread busy insert_rows ${i} & if [ $[ ${count} % 1000 ] - eq 0 ]; then ${client} -u ${user} -S ${socket} -D ${db} -e "TRUNCATE TABLE t1;" fi count=$[ ${count} + 1 ] if [ $[ ${count} % 100 ] - eq 0 ]; then # Limit disk I/O, check once every new 100 threads echo "Count: ${count}" | tee lastcount.log TAIL= "$(tail -n10 ${errorlog} | tr -d '\n')" if [[ "${TAIL}" == * "ERROR" * ]]; then echo '*** Error found:' grep -i 'ERROR' log /master .err abort elif [[ "${TAIL}" == * "down complete" * ]]; then echo '*** Server shutdown' abort elif ! ${client}-admin ping -u ${user} -S ${socket} > /dev/null 2>&1; then echo '*** Server gone (killed/crashed)' abort fi fi fi done done And within 111800 iterations observed: 10.6-MDEV-32371 963a88be2e89100260e09e46e90e2b3271444899 (Optimized) Count: 111500 Count: 111600 Count: 111700 Count: 111800 ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it Count: 111900 *** Error found: 2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it Reflected in the error log as: 10.6.17 963a88be2e89100260e09e46e90e2b3271444899 (Optimized) 2023-11-30 13:51:15 0 [Note] /test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/bin/mariadbd: ready for connections. Version: '10.6.17-MariaDB-log' socket: '/test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/socket.sock' port: 11471 MariaDB Server 2023-11-30 13:51:33 2392 [Warning] Aborted connection 2392 to db: 'unconnected' user: 'unauthenticated' host: 'localhost' (This connection closed normally without authentication) 2023-11-30 13:52:37 5935 [Note] InnoDB: Number of pools: 2 2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117470 [Note] InnoDB: btr_page_get_parent 2 2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [Note] InnoDB: btr_page_get_parent 2 2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it The extra [Note] debug messages are the result of this commit . I will try to see if I can get an rr trace, however MDEV-31817 has proven to be rr-elusive which likely means this issue will be too.

Roel Van de Paar made changes - 2023-11-30 07:13

Status

Needs Feedback [ 10501 ]

Open [ 1 ]

Roel Van de Paar made changes - 2023-11-30 07:13

Status

Open [ 1 ]

Confirmed [ 10101 ]

Marko Mäkelä added a comment - 2023-11-30 07:22

Roel, thank you. It is now clear that the hard-to-fix ROLLBACK issue on ROW_FORMAT=COMPRESSED tables that I identified in ~~MDEV-30882~~ is rather easy to hit.

I believe that the reason why I was not able to reproduce the problem with bjhilbrands’s data dump was that the relative timing of the SQL threads was different due to the enabled UNIV_ZIP_DEBUG checks, and much fewer rollbacks (due to deadlock or lock wait timeout) took place.

Marko Mäkelä added a comment - 2023-11-30 07:22 Roel , thank you. It is now clear that the hard-to-fix ROLLBACK issue on ROW_FORMAT=COMPRESSED tables that I identified in MDEV-30882 is rather easy to hit. I believe that the reason why I was not able to reproduce the problem with bjhilbrands ’s data dump was that the relative timing of the SQL threads was different due to the enabled UNIV_ZIP_DEBUG checks, and much fewer rollbacks (due to deadlock or lock wait timeout) took place.

Marko Mäkelä made changes - 2023-11-30 07:27

Link

This issue relates to ~~MDEV-30882~~ [ ~~MDEV-30882~~ ]

Bart-Jan Hilbrands added a comment - 2023-11-30 07:34

Thank you indeed! This specific restore takes about ~30 minutes to complete for me locally, and it had to ran five times before the problem re-occurred. So even if it was still reproducible with the additional debug checks as you mentioned, it probably would have taken multiple days of non-stop loading for the problem occur, which I agree is not feasible.

Bart-Jan Hilbrands added a comment - 2023-11-30 07:34 Thank you indeed! This specific restore takes about ~30 minutes to complete for me locally, and it had to ran five times before the problem re-occurred. So even if it was still reproducible with the additional debug checks as you mentioned, it probably would have taken multiple days of non-stop loading for the problem occur, which I agree is not feasible.

Roel Van de Paar added a comment - 2023-11-30 08:12

If it matters for analysis, the script will see the occasional duplicate key:

Count: 1900

Count: 2000

ERROR 1062 (23000) at line 1: Duplicate entry '1610-2023-11-30 19:10:05' for key 'PRIMARY'

ERROR 1062 (23000) at line 1: Duplicate entry '2862-2023-11-30 19:10:05' for key 'PRIMARY'

Count: 2100

Roel Van de Paar added a comment - 2023-11-30 08:12 If it matters for analysis, the script will see the occasional duplicate key: Count: 1900 Count: 2000 ERROR 1062 (23000) at line 1: Duplicate entry '1610-2023-11-30 19:10:05' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '2862-2023-11-30 19:10:05' for key 'PRIMARY' Count: 2100

Marko Mäkelä made changes - 2023-11-30 13:31

Component/s	Backup [ 13902 ]
Summary	Table corruption during database restore	ROW_FORMAT=COMPRESSED table corruption due to ROLLBACK

Marko Mäkelä added a comment - 2024-01-25 10:30

Roel, are you able to reproduce this?

Marko Mäkelä added a comment - 2024-01-25 10:30 Roel , are you able to reproduce this?

Roel Van de Paar made changes - 2024-01-30 02:46

Attachment

MDEV-32174_ps.txt [ 72878 ]

Roel Van de Paar added a comment - 2024-01-30 02:49

marko Hi! Sorry been busy with Replication and Spider. I setup a new rr run with:

cmake -DCMAKE_C{,XX}_FLAGS='-Og -march=native -mtune=native -Wno-error=maybe-uninitialized -Wno-error=unused-result' -DCMAKE_BUILD_TYPE=Debug -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF -DWITH_INNODB_EXTRA_DEBUG=ON && make -j128

However I immediately noticed that the processing (done count) seemed to be going much slower, though perhaps it is caused by `rr` slowness.
It looks like threads are getting stuck. I uploaded MDEV-32174_ps.txt for your review. I will let it run though at this rate it will take maybe a week orso to hit the same issue.

Roel Van de Paar added a comment - 2024-01-30 02:49 marko Hi! Sorry been busy with Replication and Spider. I setup a new rr run with: cmake -DCMAKE_C{,XX}_FLAGS='-Og -march=native -mtune=native -Wno-error=maybe-uninitialized -Wno-error=unused-result' -DCMAKE_BUILD_TYPE=Debug -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF -DWITH_INNODB_EXTRA_DEBUG=ON && make -j128 However I immediately noticed that the processing (done count) seemed to be going much slower, though perhaps it is caused by `rr` slowness. It looks like threads are getting stuck. I uploaded MDEV-32174_ps.txt for your review. I will let it run though at this rate it will take maybe a week orso to hit the same issue.

Roel Van de Paar added a comment - 2024-01-30 02:59

Confirmed that the slowness is due to rr. Without it, the count goes considerably quicker. Also, the processlist, while still showing 15-20 second old threads at times, does continually clear up.

Roel Van de Paar added a comment - 2024-01-30 02:59 Confirmed that the slowness is due to rr. Without it, the count goes considerably quicker. Also, the processlist, while still showing 15-20 second old threads at times, does continually clear up.

Roel Van de Paar added a comment - 2024-01-30 03:56 - edited

The rr instance ran into:

10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0
Version: '10.6.17-MariaDB-debug-log' socket: '/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/socket.sock' port: 12864 Source distribution
2024-01-30 13:43:02 1571 [Note] InnoDB: Number of pools: 2
2024-01-30 14:08:01 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
240130 14:08:01 [ERROR] mysqld got signal 6 ;

10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0
Core was generated by `/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/bin/mariadbd --no-defaults --ma'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
[Current thread is 1 (Thread 0x39ef7f1ee700 (LWP 389013))]
(gdb) bt
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1 0x000055ca95c706fe in my_write_core (sig=sig@entry=6) at /test/10.6_opt/mysys/stacktrace.c:424
#2 0x000055ca954e8a55 in handle_fatal_signal (sig=6) at /test/10.6_opt/sql/signal_handler.cc:357
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5 0x000046220f678859 in __GI_abort () at abort.c:79
#6 0x000055ca95a69f84 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_opt/storage/innobase/ut/ut0ut.cc:515
#7 0x000055ca95a1256f in srv_monitor_task () at /test/10.6_opt/storage/innobase/include/ut0ut.h:393
#8 0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:344
#9 tpool::thread_pool_generic::timer_generic::execute (arg=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:364
#10 0x000055ca95c16f06 in tpool::task::execute (this=0x55ca97277580) at /test/10.6_opt/tpool/task.cc:37
#11 0x000055ca95c14a9f in tpool::thread_pool_generic::worker_main (this=0x55ca971bd820, thread_var=0x55ca971bdfe0) at /test/10.6_opt/tpool/tpool_generic.cc:583
#12 0x000055ca95c15e52 in std::__invoke_impl<void, void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#13 std::__invoke<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#17 0x000048d363f37de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x000008d07419b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#19 0x000046220f775133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

marko Do you want to check the rr trace for this?

rr: /test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/rr/latest-trace

Related: MDEV-24911

Roel Van de Paar added a comment - 2024-01-30 03:56 - edited The rr instance ran into: 10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0 Version: '10.6.17-MariaDB-debug-log' socket: '/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/socket.sock' port: 12864 Source distribution 2024-01-30 13:43:02 1571 [Note] InnoDB: Number of pools: 2 2024-01-30 14:08:01 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ 240130 14:08:01 [ERROR] mysqld got signal 6 ; 10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0 Core was generated by `/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/bin/mariadbd --no-defaults --ma'. Program terminated with signal SIGABRT, Aborted. #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 [Current thread is 1 (Thread 0x39ef7f1ee700 (LWP 389013))] (gdb) bt #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 0x000055ca95c706fe in my_write_core (sig=sig@entry=6) at /test/10.6_opt/mysys/stacktrace.c:424 #2 0x000055ca954e8a55 in handle_fatal_signal (sig=6) at /test/10.6_opt/sql/signal_handler.cc:357 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #5 0x000046220f678859 in __GI_abort () at abort.c:79 #6 0x000055ca95a69f84 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_opt/storage/innobase/ut/ut0ut.cc:515 #7 0x000055ca95a1256f in srv_monitor_task () at /test/10.6_opt/storage/innobase/include/ut0ut.h:393 #8 0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:344 #9 tpool::thread_pool_generic::timer_generic::execute (arg=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:364 #10 0x000055ca95c16f06 in tpool::task::execute (this=0x55ca97277580) at /test/10.6_opt/tpool/task.cc:37 #11 0x000055ca95c14a9f in tpool::thread_pool_generic::worker_main (this=0x55ca971bd820, thread_var=0x55ca971bdfe0) at /test/10.6_opt/tpool/tpool_generic.cc:583 #12 0x000055ca95c15e52 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89 #13 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95 #14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244 #15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251 #16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195 #17 0x000048d363f37de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x000008d07419b609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #19 0x000046220f775133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 marko Do you want to check the rr trace for this? rr: /test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/rr/latest-trace Related: MDEV-24911

Marko Mäkelä added a comment - 2024-01-30 07:04

Roel, I think that most hangs or deadlocks involve only 2 or 3 threads. The rr replay trace that you produced for my analysis contains 2,013 threads. Due to the way how rr record works, this may very easily lead to unfair scheduling and starvation of threads. Such “fake hangs” are best avoided by reducing the number of concurrent threads or by reducing some configuration parameters, or even patching the source code, such as reducing OS_AIO_N_PENDING_IOS_PER_THREAD from 256 to 1, which we did to uncover ~~MDEV-32861~~ and to reduce the amount of false positive hang reports.

I believe that this particular bug (a corruption on ROLLBACK) should be reproducible with 1 connection thread, or maybe at most 2, if the rollbacks are being provoked not by explicit ROLLBACK statements no by duplicate key errors, but by deadlocks or lock wait timeouts between conflicting transactions. I do not see any reason to use thousands of concurrent connections.

I tried to find the latest threads that are contending on dict_sys.latch, but the trace is huge and the system is loaded with other processes. Already halfway through the trace (at when 2000000) the dict_sys.latch_ex_wait_start was at its final value 1706582093182145 (LANG=C TZ=UTC0 date --date @1706582093.182145 converts it to a readable timestamp). The interesting part of the trace should be somewhere after when 1534726 (timestamp 1706582052682666). In a few minutes I might find it. I only do this to prove that this is a false alarm.

Marko Mäkelä added a comment - 2024-01-30 07:04 Roel , I think that most hangs or deadlocks involve only 2 or 3 threads. The rr replay trace that you produced for my analysis contains 2,013 threads. Due to the way how rr record works, this may very easily lead to unfair scheduling and starvation of threads. Such “fake hangs” are best avoided by reducing the number of concurrent threads or by reducing some configuration parameters, or even patching the source code, such as reducing OS_AIO_N_PENDING_IOS_PER_THREAD from 256 to 1, which we did to uncover MDEV-32861 and to reduce the amount of false positive hang reports. I believe that this particular bug (a corruption on ROLLBACK ) should be reproducible with 1 connection thread, or maybe at most 2, if the rollbacks are being provoked not by explicit ROLLBACK statements no by duplicate key errors, but by deadlocks or lock wait timeouts between conflicting transactions. I do not see any reason to use thousands of concurrent connections. I tried to find the latest threads that are contending on dict_sys.latch , but the trace is huge and the system is loaded with other processes. Already halfway through the trace (at when 2000000) the dict_sys.latch_ex_wait_start was at its final value 1706582093182145 ( LANG=C TZ=UTC0 date --date @1706582093.182145 converts it to a readable timestamp). The interesting part of the trace should be somewhere after when 1534726 (timestamp 1706582052682666). In a few minutes I might find it. I only do this to prove that this is a false alarm.

Marko Mäkelä added a comment - 2024-01-30 07:18

The final wait for exclusive dict_sys.latch started in Thread 1014 at when 1680126, in the master callback task. This will block any subsequent waits for dict_sys.latch until the exclusive latch has been granted and released. Here is the stack trace:

10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0
#13 0x000055ca95b3a4a2 in ssux_lock_impl<false>::wr_lock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:966
#14 dict_sys_t::lock_wait (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:961
#15 0x000055ca9581f65e in dict_sys_t::lock (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/include/dict0dict.h:1524
#16 0x000055ca95b3ee6b in dict_sys_t::evict_table_LRU (this=0x55ca965a2300 <dict_sys>, half=half@entry=false) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:1300
#17 0x000055ca95a0f5c5 in srv_master_do_idle_tasks (counter_time=counter_time@entry=2321294823926) at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1465
#18 0x000055ca95a11a4c in srv_master_callback () at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1516
#19 0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca972864f0) at /test/10.6_opt/tpool/tpool_generic.cc:344

Everything that I have observed so far suggests that between this point and the end of the rr replay trace, one thread will be holding dict_sys.latch in shared mode. We will need to identify this thread and find out what is blocking it. If it turns out that rr record is neglecting this thread (not executing it at all), then this is the type of “fake hang” that I suspected and expected. To avoid an occasional bug of rr replay that reverse-continue may miss breakpoints or watchpoints, I restarted the trace slightly before the start of the final wait started. At this point, the dict_sys.latch was still unoccupied:

watch -l dict_sys.latch_ex_wait_start._M_i

watch -l dict_sys.latch.readers._M_i

display dict_sys.latch_ex_wait_start

display dict_sys.latch

display dict_sys.latch_ex

run 1600000

At when 1600089, dict_sys.latch was both acquired and released in shared mode by Thread 6, and again reacquired at when 1600174. At this rate it will be several minutes before when 1680126 (the start of the wait in Thread 1014) will be reached.

Marko Mäkelä added a comment - 2024-01-30 07:18 The final wait for exclusive dict_sys.latch started in Thread 1014 at when 1680126, in the master callback task. This will block any subsequent waits for dict_sys.latch until the exclusive latch has been granted and released. Here is the stack trace: 10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0 #13 0x000055ca95b3a4a2 in ssux_lock_impl<false>::wr_lock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:966 #14 dict_sys_t::lock_wait (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:961 #15 0x000055ca9581f65e in dict_sys_t::lock (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/include/dict0dict.h:1524 #16 0x000055ca95b3ee6b in dict_sys_t::evict_table_LRU (this=0x55ca965a2300 <dict_sys>, half=half@entry=false) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:1300 #17 0x000055ca95a0f5c5 in srv_master_do_idle_tasks (counter_time=counter_time@entry=2321294823926) at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1465 #18 0x000055ca95a11a4c in srv_master_callback () at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1516 #19 0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca972864f0) at /test/10.6_opt/tpool/tpool_generic.cc:344 Everything that I have observed so far suggests that between this point and the end of the rr replay trace, one thread will be holding dict_sys.latch in shared mode. We will need to identify this thread and find out what is blocking it. If it turns out that rr record is neglecting this thread (not executing it at all), then this is the type of “fake hang” that I suspected and expected. To avoid an occasional bug of rr replay that reverse-continue may miss breakpoints or watchpoints, I restarted the trace slightly before the start of the final wait started. At this point, the dict_sys.latch was still unoccupied: watch -l dict_sys.latch_ex_wait_start._M_i watch -l dict_sys.latch.readers._M_i display dict_sys.latch_ex_wait_start display dict_sys.latch display dict_sys.latch_ex run 1600000 At when 1600089, dict_sys.latch was both acquired and released in shared mode by Thread 6, and again reacquired at when 1600174. At this rate it will be several minutes before when 1680126 (the start of the wait in Thread 1014) will be reached.

Marko Mäkelä added a comment - 2024-01-30 15:24

The requests to acquire and release a shared dict_sys.latch in Thread 6 are by the purge coordinator task. Here is one stack trace:

10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0
#3 0x000055ca95a219d1 in purge_sys_t::close_and_reopen (
this=this@entry=0x55ca96e99e40 <purge_sys>, id=<optimized out>,
thd=thd@entry=0x55ca97392688, mdl=mdl@entry=0x5a8517a51ae8)
at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1237
#4 0x000055ca95a28811 in trx_purge_attach_undo_recs (
n_purge_threads=n_purge_threads@entry=4, thd=thd@entry=0x55ca97392688)
at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1349
#5 0x000055ca95a28c1a in trx_purge (n_tasks=n_tasks@entry=4,
history_size=3911) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1453
#6 0x000055ca95a1232d in purge_coordinator_state::do_purge (
this=0x55ca96e99020 <purge_state>)
at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1638
#7 purge_coordinator_callback ()
at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1716

However, I realized that the dict_sys.latch was released after when 1680126. The trace is executing extremely slowly. This is the latest that I got, after several hours of waiting:

10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0
Thread 2015 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i

Old value = 2147483648
New value = 0
ssux_lock_impl<false>::wr_unlock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/include/srw_lock.h:306
306 writer.wr_unlock();
1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>}
2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}}
3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 2147483649}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}
Current event: 2788949
[Switching to Thread 350035.350041]

Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i

Old value = 0
New value = 1
0x000055ca95a20e42 in std::__atomic_base<unsigned int>::compare_exchange_weak (__m2=std::memory_order_relaxed, __m1=std::memory_order_acquire, __i2=1, __i1=<synthetic pointer>: <optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:448
448 compare_exchange_weak(__int_type& __i1, __int_type __i2,
1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>}
2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}}
3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 1}, <No data fields>}}
Current event: 2789015

Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i

Old value = 1
New value = 0
0x000055ca95a21123 in ssux_lock_impl<false>::rd_unlock (this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:549
549 fetch_sub(__int_type __i,
1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>}
2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}}
3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}
Current event: 2789015

That is, a shared dict_sys.latch was released, reacquired and released at this point, and acquired in shared mode after that. The final wait must have started somewhere between that and the final when count (slightly above 4 million).

I have never seen rr replay traces executing this slow when there is no other load on the system. I think that it must be due to the huge number of threads and the constant context switches between them. I will assume that this is a “fake hang”.

Please try to reproduce the bug of this ticket with a more reasonable configuration.

Marko Mäkelä added a comment - 2024-01-30 15:24 The requests to acquire and release a shared dict_sys.latch in Thread 6 are by the purge coordinator task. Here is one stack trace: 10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0 #3 0x000055ca95a219d1 in purge_sys_t::close_and_reopen ( this=this@entry=0x55ca96e99e40 <purge_sys>, id=<optimized out>, thd=thd@entry=0x55ca97392688, mdl=mdl@entry=0x5a8517a51ae8) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1237 #4 0x000055ca95a28811 in trx_purge_attach_undo_recs ( n_purge_threads=n_purge_threads@entry=4, thd=thd@entry=0x55ca97392688) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1349 #5 0x000055ca95a28c1a in trx_purge (n_tasks=n_tasks@entry=4, history_size=3911) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1453 #6 0x000055ca95a1232d in purge_coordinator_state::do_purge ( this=0x55ca96e99020 <purge_state>) at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1638 #7 purge_coordinator_callback () at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1716 However, I realized that the dict_sys.latch was released after when 1680126. The trace is executing extremely slowly. This is the latest that I got, after several hours of waiting: 10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0 Thread 2015 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i Old value = 2147483648 New value = 0 ssux_lock_impl<false>::wr_unlock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/include/srw_lock.h:306 306 writer.wr_unlock(); 1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>} 2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}} 3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 2147483649}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}} Current event: 2788949 [Switching to Thread 350035.350041] Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i Old value = 0 New value = 1 0x000055ca95a20e42 in std::__atomic_base<unsigned int>::compare_exchange_weak (__m2=std::memory_order_relaxed, __m1=std::memory_order_acquire, __i2=1, __i1=<synthetic pointer>: <optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:448 448 compare_exchange_weak(__int_type& __i1, __int_type __i2, 1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>} 2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}} 3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 1}, <No data fields>}} Current event: 2789015 Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i Old value = 1 New value = 0 0x000055ca95a21123 in ssux_lock_impl<false>::rd_unlock (this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:549 549 fetch_sub(__int_type __i, 1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>} 2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}} 3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}} Current event: 2789015 That is, a shared dict_sys.latch was released, reacquired and released at this point, and acquired in shared mode after that. The final wait must have started somewhere between that and the final when count (slightly above 4 million). I have never seen rr replay traces executing this slow when there is no other load on the system. I think that it must be due to the huge number of threads and the constant context switches between them. I will assume that this is a “fake hang”. Please try to reproduce the bug of this ticket with a more reasonable configuration.

Marko Mäkelä added a comment - 2024-01-30 16:32

The last I observed before logging off the environment was a shared latch release at when 2790836. The final exclusive lock wait should have started somewhere after that.

Marko Mäkelä added a comment - 2024-01-30 16:32 The last I observed before logging off the environment was a shared latch release at when 2790836. The final exclusive lock wait should have started somewhere after that.

Roel Van de Paar made changes - 2024-03-01 23:01

Assignee

Marko Mäkelä [ marko ]

Roel Van de Paar [ roel ]

Roel Van de Paar made changes - 2024-03-01 23:01

Status

Confirmed [ 10101 ]

In Progress [ 3 ]

Roel Van de Paar made changes - 2024-03-01 23:01

Status

In Progress [ 3 ]

In Testing [ 10301 ]

Marko Mäkelä made changes - 2024-11-14 06:29

Link

This issue relates to ~~MDEV-35413~~ [ ~~MDEV-35413~~ ]

Marko Mäkelä made changes - 2025-01-27 05:31

Link

This issue relates to ~~MDEV-35779~~ [ ~~MDEV-35779~~ ]

Marko Mäkelä made changes - 2025-01-27 06:57

Assignee	Roel Van de Paar [ roel ]	Marko Mäkelä [ marko ]
Status	In Testing [ 10301 ]	Stalled [ 10000 ]

Marko Mäkelä made changes - 2025-01-27 06:57

Status

Stalled [ 10000 ]

Needs Feedback [ 10501 ]

Marko Mäkelä added a comment - 2025-01-27 07:01

bjhilbrands or Roel, have you been able to reproduce this with a more recent version of MariaDB Server? With the logical dump mentioned in ~~MDEV-35779~~, I got MariaDB Server 10.6.18 to report corruption, but the current release seemed to be OK. I suspect that the failure is nondeterministic, so we’d better test it several times.

Marko Mäkelä added a comment - 2025-01-27 07:01 bjhilbrands or Roel , have you been able to reproduce this with a more recent version of MariaDB Server? With the logical dump mentioned in MDEV-35779 , I got MariaDB Server 10.6.18 to report corruption, but the current release seemed to be OK. I suspect that the failure is nondeterministic, so we’d better test it several times.

Roel Van de Paar added a comment - 2025-01-28 05:13

No corruption issues thus far on recent & current 10.6 tests:

CS 10.6.21 c05e7c4e0eff174a1f2b5e6efd53d80954f9e34b (Optimized)
2025-01-28 15:19:35 0 [Note] /test/MD210125-mariadb-10.6.21-linux-x86_64-opt/bin/mariadbd: ready for connections.
Version: '10.6.21-MariaDB-log' socket: '/test/MD210125-mariadb-10.6.21-linux-x86_64-opt/socket.sock' port: 12427 MariaDB Server
2025-01-28 15:20:32 1511 [Note] InnoDB: Number of pools: 2
2025-01-28 15:20:45 3015 [Note] InnoDB: Number of pools: 3
2025-01-28 15:21:01 4524 [Note] InnoDB: Number of pools: 4
2025-01-28 15:21:20 6027 [Note] InnoDB: Number of pools: 5
2025-01-28 15:22:11 7533 [Note] InnoDB: Number of pools: 6
2025-01-28 15:25:26 1804 [Note] Detected table cache mutex contention at instance 1: 55% waits. Additional table cache instance activated. Number of instances after activation: 2.

I did however see a few duplicate errors again:

MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)
ERROR 1062 (23000) at line 1: Duplicate entry '908-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '869-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '969-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '924-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1062-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1227-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1195-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1166-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1290-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1275-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1416-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1461-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1753-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1861-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1853-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '878-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1325-2025-01-28 15:57:58' for key 'PRIMARY'
ERROR 1062 (23000) at line 1: Duplicate entry '1897-2025-01-28 15:57:58' for key 'PRIMARY'

The scarcity/infrequency of these errors (they are almost never seen, even when running 8k threads), combined with the fact they should not really happen afaics, makes me believe there is a highly-sporadic race condition bug with LAST_INSERT_ID. The query that produced these errors is:

INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))

Roel Van de Paar added a comment - 2025-01-28 05:13 No corruption issues thus far on recent & current 10.6 tests: CS 10.6.21 c05e7c4e0eff174a1f2b5e6efd53d80954f9e34b (Optimized) 2025-01-28 15:19:35 0 [Note] /test/MD210125-mariadb-10.6.21-linux-x86_64-opt/bin/mariadbd: ready for connections. Version: '10.6.21-MariaDB-log' socket: '/test/MD210125-mariadb-10.6.21-linux-x86_64-opt/socket.sock' port: 12427 MariaDB Server 2025-01-28 15:20:32 1511 [Note] InnoDB: Number of pools: 2 2025-01-28 15:20:45 3015 [Note] InnoDB: Number of pools: 3 2025-01-28 15:21:01 4524 [Note] InnoDB: Number of pools: 4 2025-01-28 15:21:20 6027 [Note] InnoDB: Number of pools: 5 2025-01-28 15:22:11 7533 [Note] InnoDB: Number of pools: 6 2025-01-28 15:25:26 1804 [Note] Detected table cache mutex contention at instance 1: 55% waits. Additional table cache instance activated. Number of instances after activation: 2. I did however see a few duplicate errors again: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) ERROR 1062 (23000) at line 1: Duplicate entry '908-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '869-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '969-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '924-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1062-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1227-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1195-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1166-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1290-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1275-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1416-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1461-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1753-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1861-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1853-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '878-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1325-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1897-2025-01-28 15:57:58' for key 'PRIMARY' The scarcity/infrequency of these errors (they are almost never seen, even when running 8k threads), combined with the fact they should not really happen afaics, makes me believe there is a highly-sporadic race condition bug with LAST_INSERT_ID . The query that produced these errors is: INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))

Roel Van de Paar added a comment - 2025-01-29 03:18 - edited

An overnight run did not produce any new occurences. Please do have a look at the duplicate key issue mentioned in the last comment. It happens only very sporadically.

Roel Van de Paar added a comment - 2025-01-29 03:18 - edited An overnight run did not produce any new occurences. Please do have a look at the duplicate key issue mentioned in the last comment. It happens only very sporadically.

Marko Mäkelä added a comment - 2025-02-07 17:58

The duplicate key errors might be related to the corner that I cut while fixing ~~MDEV-30882~~. Does CHECK TABLE…EXTENDED (~~MDEV-24402~~) report any errors for these ROW_FORMAT=COMPRESSED tables?

Marko Mäkelä added a comment - 2025-02-07 17:58 The duplicate key errors might be related to the corner that I cut while fixing MDEV-30882 . Does CHECK TABLE…EXTENDED ( MDEV-24402 ) report any errors for these ROW_FORMAT=COMPRESSED tables?

Roel Van de Paar made changes - 2025-02-09 19:57

Comment

[ Had an interesting run this morning with this. This time I used:
{code:sql}
threads=8000 # Number of concurrent threads
queries=100 # Number of t1/t2 INSERTs per thread/per test round
{code}
And here is what I saw:
{noformat:title=~~MDEV-32174~~ CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
/test/~~MDEV-32174~~_mariadb-10.6.21-linux-x86_64$ ./script1.sh
Count: 100
Count: 200
Count: 300
Count: 400
Count: 500
...normal count operation continues...
Count: 2600
Count: 2700
Count: 2800
Count: 2900
Count: 3000
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '2198-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '1931-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------
...additional similar duplicate entries...
ERROR 1062 (23000) at line 1: Duplicate entry '2781-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '3123-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '3179-2025-02-10 06:38:40' for key 'PRIMARY'
Count: 3100
Count: 3200
Count: 3300
Count: 3400
Count: 3500
...normal count operation continues...
Count: 4600
Count: 4700
Count: 4800
Count: 4900
Count: 5000
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '858-2025-02-10 06:39:03' for key 'PRIMARY'
Count: 5100
Count: 5200
Count: 5300
Count: 5400
Count: 5500
Count: 5600
Count: 5700
Count: 5800
Count: 5900
Count: 6000
./script1.sh: line 23: 2193819 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
./script1.sh: line 23: 2193829 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
...many similar such messages...
./script1.sh: line 23: 2194480 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
./script1.sh: line 23: 2194481 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
Count: 6100
Count: 6200
Count: 6300
Count: 6400
Count: 6500
...script continues...
{noformat}
Firstly it seems interesting that the single '858-2025-02-10 06:39:03' duplicate entry occurence happened quite offset from the earlier batch of duplicate entries, again suggesting some sort of (highly-sporadic) race condition bug, as we were discussing. Immediately after this (let's say around count 5500), I used CHECK TABLE and the table was fine:
{noformat:title=~~MDEV-32174~~ CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
10.6.21>CHECK TABLE t2 EXTENDED;
+---------+-------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------+-------+----------+----------+
| test.t2 | check | status | OK |
+---------+-------+----------+----------+
1 row in set (0.037 sec)
{noformat}
However, a little later the client kills happened, likely as a result of automated server resource monitoring we use.
Immediately after this, {{CHECK TABLE}} reported a warning:
{noformat:title=~~MDEV-32174~~ CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
10.6.21>CHECK TABLE t2 EXTENDED;
+---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)} |
| test.t2 | check | status | OK |
+---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.655 sec)
{noformat}
And checking the error log we see it's right in admist the killed connections:
{noformat:title=~~MDEV-32174~~ CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
...more aborted connections before...
2025-02-10 6:39:32 5829 [Warning] Aborted connection 5829 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:32 5832 [Warning] Aborted connection 5832 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:33 4052 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)}
2025-02-10 6:39:33 5848 [Warning] Aborted connection 5848 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:33 5912 [Warning] Aborted connection 5912 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:34 5924 [Warning] Aborted connection 5924 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
...more aborted connections after...
{noformat} ]

Roel Van de Paar added a comment - 2025-02-09 20:00

Had an interesting run this morning with this. This time I used:

threads=8000   # Number of concurrent threads

queries=100    # Number of t1/t2 INSERTs per thread/per test round

And here is what I saw:

MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)
/test/MDEV-32174_mariadb-10.6.21-linux-x86_64$ ./script1.sh
Count: 100
Count: 200
Count: 300
Count: 400
Count: 500
...normal count operation continues...
Count: 2600
Count: 2700
Count: 2800
Count: 2900
Count: 3000
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '2198-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '1931-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------
...additional similar duplicate entries...
ERROR 1062 (23000) at line 1: Duplicate entry '2781-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '3123-2025-02-10 06:38:40' for key 'PRIMARY'
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '3179-2025-02-10 06:38:40' for key 'PRIMARY'
Count: 3100
Count: 3200
Count: 3300
Count: 3400
Count: 3500
...normal count operation continues...
Count: 4600
Count: 4700
Count: 4800
Count: 4900
Count: 5000
--------------
INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
--------------

ERROR 1062 (23000) at line 1: Duplicate entry '858-2025-02-10 06:39:03' for key 'PRIMARY'
Count: 5100
Count: 5200
Count: 5300
Count: 5400
Count: 5500
Count: 5600
Count: 5700
Count: 5800
Count: 5900
Count: 6000
./script1.sh: line 23: 2193819 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
./script1.sh: line 23: 2193829 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
...many similar such messages...
./script1.sh: line 23: 2194480 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
./script1.sh: line 23: 2194481 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
Count: 6100
Count: 6200
Count: 6300
Count: 6400
Count: 6500
...script continues...

Firstly it seems interesting that the single '858-2025-02-10 06:39:03' duplicate entry occurence happened quite offset from the earlier batch of duplicate entries, again suggesting some sort of (highly-sporadic) race condition bug, as we were discussing. Immediately after this (let's say around count 5500), I used CHECK TABLE and the table was fine:

MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)
10.6.21>CHECK TABLE t2 EXTENDED;
+---------+-------+----------+----------+
\| Table \| Op \| Msg_type \| Msg_text \|
+---------+-------+----------+----------+
\| test.t2 \| check \| status \| OK \|
+---------+-------+----------+----------+
1 row in set (0.037 sec)

However, a little later the client kills happened, likely as a result of automated server resource monitoring we use.
Immediately after this, CHECK TABLE reported a warning:

MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)
10.6.21>CHECK TABLE t2 EXTENDED;
+---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
\| Table \| Op \| Msg_type \| Msg_text \|
+---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
\| test.t2 \| check \| Warning \| InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)} \|
\| test.t2 \| check \| status \| OK \|
+---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.655 sec)

And checking the error log we see it's right in admist the killed connections:

MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)
...more aborted connections before...
2025-02-10 6:39:32 5829 [Warning] Aborted connection 5829 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:32 5832 [Warning] Aborted connection 5832 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:33 4052 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)}
2025-02-10 6:39:33 5848 [Warning] Aborted connection 5848 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:33 5912 [Warning] Aborted connection 5912 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
2025-02-10 6:39:34 5924 [Warning] Aborted connection 5924 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
...more aborted connections after...

Could client kills (or flaky connections) be the issue not just for this warning but also for the earlier observed corruptions?

Roel Van de Paar added a comment - 2025-02-09 20:00 Had an interesting run this morning with this. This time I used: threads=8000 # Number of concurrent threads queries=100 # Number of t1/t2 INSERTs per thread/per test round And here is what I saw: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) /test/MDEV-32174_mariadb-10.6.21-linux-x86_64$ ./script1.sh Count: 100 Count: 200 Count: 300 Count: 400 Count: 500 ...normal count operation continues... Count: 2600 Count: 2700 Count: 2800 Count: 2900 Count: 3000 -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) -------------- ERROR 1062 (23000) at line 1: Duplicate entry '2198-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) -------------- ERROR 1062 (23000) at line 1: Duplicate entry '1931-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) -------------- ...additional similar duplicate entries... ERROR 1062 (23000) at line 1: Duplicate entry '2781-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) -------------- ERROR 1062 (23000) at line 1: Duplicate entry '3123-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) -------------- ERROR 1062 (23000) at line 1: Duplicate entry '3179-2025-02-10 06:38:40' for key 'PRIMARY' Count: 3100 Count: 3200 Count: 3300 Count: 3400 Count: 3500 ...normal count operation continues... Count: 4600 Count: 4700 Count: 4800 Count: 4900 Count: 5000 -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) -------------- ERROR 1062 (23000) at line 1: Duplicate entry '858-2025-02-10 06:39:03' for key 'PRIMARY' Count: 5100 Count: 5200 Count: 5300 Count: 5400 Count: 5500 Count: 5600 Count: 5700 Count: 5800 Count: 5900 Count: 6000 ./script1.sh: line 23: 2193819 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" ./script1.sh: line 23: 2193829 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" ...many similar such messages... ./script1.sh: line 23: 2194480 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" ./script1.sh: line 23: 2194481 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" Count: 6100 Count: 6200 Count: 6300 Count: 6400 Count: 6500 ...script continues... Firstly it seems interesting that the single '858-2025-02-10 06:39:03' duplicate entry occurence happened quite offset from the earlier batch of duplicate entries, again suggesting some sort of (highly-sporadic) race condition bug, as we were discussing. Immediately after this (let's say around count 5500), I used CHECK TABLE and the table was fine: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) 10.6.21>CHECK TABLE t2 EXTENDED; +---------+-------+----------+----------+ | Table | Op | Msg_type | Msg_text | +---------+-------+----------+----------+ | test.t2 | check | status | OK | +---------+-------+----------+----------+ 1 row in set (0.037 sec) However, a little later the client kills happened, likely as a result of automated server resource monitoring we use. Immediately after this, CHECK TABLE reported a warning: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) 10.6.21>CHECK TABLE t2 EXTENDED; +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)} | | test.t2 | check | status | OK | +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set (0.655 sec) And checking the error log we see it's right in admist the killed connections: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) ...more aborted connections before... 2025-02-10 6:39:32 5829 [Warning] Aborted connection 5829 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:32 5832 [Warning] Aborted connection 5832 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:33 4052 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)} 2025-02-10 6:39:33 5848 [Warning] Aborted connection 5848 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:33 5912 [Warning] Aborted connection 5912 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:34 5924 [Warning] Aborted connection 5924 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) ...more aborted connections after... Could client kills (or flaky connections) be the issue not just for this warning but also for the earlier observed corruptions?

Roel Van de Paar added a comment - 2025-02-09 20:51 - edited

The Clustered index record with stale history in table warning is readily reproducible, even with different locations:

MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)
\| test.t2 \| check \| Warning \| InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] l(0x8000006C),[5] z (0x99B5D47AE8),[6] (0x0000000C86B7),[7] " (0xD00000220716A0),[8] (0x0000000000000000)} \|
...
\| test.t2 \| check \| Warning \| InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000016),[5] { (0x99B5D47B0C),[6] a(0x0000000C8661),[7] 8(0xFA00000CF30938),[8] (0x0000000000000000)} \|
...
\| test.t2 \| check \| Warning \| InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000020),[5] {G(0x99B5D47B47),[6] h(0x0000000DDD68),[7] $ ) (0xC5000024D6299C),[8] (0x0000000000000000)} \|

It seems these are "repaired" without problems. The table will show clear/ok afterwards. However, when many threads are running, these are observed near-constantly.

I also tried reproducing corruption by running a kill script which randomly kills connections, but no corruption was observed, which seems consistent with your findings.

Roel Van de Paar added a comment - 2025-02-09 20:51 - edited The Clustered index record with stale history in table warning is readily reproducible, even with different locations: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] l(0x8000006C),[5] z (0x99B5D47AE8),[6] (0x0000000C86B7),[7] " (0xD00000220716A0),[8] (0x0000000000000000)} | ... | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000016),[5] { (0x99B5D47B0C),[6] a(0x0000000C8661),[7] 8(0xFA00000CF30938),[8] (0x0000000000000000)} | ... | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000020),[5] {G(0x99B5D47B47),[6] h(0x0000000DDD68),[7] $ ) (0xC5000024D6299C),[8] (0x0000000000000000)} | It seems these are "repaired" without problems. The table will show clear/ok afterwards. However, when many threads are running, these are observed near-constantly. I also tried reproducing corruption by running a kill script which randomly kills connections, but no corruption was observed, which seems consistent with your findings.

Bart-Jan Hilbrands added a comment - 2025-02-26 15:02

Hi, unfortunately I was still able to reproduce the problem; just a quick syslog entry:
```
<pre>Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 2025-02-26 14:51:35 353 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the tabl
e or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please
refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 2025-02-26 14:51:35 353 [ERROR] mariadbd: Index for table 'transition_coverages' is corrupt; try to repair it
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 2025-02-26 14:51:35 0x7cb3a8169640 InnoDB: Assertion failure in file ./storage/innobase/page/page0zip.cc line 4296
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: Failing assertion: slot_rec
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: We intentionally generate a memory trap.
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: If you get repeated assertion failures or crashes, even
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: immediately after the mariadbd startup, there may be
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: corruption in the InnoDB tablespace. Please refer to
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: about forcing recovery.
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 250226 14:51:35 [ERROR] mysqld got signal 6 ;
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Sorry, we probably made a mistake, and this is a bug.
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Your assistance in bug reporting will enable us to fix this for the next release.
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: We will try our best to scrape up some info that will hopefully help
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: diagnose the problem, but since we have already crashed,
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: something is definitely wrong and this may fail.
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Server version: 10.6.18-MariaDB-0ubuntu0.22.04.1 source revision: 887bb3f73555ff8a50138a580ca8308b9b5c069c
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: key_buffer_size=134217728
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: read_buffer_size=131072
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: max_used_connections=9
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: max_threads=153
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: thread_count=9
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: It is possible that mysqld could use up to
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468009 K bytes of memory
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Hope that's ok; if not, decrease some variables in the equation.
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Thread pointer: 0x7cb338000c68
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Attempting backtrace. You can use the following information to find out
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: where mysqld died. If you see no messages after this, something went
Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: terribly wrong...
</pre>
```

Bart-Jan Hilbrands added a comment - 2025-02-26 15:02 Hi, unfortunately I was still able to reproduce the problem; just a quick syslog entry: ``` <pre>Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 2025-02-26 14:51:35 353 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the tabl e or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 2025-02-26 14:51:35 353 [ERROR] mariadbd: Index for table 'transition_coverages' is corrupt; try to repair it Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 2025-02-26 14:51:35 0x7cb3a8169640 InnoDB: Assertion failure in file ./storage/innobase/page/page0zip.cc line 4296 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: Failing assertion: slot_rec Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: We intentionally generate a memory trap. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: If you get repeated assertion failures or crashes, even Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: immediately after the mariadbd startup, there may be Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: corruption in the InnoDB tablespace. Please refer to Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/ Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: about forcing recovery. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 250226 14:51:35 [ERROR] mysqld got signal 6 ; Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Sorry, we probably made a mistake, and this is a bug. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Your assistance in bug reporting will enable us to fix this for the next release. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : To report this bug, see https://mariadb.com/kb/en/reporting-bugs Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : We will try our best to scrape up some info that will hopefully help Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : diagnose the problem, but since we have already crashed, Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : something is definitely wrong and this may fail. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Server version: 10.6.18-MariaDB-0ubuntu0.22.04.1 source revision: 887bb3f73555ff8a50138a580ca8308b9b5c069c Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : key_buffer_size=134217728 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : read_buffer_size=131072 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : max_used_connections=9 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : max_threads=153 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : thread_count=9 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : It is possible that mysqld could use up to Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468009 K bytes of memory Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Hope that's ok; if not, decrease some variables in the equation. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Thread pointer: 0x7cb338000c68 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Attempting backtrace. You can use the following information to find out Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : where mysqld died. If you see no messages after this, something went Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : terribly wrong... </pre> ```

Sergei Golubchik made changes - 2025-02-26 16:56

Status

Needs Feedback [ 10501 ]

Open [ 1 ]

Roel Van de Paar added a comment - 2025-02-26 20:44

bjhilbrands Thank you for the update! I see you got an assertion as well here. How did you reproduce the corruption in this case please?
PS. You can use {noformat}...{noformat} tags instead in JIRA.

Roel Van de Paar added a comment - 2025-02-26 20:44 bjhilbrands Thank you for the update! I see you got an assertion as well here. How did you reproduce the corruption in this case please? PS. You can use {noformat}...{noformat} tags instead in JIRA.

Bart-Jan Hilbrands added a comment - 2025-02-27 08:14

Hi, this was reproduced by running myloader with the default configuration (4 threads); so I suppose it is possible that issue is resolved when myloader is ran with 1 thread only

Bart-Jan Hilbrands added a comment - 2025-02-27 08:14 Hi, this was reproduced by running myloader with the default configuration (4 threads); so I suppose it is possible that issue is resolved when myloader is ran with 1 thread only

Marko Mäkelä added a comment - 2025-02-27 14:01

bjhilbrands, MariaDB Server 10.6.18 should be affected by the race condition that was fixed in ~~MDEV-34453~~ (MariaDB Server 10.6.20). Can you reproduce the corruption with 10.6.20 or 10.6.21 or a later release?

Marko Mäkelä added a comment - 2025-02-27 14:01 bjhilbrands , MariaDB Server 10.6.18 should be affected by the race condition that was fixed in MDEV-34453 (MariaDB Server 10.6.20). Can you reproduce the corruption with 10.6.20 or 10.6.21 or a later release?

Bart-Jan Hilbrands added a comment - 2025-03-03 06:59

I set up a pipeline that restores database backups 24/7; i'll report back next week if the issue hasn't been reproduced by then

Bart-Jan Hilbrands added a comment - 2025-03-03 06:59 I set up a pipeline that restores database backups 24/7; i'll report back next week if the issue hasn't been reproduced by then

Bart-Jan Hilbrands added a comment - 2025-03-04 13:46

It is still looking promising so far! I have another question on which other versions of mariadb this might affect/have affected. We are upgrading to ubuntu 24.04 this year, and the current mariadb version in the ubuntu package repository is 10.11.8. Do you know if we might also run into this problem using that version, or that we would have install a later version of 10.11?

Bart-Jan Hilbrands added a comment - 2025-03-04 13:46 It is still looking promising so far! I have another question on which other versions of mariadb this might affect/have affected. We are upgrading to ubuntu 24.04 this year, and the current mariadb version in the ubuntu package repository is 10.11.8. Do you know if we might also run into this problem using that version, or that we would have install a later version of 10.11?

Bart-Jan Hilbrands added a comment - 2025-03-10 08:48 - edited

Alright, it has been a week of nonstop restores now and the problem has not reproduced yet with version 10.6.21. I think we can pretty safely say that the problem has been resolved, thanks for the effort!

Bart-Jan Hilbrands added a comment - 2025-03-10 08:48 - edited Alright, it has been a week of nonstop restores now and the problem has not reproduced yet with version 10.6.21. I think we can pretty safely say that the problem has been resolved, thanks for the effort!

Marko Mäkelä made changes - 2025-03-10 09:02

Link

This issue duplicates ~~MDEV-34453~~ [ ~~MDEV-34453~~ ]

Marko Mäkelä added a comment - 2025-03-10 09:04

Thank you for the testing. I think that we can declare this bug a duplicate of ~~MDEV-34453~~.

Marko Mäkelä added a comment - 2025-03-10 09:04 Thank you for the testing. I think that we can declare this bug a duplicate of MDEV-34453 .

Marko Mäkelä made changes - 2025-03-10 09:04

Fix Version/s		11.8.0 [ 29960 ]
Fix Version/s		11.7.1 [ 29913 ]
Fix Version/s		11.6.2 [ 29908 ]
Fix Version/s		11.4.4 [ 29907 ]
Fix Version/s		11.2.6 [ 29906 ]
Fix Version/s		10.11.10 [ 29904 ]
Fix Version/s		10.6.20 [ 29903 ]
Fix Version/s	10.6 [ 24028 ]
Resolution		Duplicate [ 3 ]
Status	Open [ 1 ]	Closed [ 6 ]

MariaDB Server

ROW_FORMAT=COMPRESSED table corruption due to ROLLBACK

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration