Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32174

ROW_FORMAT=COMPRESSED table corruption due to ROLLBACK

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Duplicate
    • 10.11.3, 10.6.12
    • None

    Description

      After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
      The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
      What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
      This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
      We have also tried upgrading to various versions of mariadb/mydumper, most notably:

      mariadb mydumper
      10.6.12 0.15
      10.11.3 0.10
      10.11.3 0.15

      But with all of the above version combinations the problem still occurred
      Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

      Provided files:
      The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
      Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
      But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
      In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.

      Update:
      After running on 10.6.15 I was again able to reproduce it. I generated a stacktrace form the core dump (mariadbd_full_bt_all_threads.txt)

      Attachments

        1. configuration.txt
          78 kB
        2. restore-failure-error-log.txt
          20 kB
        3. table_structure.txt
          1 kB
        4. syslog-restore-10.6.15.txt
          24 kB
        5. mariadbd_full_bt_all_threads.txt
          149 kB
        6. MDEV-32174_ps.txt
          335 kB

        Issue Links

          Activity

            bjhilbrands Bart-Jan Hilbrands created issue -
            bjhilbrands Bart-Jan Hilbrands made changes -
            Field Original Value New Value
            Description After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
            The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
            What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
            This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
            We have also tried upgrading to various versions of mariadb/mydumper, most notably:
            | mariadb | mydumper |
            |10.6.12 | 0.15 |
            |11.1 | 0.10 |
            |11.1 | 0.15|
            But with all of the above version combinations the problem still occurred
            Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

            Provided files:
            The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
            Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
            But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
            In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.
            After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
            The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
            What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
            This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
            We have also tried upgrading to various versions of mariadb/mydumper, most notably:
            | mariadb | mydumper |
            |10.6.12 | 0.15 |
            |10.11.3 | 0.10 |
            |10.11.3 | 0.15|
            But with all of the above version combinations the problem still occurred
            Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

            Provided files:
            The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
            Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
            But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
            In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.
            danblack Daniel Black added a comment - - edited

            Hi bjhilbrands,

            To give us a little more detail:

            sudo apt-get install mariadb-server-core-dbg
            

            And using abrt-retrace and starting a gdb session run thread apply all bt -frame-arguments all full.

            If you are able to test with 10.6.15 that would be much appreciated. repo config.

            or even the 10.11 though an earlier version will tell us where to fix it more reliably.

            danblack Daniel Black added a comment - - edited Hi bjhilbrands , To give us a little more detail: sudo apt-get install mariadb-server-core-dbg And using abrt-retrace and starting a gdb session run thread apply all bt -frame-arguments all full . If you are able to test with 10.6.15 that would be much appreciated. repo config . or even the 10.11 though an earlier version will tell us where to fix it more reliably.

            If the innodb_buffer_pool_size is much smaller than the database, this report could be a duplicate of MDEV-31767.

            marko Marko Mäkelä added a comment - If the innodb_buffer_pool_size is much smaller than the database, this report could be a duplicate of MDEV-31767 .

            Hi, thank you for the quick feedback. I am testing it on 10.6.15 right now, i'll get back to you with my findings and a full stacktrace if the problem indeed still occurs.

            bjhilbrands Bart-Jan Hilbrands added a comment - Hi, thank you for the quick feedback. I am testing it on 10.6.15 right now, i'll get back to you with my findings and a full stacktrace if the problem indeed still occurs.
            bjhilbrands Bart-Jan Hilbrands made changes -
            Description After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
            The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
            What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
            This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
            We have also tried upgrading to various versions of mariadb/mydumper, most notably:
            | mariadb | mydumper |
            |10.6.12 | 0.15 |
            |10.11.3 | 0.10 |
            |10.11.3 | 0.15|
            But with all of the above version combinations the problem still occurred
            Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

            Provided files:
            The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
            Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
            But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
            In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.
            After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
            The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
            What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
            This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
            We have also tried upgrading to various versions of mariadb/mydumper, most notably:
            | mariadb | mydumper |
            |10.6.12 | 0.15 |
            |10.11.3 | 0.10 |
            |10.11.3 | 0.15|
            But with all of the above version combinations the problem still occurred
            Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

            Provided files:
            The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
            Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
            But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
            In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.


            Update:
            After running on 10.6.15 I was again able to reproduce it. I generated a stacktrace form the core dump (
            bjhilbrands Bart-Jan Hilbrands made changes -
            Description After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
            The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
            What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
            This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
            We have also tried upgrading to various versions of mariadb/mydumper, most notably:
            | mariadb | mydumper |
            |10.6.12 | 0.15 |
            |10.11.3 | 0.10 |
            |10.11.3 | 0.15|
            But with all of the above version combinations the problem still occurred
            Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

            Provided files:
            The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
            Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
            But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
            In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.


            Update:
            After running on 10.6.15 I was again able to reproduce it. I generated a stacktrace form the core dump (
            After upgrading to ubuntu 22.04, and alongside that to mariadb 10.6, we started experiencing table corruption during our nightly restores. Because these restores are to staging servers loss of data was acceptable, so we fixed the problem by simply removing the corrupted table. But this obviously won't fly should we ever have to do a production restore (Which is very rarely).
            The database portion of the restores specifically are done using mydumper with the default amount of 4 threads.
            What makes the problem particularly nasty is that it cannot be consistently reproduced, we ourselves noticed it seemed to happen in roughly 1 out of every 7 restores (Where each restore takes about ~40-50 minutes).
            This made us think it was related to parallelism. So we tried running mydumper single-threaded, which did not solve the problem.
            We have also tried upgrading to various versions of mariadb/mydumper, most notably:
            | mariadb | mydumper |
            |10.6.12 | 0.15 |
            |10.11.3 | 0.10 |
            |10.11.3 | 0.15|
            But with all of the above version combinations the problem still occurred
            Eventually we found that we could no longer reproduce the problem while running mariadb 10.5.21 with mydumper 0.10, but we are still unsure of the underlying cause.

            Provided files:
            The included table structure is just one table of our db, as we were able to reproduce it by only restoring backups of this table.
            Because it is quite time consuming to reproduce and we would have to generate a large amount of dummy data (Existing dumps all contain customer data), we have not included a database dump for now.
            But we still wanted to create this bug report just in case you might already see something strange based on the included table structure and error log.
            In case there is no immediate apparent problem and you still want to look further into this, we would of course be happy to provide a database dump.


            Update:
            After running on 10.6.15 I was again able to reproduce it. I generated a stacktrace form the core dump (mariadbd_full_bt_all_threads.txt)
            bjhilbrands Bart-Jan Hilbrands made changes -
            Attachment mariadbd_full_bt_all_threads.txt [ 72008 ]
            bjhilbrands Bart-Jan Hilbrands made changes -
            Attachment syslog-restore-10.6.15.txt [ 72011 ]

            I was able to reproduce it again with 10.6.15. One thing I noticed that was different is, that mariadb is stil running properly (Before it was in a restart loop). Now it is running fine with status "To roll back: 1 transactions, 5668 rows". Stacktrace I generated from the core dump is attached as mariadbd_full_bt_all_threads.txt , also added output from the syslog if that is useful syslog-restore-10.6.15.txt
            .

            bjhilbrands Bart-Jan Hilbrands added a comment - I was able to reproduce it again with 10.6.15. One thing I noticed that was different is, that mariadb is stil running properly (Before it was in a restart loop). Now it is running fine with status "To roll back: 1 transactions, 5668 rows". Stacktrace I generated from the core dump is attached as mariadbd_full_bt_all_threads.txt , also added output from the syslog if that is useful syslog-restore-10.6.15.txt .
            danblack Daniel Black added a comment -

            I think the backtrace is still being processed with the 10.6.12 symbols rather than the 10.6.15 ones. Can you install mariadb-server-core-10.6-dbgsym which should come from the mariadb repo. After that re-resolve the previous core dump. Thanks for the confirmation and quick response.

            danblack Daniel Black added a comment - I think the backtrace is still being processed with the 10.6.12 symbols rather than the 10.6.15 ones. Can you install mariadb-server-core-10.6-dbgsym which should come from the mariadb repo. After that re-resolve the previous core dump. Thanks for the confirmation and quick response.
            bjhilbrands Bart-Jan Hilbrands made changes -
            Attachment mariadbd_full_bt_all_threads.txt [ 72008 ]
            bjhilbrands Bart-Jan Hilbrands made changes -
            Attachment mariadbd_full_bt_all_threads.txt [ 72012 ]

            I see, updated the attachment with the fixed version

            bjhilbrands Bart-Jan Hilbrands added a comment - I see, updated the attachment with the fixed version
            bjhilbrands Bart-Jan Hilbrands made changes -
            Attachment mariadbd_full_bt_all_threads.txt [ 72012 ]
            bjhilbrands Bart-Jan Hilbrands made changes -
            Attachment mariadbd_full_bt_all_threads.txt [ 72013 ]
            alice Alice Sherepa made changes -
            danblack Daniel Black made changes -
            Fix Version/s 10.6 [ 24028 ]
            danblack Daniel Black made changes -
            Component/s Storage Engine - InnoDB [ 10129 ]
            danblack Daniel Black made changes -
            Assignee Marko Mäkelä [ marko ]
            danblack Daniel Black added a comment -

            Thanks bjhilbrands.

            page_zip_dir_find returning a null pointer

            Aborting thread

            #8  0x00007f0eb02b7476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
                    ret = <optimized out>
            #9  0x00007f0eb029d7f3 in __GI_abort () at ./stdlib/abort.c:79
                    save_stage = 1
                    act = {__sigaction_handler = {sa_handler = 0x7f0e983b9d00, sa_sigaction = 0x7f0e983b9d00}, sa_mask = {__val = {398, 139701063841440, 3432, 94106376343816, 139701062207568, 398, 94106364354817, 9, 139701063841440, 398, 398, 1, 94106376343816, 139701063824896, 139701062160471, 1103806595077}}, sa_flags = -1337395328, sa_restorer = 0x7f0eb048b600 <_IO_file_jumps>}
                    sigs = {__val = {32, 139700660311452, 139701063841056, 15840675240469249536, 9, 15840675240469249536, 139701063841888, 139700660311648, 94106376288112, 139701062769284, 4297, 398, 94106376343816, 398, 139700660311424, 139701062205293}}
            #10 0x00005596d86b1d3e in ut_dbg_assertion_failed (expr=expr@entry=0x5596d9213770 "slot_rec", file=file@entry=0x5596d9213230 "./storage/innobase/page/page0zip.cc", line=line@entry=4297) at ./storage/innobase/ut/ut0dbg.cc:60
            No locals.
            #11 0x00005596d8ddb888 in page_zip_dir_delete (block=block@entry=0x7f0e8c0f46a0, rec=<optimized out>, index=index@entry=0x7f0e5c0535f0, offsets=offsets@entry=0x7f0e983b97d0, free=0x0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/page/page0zip.cc:4297
                    page_zip = 0x7f0e8c0f46d8
                    page_free = <optimized out>
                    garbage = <optimized out>
                    slot_rec = 0x0
                    n_recs = <optimized out>
                    page_n_recs = <optimized out>
                    slot_free = <optimized out>
                    slot_len = <optimized out>
            #12 0x00005596d8dcb06c in page_cur_delete_rec (cursor=<optimized out>, offsets=0x7f0e983b97d0, mtr=0x7f0e983b9d00) at ./storage/innobase/page/page0cur.cc:2362
                    cur_dir_slot = <optimized out>
                    current_rec = 0x7f0e92306b62 ""
                    prev_rec = <optimized out>
                    next_rec = <optimized out>
                    cur_n_owned = 8
                    rec = <optimized out>
                    index = 0x7f0e5c0535f0
                    block = 0x7f0e8c0f46a0
                    cur_slot_no = 152
                    slot_rec = <optimized out>
                    data_size = <optimized out>
                    extra_size = <optimized out>
            #13 0x00005596d8e8807e in btr_cur_optimistic_delete (cursor=cursor@entry=0x7f0e983b9aa0, flags=flags@entry=0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/btr/btr0cur.cc:4619
                    page = <optimized out>
                    page_zip = <optimized out>
                    block = 0x7f0e8c0f46a0
                    rec = 0x7f0e92306b62 ""
                    heap = 0x0
                    offsets_ = {300, 3, 32774, 4, 8, 12, 0, 0, 11824, 23557, 32526, 0, 39664, 38971, 0, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 11456, 35854, 32526, 0, 11824, 23557, 32526, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 38992, 38971, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 39664, 38971, 32526, 0, 40608, 38971, 32526, 0, 42204, 55522, 21910, 0, 11456, 35854, 32526, 0, 4088, 0, 0, 0, 1, 0, 0, 0, 46830, 37311, 32526, 0, 39136, 38971, 32526, 0, 0, 0, 0, 0, 300, 15, 32780, 4, 10, 17, 21, 25, 176, 176, 185, 227, 231, 240, 244, 248, 252, 498, 0, 0, 40192, 38971, 32526, 0, 4089, 0, 0, 0, 11456, 35854, 32526, 0, 47973, 37311, 32526, 0, 40192, 38971, 32526, 0, 39168, 38971, 32526, 0, 36003, 55517, 21910, 0, 4, 0, 0, 0, 47973, 37311, 32526, 0, 39312, 38971, 32526, 0, 46156, 55516, 21910, 0, 41, 0, 0, 0, 47973, 37311, 32526, 0, 47340, 37311, 32526, 0, 49140, 37311, 32526, 0, 47340, 37311, 32526, 0, 2, 0, 0, 0, 16384, 0, 0, 0, 39392, 38971, 32526, 0, 33970, 37311, 32526, 0, 49140, 37311, 32526, 0...}
                    offsets = <optimized out>
                    err = DB_SUCCESS
            #14 0x00005596d8f1eef5 in row_undo_ins_remove_sec_low (mode=<optimized out>, index=<optimized out>, entry=<optimized out>, thr=<optimized out>) at ./storage/innobase/row/row0uins.cc:318
                    btr_cur = 0x7f0e983b9aa0
                    pcur = {btr_cur = {page_cur = {index = 0x7f0e5c0535f0, rec = 0x7f0e92304070 "", offsets = 0x0, block = 0x7f0e8c0f46a0}, purge_node = 0x0, thr = 0x0, flag = BTR_CUR_BINARY, tree_height = 3, up_match = 1, up_bytes = 0, low_match = 3, low_bytes = 0, n_fields = 0, n_bytes = 0, fold = 0, path_arr = 0x0, rtr_info = 0x0}, latch_mode = BTR_MODIFY_LEAF, old_rec = 0x0, old_n_core_fields = 0, old_n_fields = 0, rel_pos = 0, block_when_stored = {m_block = 0x0, m_page_id = {m_id = 0}}, modify_clock = 0, pos_state = BTR_PCUR_IS_POSITIONED, search_mode = PAGE_CUR_LE, trx_if_known = 0x0, old_rec_buf = 0x0, buf_size = 0}
                    err = DB_SUCCESS
                    mtr = {m_last = 0x7f0e8c0f46a0, m_last_offset = 48, m_log_mode = 0, m_modifications = 1, m_made_dirty = 0, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7f0e983b9d20, Size = 2, Capacity = 16}, small = {{object = 0x7f0e5c053768, type = MTR_MEMO_S_LOCK}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c097900, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c0468c0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c002080, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c050c80, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c089080, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e5c04e4f8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f0e8c08c780, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f0e8c050c80, type = 2956040257}, {object = 0x7f0e68000090, type = 0}, {object = 0x7, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x428c0e2cc0, type = 1152}, {object = 0x7f0e8c139800, type = 605081585}, {object = 0x57, type = 1200}}}, m_log = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x7f0e983b9e48, prev = 0x7f0e983b9e48}}, size_ = 1}, m_size = 19, m_first_block = {<ilist_node<void>> = {next = 0x7f0e983b9e28, prev = 0x7f0e983b9e28}, m_data = "<omitted customer data>"..., m_used = 19}}, m_user_space = 0x7f0e5c04e4f8, m_commit_lsn = 0, m_freed_space = 0x0, m_freed_pages = 0x0}
                    modify_leaf = true
            #15 0x00005596d8f1f4f4 in row_undo_ins_remove_sec (thr=0x7f0e68095bf8, entry=0x7f0e6bfc6430, index=0x7f0e5c0535f0) at ./storage/innobase/row/row0uins.cc:353
                    err = <optimized out>
                    n_tries = 0
                    retry = <optimized out>
                    err = <optimized out>
                    n_tries = <optimized out>
            #16 row_undo_ins_remove_sec_rec (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0uins.cc:549
                    entry = 0x7f0e6bfc6430
                    err = DB_SUCCESS
                    index = 0x7f0e5c0535f0
                    heap = 0x7f0e6bfc63c8
                    func_exit = <optimized out>
                    err = <optimized out>
                    index = <optimized out>
                    heap = <optimized out>
                    entry = <optimized out>
            #17 row_undo_ins (node=node@entry=0x7f0e6bf6fa20, thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0uins.cc:601
                    err = <optimized out>
                    dict_locked = false
            #18 0x00005596d8e2ab7b in row_undo (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0undo.cc:401
                    err = <optimized out>
                    err = <optimized out>
            #19 row_undo_step (thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0undo.cc:452
                    err = <optimized out>
                    node = 0x7f0e6bf6fa20
                    trx = 0x7f0ea62aa680
            #20 0x00005596d8de36f2 in que_thr_step (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:586
                    node = 0x7f0e6bf6fa20
                    old_thr = 0x7f0e68095bf8
                    trx = <optimized out>
                    type = <optimized out>
                    node = <optimized out>
                    old_thr = <optimized out>
                    trx = <optimized out>
                    type = <optimized out>
            #21 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:644
                    next_thr = <optimized out>
                    trx = <optimized out>
            #22 que_run_threads (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:664
                    trx = 0x7f0ea62aa680
                    loop = <optimized out>
            #23 0x00005596d8e4b414 in trx_t::rollback_low (this=this@entry=0x7f0ea62aa680, savept=savept@entry=0x7f0ea62aafc0) at ./storage/innobase/trx/trx0roll.cc:125
                    thr = 0x7f0e680e8ad8
                    heap = 0x7f0e680e8978
                    roll_node = 0x7f0e680e89e0
            #24 0x00005596d8e4a300 in trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:176
            No locals.
            #25 trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:163
            No locals.
            #26 trx_rollback_last_sql_stat_for_mysql (trx=trx@entry=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:307
                    err = <optimized out>
            #27 0x00005596d8d4d2e0 in innobase_rollback (hton=<optimized out>, thd=0x7f0e68000c68, rollback_trx=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:4705
                    trx = 0x7f0ea62aa680
                    error = <optimized out>
            #28 0x00005596d8a4eed0 in ha_rollback_trans (thd=thd@entry=0x7f0e68000c68, all=all@entry=false) at ./sql/handler.cc:2224
                    err = <optimized out>
                    ht = <optimized out>
                    error = 0
                    trans = <optimized out>
                    ha_info = 0x7f0e68003628
                    ha_info_next = <optimized out>
                    is_real_trans = <optimized out>
            #29 0x00005596d892f9d9 in trans_rollback_stmt (thd=thd@entry=0x7f0e68000c68) at ./sql/transaction.cc:535
            No locals.
            #30 0x00005596d8802e50 in mysql_execute_command (thd=0x7f0e68000c68, is_called_from_prepared_stmt=<optimized out>) at ./sql/sql_parse.cc:6077
                    multi_delete_error = <optimized out>
                    wsrep_error_label = <optimized out>
                    error = <optimized out>
                    res = 1
                    up_result = <optimized out>
                    lex = 0x7f0e68004d88
                    select_lex = <optimized out>
                    first_table = <optimized out>
                    all_tables = 0x7f0e68013130
                    unit = 0x7f0e68004e50
                    have_table_map_for_update = <optimized out>
                    rpl_filter = <optimized out>
                    ots = {ctx = 0x7f0e68004a58, traceable = false}
                    orig_binlog_format = <optimized out>
                    orig_current_stmt_binlog_format = <optimized out>
            #31 0x00005596d88085f7 in mysql_parse (thd=0x7f0e68000c68, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>) at ./sql/sql_parse.cc:8053
                    found_semicolon = <optimized out>
                    error = <optimized out>
                    lex = 0x7f0e68004d88
                    err = <optimized out>
            #32 0x00005596d880addd in dispatch_command (command=COM_QUERY, thd=0x7f0e68000c68, packet=<optimized out>, packet_length=<optimized out>, blocking=<optimized out>) at ./sql/sql_parse.cc:1993
                    beginning_of_next_stmt = <optimized out>
                    length = <optimized out>
                    packet_end = 0x7f0e3b861df4 ""
                    parser_state = {m_lip = {lookahead_token = -1, lookahead_yylval = 0x0, m_thd = 0x7f0e68000c68, m_ptr = 0x7f0e3b861df5 "\b", m_tok_start = 0x7f0e3b861df5 "\b", m_tok_end = 0x7f0e3b861df5 "\b", m_end_of_query = 0x7f0e3b861df4 "", m_tok_start_prev = 0x7f0e3b861df4 "", m_buf = 0x7f0e3afb4e00 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_buf_length = 9097204, m_echo = true, m_echo_saved = true, m_cpp_buf = 0x7f0e38000070 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_cpp_ptr = 0x7f0e388ad064 "", m_cpp_tok_start = 0x7f0e388ad064 "", m_cpp_tok_start_prev = 0x7f0e388ad064 "", m_cpp_tok_end = 0x7f0e388ad064 "", m_body_utf8 = 0x0, m_body_utf8_ptr = 0x0, m_cpp_utf8_processed_ptr = 0x0, next_state = MY_LEX_END, found_semicolon = 0x0, ignore_space = false, stmt_prepare_mode = false, multi_statements = true, yylineno = 26245, m_digest = 0x0, in_comment = NO_COMMENT, in_comment_saved = NO_COMMENT, m_cpp_text_start = 0x7f0e388acf97 "<omitted customer data>"..., m_cpp_text_end = 0x7f0e388ad062 "\")", m_underscore_cs = 0x0}, m_yacc = {yacc_yyss = 0x0, yacc_yyvs = 0x0, m_set_signal_info = {m_item = {0x0 <repeats 12 times>}}, m_lock_type = TL_READ_DEFAULT, m_mdl_type = MDL_SHARED_READ}, m_digest_psi = 0x0}
                    net = 0x7f0e68000f20
                    error = false
                    do_end_of_statement = true
                    drop_more_results = false
                    __FUNCTION__ = "dispatch_command"
                    res = <optimized out>
            #33 0x00005596d880cbe8 in do_command (thd=0x7f0e68000c68, blocking=blocking@entry=true) at ./sql/sql_parse.cc:1409
                    return_value = <optimized out>
                    packet = 0x7f0e3919adb8 "\003INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"...
                    packet_length = <optimized out>
                    net = 0x7f0e68000f20
                    command = COM_QUERY
                    __FUNCTION__ = "do_command"
            #34 0x00005596d891efef in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at ./sql/sql_connect.cc:1416
                    create_user = true
                    thr_create_utime = <optimized out>
                    thd = 0x7f0e68000c68
            #35 0x00005596d891f33d in handle_one_connection (arg=arg@entry=0x5596db163948) at ./sql/sql_connect.cc:1318
                    connect = 0x5596db163948
            #36 0x00005596d8ca4ce6 in pfs_spawn_thread (arg=0x5596db1180f8) at ./storage/perfschema/pfs.cc:2201
                    typed_arg = 0x5596db1180f8
                    user_arg = 0x5596db163948
                    user_start_routine = 0x5596d891f2e0 <handle_one_connection(void*)>
                    pfs = <optimized out>
                    klass = <optimized out>
            #37 0x00007f0eb0309b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
                    ret = <optimized out>
                    pd = <optimized out>
                    out = <optimized out>
                    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140720411833360, -1557864560374536914, 139700660323904, 1, 139701062244432, 140720411833712, 1476171432797185326, 1476118612295460142}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
                    not_first_call = <optimized out>
            #38 0x00007f0eb039ba00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
            No locals.
            

            danblack Daniel Black added a comment - Thanks bjhilbrands . page_zip_dir_find returning a null pointer Aborting thread #8 0x00007f0eb02b7476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 ret = <optimized out> #9 0x00007f0eb029d7f3 in __GI_abort () at ./stdlib/abort.c:79 save_stage = 1 act = {__sigaction_handler = {sa_handler = 0x7f0e983b9d00, sa_sigaction = 0x7f0e983b9d00}, sa_mask = {__val = {398, 139701063841440, 3432, 94106376343816, 139701062207568, 398, 94106364354817, 9, 139701063841440, 398, 398, 1, 94106376343816, 139701063824896, 139701062160471, 1103806595077}}, sa_flags = -1337395328, sa_restorer = 0x7f0eb048b600 <_IO_file_jumps>} sigs = {__val = {32, 139700660311452, 139701063841056, 15840675240469249536, 9, 15840675240469249536, 139701063841888, 139700660311648, 94106376288112, 139701062769284, 4297, 398, 94106376343816, 398, 139700660311424, 139701062205293}} #10 0x00005596d86b1d3e in ut_dbg_assertion_failed (expr=expr@entry=0x5596d9213770 "slot_rec", file=file@entry=0x5596d9213230 "./storage/innobase/page/page0zip.cc", line=line@entry=4297) at ./storage/innobase/ut/ut0dbg.cc:60 No locals. #11 0x00005596d8ddb888 in page_zip_dir_delete (block=block@entry=0x7f0e8c0f46a0, rec=<optimized out>, index=index@entry=0x7f0e5c0535f0, offsets=offsets@entry=0x7f0e983b97d0, free=0x0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/page/page0zip.cc:4297 page_zip = 0x7f0e8c0f46d8 page_free = <optimized out> garbage = <optimized out> slot_rec = 0x0 n_recs = <optimized out> page_n_recs = <optimized out> slot_free = <optimized out> slot_len = <optimized out> #12 0x00005596d8dcb06c in page_cur_delete_rec (cursor=<optimized out>, offsets=0x7f0e983b97d0, mtr=0x7f0e983b9d00) at ./storage/innobase/page/page0cur.cc:2362 cur_dir_slot = <optimized out> current_rec = 0x7f0e92306b62 "" prev_rec = <optimized out> next_rec = <optimized out> cur_n_owned = 8 rec = <optimized out> index = 0x7f0e5c0535f0 block = 0x7f0e8c0f46a0 cur_slot_no = 152 slot_rec = <optimized out> data_size = <optimized out> extra_size = <optimized out> #13 0x00005596d8e8807e in btr_cur_optimistic_delete (cursor=cursor@entry=0x7f0e983b9aa0, flags=flags@entry=0, mtr=mtr@entry=0x7f0e983b9d00) at ./storage/innobase/btr/btr0cur.cc:4619 page = <optimized out> page_zip = <optimized out> block = 0x7f0e8c0f46a0 rec = 0x7f0e92306b62 "" heap = 0x0 offsets_ = {300, 3, 32774, 4, 8, 12, 0, 0, 11824, 23557, 32526, 0, 39664, 38971, 0, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 11456, 35854, 32526, 0, 11824, 23557, 32526, 0, 1, 0, 0, 0, 64144, 27638, 32526, 0, 38992, 38971, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 64032, 27638, 32526, 0, 11824, 23557, 32526, 0, 39664, 38971, 32526, 0, 40608, 38971, 32526, 0, 42204, 55522, 21910, 0, 11456, 35854, 32526, 0, 4088, 0, 0, 0, 1, 0, 0, 0, 46830, 37311, 32526, 0, 39136, 38971, 32526, 0, 0, 0, 0, 0, 300, 15, 32780, 4, 10, 17, 21, 25, 176, 176, 185, 227, 231, 240, 244, 248, 252, 498, 0, 0, 40192, 38971, 32526, 0, 4089, 0, 0, 0, 11456, 35854, 32526, 0, 47973, 37311, 32526, 0, 40192, 38971, 32526, 0, 39168, 38971, 32526, 0, 36003, 55517, 21910, 0, 4, 0, 0, 0, 47973, 37311, 32526, 0, 39312, 38971, 32526, 0, 46156, 55516, 21910, 0, 41, 0, 0, 0, 47973, 37311, 32526, 0, 47340, 37311, 32526, 0, 49140, 37311, 32526, 0, 47340, 37311, 32526, 0, 2, 0, 0, 0, 16384, 0, 0, 0, 39392, 38971, 32526, 0, 33970, 37311, 32526, 0, 49140, 37311, 32526, 0...} offsets = <optimized out> err = DB_SUCCESS #14 0x00005596d8f1eef5 in row_undo_ins_remove_sec_low (mode=<optimized out>, index=<optimized out>, entry=<optimized out>, thr=<optimized out>) at ./storage/innobase/row/row0uins.cc:318 btr_cur = 0x7f0e983b9aa0 pcur = {btr_cur = {page_cur = {index = 0x7f0e5c0535f0, rec = 0x7f0e92304070 "", offsets = 0x0, block = 0x7f0e8c0f46a0}, purge_node = 0x0, thr = 0x0, flag = BTR_CUR_BINARY, tree_height = 3, up_match = 1, up_bytes = 0, low_match = 3, low_bytes = 0, n_fields = 0, n_bytes = 0, fold = 0, path_arr = 0x0, rtr_info = 0x0}, latch_mode = BTR_MODIFY_LEAF, old_rec = 0x0, old_n_core_fields = 0, old_n_fields = 0, rel_pos = 0, block_when_stored = {m_block = 0x0, m_page_id = {m_id = 0}}, modify_clock = 0, pos_state = BTR_PCUR_IS_POSITIONED, search_mode = PAGE_CUR_LE, trx_if_known = 0x0, old_rec_buf = 0x0, buf_size = 0} err = DB_SUCCESS mtr = {m_last = 0x7f0e8c0f46a0, m_last_offset = 48, m_log_mode = 0, m_modifications = 1, m_made_dirty = 0, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {<small_vector_base> = {BeginX = 0x7f0e983b9d20, Size = 2, Capacity = 16}, small = {{object = 0x7f0e5c053768, type = MTR_MEMO_S_LOCK}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c0f46a0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c097900, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c0468c0, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c002080, type = MTR_MEMO_PAGE_X_MODIFY}, {object = 0x7f0e8c050c80, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e8c089080, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x7f0e5c04e4f8, type = MTR_MEMO_SPACE_X_LOCK}, {object = 0x7f0e8c08c780, type = MTR_MEMO_PAGE_SX_FIX}, {object = 0x7f0e8c050c80, type = 2956040257}, {object = 0x7f0e68000090, type = 0}, {object = 0x7, type = MTR_MEMO_PAGE_X_FIX}, {object = 0x428c0e2cc0, type = 1152}, {object = 0x7f0e8c139800, type = 605081585}, {object = 0x57, type = 1200}}}, m_log = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x7f0e983b9e48, prev = 0x7f0e983b9e48}}, size_ = 1}, m_size = 19, m_first_block = {<ilist_node<void>> = {next = 0x7f0e983b9e28, prev = 0x7f0e983b9e28}, m_data = "<omitted customer data>"..., m_used = 19}}, m_user_space = 0x7f0e5c04e4f8, m_commit_lsn = 0, m_freed_space = 0x0, m_freed_pages = 0x0} modify_leaf = true #15 0x00005596d8f1f4f4 in row_undo_ins_remove_sec (thr=0x7f0e68095bf8, entry=0x7f0e6bfc6430, index=0x7f0e5c0535f0) at ./storage/innobase/row/row0uins.cc:353 err = <optimized out> n_tries = 0 retry = <optimized out> err = <optimized out> n_tries = <optimized out> #16 row_undo_ins_remove_sec_rec (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0uins.cc:549 entry = 0x7f0e6bfc6430 err = DB_SUCCESS index = 0x7f0e5c0535f0 heap = 0x7f0e6bfc63c8 func_exit = <optimized out> err = <optimized out> index = <optimized out> heap = <optimized out> entry = <optimized out> #17 row_undo_ins (node=node@entry=0x7f0e6bf6fa20, thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0uins.cc:601 err = <optimized out> dict_locked = false #18 0x00005596d8e2ab7b in row_undo (thr=0x7f0e68095bf8, node=0x7f0e6bf6fa20) at ./storage/innobase/row/row0undo.cc:401 err = <optimized out> err = <optimized out> #19 row_undo_step (thr=thr@entry=0x7f0e68095bf8) at ./storage/innobase/row/row0undo.cc:452 err = <optimized out> node = 0x7f0e6bf6fa20 trx = 0x7f0ea62aa680 #20 0x00005596d8de36f2 in que_thr_step (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:586 node = 0x7f0e6bf6fa20 old_thr = 0x7f0e68095bf8 trx = <optimized out> type = <optimized out> node = <optimized out> old_thr = <optimized out> trx = <optimized out> type = <optimized out> #21 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:644 next_thr = <optimized out> trx = <optimized out> #22 que_run_threads (thr=0x7f0e68095bf8) at ./storage/innobase/que/que0que.cc:664 trx = 0x7f0ea62aa680 loop = <optimized out> #23 0x00005596d8e4b414 in trx_t::rollback_low (this=this@entry=0x7f0ea62aa680, savept=savept@entry=0x7f0ea62aafc0) at ./storage/innobase/trx/trx0roll.cc:125 thr = 0x7f0e680e8ad8 heap = 0x7f0e680e8978 roll_node = 0x7f0e680e89e0 #24 0x00005596d8e4a300 in trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:176 No locals. #25 trx_t::rollback (savept=0x7f0ea62aafc0, this=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:163 No locals. #26 trx_rollback_last_sql_stat_for_mysql (trx=trx@entry=0x7f0ea62aa680) at ./storage/innobase/trx/trx0roll.cc:307 err = <optimized out> #27 0x00005596d8d4d2e0 in innobase_rollback (hton=<optimized out>, thd=0x7f0e68000c68, rollback_trx=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:4705 trx = 0x7f0ea62aa680 error = <optimized out> #28 0x00005596d8a4eed0 in ha_rollback_trans (thd=thd@entry=0x7f0e68000c68, all=all@entry=false) at ./sql/handler.cc:2224 err = <optimized out> ht = <optimized out> error = 0 trans = <optimized out> ha_info = 0x7f0e68003628 ha_info_next = <optimized out> is_real_trans = <optimized out> #29 0x00005596d892f9d9 in trans_rollback_stmt (thd=thd@entry=0x7f0e68000c68) at ./sql/transaction.cc:535 No locals. #30 0x00005596d8802e50 in mysql_execute_command (thd=0x7f0e68000c68, is_called_from_prepared_stmt=<optimized out>) at ./sql/sql_parse.cc:6077 multi_delete_error = <optimized out> wsrep_error_label = <optimized out> error = <optimized out> res = 1 up_result = <optimized out> lex = 0x7f0e68004d88 select_lex = <optimized out> first_table = <optimized out> all_tables = 0x7f0e68013130 unit = 0x7f0e68004e50 have_table_map_for_update = <optimized out> rpl_filter = <optimized out> ots = {ctx = 0x7f0e68004a58, traceable = false} orig_binlog_format = <optimized out> orig_current_stmt_binlog_format = <optimized out> #31 0x00005596d88085f7 in mysql_parse (thd=0x7f0e68000c68, rawbuf=<optimized out>, length=<optimized out>, parser_state=<optimized out>) at ./sql/sql_parse.cc:8053 found_semicolon = <optimized out> error = <optimized out> lex = 0x7f0e68004d88 err = <optimized out> #32 0x00005596d880addd in dispatch_command (command=COM_QUERY, thd=0x7f0e68000c68, packet=<optimized out>, packet_length=<optimized out>, blocking=<optimized out>) at ./sql/sql_parse.cc:1993 beginning_of_next_stmt = <optimized out> length = <optimized out> packet_end = 0x7f0e3b861df4 "" parser_state = {m_lip = {lookahead_token = -1, lookahead_yylval = 0x0, m_thd = 0x7f0e68000c68, m_ptr = 0x7f0e3b861df5 "\b", m_tok_start = 0x7f0e3b861df5 "\b", m_tok_end = 0x7f0e3b861df5 "\b", m_end_of_query = 0x7f0e3b861df4 "", m_tok_start_prev = 0x7f0e3b861df4 "", m_buf = 0x7f0e3afb4e00 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_buf_length = 9097204, m_echo = true, m_echo_saved = true, m_cpp_buf = 0x7f0e38000070 "INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"..., m_cpp_ptr = 0x7f0e388ad064 "", m_cpp_tok_start = 0x7f0e388ad064 "", m_cpp_tok_start_prev = 0x7f0e388ad064 "", m_cpp_tok_end = 0x7f0e388ad064 "", m_body_utf8 = 0x0, m_body_utf8_ptr = 0x0, m_cpp_utf8_processed_ptr = 0x0, next_state = MY_LEX_END, found_semicolon = 0x0, ignore_space = false, stmt_prepare_mode = false, multi_statements = true, yylineno = 26245, m_digest = 0x0, in_comment = NO_COMMENT, in_comment_saved = NO_COMMENT, m_cpp_text_start = 0x7f0e388acf97 "<omitted customer data>"..., m_cpp_text_end = 0x7f0e388ad062 "\")", m_underscore_cs = 0x0}, m_yacc = {yacc_yyss = 0x0, yacc_yyvs = 0x0, m_set_signal_info = {m_item = {0x0 <repeats 12 times>}}, m_lock_type = TL_READ_DEFAULT, m_mdl_type = MDL_SHARED_READ}, m_digest_psi = 0x0} net = 0x7f0e68000f20 error = false do_end_of_statement = true drop_more_results = false __FUNCTION__ = "dispatch_command" res = <optimized out> #33 0x00005596d880cbe8 in do_command (thd=0x7f0e68000c68, blocking=blocking@entry=true) at ./sql/sql_parse.cc:1409 return_value = <optimized out> packet = 0x7f0e3919adb8 "\003INSERT INTO `steps` VALUES\n(11626625,40592,907,\"<omitted customer data>"... packet_length = <optimized out> net = 0x7f0e68000f20 command = COM_QUERY __FUNCTION__ = "do_command" #34 0x00005596d891efef in do_handle_one_connection (connect=<optimized out>, put_in_cache=true) at ./sql/sql_connect.cc:1416 create_user = true thr_create_utime = <optimized out> thd = 0x7f0e68000c68 #35 0x00005596d891f33d in handle_one_connection (arg=arg@entry=0x5596db163948) at ./sql/sql_connect.cc:1318 connect = 0x5596db163948 #36 0x00005596d8ca4ce6 in pfs_spawn_thread (arg=0x5596db1180f8) at ./storage/perfschema/pfs.cc:2201 typed_arg = 0x5596db1180f8 user_arg = 0x5596db163948 user_start_routine = 0x5596d891f2e0 <handle_one_connection(void*)> pfs = <optimized out> klass = <optimized out> #37 0x00007f0eb0309b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 ret = <optimized out> pd = <optimized out> out = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140720411833360, -1557864560374536914, 139700660323904, 1, 139701062244432, 140720411833712, 1476171432797185326, 1476118612295460142}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> #38 0x00007f0eb039ba00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 No locals.
            marko Marko Mäkelä made changes -

            The posted stack trace is specific to ROW_FORMAT=COMPRESSED. There is an open bug MDEV-31441 with a test case that reproduces corruption on 10.4. This may be related to this failure.

            Starting with 10.6, MDEV-515 allows a bulk load mechanism that will use table-level locking and undo logging when inserting into an empty table or partition. This mechanism is enabled by SET unique_checks=0,foreign_key_checks=0.

            My understanding is that Myloader would by default attempt to use multiple threads for loading a single table. That will obviously not work with the table-level locking. There exists MDEV-16281 for improving the data loading.

            marko Marko Mäkelä added a comment - The posted stack trace is specific to ROW_FORMAT=COMPRESSED . There is an open bug MDEV-31441 with a test case that reproduces corruption on 10.4. This may be related to this failure. Starting with 10.6, MDEV-515 allows a bulk load mechanism that will use table-level locking and undo logging when inserting into an empty table or partition. This mechanism is enabled by SET unique_checks=0,foreign_key_checks=0 . My understanding is that Myloader would by default attempt to use multiple threads for loading a single table. That will obviously not work with the table-level locking. There exists MDEV-16281 for improving the data loading.
            marko Marko Mäkelä made changes -

            As for the relation to MDEV-31441, if they are indeed the same problem that would mean it was fixed in 10.5 (At least as far as we can tell; we have been running 10.5 for five months now and have not seen the problem reoccur) .

            Regarding the table-level locking, we did run myloader with the

            --threads 1

            option before downgrading to 10.5. But while the problem seemed much less likely to occur, it did eventually still happen.

            bjhilbrands Bart-Jan Hilbrands added a comment - As for the relation to MDEV-31441 , if they are indeed the same problem that would mean it was fixed in 10.5 (At least as far as we can tell; we have been running 10.5 for five months now and have not seen the problem reoccur) . Regarding the table-level locking, we did run myloader with the --threads 1 option before downgrading to 10.5. But while the problem seemed much less likely to occur, it did eventually still happen.

            bjhilbrands, sorry, actually I meant MDEV-31574 and not MDEV-31441.

            The famous quote of E. W. Dijkstra is applicable here: Program testing can be used to show the presence of bugs, but never to show their absence!

            Depending on when transactions are rolled back or rollback or the purge of committed history is executed, the physical layout of index pages or the entire index tree could be very different for the same logical contents. Also in MDEV-31817 we are able to reproduce some corruption on a ROW_FORMAT=COMPRESSED table.

            marko Marko Mäkelä added a comment - bjhilbrands , sorry, actually I meant MDEV-31574 and not MDEV-31441 . The famous quote of E. W. Dijkstra is applicable here: Program testing can be used to show the presence of bugs, but never to show their absence! Depending on when transactions are rolled back or rollback or the purge of committed history is executed, the physical layout of index pages or the entire index tree could be very different for the same logical contents. Also in MDEV-31817 we are able to reproduce some corruption on a ROW_FORMAT=COMPRESSED table.

            It turns out that MDEV-31574 had already been fixed in 10.5 by MDEV-12353. But I am rather certain that there are other bugs that cause corruption of ROW_FORMAT=COMPRESSED tables. We’d only need some data for reproducing them.

            marko Marko Mäkelä added a comment - It turns out that MDEV-31574 had already been fixed in 10.5 by MDEV-12353 . But I am rather certain that there are other bugs that cause corruption of ROW_FORMAT=COMPRESSED tables. We’d only need some data for reproducing them.

            Working on generating some data right now! Its going to take a while mostly because I'll have to make sure it is still reproducible with the generated data (Don't see a reason why not), and reproducing it might take several hours.

            bjhilbrands Bart-Jan Hilbrands added a comment - Working on generating some data right now! Its going to take a while mostly because I'll have to make sure it is still reproducible with the generated data (Don't see a reason why not), and reproducing it might take several hours.

            I have some data ready but the file is too big; https://mariadb.com/kb/en/reporting-bugs/ mentions uploading to a public ftp server, but we are reluctant to use that. Would it be okay to invite you to a private sharepoint folder that contains the archived dump via email?

            bjhilbrands Bart-Jan Hilbrands added a comment - I have some data ready but the file is too big; https://mariadb.com/kb/en/reporting-bugs/ mentions uploading to a public ftp server, but we are reluctant to use that. Would it be okay to invite you to a private sharepoint folder that contains the archived dump via email?

            bjhilbrands, I sent you an email.

            marko Marko Mäkelä added a comment - bjhilbrands , I sent you an email.

            Shared the drive. Perhaps some more detailed information: it always seems be a steps table index that gets corrupted, but that might also be a statistical thing as the vast majority of the data are steps.

            bjhilbrands Bart-Jan Hilbrands added a comment - Shared the drive. Perhaps some more detailed information: it always seems be a steps table index that gets corrupted, but that might also be a statistical thing as the vast majority of the data are steps.

            Thank you. I will get to this next week, after fixing MDEV-31441.

            marko Marko Mäkelä added a comment - Thank you. I will get to this next week, after fixing MDEV-31441 .

            bjhilbrands, can you please reduce the dump so that the error would reproduce much faster? For example:

            1. remove any other tables than steps or any tables that it might be connected via FOREIGN KEY constraints (unless myloader does set foreign_key_checks=0)
            2. remove any non-unique indexes on the table steps other than the one where it crashes; in GDB, execute frame 11 to get to the stack frame of page_zip_dir_delete() and then print index->name to see the name of the index

            I have had a single-threaded load of the dump running against a debug instrumented server for over 102 CPU minutes (and some 75 minutes real time) now, and it still has not crashed. The data directory has grown to 4.3 GiB during this time. I think that the main cause of slowdown could be the cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON a.k.a. page_zip_validate(), which I think needs to be enabled to get to the root cause as early as possible.

            I am attempting to load the dump as follows:

            myloader --directory /dev/shm/data-dump2 --socket /dev/shm/mariadb.sock -u root --threads=1
            

            marko Marko Mäkelä added a comment - bjhilbrands , can you please reduce the dump so that the error would reproduce much faster? For example: remove any other tables than steps or any tables that it might be connected via FOREIGN KEY constraints (unless myloader does set foreign_key_checks=0 ) remove any non-unique indexes on the table steps other than the one where it crashes; in GDB, execute frame 11 to get to the stack frame of page_zip_dir_delete() and then print index->name to see the name of the index I have had a single-threaded load of the dump running against a debug instrumented server for over 102 CPU minutes (and some 75 minutes real time) now, and it still has not crashed. The data directory has grown to 4.3 GiB during this time. I think that the main cause of slowdown could be the cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON a.k.a. page_zip_validate() , which I think needs to be enabled to get to the root cause as early as possible. I am attempting to load the dump as follows: myloader --directory /dev/shm/data-dump2 --socket /dev/shm/mariadb.sock -u root --threads=1
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] Needs Feedback [ 10501 ]

            Uploaded a new package in the same drive containing only the steps table. Reproducing it with one thread is indeed quite painful. In fact, I had like 10 successful restores in a row using --threads=1 while looking into this, which the resulted in us rolling out a fix to only run myloader single threaded everywhere. It then ran fine for about a week, before the same problem occurred (but this time on a server only). Just as a sanity check I'll try to actually reproduce it locally using just 1 thread.

            bjhilbrands Bart-Jan Hilbrands added a comment - Uploaded a new package in the same drive containing only the steps table. Reproducing it with one thread is indeed quite painful. In fact, I had like 10 successful restores in a row using --threads=1 while looking into this, which the resulted in us rolling out a fix to only run myloader single threaded everywhere. It then ran fine for about a week, before the same problem occurred (but this time on a server only). Just as a sanity check I'll try to actually reproduce it locally using just 1 thread.

            It does make me wonder if there were other factors involved that caused the failure during the single threaded restore.

            bjhilbrands Bart-Jan Hilbrands added a comment - It does make me wonder if there were other factors involved that caused the failure during the single threaded restore.

            If it proves too time consuming to reproduce it single threaded, it might also be the most productive to wait for MDEV-16281 and see if it is still reproducible.

            bjhilbrands Bart-Jan Hilbrands added a comment - If it proves too time consuming to reproduce it single threaded, it might also be the most productive to wait for MDEV-16281 and see if it is still reproducible.

            bjhilbrands, the stack trace of your crash is during the rollback of an INSERT statement. I suppose that the reason is a deadlock between concurrent INSERT transactions. These can occur when there are multiple indexes in a table. I am not familiar with mydumper or myloader, but it would make sense to create the indexes only after loading the data. This should not only prevent the deadlocks and rollbacks from occurring, but also speed up the operation, because only one index tree would have to be updated during the data load, which would allow more efficient use of the buffer pool. Can you try if this would make the crash go away?

            In MDEV-30882 I already noted that there are hard-to-fix problems with rollback in ROW_FORMAT=COMPRESSED tables. I was not able to fully fix the bug. A page overflow during a rollback is possible. I have a data load running without --threads=1, but my build option cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON is slowing it down extremely. Only 2 CPU threads are occupied.

            marko Marko Mäkelä added a comment - bjhilbrands , the stack trace of your crash is during the rollback of an INSERT statement. I suppose that the reason is a deadlock between concurrent INSERT transactions. These can occur when there are multiple indexes in a table. I am not familiar with mydumper or myloader , but it would make sense to create the indexes only after loading the data. This should not only prevent the deadlocks and rollbacks from occurring, but also speed up the operation, because only one index tree would have to be updated during the data load, which would allow more efficient use of the buffer pool. Can you try if this would make the crash go away? In MDEV-30882 I already noted that there are hard-to-fix problems with rollback in ROW_FORMAT=COMPRESSED tables. I was not able to fully fix the bug. A page overflow during a rollback is possible. I have a data load running without --threads=1 , but my build option cmake -DWITH_INNODB_EXTRA_DEBUG:BOOL=ON is slowing it down extremely. Only 2 CPU threads are occupied.

            I have had the data loading running for 6.5 CPU hours (more than 3 wall-clock hours), and there has been no crash so far. The data directory has grown to only 5.1 GiB during this time. The build option that I am using will cause very frequent calls to page_zip_decompress(), which could easily slow down things by an order of magnitude. In any case, this does not look like a feasible option for reproducing the bug.

            marko Marko Mäkelä added a comment - I have had the data loading running for 6.5 CPU hours (more than 3 wall-clock hours), and there has been no crash so far. The data directory has grown to only 5.1 GiB during this time. The build option that I am using will cause very frequent calls to page_zip_decompress() , which could easily slow down things by an order of magnitude. In any case, this does not look like a feasible option for reproducing the bug.

            bjhilbrands, the data load had finally completed on my debug build, without any crash:

            10:24:18 /dev/shm/10.6/mysql-test$myloader myloader --directory /dev/shm/data-dump-steps-only --socket /dev/shm/mariadb.sock -u root
            15:50:30 /dev/shm/10.6/mysql-test$
            

            The data directory size grew to 5.8 GiB, of which 4.0 GiB is the ib_logfile0. The server was shut down successfully, after consuming 706 minutes and 33 seconds of CPU time:

            10.6 20b0ec9aae38faaa4d249f67315cc5394eaea22e

            2023-11-29 16:36:48 0 [Note] InnoDB: Shutdown completed; log sequence number 13600918514; transaction id 1814
            

            According to the LSN output, it had written some 13 GiB to the circular ib_logfile0.

            marko Marko Mäkelä added a comment - bjhilbrands , the data load had finally completed on my debug build, without any crash: 10:24:18 /dev/shm/10.6/mysql-test$myloader myloader --directory /dev/shm/data-dump-steps-only --socket /dev/shm/mariadb.sock -u root 15:50:30 /dev/shm/10.6/mysql-test$ The data directory size grew to 5.8 GiB, of which 4.0 GiB is the ib_logfile0 . The server was shut down successfully, after consuming 706 minutes and 33 seconds of CPU time: 10.6 20b0ec9aae38faaa4d249f67315cc5394eaea22e 2023-11-29 16:36:48 0 [Note] InnoDB: Shutdown completed; log sequence number 13600918514; transaction id 1814 According to the LSN output, it had written some 13 GiB to the circular ib_logfile0 .

            marko mentioned the potential ROLLBACK issues in this ticket during re-testing of MDEV-31817 (that bug did not reproduce anymore in the patched branch 10.6-MDEV-32371 at commit 963a88be2e89100260e09e46e90e2b3271444899), so I added ROLLBACK and START TRANSACTION to the test using a random selection, as follows:

            #!/bin/bash
            # Start server with --max_connections=10000 --log-bin
            # Set variables and ensure ramloc is a ramdisk or tmpfs (i.e. /dev/shm)
             
            user="root"
            socket="./socket.sock"
            db="test"
            client="./bin/mariadb"
            errorlog="./log/master.err"
            ramloc="/dev/shm"
            threads=2000   # Number of concurrent threads
            queries=100    # Number of t1/t2 INSERTs per thread/per test round
            rounds=999999  # Number of max test rounds
             
            # Setup
            ${client} -u ${user} -S ${socket} -D ${db} -e "
            DROP TABLE IF EXISTS t1;
            DROP TABLE IF EXISTS t2;
            CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT, c2 INT NOT NULL, PRIMARY KEY (c1), UNIQUE KEY u1 (c1,c2)) ENGINE=InnoDB AUTO_INCREMENT=1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; 
            CREATE TABLE t2 (c1 DATETIME NOT NULL, c2 DOUBLE NOT NULL, t1_c1 INT NOT NULL, PRIMARY KEY (t1_c1,c1)) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;
            "
             
            insert_rows(){
              SQL=
              for ((i=0;i<${queries};i++)); do
                rand=$((RANDOM % 4 + 1))
                if [ ${rand} -eq 1 ]; then
                  SQL="${SQL}INSERT INTO t1 (c2) VALUES (0);INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()));"
                elif [ ${rand} -eq 2 ]; then
                  SQL="${SQL}START TRANSACTION;"
                else
                  SQL="${SQL}ROLLBACK;"
                fi
              done
              ${client} -u ${user} -S ${socket} -D ${db} -e "${SQL}"
              rm -f ${ramloc}/${prefix}_md_proc_${1}  # Thread done
            }
             
            abort(){ jobs -p | xargs -P100 kill >/dev/null 2>&1; rm -Rf ${ramloc}/${prefix}_md_proc_*; exit 1; }
            trap abort SIGINT
             
            count=0
            prefix="$(echo "${RANDOM}${RANDOM}${RANDOM}" | cut -b1-5)"
            rm -f ${ramloc}/${prefix}_md_proc_*
            for ((i=0;i<${rounds};i++)); do
              for ((i=0;i<${threads};i++)); do
                if [ ! -r ${ramloc}/${prefix}_md_proc_${i} ]; then  # Thread idle
                  touch ${ramloc}/${prefix}_md_proc_${i}  # Thread busy
                  insert_rows ${i} &
                  if [ $[ ${count} % 1000 ] -eq 0 ]; then
                    ${client} -u ${user} -S ${socket} -D ${db} -e "TRUNCATE TABLE t1;"
                  fi
                  count=$[ ${count} + 1 ]
                  if [ $[ ${count} % 100 ] -eq 0 ]; then  # Limit disk I/O, check once every new 100 threads
                    echo "Count: ${count}" | tee lastcount.log
                    TAIL="$(tail -n10 ${errorlog} | tr -d '\n')"
                    if [[ "${TAIL}" == *"ERROR"* ]]; then
                      echo '*** Error found:'
                      grep -i 'ERROR' log/master.err
                      abort
                    elif [[ "${TAIL}" == *"down complete"* ]]; then
                      echo '*** Server shutdown'
                      abort
                    elif ! ${client}-admin ping -u ${user} -S ${socket} > /dev/null 2>&1; then
                      echo '*** Server gone (killed/crashed)'
                      abort
                    fi
                  fi
                fi
              done
            done
            

            And within 111800 iterations observed:

            10.6-MDEV-32371 963a88be2e89100260e09e46e90e2b3271444899 (Optimized)

            Count: 111500
            Count: 111600
            Count: 111700
            Count: 111800
            ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it
            ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it
            ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it
            Count: 111900
            *** Error found:
            2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
            2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
            2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
            

            Reflected in the error log as:

            10.6.17 963a88be2e89100260e09e46e90e2b3271444899 (Optimized)

            2023-11-30 13:51:15 0 [Note] /test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/bin/mariadbd: ready for connections.
            Version: '10.6.17-MariaDB-log'  socket: '/test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/socket.sock'  port: 11471  MariaDB Server
            2023-11-30 13:51:33 2392 [Warning] Aborted connection 2392 to db: 'unconnected' user: 'unauthenticated' host: 'localhost' (This connection closed normally without authentication)
            2023-11-30 13:52:37 5935 [Note] InnoDB: Number of pools: 2
            2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            2023-11-30 14:17:25 117470 [Note] InnoDB: btr_page_get_parent 2
            2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            2023-11-30 14:17:25 117468 [Note] InnoDB: btr_page_get_parent 2
            2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
            2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
            2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it
            

            The extra [Note] debug messages are the result of this commit.
            I will try to see if I can get an rr trace, however MDEV-31817 has proven to be rr-elusive which likely means this issue will be too.

            Roel Roel Van de Paar added a comment - marko mentioned the potential ROLLBACK issues in this ticket during re-testing of MDEV-31817 (that bug did not reproduce anymore in the patched branch 10.6- MDEV-32371 at commit 963a88be2e89100260e09e46e90e2b3271444899 ), so I added ROLLBACK and START TRANSACTION to the test using a random selection, as follows: #!/bin/bash # Start server with --max_connections=10000 --log-bin # Set variables and ensure ramloc is a ramdisk or tmpfs (i.e. /dev/shm) user= "root" socket= "./socket.sock" db= "test" client= "./bin/mariadb" errorlog= "./log/master.err" ramloc= "/dev/shm" threads=2000 # Number of concurrent threads queries=100 # Number of t1/t2 INSERTs per thread/per test round rounds=999999 # Number of max test rounds # Setup ${client} -u ${user} -S ${socket} -D ${db} -e " DROP TABLE IF EXISTS t1; DROP TABLE IF EXISTS t2; CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT, c2 INT NOT NULL, PRIMARY KEY (c1), UNIQUE KEY u1 (c1,c2)) ENGINE=InnoDB AUTO_INCREMENT=1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; CREATE TABLE t2 (c1 DATETIME NOT NULL, c2 DOUBLE NOT NULL, t1_c1 INT NOT NULL, PRIMARY KEY (t1_c1,c1)) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4; " insert_rows(){ SQL= for ((i=0;i<${queries};i++)); do rand=$((RANDOM % 4 + 1)) if [ ${rand} - eq 1 ]; then SQL= "${SQL}INSERT INTO t1 (c2) VALUES (0);INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()));" elif [ ${rand} - eq 2 ]; then SQL= "${SQL}START TRANSACTION;" else SQL= "${SQL}ROLLBACK;" fi done ${client} -u ${user} -S ${socket} -D ${db} -e "${SQL}" rm -f ${ramloc}/${prefix}_md_proc_${1} # Thread done } abort(){ jobs -p | xargs -P100 kill > /dev/null 2>&1; rm -Rf ${ramloc}/${prefix}_md_proc_*; exit 1; } trap abort SIGINT count=0 prefix= "$(echo " ${RANDOM}${RANDOM}${RANDOM} " | cut -b1-5)" rm -f ${ramloc}/${prefix}_md_proc_* for ((i=0;i<${rounds};i++)); do for ((i=0;i<${threads};i++)); do if [ ! -r ${ramloc}/${prefix}_md_proc_${i} ]; then # Thread idle touch ${ramloc}/${prefix}_md_proc_${i} # Thread busy insert_rows ${i} & if [ $[ ${count} % 1000 ] - eq 0 ]; then ${client} -u ${user} -S ${socket} -D ${db} -e "TRUNCATE TABLE t1;" fi count=$[ ${count} + 1 ] if [ $[ ${count} % 100 ] - eq 0 ]; then # Limit disk I/O, check once every new 100 threads echo "Count: ${count}" | tee lastcount.log TAIL= "$(tail -n10 ${errorlog} | tr -d '\n')" if [[ "${TAIL}" == * "ERROR" * ]]; then echo '*** Error found:' grep -i 'ERROR' log /master .err abort elif [[ "${TAIL}" == * "down complete" * ]]; then echo '*** Server shutdown' abort elif ! ${client}-admin ping -u ${user} -S ${socket} > /dev/null 2>&1; then echo '*** Server gone (killed/crashed)' abort fi fi fi done done And within 111800 iterations observed: 10.6-MDEV-32371 963a88be2e89100260e09e46e90e2b3271444899 (Optimized) Count: 111500 Count: 111600 Count: 111700 Count: 111800 ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it ERROR 1034 (HY000) at line 1: Index for table 't1' is corrupt; try to repair it Count: 111900 *** Error found: 2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it Reflected in the error log as: 10.6.17 963a88be2e89100260e09e46e90e2b3271444899 (Optimized) 2023-11-30 13:51:15 0 [Note] /test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/bin/mariadbd: ready for connections. Version: '10.6.17-MariaDB-log' socket: '/test/MARKO_MD281123-mariadb-10.6.17-linux-x86_64-opt/socket.sock' port: 11471 MariaDB Server 2023-11-30 13:51:33 2392 [Warning] Aborted connection 2392 to db: 'unconnected' user: 'unauthenticated' host: 'localhost' (This connection closed normally without authentication) 2023-11-30 13:52:37 5935 [Note] InnoDB: Number of pools: 2 2023-11-30 14:17:25 117465 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117470 [Note] InnoDB: btr_page_get_parent 2 2023-11-30 14:17:25 117470 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [Note] InnoDB: btr_page_get_parent 2 2023-11-30 14:17:25 117468 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. 2023-11-30 14:17:25 117468 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117465 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it 2023-11-30 14:17:25 117470 [ERROR] mariadbd: Index for table 't1' is corrupt; try to repair it The extra [Note] debug messages are the result of this commit . I will try to see if I can get an rr trace, however MDEV-31817 has proven to be rr-elusive which likely means this issue will be too.
            Roel Roel Van de Paar made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            Roel Roel Van de Paar made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]

            Roel, thank you. It is now clear that the hard-to-fix ROLLBACK issue on ROW_FORMAT=COMPRESSED tables that I identified in MDEV-30882 is rather easy to hit.

            I believe that the reason why I was not able to reproduce the problem with bjhilbrands’s data dump was that the relative timing of the SQL threads was different due to the enabled UNIV_ZIP_DEBUG checks, and much fewer rollbacks (due to deadlock or lock wait timeout) took place.

            marko Marko Mäkelä added a comment - Roel , thank you. It is now clear that the hard-to-fix ROLLBACK issue on ROW_FORMAT=COMPRESSED tables that I identified in MDEV-30882 is rather easy to hit. I believe that the reason why I was not able to reproduce the problem with bjhilbrands ’s data dump was that the relative timing of the SQL threads was different due to the enabled UNIV_ZIP_DEBUG checks, and much fewer rollbacks (due to deadlock or lock wait timeout) took place.
            marko Marko Mäkelä made changes -

            Thank you indeed! This specific restore takes about ~30 minutes to complete for me locally, and it had to ran five times before the problem re-occurred. So even if it was still reproducible with the additional debug checks as you mentioned, it probably would have taken multiple days of non-stop loading for the problem occur, which I agree is not feasible.

            bjhilbrands Bart-Jan Hilbrands added a comment - Thank you indeed! This specific restore takes about ~30 minutes to complete for me locally, and it had to ran five times before the problem re-occurred. So even if it was still reproducible with the additional debug checks as you mentioned, it probably would have taken multiple days of non-stop loading for the problem occur, which I agree is not feasible.

            If it matters for analysis, the script will see the occasional duplicate key:

            Count: 1900
            Count: 2000
            ERROR 1062 (23000) at line 1: Duplicate entry '1610-2023-11-30 19:10:05' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '2862-2023-11-30 19:10:05' for key 'PRIMARY'
            Count: 2100
            

            Roel Roel Van de Paar added a comment - If it matters for analysis, the script will see the occasional duplicate key: Count: 1900 Count: 2000 ERROR 1062 (23000) at line 1: Duplicate entry '1610-2023-11-30 19:10:05' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '2862-2023-11-30 19:10:05' for key 'PRIMARY' Count: 2100
            marko Marko Mäkelä made changes -
            Component/s Backup [ 13902 ]
            Summary Table corruption during database restore ROW_FORMAT=COMPRESSED table corruption due to ROLLBACK

            Roel, are you able to reproduce this?

            marko Marko Mäkelä added a comment - Roel , are you able to reproduce this?
            Roel Roel Van de Paar made changes -
            Attachment MDEV-32174_ps.txt [ 72878 ]

            marko Hi! Sorry been busy with Replication and Spider. I setup a new rr run with:

            cmake -DCMAKE_C{,XX}_FLAGS='-Og -march=native -mtune=native -Wno-error=maybe-uninitialized -Wno-error=unused-result' -DCMAKE_BUILD_TYPE=Debug -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF -DWITH_INNODB_EXTRA_DEBUG=ON && make -j128
            

            However I immediately noticed that the processing (done count) seemed to be going much slower, though perhaps it is caused by `rr` slowness.
            It looks like threads are getting stuck. I uploaded MDEV-32174_ps.txt for your review. I will let it run though at this rate it will take maybe a week orso to hit the same issue.

            Roel Roel Van de Paar added a comment - marko Hi! Sorry been busy with Replication and Spider. I setup a new rr run with: cmake -DCMAKE_C{,XX}_FLAGS='-Og -march=native -mtune=native -Wno-error=maybe-uninitialized -Wno-error=unused-result' -DCMAKE_BUILD_TYPE=Debug -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF -DWITH_INNODB_EXTRA_DEBUG=ON && make -j128 However I immediately noticed that the processing (done count) seemed to be going much slower, though perhaps it is caused by `rr` slowness. It looks like threads are getting stuck. I uploaded MDEV-32174_ps.txt for your review. I will let it run though at this rate it will take maybe a week orso to hit the same issue.

            Confirmed that the slowness is due to rr. Without it, the count goes considerably quicker. Also, the processlist, while still showing 15-20 second old threads at times, does continually clear up.

            Roel Roel Van de Paar added a comment - Confirmed that the slowness is due to rr. Without it, the count goes considerably quicker. Also, the processlist, while still showing 15-20 second old threads at times, does continually clear up.
            Roel Roel Van de Paar added a comment - - edited

            The rr instance ran into:

            10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0

            Version: '10.6.17-MariaDB-debug-log'  socket: '/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/socket.sock'  port: 12864  Source distribution
            2024-01-30 13:43:02 1571 [Note] InnoDB: Number of pools: 2
            2024-01-30 14:08:01 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
            240130 14:08:01 [ERROR] mysqld got signal 6 ;
            

            10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0

            Core was generated by `/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/bin/mariadbd --no-defaults --ma'.
            Program terminated with signal SIGABRT, Aborted.
            #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
                at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
            [Current thread is 1 (Thread 0x39ef7f1ee700 (LWP 389013))]
            (gdb) bt
            #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
            #1  0x000055ca95c706fe in my_write_core (sig=sig@entry=6) at /test/10.6_opt/mysys/stacktrace.c:424
            #2  0x000055ca954e8a55 in handle_fatal_signal (sig=6) at /test/10.6_opt/sql/signal_handler.cc:357
            #3  <signal handler called>
            #4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
            #5  0x000046220f678859 in __GI_abort () at abort.c:79
            #6  0x000055ca95a69f84 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_opt/storage/innobase/ut/ut0ut.cc:515
            #7  0x000055ca95a1256f in srv_monitor_task () at /test/10.6_opt/storage/innobase/include/ut0ut.h:393
            #8  0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:344
            #9  tpool::thread_pool_generic::timer_generic::execute (arg=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:364
            #10 0x000055ca95c16f06 in tpool::task::execute (this=0x55ca97277580) at /test/10.6_opt/tpool/task.cc:37
            #11 0x000055ca95c14a9f in tpool::thread_pool_generic::worker_main (this=0x55ca971bd820, thread_var=0x55ca971bdfe0) at /test/10.6_opt/tpool/tpool_generic.cc:583
            #12 0x000055ca95c15e52 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
            #13 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
            #14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
            #15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
            #16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
            #17 0x000048d363f37de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
            #18 0x000008d07419b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #19 0x000046220f775133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            marko Do you want to check the rr trace for this?

            rr: /test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/rr/latest-trace
            

            Related: MDEV-24911

            Roel Roel Van de Paar added a comment - - edited The rr instance ran into: 10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0 Version: '10.6.17-MariaDB-debug-log' socket: '/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/socket.sock' port: 12864 Source distribution 2024-01-30 13:43:02 1571 [Note] InnoDB: Number of pools: 2 2024-01-30 14:08:01 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ 240130 14:08:01 [ERROR] mysqld got signal 6 ; 10.6.17 12d05c8266567ea9a62ccabc3180a90129bf81e0 Core was generated by `/test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/bin/mariadbd --no-defaults --ma'. Program terminated with signal SIGABRT, Aborted. #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 [Current thread is 1 (Thread 0x39ef7f1ee700 (LWP 389013))] (gdb) bt #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 0x000055ca95c706fe in my_write_core (sig=sig@entry=6) at /test/10.6_opt/mysys/stacktrace.c:424 #2 0x000055ca954e8a55 in handle_fatal_signal (sig=6) at /test/10.6_opt/sql/signal_handler.cc:357 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #5 0x000046220f678859 in __GI_abort () at abort.c:79 #6 0x000055ca95a69f84 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_opt/storage/innobase/ut/ut0ut.cc:515 #7 0x000055ca95a1256f in srv_monitor_task () at /test/10.6_opt/storage/innobase/include/ut0ut.h:393 #8 0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:344 #9 tpool::thread_pool_generic::timer_generic::execute (arg=0x55ca97277540) at /test/10.6_opt/tpool/tpool_generic.cc:364 #10 0x000055ca95c16f06 in tpool::task::execute (this=0x55ca97277580) at /test/10.6_opt/tpool/task.cc:37 #11 0x000055ca95c14a9f in tpool::thread_pool_generic::worker_main (this=0x55ca971bd820, thread_var=0x55ca971bdfe0) at /test/10.6_opt/tpool/tpool_generic.cc:583 #12 0x000055ca95c15e52 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89 #13 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95 #14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244 #15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251 #16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195 #17 0x000048d363f37de4 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6 #18 0x000008d07419b609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #19 0x000046220f775133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 marko Do you want to check the rr trace for this? rr: /test/MDEV-32174_mariadb-10.6.17-linux-x86_64_2/rr/latest-trace Related: MDEV-24911

            Roel, I think that most hangs or deadlocks involve only 2 or 3 threads. The rr replay trace that you produced for my analysis contains 2,013 threads. Due to the way how rr record works, this may very easily lead to unfair scheduling and starvation of threads. Such “fake hangs” are best avoided by reducing the number of concurrent threads or by reducing some configuration parameters, or even patching the source code, such as reducing OS_AIO_N_PENDING_IOS_PER_THREAD from 256 to 1, which we did to uncover MDEV-32861 and to reduce the amount of false positive hang reports.

            I believe that this particular bug (a corruption on ROLLBACK) should be reproducible with 1 connection thread, or maybe at most 2, if the rollbacks are being provoked not by explicit ROLLBACK statements no by duplicate key errors, but by deadlocks or lock wait timeouts between conflicting transactions. I do not see any reason to use thousands of concurrent connections.

            I tried to find the latest threads that are contending on dict_sys.latch, but the trace is huge and the system is loaded with other processes. Already halfway through the trace (at when 2000000) the dict_sys.latch_ex_wait_start was at its final value 1706582093182145 (LANG=C TZ=UTC0 date --date @1706582093.182145 converts it to a readable timestamp). The interesting part of the trace should be somewhere after when 1534726 (timestamp 1706582052682666). In a few minutes I might find it. I only do this to prove that this is a false alarm.

            marko Marko Mäkelä added a comment - Roel , I think that most hangs or deadlocks involve only 2 or 3 threads. The rr replay trace that you produced for my analysis contains 2,013 threads. Due to the way how rr record works, this may very easily lead to unfair scheduling and starvation of threads. Such “fake hangs” are best avoided by reducing the number of concurrent threads or by reducing some configuration parameters, or even patching the source code, such as reducing OS_AIO_N_PENDING_IOS_PER_THREAD from 256 to 1, which we did to uncover MDEV-32861 and to reduce the amount of false positive hang reports. I believe that this particular bug (a corruption on ROLLBACK ) should be reproducible with 1 connection thread, or maybe at most 2, if the rollbacks are being provoked not by explicit ROLLBACK statements no by duplicate key errors, but by deadlocks or lock wait timeouts between conflicting transactions. I do not see any reason to use thousands of concurrent connections. I tried to find the latest threads that are contending on dict_sys.latch , but the trace is huge and the system is loaded with other processes. Already halfway through the trace (at when 2000000) the dict_sys.latch_ex_wait_start was at its final value 1706582093182145 ( LANG=C TZ=UTC0 date --date @1706582093.182145 converts it to a readable timestamp). The interesting part of the trace should be somewhere after when 1534726 (timestamp 1706582052682666). In a few minutes I might find it. I only do this to prove that this is a false alarm.

            The final wait for exclusive dict_sys.latch started in Thread 1014 at when 1680126, in the master callback task. This will block any subsequent waits for dict_sys.latch until the exclusive latch has been granted and released. Here is the stack trace:

            10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0

            #13 0x000055ca95b3a4a2 in ssux_lock_impl<false>::wr_lock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:966
            #14 dict_sys_t::lock_wait (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:961
            #15 0x000055ca9581f65e in dict_sys_t::lock (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/include/dict0dict.h:1524
            #16 0x000055ca95b3ee6b in dict_sys_t::evict_table_LRU (this=0x55ca965a2300 <dict_sys>, half=half@entry=false) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:1300
            #17 0x000055ca95a0f5c5 in srv_master_do_idle_tasks (counter_time=counter_time@entry=2321294823926) at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1465
            #18 0x000055ca95a11a4c in srv_master_callback () at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1516
            #19 0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca972864f0) at /test/10.6_opt/tpool/tpool_generic.cc:344
            

            Everything that I have observed so far suggests that between this point and the end of the rr replay trace, one thread will be holding dict_sys.latch in shared mode. We will need to identify this thread and find out what is blocking it. If it turns out that rr record is neglecting this thread (not executing it at all), then this is the type of “fake hang” that I suspected and expected. To avoid an occasional bug of rr replay that reverse-continue may miss breakpoints or watchpoints, I restarted the trace slightly before the start of the final wait started. At this point, the dict_sys.latch was still unoccupied:

            watch -l dict_sys.latch_ex_wait_start._M_i
            watch -l dict_sys.latch.readers._M_i
            display dict_sys.latch_ex_wait_start
            display dict_sys.latch
            display dict_sys.latch_ex
            run 1600000
            

            At when 1600089, dict_sys.latch was both acquired and released in shared mode by Thread 6, and again reacquired at when 1600174. At this rate it will be several minutes before when 1680126 (the start of the wait in Thread 1014) will be reached.

            marko Marko Mäkelä added a comment - The final wait for exclusive dict_sys.latch started in Thread 1014 at when 1680126, in the master callback task. This will block any subsequent waits for dict_sys.latch until the exclusive latch has been granted and released. Here is the stack trace: 10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0 #13 0x000055ca95b3a4a2 in ssux_lock_impl<false>::wr_lock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:966 #14 dict_sys_t::lock_wait (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:961 #15 0x000055ca9581f65e in dict_sys_t::lock (this=this@entry=0x55ca965a2300 <dict_sys>) at /test/10.6_opt/storage/innobase/include/dict0dict.h:1524 #16 0x000055ca95b3ee6b in dict_sys_t::evict_table_LRU (this=0x55ca965a2300 <dict_sys>, half=half@entry=false) at /test/10.6_opt/storage/innobase/dict/dict0dict.cc:1300 #17 0x000055ca95a0f5c5 in srv_master_do_idle_tasks (counter_time=counter_time@entry=2321294823926) at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1465 #18 0x000055ca95a11a4c in srv_master_callback () at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1516 #19 0x000055ca95c16003 in tpool::thread_pool_generic::timer_generic::run (this=0x55ca972864f0) at /test/10.6_opt/tpool/tpool_generic.cc:344 Everything that I have observed so far suggests that between this point and the end of the rr replay trace, one thread will be holding dict_sys.latch in shared mode. We will need to identify this thread and find out what is blocking it. If it turns out that rr record is neglecting this thread (not executing it at all), then this is the type of “fake hang” that I suspected and expected. To avoid an occasional bug of rr replay that reverse-continue may miss breakpoints or watchpoints, I restarted the trace slightly before the start of the final wait started. At this point, the dict_sys.latch was still unoccupied: watch -l dict_sys.latch_ex_wait_start._M_i watch -l dict_sys.latch.readers._M_i display dict_sys.latch_ex_wait_start display dict_sys.latch display dict_sys.latch_ex run 1600000 At when 1600089, dict_sys.latch was both acquired and released in shared mode by Thread 6, and again reacquired at when 1600174. At this rate it will be several minutes before when 1680126 (the start of the wait in Thread 1014) will be reached.

            The requests to acquire and release a shared dict_sys.latch in Thread 6 are by the purge coordinator task. Here is one stack trace:

            10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0

            #3  0x000055ca95a219d1 in purge_sys_t::close_and_reopen (
                this=this@entry=0x55ca96e99e40 <purge_sys>, id=<optimized out>, 
                thd=thd@entry=0x55ca97392688, mdl=mdl@entry=0x5a8517a51ae8)
                at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1237
            #4  0x000055ca95a28811 in trx_purge_attach_undo_recs (
                n_purge_threads=n_purge_threads@entry=4, thd=thd@entry=0x55ca97392688)
                at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1349
            #5  0x000055ca95a28c1a in trx_purge (n_tasks=n_tasks@entry=4, 
                history_size=3911) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1453
            #6  0x000055ca95a1232d in purge_coordinator_state::do_purge (
                this=0x55ca96e99020 <purge_state>)
                at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1638
            #7  purge_coordinator_callback ()
                at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1716
            

            However, I realized that the dict_sys.latch was released after when 1680126. The trace is executing extremely slowly. This is the latest that I got, after several hours of waiting:

            10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0

            Thread 2015 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i
             
            Old value = 2147483648
            New value = 0
            ssux_lock_impl<false>::wr_unlock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/include/srw_lock.h:306
            306	    writer.wr_unlock();
            1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>}
            2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}}
            3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 2147483649}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}
            Current event: 2788949
            [Switching to Thread 350035.350041]
             
            Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i
             
            Old value = 0
            New value = 1
            0x000055ca95a20e42 in std::__atomic_base<unsigned int>::compare_exchange_weak (__m2=std::memory_order_relaxed, __m1=std::memory_order_acquire, __i2=1, __i1=<synthetic pointer>: <optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:448
            448	      compare_exchange_weak(__int_type& __i1, __int_type __i2,
            1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>}
            2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}}
            3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 1}, <No data fields>}}
            Current event: 2789015
             
            Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i
             
            Old value = 1
            New value = 0
            0x000055ca95a21123 in ssux_lock_impl<false>::rd_unlock (this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:549
            549	      fetch_sub(__int_type __i,
            1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>}
            2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}}
            3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}
            Current event: 2789015
            

            That is, a shared dict_sys.latch was released, reacquired and released at this point, and acquired in shared mode after that. The final wait must have started somewhere between that and the final when count (slightly above 4 million).

            I have never seen rr replay traces executing this slow when there is no other load on the system. I think that it must be due to the huge number of threads and the constant context switches between them. I will assume that this is a “fake hang”.

            Please try to reproduce the bug of this ticket with a more reasonable configuration.

            marko Marko Mäkelä added a comment - The requests to acquire and release a shared dict_sys.latch in Thread 6 are by the purge coordinator task. Here is one stack trace: 10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0 #3 0x000055ca95a219d1 in purge_sys_t::close_and_reopen ( this=this@entry=0x55ca96e99e40 <purge_sys>, id=<optimized out>, thd=thd@entry=0x55ca97392688, mdl=mdl@entry=0x5a8517a51ae8) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1237 #4 0x000055ca95a28811 in trx_purge_attach_undo_recs ( n_purge_threads=n_purge_threads@entry=4, thd=thd@entry=0x55ca97392688) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1349 #5 0x000055ca95a28c1a in trx_purge (n_tasks=n_tasks@entry=4, history_size=3911) at /test/10.6_opt/storage/innobase/trx/trx0purge.cc:1453 #6 0x000055ca95a1232d in purge_coordinator_state::do_purge ( this=0x55ca96e99020 <purge_state>) at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1638 #7 purge_coordinator_callback () at /test/10.6_opt/storage/innobase/srv/srv0srv.cc:1716 However, I realized that the dict_sys.latch was released after when 1680126. The trace is executing extremely slowly. This is the latest that I got, after several hours of waiting: 10.6 12d05c8266567ea9a62ccabc3180a90129bf81e0 Thread 2015 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i   Old value = 2147483648 New value = 0 ssux_lock_impl<false>::wr_unlock (this=0x55ca965a2340 <dict_sys+64>) at /test/10.6_opt/storage/innobase/include/srw_lock.h:306 306 writer.wr_unlock(); 1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>} 2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}} 3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 2147483649}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}} Current event: 2788949 [Switching to Thread 350035.350041]   Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i   Old value = 0 New value = 1 0x000055ca95a20e42 in std::__atomic_base<unsigned int>::compare_exchange_weak (__m2=std::memory_order_relaxed, __m1=std::memory_order_acquire, __i2=1, __i1=<synthetic pointer>: <optimized out>, this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:448 448 compare_exchange_weak(__int_type& __i1, __int_type __i2, 1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>} 2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}} 3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 1}, <No data fields>}} Current event: 2789015   Thread 6 hit Hardware watchpoint 2: -location dict_sys.latch.readers._M_i   Old value = 1 New value = 0 0x000055ca95a21123 in ssux_lock_impl<false>::rd_unlock (this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:549 549 fetch_sub(__int_type __i, 1: dict_sys.latch_ex_wait_start = {<std::__atomic_base<unsigned long long>> = {_M_i = 0}, <No data fields>} 2: dict_sys.latch_ex = {m = {<std::__atomic_base<unsigned long>> = {_M_i = 0}, <No data fields>}} 3: dict_sys.latch = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}} Current event: 2789015 That is, a shared dict_sys.latch was released, reacquired and released at this point, and acquired in shared mode after that. The final wait must have started somewhere between that and the final when count (slightly above 4 million). I have never seen rr replay traces executing this slow when there is no other load on the system. I think that it must be due to the huge number of threads and the constant context switches between them. I will assume that this is a “fake hang”. Please try to reproduce the bug of this ticket with a more reasonable configuration.

            The last I observed before logging off the environment was a shared latch release at when 2790836. The final exclusive lock wait should have started somewhere after that.

            marko Marko Mäkelä added a comment - The last I observed before logging off the environment was a shared latch release at when 2790836. The final exclusive lock wait should have started somewhere after that.
            Roel Roel Van de Paar made changes -
            Assignee Marko Mäkelä [ marko ] Roel Van de Paar [ roel ]
            Roel Roel Van de Paar made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]
            Roel Roel Van de Paar made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Assignee Roel Van de Paar [ roel ] Marko Mäkelä [ marko ]
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            marko Marko Mäkelä made changes -
            Status Stalled [ 10000 ] Needs Feedback [ 10501 ]

            bjhilbrands or Roel, have you been able to reproduce this with a more recent version of MariaDB Server? With the logical dump mentioned in MDEV-35779, I got MariaDB Server 10.6.18 to report corruption, but the current release seemed to be OK. I suspect that the failure is nondeterministic, so we’d better test it several times.

            marko Marko Mäkelä added a comment - bjhilbrands or Roel , have you been able to reproduce this with a more recent version of MariaDB Server? With the logical dump mentioned in MDEV-35779 , I got MariaDB Server 10.6.18 to report corruption, but the current release seemed to be OK. I suspect that the failure is nondeterministic, so we’d better test it several times.

            No corruption issues thus far on recent & current 10.6 tests:

            CS 10.6.21 c05e7c4e0eff174a1f2b5e6efd53d80954f9e34b (Optimized)

            2025-01-28 15:19:35 0 [Note] /test/MD210125-mariadb-10.6.21-linux-x86_64-opt/bin/mariadbd: ready for connections.
            Version: '10.6.21-MariaDB-log'  socket: '/test/MD210125-mariadb-10.6.21-linux-x86_64-opt/socket.sock'  port: 12427  MariaDB Server
            2025-01-28 15:20:32 1511 [Note] InnoDB: Number of pools: 2
            2025-01-28 15:20:45 3015 [Note] InnoDB: Number of pools: 3
            2025-01-28 15:21:01 4524 [Note] InnoDB: Number of pools: 4
            2025-01-28 15:21:20 6027 [Note] InnoDB: Number of pools: 5
            2025-01-28 15:22:11 7533 [Note] InnoDB: Number of pools: 6
            2025-01-28 15:25:26 1804 [Note] Detected table cache mutex contention at instance 1: 55% waits. Additional table cache instance activated. Number of instances after activation: 2.
            

            I did however see a few duplicate errors again:

            MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)

            ERROR 1062 (23000) at line 1: Duplicate entry '908-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '869-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '969-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '924-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1062-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1227-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1195-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1166-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1290-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1275-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1416-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1461-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1753-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1861-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1853-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '878-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1325-2025-01-28 15:57:58' for key 'PRIMARY'
            ERROR 1062 (23000) at line 1: Duplicate entry '1897-2025-01-28 15:57:58' for key 'PRIMARY'
            

            The scarcity/infrequency of these errors (they are almost never seen, even when running 8k threads), combined with the fact they should not really happen afaics, makes me believe there is a highly-sporadic race condition bug with LAST_INSERT_ID. The query that produced these errors is:

            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            

            Roel Roel Van de Paar added a comment - No corruption issues thus far on recent & current 10.6 tests: CS 10.6.21 c05e7c4e0eff174a1f2b5e6efd53d80954f9e34b (Optimized) 2025-01-28 15:19:35 0 [Note] /test/MD210125-mariadb-10.6.21-linux-x86_64-opt/bin/mariadbd: ready for connections. Version: '10.6.21-MariaDB-log' socket: '/test/MD210125-mariadb-10.6.21-linux-x86_64-opt/socket.sock' port: 12427 MariaDB Server 2025-01-28 15:20:32 1511 [Note] InnoDB: Number of pools: 2 2025-01-28 15:20:45 3015 [Note] InnoDB: Number of pools: 3 2025-01-28 15:21:01 4524 [Note] InnoDB: Number of pools: 4 2025-01-28 15:21:20 6027 [Note] InnoDB: Number of pools: 5 2025-01-28 15:22:11 7533 [Note] InnoDB: Number of pools: 6 2025-01-28 15:25:26 1804 [Note] Detected table cache mutex contention at instance 1: 55% waits. Additional table cache instance activated. Number of instances after activation: 2. I did however see a few duplicate errors again: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) ERROR 1062 (23000) at line 1: Duplicate entry '908-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '869-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '969-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '924-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1062-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1227-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1195-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1166-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1290-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1275-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1416-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1461-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1753-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1861-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1853-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '878-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1325-2025-01-28 15:57:58' for key 'PRIMARY' ERROR 1062 (23000) at line 1: Duplicate entry '1897-2025-01-28 15:57:58' for key 'PRIMARY' The scarcity/infrequency of these errors (they are almost never seen, even when running 8k threads), combined with the fact they should not really happen afaics, makes me believe there is a highly-sporadic race condition bug with LAST_INSERT_ID . The query that produced these errors is: INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            Roel Roel Van de Paar added a comment - - edited

            An overnight run did not produce any new occurences. Please do have a look at the duplicate key issue mentioned in the last comment. It happens only very sporadically.

            Roel Roel Van de Paar added a comment - - edited An overnight run did not produce any new occurences. Please do have a look at the duplicate key issue mentioned in the last comment. It happens only very sporadically.

            The duplicate key errors might be related to the corner that I cut while fixing MDEV-30882. Does CHECK TABLE…EXTENDED (MDEV-24402) report any errors for these ROW_FORMAT=COMPRESSED tables?

            marko Marko Mäkelä added a comment - The duplicate key errors might be related to the corner that I cut while fixing MDEV-30882 . Does CHECK TABLE…EXTENDED ( MDEV-24402 ) report any errors for these ROW_FORMAT=COMPRESSED tables?
            Roel Roel Van de Paar made changes -
            Comment [ Had an interesting run this morning with this. This time I used:
            {code:sql}
            threads=8000 # Number of concurrent threads
            queries=100 # Number of t1/t2 INSERTs per thread/per test round
            {code}
            And here is what I saw:
            {noformat:title=MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
            /test/MDEV-32174_mariadb-10.6.21-linux-x86_64$ ./script1.sh
            Count: 100
            Count: 200
            Count: 300
            Count: 400
            Count: 500
            ...normal count operation continues...
            Count: 2600
            Count: 2700
            Count: 2800
            Count: 2900
            Count: 3000
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------

            ERROR 1062 (23000) at line 1: Duplicate entry '2198-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------

            ERROR 1062 (23000) at line 1: Duplicate entry '1931-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------
            ...additional similar duplicate entries...
            ERROR 1062 (23000) at line 1: Duplicate entry '2781-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------

            ERROR 1062 (23000) at line 1: Duplicate entry '3123-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------

            ERROR 1062 (23000) at line 1: Duplicate entry '3179-2025-02-10 06:38:40' for key 'PRIMARY'
            Count: 3100
            Count: 3200
            Count: 3300
            Count: 3400
            Count: 3500
            ...normal count operation continues...
            Count: 4600
            Count: 4700
            Count: 4800
            Count: 4900
            Count: 5000
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------

            ERROR 1062 (23000) at line 1: Duplicate entry '858-2025-02-10 06:39:03' for key 'PRIMARY'
            Count: 5100
            Count: 5200
            Count: 5300
            Count: 5400
            Count: 5500
            Count: 5600
            Count: 5700
            Count: 5800
            Count: 5900
            Count: 6000
            ./script1.sh: line 23: 2193819 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            ./script1.sh: line 23: 2193829 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            ...many similar such messages...
            ./script1.sh: line 23: 2194480 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            ./script1.sh: line 23: 2194481 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            Count: 6100
            Count: 6200
            Count: 6300
            Count: 6400
            Count: 6500
            ...script continues...
            {noformat}
            Firstly it seems interesting that the single '858-2025-02-10 06:39:03' duplicate entry occurence happened quite offset from the earlier batch of duplicate entries, again suggesting some sort of (highly-sporadic) race condition bug, as we were discussing. Immediately after this (let's say around count 5500), I used CHECK TABLE and the table was fine:
            {noformat:title=MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
            10.6.21>CHECK TABLE t2 EXTENDED;
            +---------+-------+----------+----------+
            | Table | Op | Msg_type | Msg_text |
            +---------+-------+----------+----------+
            | test.t2 | check | status | OK |
            +---------+-------+----------+----------+
            1 row in set (0.037 sec)
            {noformat}
            However, a little later the client kills happened, likely as a result of automated server resource monitoring we use.
            Immediately after this, {{CHECK TABLE}} reported a warning:
            {noformat:title=MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
            10.6.21>CHECK TABLE t2 EXTENDED;
            +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Op | Msg_type | Msg_text |
            +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)} |
            | test.t2 | check | status | OK |
            +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            2 rows in set (0.655 sec)
            {noformat}
            And checking the error log we see it's right in admist the killed connections:
            {noformat:title=MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)}
            ...more aborted connections before...
            2025-02-10 6:39:32 5829 [Warning] Aborted connection 5829 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10 6:39:32 5832 [Warning] Aborted connection 5832 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10 6:39:33 4052 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)}
            2025-02-10 6:39:33 5848 [Warning] Aborted connection 5848 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10 6:39:33 5912 [Warning] Aborted connection 5912 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10 6:39:34 5924 [Warning] Aborted connection 5924 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            ...more aborted connections after...
            {noformat} ]

            Had an interesting run this morning with this. This time I used:

            threads=8000   # Number of concurrent threads
            queries=100    # Number of t1/t2 INSERTs per thread/per test round
            

            And here is what I saw:

            MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)

            /test/MDEV-32174_mariadb-10.6.21-linux-x86_64$ ./script1.sh 
            Count: 100
            Count: 200
            Count: 300
            Count: 400
            Count: 500
            ...normal count operation continues...
            Count: 2600
            Count: 2700
            Count: 2800
            Count: 2900
            Count: 3000
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------
             
            ERROR 1062 (23000) at line 1: Duplicate entry '2198-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------
             
            ERROR 1062 (23000) at line 1: Duplicate entry '1931-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------
            ...additional similar duplicate entries...
            ERROR 1062 (23000) at line 1: Duplicate entry '2781-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------
             
            ERROR 1062 (23000) at line 1: Duplicate entry '3123-2025-02-10 06:38:40' for key 'PRIMARY'
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------
             
            ERROR 1062 (23000) at line 1: Duplicate entry '3179-2025-02-10 06:38:40' for key 'PRIMARY'
            Count: 3100
            Count: 3200
            Count: 3300
            Count: 3400
            Count: 3500
            ...normal count operation continues...
            Count: 4600
            Count: 4700
            Count: 4800
            Count: 4900
            Count: 5000
            --------------
            INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID()))
            --------------
             
            ERROR 1062 (23000) at line 1: Duplicate entry '858-2025-02-10 06:39:03' for key 'PRIMARY'
            Count: 5100
            Count: 5200
            Count: 5300
            Count: 5400
            Count: 5500
            Count: 5600
            Count: 5700
            Count: 5800
            Count: 5900
            Count: 6000
            ./script1.sh: line 23: 2193819 Killed                  ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            ./script1.sh: line 23: 2193829 Killed                  ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            ...many similar such messages...
            ./script1.sh: line 23: 2194480 Killed                  ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            ./script1.sh: line 23: 2194481 Killed                  ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}"
            Count: 6100
            Count: 6200
            Count: 6300
            Count: 6400
            Count: 6500
            ...script continues...
            

            Firstly it seems interesting that the single '858-2025-02-10 06:39:03' duplicate entry occurence happened quite offset from the earlier batch of duplicate entries, again suggesting some sort of (highly-sporadic) race condition bug, as we were discussing. Immediately after this (let's say around count 5500), I used CHECK TABLE and the table was fine:

            MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)

            10.6.21>CHECK TABLE t2 EXTENDED;
            +---------+-------+----------+----------+
            | Table   | Op    | Msg_type | Msg_text |
            +---------+-------+----------+----------+
            | test.t2 | check | status   | OK       |
            +---------+-------+----------+----------+
            1 row in set (0.037 sec)
            

            However, a little later the client kills happened, likely as a result of automated server resource monitoring we use.
            Immediately after this, CHECK TABLE reported a warning:

            MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)

            10.6.21>CHECK TABLE t2 EXTENDED;
            +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table   | Op    | Msg_type | Msg_text                                                                                                                                                                                                                                           |
            +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | test.t2 | check | Warning  | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4]    (0x80000002),[5]   i (0x99B5D469D5),[6]    \ (0x000000015CBF),[7]      T(0xD000000C0D0554),[8]        (0x0000000000000000)} |
            | test.t2 | check | status   | OK                                                                                                                                                                                                                                                 |
            +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            2 rows in set (0.655 sec)
            

            And checking the error log we see it's right in admist the killed connections:

            MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)

            ...more aborted connections before...
            2025-02-10  6:39:32 5829 [Warning] Aborted connection 5829 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10  6:39:32 5832 [Warning] Aborted connection 5832 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10  6:39:33 4052 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4]    (0x80000002),[5]   i (0x99B5D469D5),[6]    \ (0x000000015CBF),[7]      T(0xD000000C0D0554),[8]        (0x0000000000000000)}
            2025-02-10  6:39:33 5848 [Warning] Aborted connection 5848 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10  6:39:33 5912 [Warning] Aborted connection 5912 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            2025-02-10  6:39:34 5924 [Warning] Aborted connection 5924 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)
            ...more aborted connections after...
            

            Could client kills (or flaky connections) be the issue not just for this warning but also for the earlier observed corruptions?

            Roel Roel Van de Paar added a comment - Had an interesting run this morning with this. This time I used: threads=8000 # Number of concurrent threads queries=100 # Number of t1/t2 INSERTs per thread/per test round And here is what I saw: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) /test/MDEV-32174_mariadb-10.6.21-linux-x86_64$ ./script1.sh Count: 100 Count: 200 Count: 300 Count: 400 Count: 500 ...normal count operation continues... Count: 2600 Count: 2700 Count: 2800 Count: 2900 Count: 3000 -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) --------------   ERROR 1062 (23000) at line 1: Duplicate entry '2198-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) --------------   ERROR 1062 (23000) at line 1: Duplicate entry '1931-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) -------------- ...additional similar duplicate entries... ERROR 1062 (23000) at line 1: Duplicate entry '2781-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) --------------   ERROR 1062 (23000) at line 1: Duplicate entry '3123-2025-02-10 06:38:40' for key 'PRIMARY' -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) --------------   ERROR 1062 (23000) at line 1: Duplicate entry '3179-2025-02-10 06:38:40' for key 'PRIMARY' Count: 3100 Count: 3200 Count: 3300 Count: 3400 Count: 3500 ...normal count operation continues... Count: 4600 Count: 4700 Count: 4800 Count: 4900 Count: 5000 -------------- INSERT INTO t2 VALUES (CURRENT_TIMESTAMP, 0, (SELECT LAST_INSERT_ID())) --------------   ERROR 1062 (23000) at line 1: Duplicate entry '858-2025-02-10 06:39:03' for key 'PRIMARY' Count: 5100 Count: 5200 Count: 5300 Count: 5400 Count: 5500 Count: 5600 Count: 5700 Count: 5800 Count: 5900 Count: 6000 ./script1.sh: line 23: 2193819 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" ./script1.sh: line 23: 2193829 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" ...many similar such messages... ./script1.sh: line 23: 2194480 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" ./script1.sh: line 23: 2194481 Killed ${client} -A --skip-ssl-verify-server-cert --skip-ssl --force --binary-mode -u ${user} -S ${socket} -D ${db} -e "${SQL}" Count: 6100 Count: 6200 Count: 6300 Count: 6400 Count: 6500 ...script continues... Firstly it seems interesting that the single '858-2025-02-10 06:39:03' duplicate entry occurence happened quite offset from the earlier batch of duplicate entries, again suggesting some sort of (highly-sporadic) race condition bug, as we were discussing. Immediately after this (let's say around count 5500), I used CHECK TABLE and the table was fine: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) 10.6.21>CHECK TABLE t2 EXTENDED; +---------+-------+----------+----------+ | Table | Op | Msg_type | Msg_text | +---------+-------+----------+----------+ | test.t2 | check | status | OK | +---------+-------+----------+----------+ 1 row in set (0.037 sec) However, a little later the client kills happened, likely as a result of automated server resource monitoring we use. Immediately after this, CHECK TABLE reported a warning: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) 10.6.21>CHECK TABLE t2 EXTENDED; +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)} | | test.t2 | check | status | OK | +---------+-------+----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 2 rows in set (0.655 sec) And checking the error log we see it's right in admist the killed connections: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) ...more aborted connections before... 2025-02-10 6:39:32 5829 [Warning] Aborted connection 5829 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:32 5832 [Warning] Aborted connection 5832 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:33 4052 [Warning] InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000002),[5] i (0x99B5D469D5),[6] \ (0x000000015CBF),[7] T(0xD000000C0D0554),[8] (0x0000000000000000)} 2025-02-10 6:39:33 5848 [Warning] Aborted connection 5848 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:33 5912 [Warning] Aborted connection 5912 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) 2025-02-10 6:39:34 5924 [Warning] Aborted connection 5924 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets) ...more aborted connections after... Could client kills (or flaky connections) be the issue not just for this warning but also for the earlier observed corruptions?
            Roel Roel Van de Paar added a comment - - edited

            The Clustered index record with stale history in table warning is readily reproducible, even with different locations:

            MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug)

            | test.t2 | check | Warning  | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4]   l(0x8000006C),[5]   z (0x99B5D47AE8),[6]      (0x0000000C86B7),[7]   "   (0xD00000220716A0),[8]        (0x0000000000000000)} |
            ...
            | test.t2 | check | Warning  | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4]    (0x80000016),[5]   { (0x99B5D47B0C),[6]     a(0x0000000C8661),[7]      8(0xFA00000CF30938),[8]        (0x0000000000000000)} |
            ...
            | test.t2 | check | Warning  | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4]    (0x80000020),[5]   {G(0x99B5D47B47),[6]     h(0x0000000DDD68),[7]   $ ) (0xC5000024D6299C),[8]        (0x0000000000000000)} |
            

            It seems these are "repaired" without problems. The table will show clear/ok afterwards. However, when many threads are running, these are observed near-constantly.

            I also tried reproducing corruption by running a kill script which randomly kills connections, but no corruption was observed, which seems consistent with your findings.

            Roel Roel Van de Paar added a comment - - edited The Clustered index record with stale history in table warning is readily reproducible, even with different locations: MDEV-32174 CS 10.6.21 831f5bc66f8d2147edd7991caf69e34058566b67 (Debug) | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] l(0x8000006C),[5] z (0x99B5D47AE8),[6] (0x0000000C86B7),[7] " (0xD00000220716A0),[8] (0x0000000000000000)} | ... | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000016),[5] { (0x99B5D47B0C),[6] a(0x0000000C8661),[7] 8(0xFA00000CF30938),[8] (0x0000000000000000)} | ... | test.t2 | check | Warning | InnoDB: Clustered index record with stale history in table `test`.`t2`: COMPACT RECORD(info_bits=0, 5 fields): {[4] (0x80000020),[5] {G(0x99B5D47B47),[6] h(0x0000000DDD68),[7] $ ) (0xC5000024D6299C),[8] (0x0000000000000000)} | It seems these are "repaired" without problems. The table will show clear/ok afterwards. However, when many threads are running, these are observed near-constantly. I also tried reproducing corruption by running a kill script which randomly kills connections, but no corruption was observed, which seems consistent with your findings.

            Hi, unfortunately I was still able to reproduce the problem; just a quick syslog entry:
            ```
            <pre>Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 2025-02-26 14:51:35 353 [ERROR] InnoDB: We detected index <span style="background-color:#FFFFFF"><font color="#300A24">corruption</font></span> in an InnoDB type table. You have to dump + drop + reimport the tabl
            e or, in a case of widespread <span style="background-color:#FFFFFF"><font color="#300A24">corruption</font></span>, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please
            refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 2025-02-26 14:51:35 353 [ERROR] mariadbd: Index for table 'transition_coverages' is corrupt; try to repair it
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 2025-02-26 14:51:35 0x7cb3a8169640 InnoDB: Assertion failure in file ./storage/innobase/page/page0zip.cc line 4296
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: Failing assertion: slot_rec
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: We intentionally generate a memory trap.
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: If you get repeated assertion failures or crashes, even
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: immediately after the mariadbd startup, there may be
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: <span style="background-color:#FFFFFF"><font color="#300A24">corruption</font></span> in the InnoDB tablespace. Please refer to
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: InnoDB: about forcing recovery.
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: 250226 14:51:35 [ERROR] mysqld got signal 6 ;
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Sorry, we probably made a mistake, and this is a bug.
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Your assistance in bug reporting will enable us to fix this for the next release.
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: We will try our best to scrape up some info that will hopefully help
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: diagnose the problem, but since we have already crashed,
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: something is definitely wrong and this may fail.
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Server version: 10.6.18-MariaDB-0ubuntu0.22.04.1 source revision: 887bb3f73555ff8a50138a580ca8308b9b5c069c
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: key_buffer_size=134217728
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: read_buffer_size=131072
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: max_used_connections=9
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: max_threads=153
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: thread_count=9
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: It is possible that mysqld could use up to
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468009 K bytes of memory
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Hope that's ok; if not, decrease some variables in the equation.
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Thread pointer: 0x7cb338000c68
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: Attempting backtrace. You can use the following information to find out
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: where mysqld died. If you see no messages after this, something went
            Feb 26 14:51:35 vmsandboxtest01 mariadbd[1132]: terribly wrong...
            </pre>
            ```

            bjhilbrands Bart-Jan Hilbrands added a comment - Hi, unfortunately I was still able to reproduce the problem; just a quick syslog entry: ``` <pre>Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 2025-02-26 14:51:35 353 [ERROR] InnoDB: We detected index <span style="background-color:#FFFFFF"><font color="#300A24">corruption</font></span> in an InnoDB type table. You have to dump + drop + reimport the tabl e or, in a case of widespread <span style="background-color:#FFFFFF"><font color="#300A24">corruption</font></span>, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 2025-02-26 14:51:35 353 [ERROR] mariadbd: Index for table 'transition_coverages' is corrupt; try to repair it Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 2025-02-26 14:51:35 0x7cb3a8169640 InnoDB: Assertion failure in file ./storage/innobase/page/page0zip.cc line 4296 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: Failing assertion: slot_rec Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: We intentionally generate a memory trap. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: If you get repeated assertion failures or crashes, even Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: immediately after the mariadbd startup, there may be Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: <span style="background-color:#FFFFFF"><font color="#300A24">corruption</font></span> in the InnoDB tablespace. Please refer to Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/ Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : InnoDB: about forcing recovery. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : 250226 14:51:35 [ERROR] mysqld got signal 6 ; Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Sorry, we probably made a mistake, and this is a bug. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Your assistance in bug reporting will enable us to fix this for the next release. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : To report this bug, see https://mariadb.com/kb/en/reporting-bugs Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : We will try our best to scrape up some info that will hopefully help Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : diagnose the problem, but since we have already crashed, Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : something is definitely wrong and this may fail. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Server version: 10.6.18-MariaDB-0ubuntu0.22.04.1 source revision: 887bb3f73555ff8a50138a580ca8308b9b5c069c Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : key_buffer_size=134217728 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : read_buffer_size=131072 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : max_used_connections=9 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : max_threads=153 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : thread_count=9 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : It is possible that mysqld could use up to Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468009 K bytes of memory Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Hope that's ok; if not, decrease some variables in the equation. Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Thread pointer: 0x7cb338000c68 Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : Attempting backtrace. You can use the following information to find out Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : where mysqld died. If you see no messages after this, something went Feb 26 14:51:35 vmsandboxtest01 mariadbd [1132] : terribly wrong... </pre> ```
            serg Sergei Golubchik made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]

            bjhilbrands Thank you for the update! I see you got an assertion as well here. How did you reproduce the corruption in this case please?
            PS. You can use {noformat}...{noformat} tags instead in JIRA.

            Roel Roel Van de Paar added a comment - bjhilbrands Thank you for the update! I see you got an assertion as well here. How did you reproduce the corruption in this case please? PS. You can use {noformat}...{noformat} tags instead in JIRA.

            Hi, this was reproduced by running myloader with the default configuration (4 threads); so I suppose it is possible that issue is resolved when myloader is ran with 1 thread only

            bjhilbrands Bart-Jan Hilbrands added a comment - Hi, this was reproduced by running myloader with the default configuration (4 threads); so I suppose it is possible that issue is resolved when myloader is ran with 1 thread only

            bjhilbrands, MariaDB Server 10.6.18 should be affected by the race condition that was fixed in MDEV-34453 (MariaDB Server 10.6.20). Can you reproduce the corruption with 10.6.20 or 10.6.21 or a later release?

            marko Marko Mäkelä added a comment - bjhilbrands , MariaDB Server 10.6.18 should be affected by the race condition that was fixed in MDEV-34453 (MariaDB Server 10.6.20). Can you reproduce the corruption with 10.6.20 or 10.6.21 or a later release?

            I set up a pipeline that restores database backups 24/7; i'll report back next week if the issue hasn't been reproduced by then

            bjhilbrands Bart-Jan Hilbrands added a comment - I set up a pipeline that restores database backups 24/7; i'll report back next week if the issue hasn't been reproduced by then

            It is still looking promising so far! I have another question on which other versions of mariadb this might affect/have affected. We are upgrading to ubuntu 24.04 this year, and the current mariadb version in the ubuntu package repository is 10.11.8. Do you know if we might also run into this problem using that version, or that we would have install a later version of 10.11?

            bjhilbrands Bart-Jan Hilbrands added a comment - It is still looking promising so far! I have another question on which other versions of mariadb this might affect/have affected. We are upgrading to ubuntu 24.04 this year, and the current mariadb version in the ubuntu package repository is 10.11.8. Do you know if we might also run into this problem using that version, or that we would have install a later version of 10.11?
            bjhilbrands Bart-Jan Hilbrands added a comment - - edited

            Alright, it has been a week of nonstop restores now and the problem has not reproduced yet with version 10.6.21. I think we can pretty safely say that the problem has been resolved, thanks for the effort!

            bjhilbrands Bart-Jan Hilbrands added a comment - - edited Alright, it has been a week of nonstop restores now and the problem has not reproduced yet with version 10.6.21. I think we can pretty safely say that the problem has been resolved, thanks for the effort!
            marko Marko Mäkelä made changes -

            Thank you for the testing. I think that we can declare this bug a duplicate of MDEV-34453.

            marko Marko Mäkelä added a comment - Thank you for the testing. I think that we can declare this bug a duplicate of MDEV-34453 .
            marko Marko Mäkelä made changes -
            Fix Version/s 11.8.0 [ 29960 ]
            Fix Version/s 11.7.1 [ 29913 ]
            Fix Version/s 11.6.2 [ 29908 ]
            Fix Version/s 11.4.4 [ 29907 ]
            Fix Version/s 11.2.6 [ 29906 ]
            Fix Version/s 10.11.10 [ 29904 ]
            Fix Version/s 10.6.20 [ 29903 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Duplicate [ 3 ]
            Status Open [ 1 ] Closed [ 6 ]

            People

              marko Marko Mäkelä
              bjhilbrands Bart-Jan Hilbrands
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.