[MDEV-24096] Server crash, InnoDB fatal error, Assertion `first_free <= srv_page_size - 8' failed in trx_undo_page_report_modify Created: 2020-11-02  Updated: 2020-11-13  Resolved: 2020-11-06

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB, Virtual Columns
Affects Version/s: 10.4.16, 10.5.7
Fix Version/s: 10.4.17, 10.5.8

Type: Bug Priority: Blocker
Reporter: Alice Sherepa Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: corruption, regression, rr-profile

Attachments: File srv2.test    
Issue Links:
Duplicate
is duplicated by MDEV-24155 Database not starting Closed
is duplicated by MDEV-24162 InnoDB: Assertion failure in file /ho... Closed
Problem/Incident
is caused by MDEV-23672 Assertion `v.v_indexes.empty()' faile... Closed
Relates
relates to MDEV-371 Unique indexes for blobs Closed
relates to MDEV-24156 trx_undo_left() fails to prevent over... Closed

 Description   

See further information in the comments and attached test case

10.5 504d4c1ff6e0cecde9adf

mysqld: /10.5/storage/innobase/trx/trx0rec.cc:807: uint16_t trx_undo_page_report_modify(buf_block_t*, trx_t*, dict_index_t*, const rec_t*, const rec_offs*, const upd_t*, ulint, const dtuple_t*, mtr_t*): Assertion `first_free <= srv_page_size - 8' failed.
201102 14:59:50 [ERROR] mysqld got signal 6 ;
 
Server version: 10.5.7-MariaDB-debug-log
 
linux/raise.c:51(__GI_raise)[0x7fbf4e2fb7bb]
stdlib/abort.c:81(__GI_abort)[0x7fbf4e2e6535]
intl/loadmsgcat.c:1177(_nl_load_domain)[0x7fbf4e2e640f]
??:0(__assert_fail)[0x7fbf4e2f4102]
trx/trx0rec.cc:809(trx_undo_page_report_modify(buf_block_t*, trx_t*, dict_index_t*, unsigned char const*, unsigned short const*, upd_t const*, unsigned long, dtuple_t const*, mtr_t*))[0x562753e9dff8]
trx/trx0rec.cc:2018(trx_undo_report_row_operation(que_thr_t*, dict_index_t*, dtuple_t const*, upd_t const*, unsigned long, unsigned char const*, unsigned short const*, unsigned long*))[0x562753ea6c5f]
btr/btr0cur.cc:5362(btr_cur_del_mark_set_clust_rec(buf_block_t*, unsigned char*, dict_index_t*, unsigned short const*, que_thr_t*, dtuple_t const*, mtr_t*))[0x562753f722c2]
row/row0upd.cc:2683(row_upd_del_mark_clust_rec(upd_node_t*, dict_index_t*, unsigned short*, que_thr_t*, unsigned long, bool, mtr_t*))[0x562753def3a7]
row/row0upd.cc:2860(row_upd_clust_step(upd_node_t*, que_thr_t*))[0x562753df05ce]
row/row0upd.cc:2992(row_upd(upd_node_t*, que_thr_t*))[0x562753df12e2]
row/row0upd.cc:3136(row_upd_step(que_thr_t*))[0x562753df2246]
row/row0mysql.cc:1849(row_update_for_mysql(row_prebuilt_t*))[0x562753d3b164]
handler/ha_innodb.cc:8491(ha_innobase::delete_row(unsigned char const*))[0x5627539aebeb]
sql/handler.cc:7261(handler::ha_delete_row(unsigned char const*))[0x562752eece40]
sql/sql_delete.cc:280(TABLE::delete_row())[0x562753357580]
sql/sql_delete.cc:797(mysql_delete(THD*, TABLE_LIST*, Item*, SQL_I_List<st_order>*, unsigned long long, unsigned long long, select_result*))[0x56275334e856]
sql/sql_parse.cc:4829(mysql_execute_command(THD*))[0x5627526e3199]
sql/sql_parse.cc:8044(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x5627526f8b66]
sql/sql_parse.cc:1875(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x5627526cf7c7]
sql/sql_parse.cc:1353(do_command(THD*))[0x5627526cc0d5]
sql/sql_connect.cc:1410(do_handle_one_connection(CONNECT*, bool))[0x562752af883c]
sql/sql_connect.cc:1314(handle_one_connection)[0x562752af8199]
perfschema/pfs.cc:2203(pfs_spawn_thread)[0x5627537b818c]
nptl/pthread_create.c:487(start_thread)[0x7fbf4edb6fa3]
x86_64/clone.S:97(clone)[0x7fbf4e3bd4cf]



 Comments   
Comment by Alice Sherepa [ 2020-11-03 ]

while trying to reproduce the first assertion I'm getting another one, probably related:
Assertion `page_align(ptr) == undo_block->frame' failed in trx_undo_page_set_next_prev_and_add

mysqld: /10.5/storage/innobase/trx/trx0rec.cc:81: uint16_t trx_undo_page_set_next_prev_and_add(buf_block_t*, byte*, mtr_t*): Assertion `page_align(ptr) == undo_block->frame' failed.
201103 12:07:17 [ERROR] mysqld got signal 6 ;
 
Server version: 10.5.7-MariaDB-debug-log
 
stdlib/abort.c:81(__GI_abort)[0x7f7b9b6eb859]
:0(__GI___assert_fail)[0x7f7b9b6fcf36]
trx/trx0rec.cc:83(trx_undo_page_set_next_prev_and_add(buf_block_t*, unsigned char*, mtr_t*))[0x56448963652a]
trx/trx0rec.cc:458(trx_undo_page_report_insert(buf_block_t*, trx_t*, dict_index_t*, dtuple_t const*, mtr_t*))[0x564489638685]
trx/trx0rec.cc:2016(trx_undo_report_row_operation(que_thr_t*, dict_index_t*, dtuple_t const*, upd_t const*, unsigned long, unsigned char const*, unsigned short const*, unsigned long*))[0x564489643156]
btr/btr0cur.cc:3288(btr_cur_ins_lock_and_undo(unsigned long, btr_cur_t*, dtuple_t*, que_thr_t*, mtr_t*, bool*))[0x564489703077]
btr/btr0cur.cc:3502(btr_cur_optimistic_insert(unsigned long, btr_cur_t*, unsigned short**, mem_block_info_t**, dtuple_t*, unsigned char**, big_rec_t**, unsigned long, que_thr_t*, mtr_t*))[0x56448970467a]
row/row0ins.cc:2724(row_ins_clust_index_entry_low(unsigned long, unsigned long, dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*))[0x56448947d842]
row/row0ins.cc:3196(row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, unsigned long))[0x5644894802da]
row/row0ins.cc:3321(row_ins_index_entry(dict_index_t*, dtuple_t*, que_thr_t*))[0x564489480ce4]
row/row0ins.cc:3490(row_ins_index_entry_step(ins_node_t*, que_thr_t*))[0x564489481e55]
row/row0ins.cc:3627(row_ins(ins_node_t*, que_thr_t*))[0x56448948287d]
row/row0ins.cc:3766(row_ins_step(que_thr_t*))[0x5644894838e2]
row/row0mysql.cc:1423(row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t))[0x5644894cd1bc]
handler/ha_innodb.cc:7592(ha_innobase::write_row(unsigned char const*))[0x56448912d9e4]
sql/handler.cc:7143(handler::ha_write_row(unsigned char const*))[0x56448862dd47]
sql/sql_insert.cc:1736(vers_insert_history_row(TABLE*))[0x564487d3a523]
sql/sql_update.cc:1093(mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*))[0x5644880fc1e6]
sql/sql_parse.cc:4430(mysql_execute_command(THD*))[0x564487dfc0d8]
sql/sql_parse.cc:8044(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x564487e153d8]
sql/sql_parse.cc:1875(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x564487deb9ea]
sql/sql_parse.cc:1353(do_command(THD*))[0x564487de8313]
sql/sql_connect.cc:1410(do_handle_one_connection(CONNECT*, bool))[0x56448822887a]
sql/sql_connect.cc:1314(handle_one_connection)[0x5644882281de]
perfschema/pfs.cc:2203(pfs_spawn_thread)[0x564488f30bfb]
nptl/pthread_create.c:478(start_thread)[0x7f7b9bc14609]

Comment by Marko Mäkelä [ 2020-11-03 ]

Yes, those two could share the root cause. Unfortunately we do not have much in terms of ASAN instrumentation for the buffer pool, so I think that a normal debug build should work equally well.

Comment by Elena Stepanova [ 2020-11-06 ]

The failure appeared in 10.4 after this commit:

Author: Aleksey Midenkov <midenok@gmail.com>
Date:   Tue Oct 20 10:57:57 2020 +0300
 
    MDEV-23672 Assertion `v.v_indexes.empty()' failed in dict_table_t::instant_column

Non-debug build also crashes, but it happens after a post-test check (but before the shutdown), and it fails to produce a coredump. It looks like this:

10.4 add67826 RelWithDebInfo

Version: '10.4.17-MariaDB-log'  socket: '/data/bld/10.4-rel-nightly/mysql-test/var/tmp/mysqld.1.sock'  port: 16000  MariaDB Server
201106 11:04:12 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.17-MariaDB-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=7
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63534 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fcf40000c48
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fcf4dffaca8 thread_stack 0x49000

Non-debug ASAN build doesn't help much, but it produces an extra error line before crashing:

Version: '10.4.17-MariaDB-log'  socket: '/data/bld/10.4-rel-asan-nightly/mysql-test/var/tmp/mysqld.1.sock'  port: 16000  Source distribution
2020-11-06 11:28:38 1 [ERROR] [FATAL] InnoDB: Trying to read page number 538976288 in space 0, space name innodb_system, which is outside the tablespace bounds. Byte offset 0, len 16384
201106 11:28:38 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.17-MariaDB-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=7
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63534 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x62b000000208
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f78b2ce4b50 thread_stack 0x5fc00
/lib/x86_64-linux-gnu/libasan.so.5(+0x6cd30)[0x7f78c47a1d30]
mysys/stacktrace.c:175(my_print_stacktrace)[0x563eb94941a6]
sql/signal_handler.cc:206(handle_fatal_signal)[0x563eb83088a8]

It also has a variation of the assertion failure:

10.4 add67826 debug

mysqld: /data/src/10.4/storage/innobase/trx/trx0rec.cc:66: void trx_undof_page_add_undo_rec_log(const buf_block_t*, ulint, ulint, mtr_t*): Assertion `new_free < srv_page_size' failed.
201106 11:36:23 [ERROR] mysqld got signal 6 ;
 
#7  0x00007ff92e542f36 in __GI___assert_fail (assertion=0x55b29a7c4c94 "new_free < srv_page_size", file=0x55b29a7c4c00 "/data/src/10.4/storage/innobase/trx/trx0rec.cc", line=66, function=0x55b29a7c4c30 "void trx_undof_page_add_undo_rec_log(const buf_block_t*, ulint, ulint, mtr_t*)") at assert.c:101
#8  0x000055b29a037d34 in trx_undof_page_add_undo_rec_log (undo_block=0x7ff92382d0e0, old_free=16371, new_free=16919, mtr=0x7ff9236fc290) at /data/src/10.4/storage/innobase/trx/trx0rec.cc:66
#9  0x000055b29a03c3f7 in trx_undo_page_report_modify (undo_block=0x7ff92382d0e0, trx=0x7ff9281aa140, index=0x7ff8d82003e0, rec=0x7ff923b10578 "\200", offsets=0x7ff9236fcd00, update=0x7ff8d8081770, cmpl_info=0, row=0x0, mtr=0x7ff9236fc290) at /data/src/10.4/storage/innobase/trx/trx0rec.cc:1504
#10 0x000055b29a03eb0c in trx_undo_report_row_operation (thr=0x7ff8d8081d40, index=0x7ff8d82003e0, clust_entry=0x0, update=0x7ff8d8081770, cmpl_info=0, rec=0x7ff923b10578 "\200", offsets=0x7ff9236fcd00, roll_ptr=0x7ff9236fc838) at /data/src/10.4/storage/innobase/trx/trx0rec.cc:2136
#11 0x000055b29a0a56ac in btr_cur_upd_lock_and_undo (flags=2, cursor=0x7ff8d8048490, offsets=0x7ff9236fcd00, update=0x7ff8d8081770, cmpl_info=0, thr=0x7ff8d8081d40, mtr=0x7ff9236fcf60, roll_ptr=0x7ff9236fc838) at /data/src/10.4/storage/innobase/btr/btr0cur.cc:3920
#12 0x000055b29a0a92db in btr_cur_optimistic_update (flags=2, cursor=0x7ff8d8048490, offsets=0x7ff9236fcc10, heap=0x7ff9236fccb0, update=0x7ff8d8081770, cmpl_info=0, thr=0x7ff8d8081d40, trx_id=317, mtr=0x7ff9236fcf60) at /data/src/10.4/storage/innobase/btr/btr0cur.cc:4707
#13 0x000055b299fe841b in row_upd_clust_rec (flags=0, node=0x7ff8d8081648, index=0x7ff8d82003e0, offsets=0x7ff9236fcd00, offsets_heap=0x7ff9236fccb0, thr=0x7ff8d8081d40, mtr=0x7ff9236fcf60) at /data/src/10.4/storage/innobase/row/row0upd.cc:2857
#14 0x000055b299fe9701 in row_upd_clust_step (node=0x7ff8d8081648, thr=0x7ff8d8081d40) at /data/src/10.4/storage/innobase/row/row0upd.cc:3230
#15 0x000055b299fe9a84 in row_upd (node=0x7ff8d8081648, thr=0x7ff8d8081d40) at /data/src/10.4/storage/innobase/row/row0upd.cc:3296
#16 0x000055b299fea04c in row_upd_step (thr=0x7ff8d8081d40) at /data/src/10.4/storage/innobase/row/row0upd.cc:3440
#17 0x000055b299f88ac9 in row_update_for_mysql (prebuilt=0x7ff8d8080aa0) at /data/src/10.4/storage/innobase/row/row0mysql.cc:1848
#18 0x000055b299dddb21 in ha_innobase::update_row (this=0x7ff8d81f7818, old_row=0x7ff8d822af28 "\020", new_row=0x7ff8d8220dd8 "\020") at /data/src/10.4/storage/innobase/handler/ha_innodb.cc:8802
#19 0x000055b299b8f708 in handler::ha_update_row (this=0x7ff8d81f7818, old_data=0x7ff8d822af28 "\020", new_data=0x7ff8d8220dd8 "\020") at /data/src/10.4/sql/handler.cc:6792
#20 0x000055b299928bdd in mysql_update (thd=0x7ff8d8000d90, table_list=0x7ff8d80135b0, fields=..., values=..., conds=0x0, order_num=0, order=0x0, limit=18446744073709551442, ignore=false, found_return=0x7ff9236fde20, updated_return=0x7ff9236fdee0) at /data/src/10.4/sql/sql_update.cc:1073
#21 0x000055b2998145ba in mysql_execute_command (thd=0x7ff8d8000d90) at /data/src/10.4/sql/sql_parse.cc:4396
#22 0x000055b2998207cb in mysql_parse (thd=0x7ff8d8000d90, rawbuf=0x7ff8d8013448 "UPDATE table100_innodb_int_autoinc SET col_longtext_utf8_key = 'foo'", length=68, parser_state=0x7ff9236fe550, is_com_multi=false, is_next_command=false) at /data/src/10.4/sql/sql_parse.cc:7938
#23 0x000055b29980cad1 in dispatch_command (command=COM_QUERY, thd=0x7ff8d8000d90, packet=0x7ff8d80087a1 "UPDATE table100_innodb_int_autoinc SET col_longtext_utf8_key = 'foo'", packet_length=68, is_com_multi=false, is_next_command=false) at /data/src/10.4/sql/sql_parse.cc:1839
#24 0x000055b29980b339 in do_command (thd=0x7ff8d8000d90) at /data/src/10.4/sql/sql_parse.cc:1357
#25 0x000055b29999a0b9 in do_handle_one_connection (connect=0x55b29c95e220) at /data/src/10.4/sql/sql_connect.cc:1412
#26 0x000055b299999e02 in handle_one_connection (arg=0x55b29c95e220) at /data/src/10.4/sql/sql_connect.cc:1316
#27 0x000055b29a3b8d22 in pfs_spawn_thread (arg=0x55b29c950af0) at /data/src/10.4/storage/perfschema/pfs.cc:1869
#28 0x00007ff92edc3609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#29 0x00007ff92e62e293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The dirty test case is attached as srv2.test.
It is fast and not big, but contains a very ugly data set of 100 lines and a chain of DDL which most likely can be further simplified; but it seems to be very sensitive to both the data contents and the sequence of events.
Currently on my machine it fails reliably every time, with all described failures – first_free... assertion o 10.5, new_free... assertion on 10.4, crashes without coredumps on 10.4-10.5 non-debug. However, due to the fragility of the test case i am not sure whether it will fail the same way or other machines, it is to be tried.

Comment by Marko Mäkelä [ 2020-11-06 ]

MDEV-23672 affected the writing of undo log records into undo log pages. Apparently it can cause undo page corruption.

Comment by Marko Mäkelä [ 2020-11-06 ]

I will take over this. I can repeat the problem with srv2.test without rr record but not with it. I think that the timing of purge could play a role here.

Comment by Marko Mäkelä [ 2020-11-06 ]

MDEV-23672 broke trx_undo_log_v_idx(), which is writing information about indexed virtual columns (MDEV-5800) to an InnoDB undo log page. It should be noted that also unique BLOBs (MDEV-371) internally use indexed virtual columns.

The checks around trx_undo_left() are insufficient, and that function is inherently unsafe because it returns an unsigned value. If the size limit was ever exceeded, we would exceed it again and again.

Exceeding the limit will not only corrupt the undo log page (which was caught by the assertion later when reading that undo log page), but it can also corrupt any adjacent pages in the InnoDB buffer pool (up to 65536 bytes, I think). When using the default innodb_page_size=16k, that would amount to up to 3 adjacent ‘random’ pages in the buffer pool. With innodb_page_size=4k it would be up to 15 unrelated pages. The buffer pool is also being used for allocating record lock bitmaps and the adaptive hash index. Thus, the stray writes could cause arbitrary ‘phase of moon’ crashes.

Comment by Thirunarayanan Balathandayuthapani [ 2020-11-06 ]

Patch looks OK to me

Comment by Elena Stepanova [ 2020-11-06 ]

I've run the tests which I had for this issue – the first RQG test used to produce the rr profile, and the second RQG test which was the source of the provided MTR test, and the MTR test itself, each repeatedly.
The MTR test didn't fail either on the patched 10.4 nor on the patched 10.5, while without the patch it was failing on both (more or less deterministically on my machine).
The second RQG test didn't fail either on the patched 10.4 nor on the patched 10.5. Without the patch, it fails non-deterministically on 10.5 but doesn't fail on 10.4.
The first RQG test fails on the patched 10.4 and 10.5 non-deterministically with symptoms similar to MDEV-22564 and other versioning-related issues on debug builds or with diagnostics area assertion failures, and with MDEV-21987 on non-debug builds. It did it before the patch as well, intermittently with the problem at hand, which was one of the reasons why it took so long to narrow down the test case.

Thus, results of the MTR test and the second RQG test are somewhat reassuring, while the result of the first RQG test is inconclusive. I think it's acceptable for an emergency patch, but more testing is definitely needed.

Comment by Marko Mäkelä [ 2020-11-06 ]

Thank you. I filed MDEV-24156 for follow-up work.

Comment by Daniel Black [ 2020-11-11 ]

For all those watching patiently, there have been packages released https://mariadb.org/mariadb-10-5-8-10-4-17-10-3-27-and-10-2-36-now-available/

Generated at Thu Feb 08 09:27:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.