[MDEV-14925] trx->rsegs.m_noredo.undo is corrupted during ROLLBACK - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 10.2.9, 10.2.10, 10.2.12
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- innodb
Environment:
RHEL 6.5 x64 on physical server.
Last error on 10.2.12.

Sprint:
10.2.14

Description

Problem started after an upgrade 10.0.22 -> 10.2.9 and continues with 10.2.12

error log for 10.2.12
2018-01-10 0:15:13 139882323265280 [ERROR] [FATAL] InnoDB: Trying to read page number 376886 in space 4294967294, space name innodb_temporary, which is outside the tablespace bounds. Byte offset 0, len 16384
180110 0:15:13 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.2.12-MariaDB-log
key_buffer_size=33554432
read_buffer_size=2097152
max_used_connections=421
max_threads=401
thread_count=401
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1700897 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f36d0f4d878
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f38e4303cc8 thread_stack 0x30000
(my_addr_resolve failure: fork)
/usr/sbin/mysqld(my_print_stacktrace+0x2b) [0x7f4373fb2a2b]
/usr/sbin/mysqld(handle_fatal_signal+0x535) [0x7f4373a7f915]
/lib64/libpthread.so.0(+0x3e5260f7e0) [0x7f4372fa67e0]
/lib64/libc.so.6(gsignal+0x35) [0x7f43716005e5]
/lib64/libc.so.6(abort+0x175) [0x7f4371601dc5]
/usr/sbin/mysqld(+0x9addca) [0x7f4373d83dca]
/usr/sbin/mysqld(+0xa56b0d) [0x7f4373e2cb0d]
/usr/sbin/mysqld(+0xa5f3bc) [0x7f4373e353bc]
/usr/sbin/mysqld(+0xa12b3b) [0x7f4373de8b3b]
/usr/sbin/mysqld(+0xa141e6) [0x7f4373dea1e6]
/usr/sbin/mysqld(+0x9f1b2b) [0x7f4373dc7b2b]
/usr/sbin/mysqld(+0x98ae6c) [0x7f4373d60e6c]
/usr/sbin/mysqld(+0x947263) [0x7f4373d1d263]
/usr/sbin/mysqld(+0x8ea2a7) [0x7f4373cc02a7]
/usr/sbin/mysqld(+0x98bde5) [0x7f4373d61de5]
/usr/sbin/mysqld(+0x98c80b) [0x7f4373d6280b]
/usr/sbin/mysqld(+0x98cec0) [0x7f4373d62ec0]
/usr/sbin/mysqld(+0x857c44) [0x7f4373c2dc44]
/usr/sbin/mysqld(ha_rollback_trans(THD*, bool)+0xe2) [0x7f4373a825c2]
/usr/sbin/mysqld(trans_rollback(THD*)+0x5a) [0x7f43739d3c3a]
/usr/sbin/mysqld(mysql_execute_command(THD*)+0x127c) [0x7f43738fee9c]
/usr/sbin/mysqld(mysql_parse(THD, char, unsigned int, Parser_state*, bool, bool)+0x29a) [0x7f437390617a]
/usr/sbin/mysqld(dispatch_command(enum_server_command, THD, char, unsigned int, bool, bool)+0x1cde) [0x7f43739084ee]
/usr/sbin/mysqld(do_command(THD*)+0x14e) [0x7f4373908a9e]
/usr/sbin/mysqld(tp_callback(TP_connection*)+0x67) [0x7f43739ff917]
/usr/sbin/mysqld(+0x682b20) [0x7f4373a58b20]
/lib64/libpthread.so.0(+0x3e52607aa1) [0x7f4372f9eaa1]
/lib64/libc.so.6(clone+0x6d) [0x7f43716b6aad]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f36d05987f0): rollback
Connection ID (thread ID): 64393
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=on,mrr_cost_based=on,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on

similar error from 10.2.9
:
2017-11-19 17:30:46 139866819159808 [ERROR] [FATAL] InnoDB: Trying to read page number 1755915 in space 4294967294, space name innodb_temporary, which is outside the tablespace bounds. Byte offset 0, len 16384
171119 17:30:46 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.2.9-MariaDB-log
key_buffer_size=33554432
read_buffer_size=2097152
max_used_connections=263
max_threads=401
thread_count=125
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1700682 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f28bd04ba58
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f3548125cc8 thread_stack 0x30000
(my_addr_resolve failure: fork)
/usr/sbin/mysqld(my_print_stacktrace+0x2b) [0x7f35507b3d8b]
/usr/sbin/mysqld(handle_fatal_signal+0x4d5) [0x7f355027e425]
/lib64/libpthread.so.0(+0x3e5260f7e0) [0x7f354f7a67e0]
/lib64/libc.so.6(gsignal+0x35) [0x7f354de005e5]
/lib64/libc.so.6(abort+0x175) [0x7f354de01dc5]
/usr/sbin/mysqld(+0x9a7dfa) [0x7f355057ddfa]
/usr/sbin/mysqld(+0xa5771d) [0x7f355062d71d]
/usr/sbin/mysqld(+0xa5c96f) [0x7f355063296f]
/usr/sbin/mysqld(+0xa1231b) [0x7f35505e831b]
/usr/sbin/mysqld(+0xa139c6) [0x7f35505e99c6]
/usr/sbin/mysqld(+0x9ebb71) [0x7f35505c1b71]
/usr/sbin/mysqld(+0x984f3c) [0x7f355055af3c]
/usr/sbin/mysqld(+0x9425bb) [0x7f35505185bb]
/usr/sbin/mysqld(+0x8e5837) [0x7f35504bb837]
/usr/sbin/mysqld(+0x985f85) [0x7f355055bf85]
/usr/sbin/mysqld(+0x98627b) [0x7f355055c27b]
/usr/sbin/mysqld(+0x91276a) [0x7f35504e876a]
/usr/sbin/mysqld(+0x914931) [0x7f35504ea931]
/usr/sbin/mysqld(+0x8490b0) [0x7f355041f0b0]
/usr/sbin/mysqld(+0x8583f0) [0x7f355042e3f0]
/usr/sbin/mysqld(handler::ha_write_row(unsigned char*)+0x177) [0x7f3550287d77]
/usr/sbin/mysqld(write_record(THD, TABLE, st_copy_info*)+0x71) [0x7f35500e5f21]
/usr/sbin/mysqld(select_insert::send_data(List<Item>&)+0xd2) [0x7f35500e6882]
/usr/sbin/mysqld(+0x5684bf) [0x7f355013e4bf]
/usr/sbin/mysqld(+0x555381) [0x7f355012b381]
/usr/sbin/mysqld(sub_select(JOIN, st_join_table, bool)+0x17e) [0x7f35501336fe]
/usr/sbin/mysqld(JOIN::exec_inner()+0x9f9) [0x7f35501516c9]
/usr/sbin/mysqld(JOIN::exec()+0x3c) [0x7f35501518fc]
/usr/sbin/mysqld(mysql_select(THD, TABLE_LIST, unsigned int, List<Item>&, Item, unsigned int, st_order, st_order, Item, st_order, unsigned long long, select_result, st_select_lex_unit, st_select_lex)+0x12a) [0x7f355015080a]
/usr/sbin/mysqld(handle_select(THD, LEX, select_result*, unsigned long)+0x1d4) [0x7f3550151b14]
/usr/sbin/mysqld(mysql_execute_command(THD*)+0x78d6) [0x7f3550105076]
/usr/sbin/mysqld(Prepared_statement::execute(String*, bool)+0x4de) [0x7f355011677e]
/usr/sbin/mysqld(+0x543dbb) [0x7f3550119dbb]
/usr/sbin/mysqld(+0x544a32) [0x7f355011aa32]
/usr/sbin/mysqld(mysqld_stmt_execute(THD, char, unsigned int)+0x27) [0x7f355011ab77]
/usr/sbin/mysqld(dispatch_command(enum_server_command, THD, char, unsigned int, bool, bool)+0xefa) [0x7f355010739a]
/usr/sbin/mysqld(do_command(THD*)+0x14e) [0x7f35501087be]
/usr/sbin/mysqld(tp_callback(TP_connection*)+0x67) [0x7f35501fe967]
/usr/sbin/mysqld(+0x6816c0) [0x7f35502576c0]
/lib64/libpthread.so.0(+0x3e52607aa1) [0x7f354f79eaa1]
/lib64/libc.so.6(clone+0x6d) [0x7f354deb6aad]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f28c8014d08): insert ignore into asset_lite select * from asset_lite_temp order by asset_id

Connection ID (thread ID): 327798
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=on,mrr_cost_based=on,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on

Attachments

Issue Links

relates to

MDEV-15524 Do not write garbage or disable page checksums for temporary tables

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2018-03-08 14:45

The crash happens when executing a rollback statement.

In InnoDB, all tables and records are exclusively locked and changes written upfront. Rollback should replay the undo log backwards, decrementing the trx->undo_no all the way to 0. Finally, the transaction with no net changes to tables will be committed.

MariaDB 10.2 (and MySQL 5.7) introduced a separate tablespace for temporary tables and temporary undo logs. Until then, temporary tables were treated as persistent.

It turns out that the function trx_roll_pop_top_rec_of_trx() is actually prioritizing the undo logs of persistent tables. Only after all persistent changes have been rolled back, InnoDB would move to processing any temporary undo log records that are past the rollback savepoint. While I have refactored this code in MariaDB 10.2 and 10.3, this logic of prioritizing persisted undo logs in rollback remains, just like it is in MySQL 5.7.

The cause of the crash is a wrong page reference. In the core dump, I successfully looked up two undo pages: one in the persistent tablespace (system tablespace; innodb_undo_tablespaces was not used) and another in the temporary tablespace. The one in the temporary tablespace appears to be empty (containing only garbage after the start=end=0x110 byte offset). The one in the persistent tablespace is almost full with records, but there is a record at offset 0xb4c, not 0xb4b = 2891. Here is the wrong pointer:

(gdb) thread 1

(gdb) frame 13

(gdb) p *undo

$1 = {id = 0, type = 1, state = 1, del_marks = 0, trx_id = 19350037632, xid = {formatID = 1,

gtrid_length = 24, bqual_length = 0,

data = "MySQLXid\346\001\000\000\000\000\000\000Y%\343\v", '\000' <repeats 107 times>},

dict_operation = 0, table_id = 6197768787689408977, rseg = 0x7f43b90c6360, space = 4294967294,

hdr_page_no = 90544, hdr_offset = 86, last_page_no = 90544, size = 1, empty = 0,

top_page_no = 376886, top_offset = 2891, top_undo_no = 13, guess_block = 0x7f3fed2ac008,

withdraw_clock = 0, undo_list = {prev = 0x0, next = 0x0}}

…

The temporary tablespace does not have page number 376886, because it is much smaller:

(gdb) frame 7

(gdb) p *space

$3 = {name = 0x7f43b4a2b9e0 "innodb_temporary", id = 4294967294, max_lsn = 0, stop_ios = false,

stop_new_ops = false, is_being_truncated = false, purpose = FIL_TYPE_TEMPORARY, chain = {count = 1,

start = 0x7f43bcb24090, end = 0x7f43bcb24090, node = &fil_node_t::chain}, size = 98304,

size_in_header = 98304, free_len = 950, free_limit = 95872, recv_size = 0, flags = 0,

…

In the 3 undo logs of the transaction, the last_page_no should be equal to top_page_no:

(gdb) p trx->rsegs.m_redo.insert_undo.last_page_no

$4 = 4653060

(gdb) p trx->rsegs.m_redo.insert_undo.top_page_no

$5 = 4653060

(gdb) p trx->rsegs.m_redo.update_undo.last_page_no

$6 = 3964969

(gdb) p trx->rsegs.m_redo.update_undo.top_page_no

$7 = 3964969

That is not the case for the temporary undo log:

(gdb) p trx->rsegs.m_noredo.undo.top_page_no

$8 = 376886

(gdb) p trx->rsegs.m_noredo.undo.last_page_no

$9 = 90544

Page 90544 exists in the temporary tablespace, but it is empty apart from an undo log header, which contains a MySQLXid that does not match the trx->xid.

We also have trx->rsegs.m_noredo.undo.guess_block which matches much of the temporary undo log data structure in memory:

(gdb) p *trx->rsegs.m_noredo.undo

$10 = {id = 0, type = 1, state = 1, del_marks = 0, trx_id = 19350037632, xid = {

    formatID = 1, gtrid_length = 24, bqual_length = 0,

    data = "MySQLXid\346\001\000\000\000\000\000\000Y%\343\v", '\000' <repeats 107 times>}, dict_operation = 0, table_id = 6197768787689408977,

  rseg = 0x7f43b90c6360, space = 4294967294, hdr_page_no = 90544,

  hdr_offset = 86, last_page_no = 90544, size = 1, empty = 0,

  top_page_no = 376886, top_offset = 2891, top_undo_no = 13,

  guess_block = 0x7f3fed2ac008, withdraw_clock = 0, undo_list = {prev = 0x0,

    next = 0x0}}

The guess_block contains only inserts to a single temporary table, in ascending order of a BIGINT AUTO_INCREMENT PRIMARY KEY column, with no gaps in the AUTO_INCREMENT values or the undo_no. The first undo record is at byte offset 0x38 and the last one at 0x2f8d. Each record is 21 bytes. The last undo_no is 0x88610 or 558,608. If this was the very last record written by the transaction, this would mean that there were 558,608 rows modified by the transaction. In addition to at least this one temporary table, the transaction modified 25 persistent tables (the size of trx->mod_tables). This temporary table had the PRIMARY KEY and one UNIQUE INDEX on 3 columns.

The top_offset is pointing to a undo record in the middle of the guess_block. The guess_block should be the very last temporary undo log block that was written in the course of the transaction before rollback.

But if this block were valid for the transaction, the top_undo_no should be much bigger (558,608 instead of 13).

In trx->rsegs.m_redo.insert_undo.guess_block.frame the last undo_no is 0x88611, or 558,609 (one larger than the last written temporary undo log record).
In trx->rsegs.m_redo.update_undo.guess_block.frame the last undo_no is 0x88330 (much smaller than the first record in the last written temporary undo log page, 0x883cf).

(gdb) p *trx->rsegs.m_redo.insert_undo

$11 = {id = 1, type = 1, state = 1, del_marks = 0, trx_id = 19350037632, xid = {

    formatID = -1, gtrid_length = 0, bqual_length = 0,

    data = '\000' <repeats 127 times>}, dict_operation = 0,

  table_id = 228954336, rseg = 0x7f43b317e710, space = 0,

  hdr_page_no = 4653060, hdr_offset = 86, last_page_no = 4653060, size = 1,

  empty = 1, top_page_no = 4653060, top_offset = 272, top_undo_no = 0,

  guess_block = 0x7f39e4aa8240, withdraw_clock = 0, undo_list = {prev = 0x0,

    next = 0x0}}

(gdb) p *trx->rsegs.m_redo.update_undo

$12 = {id = 0, type = 2, state = 1, del_marks = 1, trx_id = 19350037632, xid = {

    formatID = 1, gtrid_length = 24, bqual_length = 0,

    data = "MySQLXid\346\001\000\000\000\000\000\000Y%\343\v", '\000' <repeats 107 times>}, dict_operation = 0, table_id = 4832885630348296191,

  rseg = 0x7f43b317e710, space = 0, hdr_page_no = 3964969, hdr_offset = 7083,

  last_page_no = 3964969, size = 1, empty = 1, top_page_no = 3964969,

  top_offset = 7269, top_undo_no = 323531, guess_block = 0x7f3d22574710,

  withdraw_clock = 0, undo_list = {prev = 0x0, next = 0x0}}

It remains a mystery how and why exactly the trx->rsegs.m_noredo.undo was corrupted. Currently my best guess is that a stray write from some other thread corrupted the memory while the explicit ROLLBACK from the client connection was being executed. There are not many assignments to top_page_no, and the FIL_PAGE_OFFSET in the undo pages looks sensible.

While painstakingly analyzing the core dump, I made a few observations that should eventually be addressed. But these should not explain the failure.

It would be more logical to execute rollback in strict descending order of records written (undo_no), as the sequence is common to both persistent and temporary undo log records of a transaction.
For temporary tablespaces, fsp_init_file_page_low() should zero out earlier contents. This would not only avoid confusion when analyzing core dumps, but also avoid leakage of data from encrypted (persistent) tablespaces to the temporary tablespace file.
For temporary tablespaces, page checksums should be enabled in order to catch any file system corruption. (I do not think that this failure can be explained by any file system corruption.)

Marko Mäkelä added a comment - 2018-03-08 14:45 The crash happens when executing a rollback statement. In InnoDB, all tables and records are exclusively locked and changes written upfront. Rollback should replay the undo log backwards, decrementing the trx->undo_no all the way to 0. Finally, the transaction with no net changes to tables will be committed. MariaDB 10.2 (and MySQL 5.7) introduced a separate tablespace for temporary tables and temporary undo logs. Until then, temporary tables were treated as persistent. It turns out that the function trx_roll_pop_top_rec_of_trx() is actually prioritizing the undo logs of persistent tables. Only after all persistent changes have been rolled back, InnoDB would move to processing any temporary undo log records that are past the rollback savepoint. While I have refactored this code in MariaDB 10.2 and 10.3, this logic of prioritizing persisted undo logs in rollback remains, just like it is in MySQL 5.7. The cause of the crash is a wrong page reference. In the core dump, I successfully looked up two undo pages: one in the persistent tablespace (system tablespace; innodb_undo_tablespaces was not used) and another in the temporary tablespace. The one in the temporary tablespace appears to be empty (containing only garbage after the start=end=0x110 byte offset). The one in the persistent tablespace is almost full with records, but there is a record at offset 0xb4c, not 0xb4b = 2891. Here is the wrong pointer: (gdb) thread 1 (gdb) frame 13 (gdb) p *undo $1 = {id = 0, type = 1, state = 1, del_marks = 0, trx_id = 19350037632, xid = {formatID = 1, gtrid_length = 24, bqual_length = 0, data = "MySQLXid\346\001\000\000\000\000\000\000Y%\343\v", '\000' <repeats 107 times>}, dict_operation = 0, table_id = 6197768787689408977, rseg = 0x7f43b90c6360, space = 4294967294, hdr_page_no = 90544, hdr_offset = 86, last_page_no = 90544, size = 1, empty = 0, top_page_no = 376886, top_offset = 2891, top_undo_no = 13, guess_block = 0x7f3fed2ac008, withdraw_clock = 0, undo_list = {prev = 0x0, next = 0x0}} … The temporary tablespace does not have page number 376886, because it is much smaller: (gdb) frame 7 (gdb) p *space $3 = {name = 0x7f43b4a2b9e0 "innodb_temporary", id = 4294967294, max_lsn = 0, stop_ios = false, stop_new_ops = false, is_being_truncated = false, purpose = FIL_TYPE_TEMPORARY, chain = {count = 1, start = 0x7f43bcb24090, end = 0x7f43bcb24090, node = &fil_node_t::chain}, size = 98304, size_in_header = 98304, free_len = 950, free_limit = 95872, recv_size = 0, flags = 0, … In the 3 undo logs of the transaction, the last_page_no should be equal to top_page_no : (gdb) p trx->rsegs.m_redo.insert_undo.last_page_no $4 = 4653060 (gdb) p trx->rsegs.m_redo.insert_undo.top_page_no $5 = 4653060 (gdb) p trx->rsegs.m_redo.update_undo.last_page_no $6 = 3964969 (gdb) p trx->rsegs.m_redo.update_undo.top_page_no $7 = 3964969 That is not the case for the temporary undo log: (gdb) p trx->rsegs.m_noredo.undo.top_page_no $8 = 376886 (gdb) p trx->rsegs.m_noredo.undo.last_page_no $9 = 90544 Page 90544 exists in the temporary tablespace, but it is empty apart from an undo log header, which contains a MySQLXid that does not match the trx->xid . We also have trx->rsegs.m_noredo.undo.guess_block which matches much of the temporary undo log data structure in memory: (gdb) p *trx->rsegs.m_noredo.undo $10 = {id = 0, type = 1, state = 1, del_marks = 0, trx_id = 19350037632, xid = { formatID = 1, gtrid_length = 24, bqual_length = 0, data = "MySQLXid\346\001\000\000\000\000\000\000Y%\343\v", '\000' <repeats 107 times>}, dict_operation = 0, table_id = 6197768787689408977, rseg = 0x7f43b90c6360, space = 4294967294, hdr_page_no = 90544, hdr_offset = 86, last_page_no = 90544, size = 1, empty = 0, top_page_no = 376886, top_offset = 2891, top_undo_no = 13, guess_block = 0x7f3fed2ac008, withdraw_clock = 0, undo_list = {prev = 0x0, next = 0x0}} The guess_block contains only inserts to a single temporary table, in ascending order of a BIGINT AUTO_INCREMENT PRIMARY KEY column, with no gaps in the AUTO_INCREMENT values or the undo_no . The first undo record is at byte offset 0x38 and the last one at 0x2f8d. Each record is 21 bytes. The last undo_no is 0x88610 or 558,608. If this was the very last record written by the transaction, this would mean that there were 558,608 rows modified by the transaction. In addition to at least this one temporary table, the transaction modified 25 persistent tables (the size of trx->mod_tables ). This temporary table had the PRIMARY KEY and one UNIQUE INDEX on 3 columns. The top_offset is pointing to a undo record in the middle of the guess_block . The guess_block should be the very last temporary undo log block that was written in the course of the transaction before rollback. But if this block were valid for the transaction, the top_undo_no should be much bigger (558,608 instead of 13). In trx->rsegs.m_redo.insert_undo.guess_block.frame the last undo_no is 0x88611, or 558,609 (one larger than the last written temporary undo log record). In trx->rsegs.m_redo.update_undo.guess_block.frame the last undo_no is 0x88330 (much smaller than the first record in the last written temporary undo log page, 0x883cf). (gdb) p *trx->rsegs.m_redo.insert_undo $11 = {id = 1, type = 1, state = 1, del_marks = 0, trx_id = 19350037632, xid = { formatID = -1, gtrid_length = 0, bqual_length = 0, data = '\000' <repeats 127 times>}, dict_operation = 0, table_id = 228954336, rseg = 0x7f43b317e710, space = 0, hdr_page_no = 4653060, hdr_offset = 86, last_page_no = 4653060, size = 1, empty = 1, top_page_no = 4653060, top_offset = 272, top_undo_no = 0, guess_block = 0x7f39e4aa8240, withdraw_clock = 0, undo_list = {prev = 0x0, next = 0x0}} (gdb) p *trx->rsegs.m_redo.update_undo $12 = {id = 0, type = 2, state = 1, del_marks = 1, trx_id = 19350037632, xid = { formatID = 1, gtrid_length = 24, bqual_length = 0, data = "MySQLXid\346\001\000\000\000\000\000\000Y%\343\v", '\000' <repeats 107 times>}, dict_operation = 0, table_id = 4832885630348296191, rseg = 0x7f43b317e710, space = 0, hdr_page_no = 3964969, hdr_offset = 7083, last_page_no = 3964969, size = 1, empty = 1, top_page_no = 3964969, top_offset = 7269, top_undo_no = 323531, guess_block = 0x7f3d22574710, withdraw_clock = 0, undo_list = {prev = 0x0, next = 0x0}} It remains a mystery how and why exactly the trx->rsegs.m_noredo.undo was corrupted. Currently my best guess is that a stray write from some other thread corrupted the memory while the explicit ROLLBACK from the client connection was being executed. There are not many assignments to top_page_no , and the FIL_PAGE_OFFSET in the undo pages looks sensible. While painstakingly analyzing the core dump, I made a few observations that should eventually be addressed. But these should not explain the failure. It would be more logical to execute rollback in strict descending order of records written ( undo_no ), as the sequence is common to both persistent and temporary undo log records of a transaction. For temporary tablespaces, fsp_init_file_page_low() should zero out earlier contents. This would not only avoid confusion when analyzing core dumps, but also avoid leakage of data from encrypted (persistent) tablespaces to the temporary tablespace file. For temporary tablespaces, page checksums should be enabled in order to catch any file system corruption. (I do not think that this failure can be explained by any file system corruption.)

Marko Mäkelä added a comment - 2018-03-08 15:00

Anyone who encounters or is able to repeat this, please reopen the bug and submit details.
The support customer who encountered the bug switched to ENGINE=Aria for temporary tables in order to prevent the problem from reoccurring.

Marko Mäkelä added a comment - 2018-03-08 15:00 Anyone who encounters or is able to repeat this, please reopen the bug and submit details. The support customer who encountered the bug switched to ENGINE=Aria for temporary tables in order to prevent the problem from reoccurring.

People

Assignee:: Marko Mäkelä

Reporter:: Juan

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2018-01-11 17:08

Updated:: 2024-07-08 00:33

Resolved:: 2018-03-08 15:00

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server