[MDEV-11894] Galera crashes after a few minutes - unusable Created: 2017-01-24  Updated: 2020-08-25  Resolved: 2017-05-29

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.3
Fix Version/s: 10.2.5

Type: Bug Priority: Critical
Reporter: Ranjan Ghosh Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 2
Labels: None
Environment:

Ubuntu 16.04.1 LTS


Issue Links:
Duplicate
is duplicated by MDEV-11927 InnoDB change buffer is not being mer... Closed
Relates
relates to MDEV-11843 MariaDB Server Fail - "Has Gone Away"... Closed
relates to MDEV-11927 InnoDB change buffer is not being mer... Closed
Sprint: 10.0.30, 10.3.1-1

 Description   

I've restarted completely from scratch multiple times (e.g. wipe /var/lib/mysql, mysql_install_db, restore dump). I just can't get the second node to run. Below I quote first my mysql.cnf, then two variations of the errors I get.

[mysqld]
 
#Recommended by Drupal
max_allowed_packet=512M
 
#Try to prevent crashes
skip-log-bin
 
# Replicate faster :-)
wsrep_slave_threads = 32
 
#Query Cache for faster performance
query_cache_type = 1
query_cache_size = 256M
 
#mysql settings mandatory for galera
binlog_format=ROW
binlog_row_image=MINIMAL
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
bind-address=0.0.0.0
innodb_flush_log_at_trx_commit=0
 
max_connections = 300
 
#galera settings
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name="uhu_cluster"
wsrep_cluster_address="gcomm://node1,node2"
wsrep_node_address=192.168.0.1
 
enforce_storage_engine=InnoDB
sql_mode=NO_AUTO_CREATE_USER

Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24  0:44:21 140172607566592 [ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch
Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24 00:44:21 0x7f7c7a7b3300  InnoDB: Assertion failure in thread 140172607566592 in file ut0ut.cc line 949
Jan 24 00:44:21 uhu1 mysqld[14620]: InnoDB: We intentionally generate a memory trap.
Jan 24 00:44:21 uhu1 mysqld[14620]: InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
Jan 24 00:44:21 uhu1 mysqld[14620]: InnoDB: If you get repeated assertion failures or crashes, even
Jan 24 00:44:21 uhu1 mysqld[14620]: InnoDB: immediately after the mysqld startup, there may be
Jan 24 00:44:21 uhu1 mysqld[14620]: InnoDB: corruption in the InnoDB tablespace. Please refer to
Jan 24 00:44:21 uhu1 mysqld[14620]: InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
Jan 24 00:44:21 uhu1 mysqld[14620]: InnoDB: about forcing recovery.
Jan 24 00:44:21 uhu1 mysqld[14620]: 170124  0:44:21 [ERROR] mysqld got signal 6 ;
Jan 24 00:44:21 uhu1 mysqld[14620]: This could be because you hit a bug. It is also possible that this binary
Jan 24 00:44:21 uhu1 mysqld[14620]: or one of the libraries it was linked against is corrupt, improperly built,
Jan 24 00:44:21 uhu1 mysqld[14620]: or misconfigured. This error can also be caused by malfunctioning hardware.
Jan 24 00:44:21 uhu1 mysqld[14620]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Jan 24 00:44:21 uhu1 mysqld[14620]: We will try our best to scrape up some info that will hopefully help
Jan 24 00:44:21 uhu1 mysqld[14620]: diagnose the problem, but since we have already crashed,
Jan 24 00:44:21 uhu1 mysqld[14620]: something is definitely wrong and this may fail.
Jan 24 00:44:21 uhu1 mysqld[14620]: Server version: 10.2.3-MariaDB-10.2.3+maria~xenial
Jan 24 00:44:21 uhu1 mysqld[14620]: key_buffer_size=134217728
Jan 24 00:44:21 uhu1 mysqld[14620]: read_buffer_size=2097152
Jan 24 00:44:21 uhu1 mysqld[14620]: max_used_connections=10
Jan 24 00:44:21 uhu1 mysqld[14620]: max_threads=302
Jan 24 00:44:21 uhu1 mysqld[14620]: thread_count=49
Jan 24 00:44:21 uhu1 mysqld[14620]: It is possible that mysqld could use up to
Jan 24 00:44:21 uhu1 mysqld[14620]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1992895 K  bytes of memory
Jan 24 00:44:21 uhu1 mysqld[14620]: Hope that's ok; if not, decrease some variables in the equation.
Jan 24 00:44:21 uhu1 mysqld[14620]: Thread pointer: 0x0x7f7c2b072008
Jan 24 00:44:21 uhu1 mysqld[14620]: Attempting backtrace. You can use the following information to find out
Jan 24 00:44:21 uhu1 mysqld[14620]: where mysqld died. If you see no messages after this, something went
Jan 24 00:44:21 uhu1 mysqld[14620]: terribly wrong...
Jan 24 00:44:21 uhu1 mysqld[14620]: stack_bottom = 0x7f7c7a7b28f8 thread_stack 0x48c00
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55eb89297dae]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(handle_fatal_signal+0x2dd)[0x55eb88cedb1d]
Jan 24 00:44:21 uhu1 mysqld[14620]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f7c9a6eb390]
Jan 24 00:44:21 uhu1 mysqld[14620]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f7c99cbb428]
Jan 24 00:44:21 uhu1 mysqld[14620]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f7c99cbd02a]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0x3fb37a)[0x55eb88aa837a]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0xa0f896)[0x55eb890bc896]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0xaaa4b6)[0x55eb891574b6]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0xaabfd7)[0x55eb89158fd7]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0xa9c680)[0x55eb89149680]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0x87e8d9)[0x55eb88f2b8d9]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0x891a47)[0x55eb88f3ea47]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_ZN7handler7ha_openEP5TABLEPKcij+0x33)[0x55eb88cf1e33]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z21open_table_from_shareP3THDP11TABLE_SHAREPKcjjjP5TABLEb+0x6c3)[0x55eb88c01883]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z10open_tableP3THDP10TABLE_LISTP18Open_table_context+0x9c3)[0x55eb88b10bc3]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z11open_tablesP3THDRK14DDL_options_stPP10TABLE_LISTPjjP19Prelocking_strategy+0xec6)[0x55eb88b139f6]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z20open_and_lock_tablesP3THDRK14DDL_options_stP10TABLE_LISTbjP19Prelocking_strategy+0x33)[0x55eb88b13e43]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_ZN22Sql_cmd_truncate_table16handler_truncateEP3THDP10TABLE_LISTb+0x80)[0x55eb88e3b800]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_ZN22Sql_cmd_truncate_table14truncate_tableEP3THDP10TABLE_LIST+0x1d8)[0x55eb88e3c108]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_ZN22Sql_cmd_truncate_table7executeEP3THD+0x50)[0x55eb88e3c1b0]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x12f7)[0x55eb88b57d77]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x28a)[0x55eb88b5faea]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(+0x4b32c7)[0x55eb88b602c7]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x11d1)[0x55eb88b61a91]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z10do_commandP3THD+0x17c)[0x55eb88b6302c]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x24a)[0x55eb88c260ea]
Jan 24 00:44:21 uhu1 mysqld[14620]: /usr/sbin/mysqld(handle_one_connection+0x3d)[0x55eb88c2627d]
Jan 24 00:44:21 uhu1 mysqld[14620]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f7c9a6e16ba]
Jan 24 00:44:21 uhu1 mysqld[14620]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f7c99d8c82d]
Jan 24 00:44:21 uhu1 mysqld[14620]: Trying to get some variables.
Jan 24 00:44:21 uhu1 mysqld[14620]: Some pointers may be invalid and cause the dump to abort.
Jan 24 00:44:21 uhu1 mysqld[14620]: Query (0x7f7c2b09d020): TRUNCATE simple_sitemap
Jan 24 00:44:21 uhu1 mysqld[14620]: Connection ID (thread ID): 358
Jan 24 00:44:21 uhu1 mysqld[14620]: Status: NOT_KILLED
Jan 24 00:44:21 uhu1 mysqld[14620]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on
Jan 24 00:44:21 uhu1 mysqld[14620]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
Jan 24 00:44:21 uhu1 mysqld[14620]: information that should help you find out what is causing the crash.

Jan 24 00:51:58 uhu1 mysqld[28280]: 2017-01-24 00:51:58 0x7fb7df3fc700  InnoDB: Assertion failure in thread 140427701241600 in file trx0purge.cc line 168
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: Failing assertion: purge_sys->iter.trx_no <= purge_sys->rseg->last_trx_no
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: We intentionally generate a memory trap.
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: If you get repeated assertion failures or crashes, even
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: immediately after the mysqld startup, there may be
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: corruption in the InnoDB tablespace. Please refer to
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
Jan 24 00:51:58 uhu1 mysqld[28280]: InnoDB: about forcing recovery.
Jan 24 00:51:58 uhu1 mysqld[28280]: 170124  0:51:58 [ERROR] mysqld got signal 6 ;
Jan 24 00:51:58 uhu1 mysqld[28280]: This could be because you hit a bug. It is also possible that this binary
Jan 24 00:51:58 uhu1 mysqld[28280]: or one of the libraries it was linked against is corrupt, improperly built,
Jan 24 00:51:58 uhu1 mysqld[28280]: or misconfigured. This error can also be caused by malfunctioning hardware.
Jan 24 00:51:58 uhu1 mysqld[28280]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Jan 24 00:51:58 uhu1 mysqld[28280]: We will try our best to scrape up some info that will hopefully help
Jan 24 00:51:58 uhu1 mysqld[28280]: diagnose the problem, but since we have already crashed,
Jan 24 00:51:58 uhu1 mysqld[28280]: something is definitely wrong and this may fail.
Jan 24 00:51:58 uhu1 mysqld[28280]: Server version: 10.2.3-MariaDB-10.2.3+maria~xenial
Jan 24 00:51:58 uhu1 mysqld[28280]: key_buffer_size=134217728
Jan 24 00:51:58 uhu1 mysqld[28280]: read_buffer_size=2097152
Jan 24 00:51:58 uhu1 mysqld[28280]: max_used_connections=4
Jan 24 00:51:58 uhu1 mysqld[28280]: max_threads=302
Jan 24 00:51:58 uhu1 mysqld[28280]: thread_count=43
Jan 24 00:51:58 uhu1 mysqld[28280]: It is possible that mysqld could use up to
Jan 24 00:51:58 uhu1 mysqld[28280]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1992895 K  bytes of memory
Jan 24 00:51:58 uhu1 mysqld[28280]: Hope that's ok; if not, decrease some variables in the equation.
Jan 24 00:51:58 uhu1 mysqld[28280]: Thread pointer: 0x0x7fb7e8412008
Jan 24 00:51:58 uhu1 mysqld[28280]: Attempting backtrace. You can use the following information to find out
Jan 24 00:51:58 uhu1 mysqld[28280]: where mysqld died. If you see no messages after this, something went
Jan 24 00:51:58 uhu1 mysqld[28280]: terribly wrong...
Jan 24 00:51:58 uhu1 mysqld[28280]: stack_bottom = 0x7fb7df3fbb68 thread_stack 0x48c00
Jan 24 00:51:58 uhu1 mysqld[28280]: /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55e8162d0dae]
Jan 24 00:51:58 uhu1 mysqld[28280]: /usr/sbin/mysqld(handle_fatal_signal+0x2dd)[0x55e815d26b1d]
Jan 24 00:51:58 uhu1 mysqld[28280]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fb845168390]
Jan 24 00:51:58 uhu1 mysqld[28280]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fb844738428]
Jan 24 00:51:58 uhu1 mysqld[28280]: /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fb84473a02a]
Jan 24 00:51:58 uhu1 mysqld[28280]: /usr/sbin/mysqld(+0x3fb37a)[0x55e815ae137a]
Jan 24 00:51:58 uhu1 mysqld[28280]: /usr/sbin/mysqld(+0x9d0403)[0x55e8160b6403]
Jan 24 00:51:58 uhu1 mysqld[28280]: /usr/sbin/mysqld(+0x9d5630)[0x55e8160bb630]
Jan 24 00:51:58 uhu1 mysqld[28280]: /usr/sbin/mysqld(+0x9d618c)[0x55e8160bc18c]
Jan 24 00:51:58 uhu1 mysqld[28280]: /usr/sbin/mysqld(+0x9affa8)[0x55e816095fa8]
Jan 24 00:51:58 uhu1 mysqld[28280]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fb84515e6ba]
Jan 24 00:51:58 uhu1 mysqld[28280]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb84480982d]
Jan 24 00:51:58 uhu1 mysqld[28280]: Trying to get some variables.
Jan 24 00:51:58 uhu1 mysqld[28280]: Some pointers may be invalid and cause the dump to abort.
Jan 24 00:51:58 uhu1 mysqld[28280]: Query (0x0):
Jan 24 00:51:58 uhu1 mysqld[28280]: Connection ID (thread ID): 4
Jan 24 00:51:58 uhu1 mysqld[28280]: Status: NOT_KILLED
Jan 24 00:51:58 uhu1 mysqld[28280]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on
Jan 24 00:51:58 uhu1 mysqld[28280]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
Jan 24 00:51:58 uhu1 mysqld[28280]: information that should help you find out what is causing the crash.



 Comments   
Comment by Ranjan Ghosh [ 2017-02-16 ]

Ah, forgot about this bug. I could solve this. The problem was that one of our table files ("simple_sitemap") was broken. The error message even gives a hint (TRUNCATE simple_sitemap). As soon as there was any access on that table, the database was halted. The problem for me was that the error message ( SYS_COLUMNS.TABLE_ID mismatch) didnt say much and googling for this message didnt yield any results. Hopefully this bug will at least make sure that anybody else confronted with this bug can now google this error message and find out: The reason for this message is that possibly some of the table files are defect. I dont know how the table file became broken in the first place. And I wonder if MariaDB shouldn't behave more graceful in such a case. I would have expected a simple error when I try to query the table or sth. like this. Instead the whole database came down.

In case anyone is interested, I could provide the "toxic" frm/ibd file if this is of any help. I dont know if it contains any customer data, so I'd need a way to post this safely somewhere (mail would be ok). It's not large (100 KB approx.)

Comment by Jan Lindström (Inactive) [ 2017-02-20 ]

I could not yet repeat the assertion on SYS_COLUMNS.TABLE_ID. Firstly, I was able to get server up and running with innodb-force-recovery=2. However, with normal startup using debug-build it did crash in here:

/* Add debug code to track history list corruption reported
                on the MySQL mailing list on Nov 9, 2004. The fut0lst.cc
                file-based list was corrupt. The prev node pointer was
                FIL_NULL, even though the list length was over 8 million nodes!
                We assume that purge truncates the history list in large
                size pieces, and if we here reach the head of the list, the
                list cannot be longer than 2000 000 undo logs now. */
    
                if (trx_sys->rseg_history_len > 2000000) {
                        ib::warn() << "Purge reached the head of the history"
                                " list, but its length is still reported as "
                                << trx_sys->rseg_history_len << "! Make"
                                " a detailed bug report, and submit it to"
                                " http://bugs.mysql.com";
                        ut_ad(0);
#3  0x0000555556155bae in trx_purge_rseg_get_next_history_log (rseg=0x7fffe1fd8380, n_pages_handled=0x7fffd67fed90) at /home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc:1295
(gdb) p n_pages_handled
$1 = (ulint *) 0x7fffd67fed90
(gdb) p *n_pages_handled
$2 = 54
(gdb) p prev_log_addr
$3 = {page = 4294967295, boffset = 18446744073709551582}
(gdb) p *rseg
$4 = {id = 41, mutex = {m_impl = {m_lock_word = 0, m_event = 0x7fffe1fd4a80, m_policy = {<MutexDebug<TTASEventMutex<GenericPolicy> >> = {_vptr.MutexDebug = 0x555556cd5200 <vtable for GenericPolicy<TTASEventMutex<GenericPolicy> >+16>, m_magic_n = 979585, m_context = {<latch_t> = {_vptr.latch_t = 0x555556cd5070 <vtable for MutexDebug<TTASEventMutex<GenericPolicy> >::Context+16>, m_id = LATCH_ID_REDO_RSEG, m_rw_lock = false, m_temp_fsp = false}, m_mutex = 0x0, m_filename = 0x0, m_line = 18446744073709551615, m_thread_id = 18446744073709551615}}, m_count = {m_spins = 0, m_waits = 0, m_calls = 0, m_enabled = false}, m_id = LATCH_ID_REDO_RSEG}}, m_ptr = 0x0}, space = 0, page_no = 53, page_size = {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, max_size = 4294967294, curr_size = 35323, update_undo_list = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list, init = 51966}, update_undo_cached = {count = 1, start = 0x7fffe1fd8500, end = 0x7fffe1fd8500, node = &trx_undo_t::undo_list, init = 51966}, insert_undo_list = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list, init = 51966}, insert_undo_cached = {count = 3, start = 0x7fffe1fd8680, end = 0x7fffe1fd8980, node = &trx_undo_t::undo_list, init = 51966}, last_page_no = 4294967295, last_offset = 4297, last_trx_no = 435812729, last_del_marks = 1, trx_ref_count = 0, skip_allocation = false}
(gdb) p trx_sys->rseg_history_len
$5 = 2169495
(gdb) p *trx_sys
$6 = {mutex = {m_impl = {m_lock_word = 1, m_event = 0x7fffe2205300, m_policy = {<MutexDebug<TTASEventMutex<GenericPolicy> >> = {_vptr.MutexDebug = 0x555556cd5200 <vtable for GenericPolicy<TTASEventMutex<GenericPolicy> >+16>, m_magic_n = 979585, m_context = {<latch_t> = {_vptr.latch_t = 0x555556cd5070 <vtable for MutexDebug<TTASEventMutex<GenericPolicy> >::Context+16>, m_id = LATCH_ID_TRX_SYS, m_rw_lock = false, m_temp_fsp = false}, m_mutex = 0x7ffff4883400, m_filename = 0x5555566e15c8 "/home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc", m_line = 1279, m_thread_id = 140736792098560}}, m_count = {m_spins = 0, m_waits = 0, m_calls = 0, m_enabled = false}, m_id = LATCH_ID_TRX_SYS}}, m_ptr = 0x0}, mvcc = 0x7ffff4b18d30, max_trx_id = 641241858, serialisation_list = {count = 0, start = 0x0, end = 0x0, node = &trx_t::no_list, init = 51966}, rw_max_trx_id = 641241857, pad1 = '\000' <repeats 63 times>, rw_trx_list = {count = 0, start = 0x0, end = 0x0, node = &trx_t::trx_list, init = 51966}, pad2 = '\000' <repeats 63 times>, mysql_trx_list = {count = 0, start = 0x0, end = 0x0, node = &trx_t::mysql_trx_list, init = 51966}, rw_trx_ids = std::vector of length 0, capacity 1, pad3 = '\000' <repeats 63 times>, rseg_array = {0x7ffff4831f80, 0x7fffe1f9e700, 0x7fffe1f9e580, 0x7fffe1f9e880, 0x7fffe1f9ea00, 0x7fffe1f9eb80, 0x7fffe1f9ed00, 0x7fffe1f9ee80, 0x7fffe1f9f000, 0x7fffe1f9f180, 0x7fffe1f9f300, 0x7fffe1f9f480, 0x7fffe1f9f600, 0x7fffe1f9f780, 0x7fffe1f9f900, 0x7fffe1f9fa80, 0x7fffe1f9fc00, 0x7fffe1f9fd80, 0x7fffe1f9ff00, 0x7fffe1fa0080, 0x7fffe1fa0200, 0x7fffe1fa0380, 0x7fffe1fa0500, 0x7fffe1fa0680, 0x7fffe1fa0800, 0x7fffe1fa0980, 0x7fffe1fa0b00, 0x7fffe1fa0c80, 0x7fffe1fa0e00, 0x7fffe1fa0f80, 0x7fffe1fa1100, 0x7fffe1fa1280, 0x7fffe1fa1400, 0x7ffff4833180, 0x7ffff4833a80, 0x7ffff4834380, 0x7ffff4834b00, 0x7ffff4835400, 0x7ffff4835a00, 0x7fffe1fd7300, 0x7fffe1fd7c00, 0x7fffe1fd8380, 0x7fffe1fd8b00, 0x7fffe1fd9100, 0x7fffe1fd9700, 0x7fffe1fda000, 0x7fffe1fda900, 0x7fffe1fdaf00, 0x7fffe1fdb500, 0x7fffe1fdbc80, 0x7fffe1fdc580, 0x7fffe1fdcb80, 0x7fffe1fb0480, 0x7fffe1fb0c00, 0x7fffe1fb1500, 0x7fffe1fb1c80, 0x7fffe1fb2400, 0x7fffe1fb2b80, 0x7fffe1fb3480, 0x7fffe1fb3a80, 0x7fffe1fb4200, 0x7fffe1fb4c80, 0x7fffe1fb5700, 0x7fffe1fb5d00, 0x7fffe1f8f600, 0x7fffe1f8fd80, 0x7fffe1f90500, 0x7fffe1f90c80, 0x7fffe1f91280, 0x7fffe1f91a00, 0x7fffe1f92000, 0x7fffe1f92780, 0x7fffe1f92d80, 0x7fffe1f93500, 0x7fffe1f93b00, 0x7fffe1f94100, 0x7fffe1f94700, 0x7fffe1f94b80, 0x7fffe21db300, 0x7fffe21db780, 0x7fffe21dbd80, 0x7fffe21dc500, 0x7fffe21dcc80, 0x7fffe21dd100, 0x7fffe21dd700, 0x7fffe21dde80, 0x7fffe21de480, 0x7fffe21dea80, 0x7fffe21df080, 0x7fffe21df680, 0x7fffe21dfe00, 0x7fffe21e0400, 0x7fffe21e0b80, 0x7fffe21e1480, 0x7fffe21e1c00, 0x7fffe21e2380, 0x7fffe21e2980, 0x7fffe21e3100, 0x7fffe21e3700, 0x7fffe21e3d00, 0x7fffe21e4300, 0x7fffe21e4a80, 0x7fffe21e5080, 0x7fffe21e5680, 0x7fffe21e5c80, 0x7fffe21e6280, 0x7fffe21e6880, 0x7fffe21e6d00, 0x7fffe1f96480, 0x7fffe1f96a80, 0x7fffe1f97080, 0x7fffe1f97500, 0x7fffe1f97e00, 0x7fffe1f98580, 0x7fffe1f98b80, 0x7fffe1f99180, 0x7fffe1f99780, 0x7fffe1f99d80, 0x7fffe1f9a380, 0x7fffe1f9a980, 0x7fffe1f9b100, 0x7fffe1f9b880, 0x7fffe1f9be80, 0x7fffe1f9c600, 0x7fffe1f9cc00, 0x7fffe1f9d200, 0x7fffe1f9d980, 0x7fffe1f9df80}, rseg_history_len = 2169495, pending_purge_rseg_array = {0x0 <repeats 128 times>}, rw_trx_set = std::set with 4 elements = {[0] = {m_id = 641241068, m_trx = 0x7fffeb0008e8}, [1] = {m_id = 641241109, m_trx = 0x7fffeb000d58}, [2] = {m_id = 641241121, m_trx = 0x7fffeb0011c8}, [3] = {m_id = 641241250, m_trx = 0x7fffeb000478}}, n_prepared_trx = 0, n_prepared_recovered_trx = 0}

Comment by Jan Lindström (Inactive) [ 2017-02-20 ]

Stack:

Thread 20 "mysqld" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffd6ffb700 (LWP 28460)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
(gdb) 
file: "../sysdeps/unix/sysv/linux/raise.c", line number: 1
(gdb) 
file: "../sysdeps/unix/sysv/linux/raise.c", line number: 1
(gdb) where
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007ffff538f40a in __GI_abort () at abort.c:89
#2  0x000055555618e82a in ut_dbg_assertion_failed (expr=0x5555566e11bc "0", file=0x5555566e15c8 "/home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc", line=1295) at /home/jan/mysql/10.2/storage/innobase/ut/ut0dbg.cc:59
#3  0x0000555556155bae in trx_purge_rseg_get_next_history_log (rseg=0x7fffe21dc380, n_pages_handled=0x7fffd6ffad90) at /home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc:1295
#4  0x00005555561564bc in trx_purge_get_next_rec (n_pages_handled=0x7fffd6ffad90, heap=0x7ffff48dfa80) at /home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc:1518
#5  0x0000555556156772 in trx_purge_fetch_next_rec (roll_ptr=0x7fffd5464428, n_pages_handled=0x7fffd6ffad90, heap=0x7ffff48dfa80) at /home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc:1590
#6  0x0000555556156a9b in trx_purge_attach_undo_recs (n_purge_threads=4, purge_sys=0x7ffff4a04200, batch_size=300) at /home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc:1664
#7  0x000055555615706f in trx_purge (n_purge_threads=4, batch_size=300, truncate=false) at /home/jan/mysql/10.2/storage/innobase/trx/trx0purge.cc:1825
#8  0x0000555556125f79 in srv_do_purge (thd=0x7fffd5416070, n_threads=4, n_total_purged=0x7fffd6ffaea0) at /home/jan/mysql/10.2/storage/innobase/srv/srv0srv.cc:2820
#9  0x0000555556126533 in srv_purge_coordinator_thread (arg=0x0) at /home/jan/mysql/10.2/storage/innobase/srv/srv0srv.cc:2965
#10 0x00007ffff7bc4464 in start_thread (arg=0x7fffd6ffb700) at pthread_create.c:333
#11 0x00007ffff54439df in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

Comment by Marko Mäkelä [ 2017-02-20 ]

In the dataset, there are 4 recovered transactions, all of them in committed state. One of them modified 14 rows, the others only 1 row (TRX_UNDO_UPD_EXIST_REC, UPD_NODE_NO_ORD_CHANGE). All of the 1-row transactions are only generating update_undo log. It seems that those 3 transactions were modifying similar data (maybe a single non-indexed string column).

The 14-row transaction (log ending in ibdata1 page 91215) is different. Here is the complete list of undo log records for that transaction, from the end to the start (with the byte offset from the start of the 16KiB page):

0x377f TRX_UNDO_UPD_DEL_REC, table_id=0x32033
0x36f8 TRX_UNDO_DEL_MARK_REC, table_id=0x32033
0x365a TRX_UNDO_UPD_DEL_REC, table_id=0x32033
0x35c3 TRX_UNDO_DEL_MARK_REC, table_id=0x32033
0x351b TRX_UNDO_UPD_DEL_REC, table_id=0x32033
0x3484 TRX_UNDO_DEL_MARK_REC, table_id=0x32033
0x33dc TRX_UNDO_UPD_DEL_REC, table_id=0x32033
0x3345 TRX_UNDO_DEL_MARK_REC, table_id=0x32033
0x32b5 TRX_UNDO_UPD_DEL_REC, table_id=0x32033
0x322c TRX_UNDO_DEL_MARK_REC, table_id=0x32033
0x3182 TRX_UNDO_UPD_DEL_REC, table_id=0x32033
0x30e9 TRX_UNDO_DEL_MARK_REC, table_id=0x32033
0x303f TRX_UNDO_UPD_DEL_REC, table_id=0x32033
0x2fa6 TRX_UNDO_DEL_MARK_REC, table_id=0x32033

Based on the undo log data, it looks like this transaction could be updating the persistent statistics for some indexes. Based on the undo log record types, this transaction seems to be alternating between delete-marking records and inserting on top of delete-marked records. I did not check the PRIMARY KEY values if it is indeed deleting and inserting the same records, but it would seem plausible if these records were generated by an InnoDB-internal transaction.

Page 8 (SYS_TABLES clustered index root page) points (among other things) to page 5773, which contains the entries for mysql.innodb_index_stats (table_id=0x32033) and mysql.innodb_table_stats (table_id=0x32034). So, we got a confirmation that the 14-record transaction was modifying mysql.innodb_index_stats.

Based on the large table_id of these two InnoDB system tables, it seems to me that this dataset was upgraded to MySQL 5.6 (or MariaDB 10.0 or 10.1 or 10.2) after a rather long history of running under MySQL/MariaDB 5.5 or earlier.

The function trx_purge_rseg_get_next_history_log() is invoked 54 times before the crash. Next, I will try to analyze each of those 54 invocations to see what kind of corruption we are dealing with.

Because the redo log was not included in the dataset, it is possible that even though we seem to be able to repeat the reported crash, something else in the dataset could be corrupted in a way that prevents drawing conclusions.

Comment by Ranjan Ghosh [ 2017-02-20 ]

> Based on the large table_id of these two InnoDB system tables, it seems to me that this dataset was upgraded to MySQL 5.6 (or MariaDB 10.0 or 10.1 or 10.2) after a rather long history of running under MySQL/MariaDB 5.5 or earlier.

That is completely correct, btw, if this is of any help. The dataset was quite old and has been upgraded multiple times between versions.

Comment by Marko Mäkelä [ 2017-02-20 ]

This could be a consequence of not having the redo log files, but a non-debug server would issue a large number of warnings:

2017-02-20 19:33:38 140736893187840 [Warning] InnoDB: Purge reached the head of the history list, but its length is still reported as 2169495! Make a detailed bug report…

and then finally hit another assertion failure on shutdown:

#2  0x0000000000a2599a in ut_dbg_assertion_failed (
    expr=0xd856f6 "UT_LIST_GET_LEN(rseg->update_undo_list) == 0", file=<optimized out>, 
    line=<optimized out>) at /mariadb/10.2/storage/innobase/ut/ut0dbg.cc:59
#3  0x0000000000a1175a in trx_rseg_mem_free (rseg=0x24ec2e0, rseg_array=0x2218e80)
    at /mariadb/10.2/storage/innobase/trx/trx0rseg.cc:134
#4  0x0000000000a15165 in trx_sys_close () at /mariadb/10.2/storage/innobase/trx/trx0sys.cc:1120
#5  0x00000000009f0053 in innodb_shutdown () at /mariadb/10.2/storage/innobase/srv/srv0start.cc:2878
#6  0x0000000000915dde in innobase_end (hton=<optimized out>, type=<optimized out>)
    at /mariadb/10.2/storage/innobase/handler/ha_innodb.cc:4552

Comment by Marko Mäkelä [ 2017-02-20 ]

I wonder if this is shares a common cause with MDEV-11927, which is showing a different form of transaction system corruption when using Galera.

Comment by Marko Mäkelä [ 2017-02-24 ]

MDEV-11927 (permanent corruption of secondary indexes in the InnoDB system tablespace) is caused by MDEV-11585 (introduced in 10.2.3).
I would like to know if this bug is a duplicate of MDEV-11927. Please retest with the configuration parameter

innodb_change_buffering=none

and report if it works with that.

Comment by Elena Stepanova [ 2017-04-23 ]

Note there is another bug report MDEV-11843 with the same assertion failure.

Comment by Jan Lindström (Inactive) [ 2017-05-29 ]

Could not repeat anymore with 40000s run on 3 node galera with default change buffering.

Generated at Thu Feb 08 07:53:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.