[MDEV-13056] The server keeps crashing Created: 2017-06-12  Updated: 2017-09-04  Resolved: 2017-09-04

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2.4, 10.2.6
Fix Version/s: 10.2.5

Type: Bug Priority: Major
Reporter: Manu Anttila Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: errorlog
Environment:

CentOS 6.8


Attachments: File fies1udb32.err    
Issue Links:
Duplicate
is duplicated by MDEV-11585 Hard-code the shared InnoDB temporary... Closed
is duplicated by MDEV-11927 InnoDB change buffer is not being mer... Closed

 Description   

dump/load, check table, drop table, create table ... as ... all seem to crash the server.

[ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch



 Comments   
Comment by Marko Mäkelä [ 2017-06-13 ]

It looks like the error is reported by dict_load_columns():

		err_msg = dict_load_column_low(table, heap, NULL, NULL,
					       &name, rec, &nth_v_col);
 
		if (err_msg == dict_load_column_del) {
			n_skipped++;
			goto next_rec;
		} else if (err_msg) {
			ib::fatal() << err_msg;
		}

There is not enough information to say what exactly caused this corruption. The corruption resides in the InnoDB system tablespace, between the internal tables SYS_TABLES and SYS_COLUMNS.

Comment by Manu Anttila [ 2017-06-14 ]

What is the best approach to fix this?

Comment by Marko Mäkelä [ 2017-06-15 ]

manttila, if I had the data files, I would start mysqld inside gdb and then examine the data structures when the SIGABRT is caught. You probably do not want to share the files because they contain highly confidential data or are big, or both.

Also, what were the preceding events? Was the server ever killed in the past (did it crash by itself, or did something kill it externally)? What were the preceding server error log messages? If the server had crashed or it had been killed, what SQL were the client connections executing right before the crash or kill occurred?

Recently, I analyzed a case where the files had been copied while InnoDB was still running. Theoretically, it should be safe with a LVM snapshot, but in practice, after some painstaking analysis of the --debug=d,ib_log output of the debug server, I had to conclude that something in the file system snapshot was not working as intended. While individual data pages looked consistent, some pages clearly were corresponding to different points of time and thus were inconsistent with each other. (In that case, the DB_ROLL_PTR in a clustered index leaf page was pointing to the middle of an undo log record; I guess the undo page had been freed and reused.) Something similar could have happened here as well, but we might never know if you did not back up the files before starting up the server. For the previous case that I mentioned, such backups were available, and it was possible to restart the server from exactly the same (badly copied, already corrupted) files over and over again.

Comment by Manu Anttila [ 2017-06-26 ]

I was loading a backup to one of the databases with too small max_allowed_packet.
After that I noticed the corruption.

I re-installed MariaDB and loaded a backup. Problem solved for now.

Comment by Elena Stepanova [ 2017-07-03 ]

manttila,
Do you mean a backup as the SQL produced by mysqlbackup, or are you using other backup utilities?
Do you remember what errors you encountered while loading the backup with too small max_allowed_packet? Did the server/client crash, or was it the usual Server gone away abort without an actual crash? Do you happen to know which table was updated at the time?
Which server version the backup was taken from?
Could you please attach your cnf file(s) and specify what value of max_allowed_packet was there at the time?

Comment by Elena Stepanova [ 2017-08-01 ]

Example of a crash report from the attached error log:

2017-06-08 15:21:03 140737157408512 [ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch
2017-06-08 15:21:03 0x7fffec462700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.2.6/storage/innobase/ut/ut0ut.cc line 842
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
170608 15:21:03 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.2.6-MariaDB-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=8
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467201 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7ffa980009a8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fffec461d70 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7ffff7a8fa6b]
/usr/sbin/mysqld(handle_fatal_signal+0x4d5)[0x7ffff75270e5]
/lib64/libpthread.so.0(+0x3368c0f7e0)[0x7ffff6a537e0]
/lib64/libc.so.6(gsignal+0x35)[0x7ffff50ad5e5]
/lib64/libc.so.6(abort+0x175)[0x7ffff50aedc5]
/usr/sbin/mysqld(+0x4779d7)[0x7ffff72fb9d7]
/usr/sbin/mysqld(+0x986338)[0x7ffff780a338]
/usr/sbin/mysqld(+0xa220fe)[0x7ffff78a60fe]
/usr/sbin/mysqld(+0xa239bb)[0x7ffff78a79bb]
/usr/sbin/mysqld(+0xa11410)[0x7ffff7895410]
/usr/sbin/mysqld(+0x81d9fa)[0x7ffff76a19fa]
/usr/sbin/mysqld(_ZN7handler7ha_openEP5TABLEPKcij+0x3e)[0x7ffff752b1ce]
/usr/sbin/mysqld(_Z21open_table_from_shareP3THDP11TABLE_SHAREPKcjjjP5TABLEb+0xb2a)[0x7ffff744d38a]
/usr/sbin/mysqld(_Z10open_tableP3THDP10TABLE_LISTP18Open_table_context+0x732)[0x7ffff7368a02]
/usr/sbin/mysqld(_Z11open_tablesP3THDRK14DDL_options_stPP10TABLE_LISTPjjP19Prelocking_strategy+0xee7)[0x7ffff736b5a7]
/usr/sbin/mysqld(_Z29mysqld_show_create_get_fieldsP3THDP10TABLE_LISTP4ListI4ItemEP6String+0xdf)[0x7ffff741420f]
/usr/sbin/mysqld(_Z18mysqld_show_createP3THDP10TABLE_LIST+0xd7)[0x7ffff74147f7]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x3557)[0x7ffff73ac737]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x29a)[0x7ffff73b158a]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x1c1d)[0x7ffff73b380d]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x161)[0x7ffff73b3dd1]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x1af)[0x7ffff746f99f]
/usr/sbin/mysqld(handle_one_connection+0x44)[0x7ffff746fac4]
/lib64/libpthread.so.0(+0x3368c07aa1)[0x7ffff6a4baa1]
/lib64/libc.so.6(clone+0x6d)[0x7ffff5163aad]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7ffa9800fa30): SHOW CREATE TABLE `muikku`.`accesscheck`
 
Connection ID (thread ID): 12
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on

Comment by Elena Stepanova [ 2017-09-04 ]

It appears to be the same problem as initially reported in MDEV-11585, which is reported to be fixed (in scope of MDEV-11927). Since the corruption itself apparently happened on 10.2.4, the its result being visible on 10.2.6 is not a contradiction to the fact that the corruption was fixed in 10.2.5.

Thus, until we have an indication that it's not so, I'm closing this bug as fixed along with MDEV-11585 / MDEV-11927.

Comment by Marko Mäkelä [ 2017-09-04 ]

elenst, MDEV-11585 introduced in 10.2.3 a bug that was fixed in 10.2.5 by MDEV-11927.
If manttila originally created the database in 10.2.3 or 10.2.4 and then upgraded to 10.2.5 or later, it is theoretically possible that this is somehow a duplicate of MDEV-11927.

The symptom of MDEV-11927 was that the secondary index on SYS_TABLES.ID was corrupted, because a buffered delete-mark or purge operation was not merged.
Here, the symptom is that there is a mismatch reported on SYS_COLUMNS.TABLE_ID, apparently because we have a SYS_TABLE record with no matching records in SYS_COLUMNS. MariaDB or InnoDB should not allow the creation of a table with no columns.

I do not think that this can be a duplicate of MDEV-11927. When loading a table definition, we would use the clustered index record of SYS_TABLES. So, even if the index on SYS_TABLES.ID was corrupted, it should have nothing to do with this error.

Comment by Elena Stepanova [ 2017-09-04 ]

Sorry, my links are not quite correct there. I meant to say that the same problem was initially reported in MDEV-11894:

Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24  0:44:21 140172607566592 [ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch
Jan 24 00:44:21 uhu1 mysqld[14620]: 2017-01-24 00:44:21 0x7f7c7a7b3300  InnoDB: Assertion failure in thread 140172607566592 in file ut0ut.cc line 949

Later it was closed as fixed in scope of MDEV-11927, which, in turn, was said to be caused by MDEV-11585.

Comment by Marko Mäkelä [ 2017-09-04 ]

I checked the code, and there is no secondary index defined on SYS_COLUMNS.
MDEV-11894 reported two problems. One looks same to this one and could have been introduced by MDEV-11585 and fixed by MDEV-11927.
But, rgpublic mentioned TRUNCATE TABLE in MDEV-11927. Could it be that the MySQL 5.7 WL#6501 TRUNCATE TABLE is not entirely crash-safe and could cause the ID mismatch?

I see that the function row_truncate_update_table_id() is updating the table_id in multiple InnoDB data dictionary tables. I seem to remember that dict_load_table() is bypassing the undo log, essentially using the READ UNCOMMITTED isolation level. If that is the case, it would explain the mismatch. To reproduce the problem, we would need a crash in the middle of that function, between the update of the SYS_TABLES and SYS_COLUMNS records.

It seems to me that MDEV-11927 fixed a crash that could have triggered the bug that demonstrates that the ‘new more crash-safe’ TRUNCATE TABLE is not actually crash-safe.

manttila, can you confirm if any TRUNCATE TABLE was executed on InnoDB tables prior to this crash?

Comment by Marko Mäkelä [ 2017-09-04 ]

With the current 10.2, TRUNCATE TABLE appears to be crash-safe. I tested as follows:

--source include/have_innodb.inc
create table t(a int) engine=innodb;
insert into t values(42);
truncate table t;
select * from t;
drop table t;

I set a breakpoint on row_truncate_update_table_id and once it was reached (on the TRUNCATE statement), also on row_upd.
The first row_upd() call was for node->table->name = "SYS_TABLES". The second call was for "SYS_COLUMNS". At that point, I did

call log_write_up_to(log_sys->lsn, true)
run

to kill and restart the server. On the restart, I got a call to row_upd() from truncate_t::update_root_page_no()/row_truncate_update_sys_tables_during_fix_up()/truncate_t::fixup_tables_in_non_system_tablespace (), and then a call to row_truncate_update_table_id() from row_truncate_update_sys_tables_during_fix_up().

The only problem that I see in the TRUNCATE recovery is that is not being skipped if innodb_force_recovery>=3 is specified, and that could cause a lock conflict with the previous attempt of TRUNCATE that was interrupted by a server kill. The TRUNCATE recovery appears to be in the correct place.

So, after all, it is possible be that elenst made the correct conclusion, and that the corruption on SYS_COLUMNS.TABLE_ID mismatch actually was fixed by MDEV-11927.

Comment by Marko Mäkelä [ 2017-09-04 ]

Back when I fixed MDEV-11927, I did not study the exact mechanism on how it would lead to the fatal error message that no matching SYS_COLUMNS.TABLE_ID is found.
But still, this looks very much like a duplicate of the problem that was reported in MDEV-11927. The corruption could have been introduced into the data by the 10.2.4 server, and an upgrade to a version with the bug fix (10.2.5 or later) would suffer from the already present corruption.

Generated at Thu Feb 08 08:02:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.