[MDEV-13056] The server keeps crashing Created: 2017-06-12 Updated: 2017-09-04 Resolved: 2017-09-04 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2.4, 10.2.6 |
| Fix Version/s: | 10.2.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Manu Anttila | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | errorlog | ||
| Environment: |
CentOS 6.8 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
dump/load, check table, drop table, create table ... as ... all seem to crash the server. [ERROR] [FATAL] InnoDB: SYS_COLUMNS.TABLE_ID mismatch |
| Comments |
| Comment by Marko Mäkelä [ 2017-06-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It looks like the error is reported by dict_load_columns():
There is not enough information to say what exactly caused this corruption. The corruption resides in the InnoDB system tablespace, between the internal tables SYS_TABLES and SYS_COLUMNS. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manu Anttila [ 2017-06-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
What is the best approach to fix this? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-06-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
manttila, if I had the data files, I would start mysqld inside gdb and then examine the data structures when the SIGABRT is caught. You probably do not want to share the files because they contain highly confidential data or are big, or both. Also, what were the preceding events? Was the server ever killed in the past (did it crash by itself, or did something kill it externally)? What were the preceding server error log messages? If the server had crashed or it had been killed, what SQL were the client connections executing right before the crash or kill occurred? Recently, I analyzed a case where the files had been copied while InnoDB was still running. Theoretically, it should be safe with a LVM snapshot, but in practice, after some painstaking analysis of the --debug=d,ib_log output of the debug server, I had to conclude that something in the file system snapshot was not working as intended. While individual data pages looked consistent, some pages clearly were corresponding to different points of time and thus were inconsistent with each other. (In that case, the DB_ROLL_PTR in a clustered index leaf page was pointing to the middle of an undo log record; I guess the undo page had been freed and reused.) Something similar could have happened here as well, but we might never know if you did not back up the files before starting up the server. For the previous case that I mentioned, such backups were available, and it was possible to restart the server from exactly the same (badly copied, already corrupted) files over and over again. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Manu Anttila [ 2017-06-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was loading a backup to one of the databases with too small max_allowed_packet. I re-installed MariaDB and loaded a backup. Problem solved for now. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-07-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
manttila, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-08-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Example of a crash report from the attached error log:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It appears to be the same problem as initially reported in Thus, until we have an indication that it's not so, I'm closing this bug as fixed along with | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst, The symptom of I do not think that this can be a duplicate of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry, my links are not quite correct there. I meant to say that the same problem was initially reported in
Later it was closed as fixed in scope of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I checked the code, and there is no secondary index defined on SYS_COLUMNS. I see that the function row_truncate_update_table_id() is updating the table_id in multiple InnoDB data dictionary tables. I seem to remember that dict_load_table() is bypassing the undo log, essentially using the READ UNCOMMITTED isolation level. If that is the case, it would explain the mismatch. To reproduce the problem, we would need a crash in the middle of that function, between the update of the SYS_TABLES and SYS_COLUMNS records. It seems to me that manttila, can you confirm if any TRUNCATE TABLE was executed on InnoDB tables prior to this crash? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
With the current 10.2, TRUNCATE TABLE appears to be crash-safe. I tested as follows:
I set a breakpoint on row_truncate_update_table_id and once it was reached (on the TRUNCATE statement), also on row_upd.
to kill and restart the server. On the restart, I got a call to row_upd() from truncate_t::update_root_page_no()/row_truncate_update_sys_tables_during_fix_up()/truncate_t::fixup_tables_in_non_system_tablespace (), and then a call to row_truncate_update_table_id() from row_truncate_update_sys_tables_during_fix_up(). The only problem that I see in the TRUNCATE recovery is that is not being skipped if innodb_force_recovery>=3 is specified, and that could cause a lock conflict with the previous attempt of TRUNCATE that was interrupted by a server kill. The TRUNCATE recovery appears to be in the correct place. So, after all, it is possible be that elenst made the correct conclusion, and that the corruption on SYS_COLUMNS.TABLE_ID mismatch actually was fixed by | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-09-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Back when I fixed |