[MDEV-11894] Galera crashes after a few minutes - unusable Created: 2017-01-24 Updated: 2020-08-25 Resolved: 2017-05-29 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.2.3 |
| Fix Version/s: | 10.2.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ranjan Ghosh | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Environment: |
Ubuntu 16.04.1 LTS |
||
| Issue Links: |
|
||||||||||||||||||||
| Sprint: | 10.0.30, 10.3.1-1 | ||||||||||||||||||||
| Description |
|
I've restarted completely from scratch multiple times (e.g. wipe /var/lib/mysql, mysql_install_db, restore dump). I just can't get the second node to run. Below I quote first my mysql.cnf, then two variations of the errors I get.
|
| Comments |
| Comment by Ranjan Ghosh [ 2017-02-16 ] | ||||||||||||||||||||||||||||
|
Ah, forgot about this bug. I could solve this. The problem was that one of our table files ("simple_sitemap") was broken. The error message even gives a hint (TRUNCATE simple_sitemap). As soon as there was any access on that table, the database was halted. The problem for me was that the error message ( SYS_COLUMNS.TABLE_ID mismatch) didnt say much and googling for this message didnt yield any results. Hopefully this bug will at least make sure that anybody else confronted with this bug can now google this error message and find out: The reason for this message is that possibly some of the table files are defect. I dont know how the table file became broken in the first place. And I wonder if MariaDB shouldn't behave more graceful in such a case. I would have expected a simple error when I try to query the table or sth. like this. Instead the whole database came down. In case anyone is interested, I could provide the "toxic" frm/ibd file if this is of any help. I dont know if it contains any customer data, so I'd need a way to post this safely somewhere (mail would be ok). It's not large (100 KB approx.) | ||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2017-02-20 ] | ||||||||||||||||||||||||||||
|
I could not yet repeat the assertion on SYS_COLUMNS.TABLE_ID. Firstly, I was able to get server up and running with innodb-force-recovery=2. However, with normal startup using debug-build it did crash in here:
| ||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2017-02-20 ] | ||||||||||||||||||||||||||||
|
Stack:
| ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-02-20 ] | ||||||||||||||||||||||||||||
|
In the dataset, there are 4 recovered transactions, all of them in committed state. One of them modified 14 rows, the others only 1 row (TRX_UNDO_UPD_EXIST_REC, UPD_NODE_NO_ORD_CHANGE). All of the 1-row transactions are only generating update_undo log. It seems that those 3 transactions were modifying similar data (maybe a single non-indexed string column). The 14-row transaction (log ending in ibdata1 page 91215) is different. Here is the complete list of undo log records for that transaction, from the end to the start (with the byte offset from the start of the 16KiB page): 0x377f TRX_UNDO_UPD_DEL_REC, table_id=0x32033 Based on the undo log data, it looks like this transaction could be updating the persistent statistics for some indexes. Based on the undo log record types, this transaction seems to be alternating between delete-marking records and inserting on top of delete-marked records. I did not check the PRIMARY KEY values if it is indeed deleting and inserting the same records, but it would seem plausible if these records were generated by an InnoDB-internal transaction. Page 8 (SYS_TABLES clustered index root page) points (among other things) to page 5773, which contains the entries for mysql.innodb_index_stats (table_id=0x32033) and mysql.innodb_table_stats (table_id=0x32034). So, we got a confirmation that the 14-record transaction was modifying mysql.innodb_index_stats. Based on the large table_id of these two InnoDB system tables, it seems to me that this dataset was upgraded to MySQL 5.6 (or MariaDB 10.0 or 10.1 or 10.2) after a rather long history of running under MySQL/MariaDB 5.5 or earlier. The function trx_purge_rseg_get_next_history_log() is invoked 54 times before the crash. Next, I will try to analyze each of those 54 invocations to see what kind of corruption we are dealing with. Because the redo log was not included in the dataset, it is possible that even though we seem to be able to repeat the reported crash, something else in the dataset could be corrupted in a way that prevents drawing conclusions. | ||||||||||||||||||||||||||||
| Comment by Ranjan Ghosh [ 2017-02-20 ] | ||||||||||||||||||||||||||||
|
> Based on the large table_id of these two InnoDB system tables, it seems to me that this dataset was upgraded to MySQL 5.6 (or MariaDB 10.0 or 10.1 or 10.2) after a rather long history of running under MySQL/MariaDB 5.5 or earlier. That is completely correct, btw, if this is of any help. The dataset was quite old and has been upgraded multiple times between versions. | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-02-20 ] | ||||||||||||||||||||||||||||
|
This could be a consequence of not having the redo log files, but a non-debug server would issue a large number of warnings:
and then finally hit another assertion failure on shutdown:
| ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-02-20 ] | ||||||||||||||||||||||||||||
|
I wonder if this is shares a common cause with | ||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2017-02-24 ] | ||||||||||||||||||||||||||||
|
and report if it works with that. | ||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-04-23 ] | ||||||||||||||||||||||||||||
|
Note there is another bug report | ||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2017-05-29 ] | ||||||||||||||||||||||||||||
|
Could not repeat anymore with 40000s run on 3 node galera with default change buffering. |