[MDEV-18519] 0x7f0118195700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.3.7/storage/innobase/btr/btr0cur.cc line 4308 Created: 2019-02-08 Updated: 2022-06-08 Resolved: 2022-06-08 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3.7, 10.3.12, 10.4 |
| Fix Version/s: | 10.6.9, 10.7.5, 10.8.4, 10.9.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Christian Kueppers | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | crash | ||
| Environment: |
1. environnement: Ubuntu 16.04.5, kernel: 4.4.0-141-generic x86_64, MariaDB version: 10.3.12 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
Following error output is generated during execution of an "normal" update query and leads to a crash of mariadb master (master-master replication).
"Optimize table" does not help because the second crash occurred after an "optimize table run" on that database table. |
| Comments |
| Comment by Elena Stepanova [ 2019-02-09 ] | |||||||||||||||||||||||||||||||
|
Do you have an error log and a stack trace from 10.3.12? | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-02-11 ] | |||||||||||||||||||||||||||||||
|
No, i haven't got the error log from 10.3.12. Sadly i've lost them because of log rotation. | |||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2019-02-19 ] | |||||||||||||||||||||||||||||||
|
Enabling coredump will slow down server restart in case of a crash, as the server will take time to write the coredump. | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-03-21 ] | |||||||||||||||||||||||||||||||
|
We've enabled the coredump on all of our mariadb servers. Until now there was no further crash. I'll report back with further details if there is a new crash. | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-05-20 ] | |||||||||||||||||||||||||||||||
|
Until now there was no further occurrence. I would say that the ticket can be closed once. | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-08-15 ] | |||||||||||||||||||||||||||||||
|
I've noticed a new crash with same error message in version 10.3.17 on added 3. environnement. There is a 11G unpacked coredump available. I've uploaded detailed error log and coredump to ftp.askmonty.org/private (MDEV18519.tar.gz). We've manually dropped the table with corrupted index and import a backup. Please reopen this bug ticket. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-16 ] | |||||||||||||||||||||||||||||||
|
onock, this ticket has not been closed. Furthermore, the core dump can only be analyzed if we have the exact same copy of the server executable and all shared libraries that were linked to it. Otherwise, the stack traces of the active threads will not be resolved correctly. Could you post the stack trace of the crashing thread, based on the core dump? You should be able to obtain it with the following:
This should identify the problematic table, and then perhaps, based on query logs, you could check if anything special was executed on the table. You might want to try to rebuild the table (ALTER TABLE t FORCE) to see if it cures the corruption. If the table is corrupted seriously enough, the server may crash during the rebuild attempt. | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-08-19 ] | |||||||||||||||||||||||||||||||
|
Thanks, here is the desired backtrace
We haven't got enabled the general query log. The query extracted from error log was
That don't look special for me. Database and table were found via error log entry and sql comment. Last time we took a backup from synced slave and import this. I'll try your command next time. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-20 ] | |||||||||||||||||||||||||||||||
|
Vielen Dank, onock!
and then reissue the UPDATE statement that triggered the crash. If the table is corrupted, it should crash every time on this operation. I see that it happens when trying to update a clustered index record without involving page split or merge:
Once you get the crash inside gdb, then you have to issue the command up until you are in the stack frame of btr_cur_optimistic_update().
If the local variable block is not available to the debugger, try replacing block->frame with cursor.btr_cur.page_cur.block->frame. Likewise, if index is not available, try cursor->index instead. Please attach the /tmp/page.bin, or upload it to our FTP server if it contains confidential data. Also, I would like to see the output of SHOW CREATE TABLE dateien. Did you invoke any ALTER TABLE on this table while running on MariaDB 10.3? I’d like to know if this might be related to instant ADD COLUMN, which is Also, it could be useful to start a debug version of the server, to see if other assertions fail when attempting the UPDATE statement. | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-08-20 ] | |||||||||||||||||||||||||||||||
|
The described server environment is a production system with several hundreds of databases. The crashed table was reimportet from backup taken from slave, like i said before. From that moment until now this table is in use without any further crash. We do not currently have a test environment with the same behavior. Following ALTER TABLE commands were issued five days before:
I've placed the SHOW CREATE TABLE on ftp named "MDEV18519_2.tar.gz". | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-20 ] | |||||||||||||||||||||||||||||||
|
onock, thanks. Unfortunately it looks like this could indeed be caused by the instant ADD COLUMN operation. There could be something particular in the table definition that causes page_get_max_insert_size_after_reorganize() to produce a wrong result. As a possible workaround, I would suggest to add ,FORCE to such statements, to ensure that the tables will be rebuilt. If the problem is related to instant ADD COLUMN, then it should go away by causing the table to be rebuilt. We have an open issue | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-21 ] | |||||||||||||||||||||||||||||||
|
This looks pretty much like | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-21 ] | |||||||||||||||||||||||||||||||
|
onock, I checked ftp://ftp.mariadb.com/uploads but cannot see any file "MDEV18519_2.tar.gz". I would really need that SHOW CREATE TABLE output in order to proceed with the analysis. Another user in | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-21 ] | |||||||||||||||||||||||||||||||
|
I finally got the files. In the error log snippet, I see this mismatch by 100 bytes:
I will try to figure it out based on the SHOW CREATE TABLE output. onock, because I do not have the same mysqld and *.so files as you, I cannot view the stack traces from the core dump that you uploaded, but thankfully you posted a stack trace with some interesting pointer values. I loaded the core dump with a mismatching 10.3 server:
I will not post index->fields@index->n_fields here, because it would reveal some of the SHOW CREATE TABLE information. That information seems coherent with regard to index->trx_id_offset and index->n_uniq, so I am confident that I extracted the above data correctly from the core dump, despite using data type information from a different executable. The interesting thing here is that we have n_fields == n_core_fields, that is, the table is not in the "instant ALTER TABLE" format. This error might not be caused by instant ALTER after all. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-21 ] | |||||||||||||||||||||||||||||||
|
It is also possible that the InnoDB data dictionary cache is wrong, and the table actually still was in the "instant ALTER" format. This would obviously cause such mismatch. I cannot check that, because the buffer pool was not included in the core dump, and because the .ibd file is no longer available. I have a vague memory that I fixed something like that in the past (InnoDB would "forget" about a past "instant ALTER"), but I cannot find that right now. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-08-23 ] | |||||||||||||||||||||||||||||||
|
My analysis of the data uploaded to | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-08-26 ] | |||||||||||||||||||||||||||||||
|
Thanks for your work and provided information. I may have overlooked it, but can I try to help at the moment, or do I have to wait for a fix? | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-09-03 ] | |||||||||||||||||||||||||||||||
|
We have tried, but have been unable to reproduce this bug. Unfortunately I cannot fix it without a repeatable test case. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-11-12 ] | |||||||||||||||||||||||||||||||
|
I believe that I may have found an explanation why mleich failed to repeat this corruption with recent 10.3 or 10.4. onock, you reported 10.3.17 as an affected version. Could it be that you had executed instant ADD COLUMN before upgrading to that version? In that case, the corruption could have been introduced in an earlier version and only noticed after the upgrade. | |||||||||||||||||||||||||||||||
| Comment by Christian Kueppers [ 2019-11-13 ] | |||||||||||||||||||||||||||||||
|
The dpkg log shows that the update from version 10.3.13 to 10.3.17 was done on 10.08.2019. ADD COLUMN queries were executed at the end of January and at the end of July 2019. So you can be right. Is there a way to proactively fix this (maybe via e.g. optimize table query) so that it doesn't happen again in the future? Because installing a newer version alone does not seem to help. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-11-14 ] | |||||||||||||||||||||||||||||||
|
onock, you can have tables rebuilt by
If the corruption was caused by the bug that was fixed in You should be able to identify the format of tables with the following:
The 'old format' type code is 45 bf and the 'instant ADD' identifier is 00 12. But, beware:
| |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-11-14 ] | |||||||||||||||||||||||||||||||
|
I removed 10.3.17 from the affected versions, because according to onock the corruption could have been introduced before the upgrade. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-11-14 ] | |||||||||||||||||||||||||||||||
|
elenst reasons that this bug is about the assertion failure, which is reporting corruption of the table, which in the reported case was likely introduced in an older 10.3 or 10.4 version of MariaDB Server before the This assertion could also fail due to something else causing the corruption. By this reasoning, I will not close this bug as a duplicate of Instead of reporting the corruption by an assertion failure, we should ideally return an error to the client and abort the operation. It might not be wise to initiate rollback of the current transaction, because that might end up corrupting the known-corrupted index even further. This is conceptually related to | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-07 ] | |||||||||||||||||||||||||||||||
|
This was missed in | |||||||||||||||||||||||||||||||
| Comment by Robert Kubik [ 2022-06-08 ] | |||||||||||||||||||||||||||||||
|
Hello,
Can I assume, that it's the same issue and just wait for the fix? | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-08 ] | |||||||||||||||||||||||||||||||
|
r.kubik, that assertion is in a different file. It is | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-08 ] | |||||||||||||||||||||||||||||||
|
In addition to the function btr_page_reorganize_low(), I fixed a few more functions so that they will return an error code instead of crashing the server.
|