[MDEV-24014] Assertion failure in btr0pcur.cc during SELECT operation Created: 2020-10-23 Updated: 2021-11-25 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3.21, 10.4.14 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Mik | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Fedora 32 Server Edition, kernel version 5.6.19-300.fc32.x86_64, |
||
| Issue Links: |
|
||||||||||||
| Description |
|
On running the following query on a Nextcloud database, I receive a "MySQL Server has gone away" error. Manual running results in a loss of connection to the database shortly, until Systemd restarts the crashed service. Query:
Error in mariadb.log:
|
| Comments |
| Comment by Marko Mäkelä [ 2020-10-23 ] | ||||||||||
|
The assertion failure says that the sibling page links for the B-tree are corrupted. I found a similar ticket You should try to find out the source of the corruption. Could it be | ||||||||||
| Comment by Mik [ 2020-10-23 ] | ||||||||||
|
Unfortunately this issue started roughly 7-8 months ago, but because it was only a single specific SELECT statement that was crashing, I had only filed a bug with Nextcloud. I assumed they would have more intricate knowledge of the various queries Nextcloud issues that could cause corruption such as this. I no longer have log files from around when the problem would have occurred. I would assume that the issue arose during a Nextcloud version upgrade, so I'll need to do some digging regarding the queries run during an upgrade operation around that time. The only issue is the problem could have been created even earlier than that, as the issue is only triggered when viewing a specific administrative panel. Is it possible to manually repair this issue in the database? I've repaired smaller issues such as duplicate PK's manually in the past, but that seems much smaller an issue compared to this. Additionally, I am not nearly as familiar as I wish I was to investigate completely on my own. | ||||||||||
| Comment by Marián Černý [ 2021-11-16 ] | ||||||||||
|
Same problem here (it looks the same - very similar).
also crashes the server with the same error. I din't check if the line refers to the same line (but I guess it does): mariadb-10.4.14/storage/innobase/btr/btr0pcur.cc line 494 Upgrading to a newer version didn't help: mariadb-10.5.13/storage/innobase/btr/btr0pcur.cc line 520 There have been no power outages on the server (nor crashes) previously. The table with the problem is a small table with just only 22 rows. However there is a TEXT column. The size is 0.17 MB. There was no bit rot on the SSD. The underlying filesystem is ZFS with two disks in a mirror. (ZFS checksums all the blocks). The server does not have ECC RAM. However I have seen similar issue twice on different setups with master-master replication where one server was crashing on an update from the other. Those servers had ECC RAM. So I suspect the data have been corrupted by an internal error in MariaDB - some time before the crash. | ||||||||||
| Comment by Marián Černý [ 2021-11-17 ] | ||||||||||
|
I have checked the older logs from a similar issue I have mentioned ("with master-master replication where one server was crashing on an update from the other") and found out that it was a different issue. The other issues was:
Then there were a lot of signal 10 crashes without an assertion failure (e.g. a reference to a source code line) when I was trying to recover. And then later I also got a crash in btr0pcru.cc, but on different line:
So the issue from the original bug report happened to me just in one instance (but the SELECT always crashed until the data from the broken table was replaced from a backup). | ||||||||||
| Comment by Marko Mäkelä [ 2021-11-18 ] | ||||||||||
|
marian.cerny, the assertions that fail for you note that the flag that is clear for the original InnoDB format (the name ROW_FORMAT=REDUNDANT was introduced for it in MySQL 5.0.3) differs between two pages of the same index tree. I remember examining a case like that years ago. One of the pages was filled with NUL bytes. I think that in this case, the corrupted page may be a leaf page of the clustered index. CHECKSUM TABLE should not access the internal pages of the B-tree, except for the initial path from the root page to the first leaf page. In other words, some data could be lost even if the corrupted page was manually fixed in the data file. To my understanding, when the server is started from a file system snapshot copy of a running server’s data directory, crash recovery will typically be invoked. We do cover crash recovery in our internal testing, and DML operations are expected to be crash-safe. (DDL is only crash-safe starting with 10.6.) It is theoretically possible that the data was corrupted due to a bug in the crash recovery logic. It might also be a bug in the file system snapshot. I don’t think that we perform any internal stress testing using file system snapshots. Without having something that repeats the problem (corrupts an initially sound database) there is not much we can do. One last note: If you use innodb_force_recovery=6 or remove ib_logfile0 to ‘fix’ recovery issues, or you see any messages like log sequence number … is in the future, then you can expect everything to be thoroughly corrupted. | ||||||||||
| Comment by Marián Černý [ 2021-11-25 ] | ||||||||||
|
@Marko, thanks for looking into my comments. To sum it up, as I understand it now, is that the error Assertion failure btr0pcur.cc line 52 (page_is_comp(next_page) == page_is_comp(page)) from my second comment might a problem that resulted from a crash recovery. There was no recovery from a filesystem snapshot in this case. In my first comment I was sending similar error than the initial bug report in Description by @Mik - Assertion failure in btr0pcur.cc line 520 (page_is_comp(next_page) == page_is_comp(page)). This also happened initially on a server where there was no crash or a crash recovery. Only once the problem occurred, I was trying to get the data from a “backup” - filesystem snapshot. Despite there was recovery from a snapshot when I was trying to get older data, I doubt the assertion was caused by the crash recovery. The assertion was there for few hourly snapshots I have tested and all the 6 daily snapshots. The snapshot that worked was the monthly snapshot (from 14 days ago). I understand that this assertion is a sign of a problem that occurred before and that from this state it’s not easy or possible to find out what happened. I do not have a way how to reproduce any of the two problems. I guess both problems might be any of those three problems:
I doubt it is bit rot, because ZFS checksums were ok. |