[MDEV-25163] Rowid filter does not process storage engine error correctly. Created: 2021-03-16 Updated: 2023-11-28 Resolved: 2023-09-25 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5.8, 10.6.10, 10.6.12, 10.5.21, 10.5.22 |
| Fix Version/s: | 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Ingi Gauti Ragnarsson | Assignee: | Vladislav Lesin |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We had this crash on production database, but we cant see reason for this. We managed to run it again by setting innodb_force_recovery=1. The server is running fine, but not knowing what caused this makes me worry
|
| Comments |
| Comment by Marko Mäkelä [ 2021-03-16 ] | ||||||||||||||||||||||||||||||||
|
Can you provide a stack trace of all executing threads during the time of the crash? | ||||||||||||||||||||||||||||||||
| Comment by Ingi Gauti Ragnarsson [ 2021-03-23 ] | ||||||||||||||||||||||||||||||||
|
Hey Marko, it happened again this sunday and we managed to get a core dump. I can provide you will link to it, it´s about 200mb zipped file (5GB), but can I provide you with the link in some more private way then public forum, such as email? | ||||||||||||||||||||||||||||||||
| Comment by Ingi Gauti Ragnarsson [ 2021-04-14 ] | ||||||||||||||||||||||||||||||||
|
Hey Marko, we´ve had few more of those crashes. Have you been able to check out the log I emailed you? | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-04-26 ] | ||||||||||||||||||||||||||||||||
|
ingig, sorry, I had not notice your message until today. The core dump alone cannot be analyzed; I would need a copy of the executable and the shared libraries that generated it.
Note: We already have some stack traces attached to Do you use Galera replication? | ||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-04-26 ] | ||||||||||||||||||||||||||||||||
|
ingig to quickly get the shared libs for mysqld, you can copy the core file (which should have the word 'core' in the file name for the purposes explained here; please rename to 'core' if needbe), plus the mysqld binary itself into a directory somewhere (note that if your binary is called mariadbd instead, then copying the mysqld symlink (which points to the mariadbd binary) will make a copy of the mariadbd binary and name it mysqld automatically, which is fine too). Then, execute this script from the directory where you copied mysqld to: | ||||||||||||||||||||||||||||||||
| Comment by Nikola Blazincic [ 2021-06-17 ] | ||||||||||||||||||||||||||||||||
|
Hi, I will be taking over this issue since the original reporter is leaving the company. Please reopen this bug/ticket. Hope to hear from you soon, and thank you. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-17 ] | ||||||||||||||||||||||||||||||||
|
nblazincic, I am reopening this as requested. But, I will be unable to proceed until you post some information from a core dump. | ||||||||||||||||||||||||||||||||
| Comment by Nikola Blazincic [ 2021-06-17 ] | ||||||||||||||||||||||||||||||||
| Comment by Nikola Blazincic [ 2021-06-17 ] | ||||||||||||||||||||||||||||||||
|
https://drive.google.com/file/d/1qsuHLk__EcRlf3aHj3WKyRd8VJoXrj1D/view?usp=sharing | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-17 ] | ||||||||||||||||||||||||||||||||
|
nblazincic, can you post the stack traces from the core dump here? | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-18 ] | ||||||||||||||||||||||||||||||||
|
Even from a core dump, it might be very tricky to determine what could have happened in the past that would have caused the problem. (And due to I remembered that we had a somewhat similar mystery case in | ||||||||||||||||||||||||||||||||
| Comment by Nikola Blazincic [ 2021-06-18 ] | ||||||||||||||||||||||||||||||||
|
I hope that this is what you need. Output of command Also output of /var/log/messages may be of some use (i cleaned it up due to privacy reasons) I noticed this line: So this will be adjusted immediately | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-18 ] | ||||||||||||||||||||||||||||||||
|
nblazincic, in gdb_output.txt To get the function parameters, an additional package with debugging symbols needs to be installed, if one is available for the server package that you are using. Alternatively, the executable has to be compiled with appropriate flags. It might be best to compile a debug version of the server. With luck, some debug assertion could fail earlier and give better hints. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-18 ] | ||||||||||||||||||||||||||||||||
|
If we ignore the fact that the source code line numbers are missing in gdb_output.txt
And in fact, we do have the line number in mariadb_log.txt
It indeed is the check_trx_state() call that triggered the crash in | ||||||||||||||||||||||||||||||||
| Comment by Nikola Blazincic [ 2021-06-28 ] | ||||||||||||||||||||||||||||||||
|
Sorry for late reply, vacation. Will check if running on debug version is an option for us on this system, or we will try to update to latest version and see if this will be resolved. | ||||||||||||||||||||||||||||||||
| Comment by Nikola Blazincic [ 2021-07-29 ] | ||||||||||||||||||||||||||||||||
|
We did an update to 10.5.11 and got the same crash today. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-07 ] | ||||||||||||||||||||||||||||||||
|
We have been unable to reproduce this in our internal testing. An SQL test case or a fully resolved stack trace of the crash would be very much appreciated. Normal installation packages of the server do not include debugging information (which is not to be confused with a debug-instrumented build). The debugging information (some files added to /usr/lib/debug/.build-id/) is typically available in a package whose name includes the string -debuginfo or -dbgsym. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-07 ] | ||||||||||||||||||||||||||||||||
|
nblazincic, ingig, if a later version crashes, can you please post the server error log and the resolved stack traces related to the crash? | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-07 ] | ||||||||||||||||||||||||||||||||
|
ingig, why was innodb_force_recovery=1 needed in the crash recovery? | ||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-03-10 ] | ||||||||||||||||||||||||||||||||
|
In | ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-07-28 ] | ||||||||||||||||||||||||||||||||
|
We are having recurring crashes with MariaDB server 10.6.14 running on CentOS 7.9, getting the same error in mysql-error.log: 2023-07-21 17:33:04 0x7f988c6c0700 InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.6.14/storage/innobase/trx/trx0trx.cc line 1252 It restarts automatically after the crash. Out of 10 databases, all with the same setup, two behave this way, the other do not. I can provide error logs, core dumps and full stack traces. | ||||||||||||||||||||||||||||||||
| Comment by Ivan [ 2023-08-01 ] | ||||||||||||||||||||||||||||||||
|
We are having the same issue also with MariaDB server 10.6.14. The error happens several times per hour. We can also provide logs if needed. We switched from 10.4 because it was showing the same behavior hoping 10.6 would fix it but we had no luck. | ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-08-07 ] | ||||||||||||||||||||||||||||||||
|
@marko , I see from the list of jiras assigned to you that you have a lot on your plate. But if I can provide fully resolved stack traces of core dumps, there might at least be a chance that the problem might be easy to pin down? (if I've read earlier comments correctly, you never did receive fully resolved stack traces from the previous reporters). | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-07 ] | ||||||||||||||||||||||||||||||||
|
For the InnoDB: Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0 there is another open bug nikwes, would the 10.6 development snapshot https://ci.mariadb.org/37167/ happen to fix this assertion failure for you? | ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-08-07 ] | ||||||||||||||||||||||||||||||||
|
I hope so. But, I can not install that snapshot in production. And since I do not know what triggers the crash, other than that it happens intermittently on two of our production db servers, I, unfortunately, have no way of testing it. I'll wait for 10.6.15, and I will leave a new comment letting you know whether it solved the problem. | ||||||||||||||||||||||||||||||||
| Comment by Ivan [ 2023-08-09 ] | ||||||||||||||||||||||||||||||||
|
Hi @marko 2023-08-09 10:39:12 0x7fcfa13e0640 InnoDB: Assertion failure in file /home/buildbot/amd64-rhel-9-rpm-autobake/build/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/storage/innobase/trx/trx0trx.cc line 1252 Let me know if the full log would be useful to you or if there anything else I can do to help debug the issue. | ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-08-14 ] | ||||||||||||||||||||||||||||||||
|
I upgraded one of the two crashing to MariaDB server 10.11.4, but no avail. As I suspected, the bug is present in that version too. | ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-08-16 ] | ||||||||||||||||||||||||||||||||
|
Still getting the error with MariaDB 10.6.15. It ran without crashing for 33h after the upgrade, so it looked promising, but at last it once again crashed with the same Failing assertion error. Let me know if you want me to provide resolved bug traces, core dumps etc. | ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-08-16 ] | ||||||||||||||||||||||||||||||||
|
I can now confirm that the error persists in MariaDB 10.11.5 as well. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-17 ] | ||||||||||||||||||||||||||||||||
|
Rey and nikwes, thank you for testing that this bug still occurs in both 10.6.15 and 10.11.5 after | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-17 ] | ||||||||||||||||||||||||||||||||
|
In MDEV-31941 there is an example where a SELECT that is executed ‘inside’ LOCK TABLES can cause this assertion failure. | ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-08-21 ] | ||||||||||||||||||||||||||||||||
|
I have checked our code. We do not use LOCK TABLE(S) anywhere. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-04 ] | ||||||||||||||||||||||||||||||||
|
A support customer of ours got a crash on MariaDB 10.5.22 (after the fix of
| ||||||||||||||||||||||||||||||||
| Comment by Niklas Wessén [ 2023-09-04 ] | ||||||||||||||||||||||||||||||||
|
I guess there can be different culprits triggering the same assertion failure. We run a master-slave setup, but we only get the assertion failure on the master node. To test if the master's data was corrupted, I removed all data from the master and replaced it with the slave's data, whereafter the frequency of the crashes went from 25 times per week to once a week. So the crashes did not cease entirely though. | ||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-09-06 ] | ||||||||||||||||||||||||||||||||
|
nikwes, I believe that this bug heavily depends on timing. Any subtle changes of data layout could affect it. Examples: how fragmented are the indexes or undo logs, how much history of committed transactions remains to be purged. To reproduce this, maybe vlad.lesin should carefully review the logic of trx_pools, to try to ensure that the reuse of memory objects for InnoDB transactions will be more likely. | ||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2023-09-19 ] | ||||||||||||||||||||||||||||||||
|
Looking at the patch... This part:
This is true... But then the code in rr_quick is [sometimes] redundant ...
Then, again:
So, the error is reported twice... but this is outside of the scope of this patch... | ||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2023-09-19 ] | ||||||||||||||||||||||||||||||||
|
It's not "will be set" it is "has been set" at this point. get_next() call is dozen lines above. Please change the wording. Please also add
to make sure the error state has been set. Also this gives a clue how to add test coverage for the changed code in sql_join_cache.cc : https://gist.github.com/spetrunia/d2e78780184e09842f4299a87531d848 | ||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2023-09-19 ] | ||||||||||||||||||||||||||||||||
|
Also please apply this coding style fix: https://gist.github.com/spetrunia/7edad4c40927c1d804aa13ec96aecd60 | ||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2023-09-25 ] | ||||||||||||||||||||||||||||||||
|
More review input provided on Slack. Ok to push when addressed. | ||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2023-11-22 ] | ||||||||||||||||||||||||||||||||
|
Summary: a query that's doing a locking read and using Rowid Filter could cause a server crash if it has encountered a Lock Wait Timeout or Deadlock opr a similar error when building the Rowid Filter. |