[MDEV-27031] InnoDB: Assertion failure in file D:\winx64-packages\build\src\storage\innobase\trx\trx0trx.cc line 1288 Created: 2021-11-12 Updated: 2023-07-26 Resolved: 2023-03-24 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.1, 10.5.13, 10.6.5 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ruut | Assignee: | Vladislav Lesin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | rr-profile-analysed | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
Several times a day the MariaDB windows services crashes. Every time the same error in the log is shown. Please advise which steps to take to fix it. I am using version: Server version: 10.6.1-MariaDB - mariadb.org binary distribution Error log:
|
| Comments |
| Comment by Ruut [ 2021-11-12 ] | ||
|
I ran mysqlcheck -A from the command line and all the tables are OK. | ||
| Comment by Marko Mäkelä [ 2021-11-12 ] | ||
|
10.6.1 is an alpha release. The first generally available release was 10.6.3. The latest release in the series is 10.6.5. That said, this looks similar to wlad, can you please provide assistance for obtaining stack traces for the crash? | ||
| Comment by Vladislav Vaintroub [ 2021-11-12 ] | ||
|
marko, the stack of the crashing thread is in the bug description, after "attempting backtrace", and looks ok. The actual mariadbd.dmp, or mysqld.dmp, is missing, so stacktraces of all threads are not currently available. ruut, maybe you can find a file with .dmp extension in your D:\MariaDB\data\ , and attach it to the bug? | ||
| Comment by Ruut [ 2021-11-13 ] | ||
|
There is no mariadbd.dmp on my C: or D: drive (in fact I search for *.dmp without success). Can you give assistance on to make sure this file is created on future crashes. I upgraded to 10.6.5-MariaDB - mariadb.org binary distribution, but same problems still occur. | ||
| Comment by Vladislav Vaintroub [ 2021-11-13 ] | ||
|
Hmm, the core should be enabled on Windows by default ( Anyway, you can add core-file into mysqld section of the my.ini file, so it looks like below.
| ||
| Comment by Marko Mäkelä [ 2021-11-13 ] | ||
|
wlad, thank you. Because this may be a race condition, I would need the stack traces of all threads. And even then, another thread that might have broken things in the crashing thread’s point of view might already be executing something completely different. Hopefully the .dmp file will contain the values of some variables and function parameters. | ||
| Comment by Ruut [ 2021-11-13 ] | ||
|
I attached the dump and also the relevant part from the .err file where the duped crashed happend. The service is now crashing several times per hour instead of several times a day. So the problem is getting worse. | ||
| Comment by Vladislav Vaintroub [ 2021-11-13 ] | ||
|
marko, I attached 2 text files , the VS and Windbg style. windbg has a little more info, exception context and some stack variables, on the crashing thread. Since it is a minidump, apart from what's on stack, there is not much, so windbg output probably contains maximum information you can get (note however, the stacks are unique, tell me, if you need to have all of them listed) | ||
| Comment by Vladislav Vaintroub [ 2021-11-13 ] | ||
|
As far as I can tell, at the moment of crash, only one thread is active, and others are waiting on condition variable in lock_wait(). | ||
| Comment by Marko Mäkelä [ 2021-11-14 ] | ||
|
wlad, thank you, the file all_threads_stack.txt | ||
| Comment by Vladislav Vaintroub [ 2021-11-14 ] | ||
|
marko, You do not see memory on heap, because this kinds of minidump we create exclude heap. And of course, one can attach a debugger to a running process and debugger will break when process is about to crash, and then one can take a minidump, or full dump, or perhaps with some filtered memory, in case of windbg (https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/-dump--create-dump-file- , I think the ".dump /mi" command could be particularly useful, in this case , because it "adds secondary memory to the minidump. Secondary memory is any memory referenced by a pointer on the stack or backing store, plus a small region surrounding this address." However, I believe the most convenient thing, for the user could be, if you print those locks with fprintfs in Innodb, prior to crash, and provide affected user with a better instrumented executable. | ||
| Comment by Ruut [ 2021-11-14 ] | ||
|
sorry guys, I really appreciate your help, I need this database to be operational again Monday morning 8.00 central European time, when the office opens up. So my plan is to reinstall mariadb 10.6.5, restore the database backups, which I expect to be all fine. Maybe the problems were caused by 10.6.1 beta which was installed on a production server for almost half a year (my fault, I did not notice it was a beta version) and I only upgraded to 10.6.5 yesterday. Could it be that 10.6.1 contains some bugs which resulted in database corruptions which are not detected by mysqlcheck, but still caused the database to crash even after upgrading to 10.6.5? | ||
| Comment by Vladislav Vaintroub [ 2021-11-14 ] | ||
|
There is no data corruption in the database, the crash is presumably is some concurrency bug, which results in this assertion. Or the assertion itself is a bug, but marko would know better. | ||
| Comment by Ruut [ 2021-11-14 ] | ||
|
So when I would install Mariadb 10.5.13 and restore the databases somehow avoid this concurrency bug form happening again? | ||
| Comment by Vladislav Vaintroub [ 2021-11-14 ] | ||
|
quite possibly, yes. | ||
| Comment by Ruut [ 2021-11-14 ] | ||
|
concerning concurrency: We use a single mariadb user for access to 4 databases which will be used by:
The sever has 16 CPU cores and 40 GB Physical memory. Would you advise to define a separate mariadb user per application pool and per python script. Or does this not matter? | ||
| Comment by Vladislav Vaintroub [ 2021-11-14 ] | ||
|
ruut, one or many mariadb users, this does not matter for concurrency. Different users are used only for access control, and privileges, and that's it. | ||
| Comment by Marko Mäkelä [ 2021-11-15 ] | ||
|
The lock.trx_locks would be allocated from the heap, via trx_pools, and therefore it would be excluded from those dumps. ruut, are those 16 CPU cores in a single CPU package? We conduct most of our internal stress tests on GNU/Linux using debug-instrumented executables. The repeatability of race condition bugs sometimes become reproducible with subtle changes of timing. Compared to the Linux kernel, the scheduler in Microsoft Windows may have different timing patterns. | ||
| Comment by Ruut [ 2021-11-15 ] | ||
|
I downgraded to 10.5.13. Same assertion error occurs as before, so unfortunately this did not solve the problem. I attached the my.ini file, which might help you find the root cause of the problem. my.ini @marko, this server has 8 cores and 16 logical processiors, see attached screenshot. | ||
| Comment by Marko Mäkelä [ 2021-11-15 ] | ||
|
wlad noticed the non-default setting innodb_rollback_on_timeout=ON in your configuration. It is possible be that our test coverage of that setting is insufficient. ruut, would the crashes go away if you removed that? | ||
| Comment by Marko Mäkelä [ 2021-11-15 ] | ||
|
We can repeat various assertion failures when enabling innodb_rollback_on_timeout=ON in a debug build. | ||
| Comment by Ruut [ 2021-11-23 ] | ||
|
@Marko Mäkelä, I removed the innodb_rollback_on_timeout settings and crashes now only happens once per week. Attached the error which now occurs once per week. To me it looks the same as the error which we received before, but maybe you see subtle differences 2021.11.22-tail-error-log.txt | ||
| Comment by Marko Mäkelä [ 2021-11-24 ] | ||
|
I think that for analyzing this bug, I will need an https://rr-project.org trace from mleich. Stack traces from a crash are not going to tell us how we got to that situation. | ||
| Comment by Ruut [ 2022-02-18 ] | ||
|
I enabled deadlock logging and found out that around 1000 deadlocks per day occurred. Fixing the deadlock issues also solved this problem. Mariadb service has not crashed in the last 30 days. Maybe next time assertion error occurres, ask reporter to enable deadlock logging | ||
| Comment by Marko Mäkelä [ 2022-08-03 ] | ||
|
mleich, on 2022-06-23 you had set the label rr-profile but forgot to specify the location of the trace. I searched for it, but didn’t find a mention of it in any chat logs either. | ||
| Comment by Matthias Leich [ 2022-08-15 ] | ||
|
Sorry, setting "rr-profile" and forgetting the location was a mistake.
origin/10.4 9a897335eb4387980aed7698b832a893dbaa3d81 2022-07-26T16:45:10+03:00 | ||
| Comment by Marko Mäkelä [ 2022-09-07 ] | ||
|
mleich, that assertion expression is different from the bug description. There are many data structures that use the InnoDB home-brew doubly linked list. That said, in | ||
| Comment by Marko Mäkelä [ 2022-09-07 ] | ||
|
Copying and adapting from
That trace involves innodb_rollback_on_timeout=ON. | ||
| Comment by Marko Mäkelä [ 2022-09-20 ] | ||
|
In | ||
| Comment by Vladislav Lesin [ 2023-03-24 ] | ||
|
The scenario is the following: This is one more variant of |