[MDEV-25163] Rowid filter does not process storage engine error correctly. Created: 2021-03-16  Updated: 2023-11-28  Resolved: 2023-09-25

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.8, 10.6.10, 10.6.12, 10.5.21, 10.5.22
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Major
Reporter: Ingi Gauti Ragnarsson Assignee: Vladislav Lesin
Resolution: Fixed Votes: 3
Labels: None

Attachments: Text File gdb_output.txt     PNG File image-2021-03-16-13-47-55-914.png     Text File mariadb_log.txt     PNG File screenshot-1.png    
Issue Links:
Blocks
blocks MDEV-24035 Failing assertion: UT_LIST_GET_LEN(lo... Open
Duplicate
duplicates MDEV-24322 Assertion failure in file /home/build... Closed
is duplicated by MDEV-29478 Database Crashed with no reason Closed
is duplicated by MDEV-29536 InnoDB: Assertion failure Closed
is duplicated by MDEV-31780 InnoDB: Assertion failure in file D:\... Closed
Relates
relates to MDEV-25296 Failing assertion: UT_LIST_GET_LEN(tr... Closed
relates to MDEV-25594 Crash in deadlock checker under high ... Closed
relates to MDEV-27031 InnoDB: Assertion failure in file D:\... Closed
relates to MDEV-31941 Assertion failures upon using OQGraph... Confirmed
relates to MDEV-32201 join_init_read_record() wrongly issue... Confirmed
relates to MDEV-32448 MariaDB 10.6.11 - InnoDB: Assertion f... Closed

 Description   

We had this crash on production database, but we cant see reason for this.
It was not possible to restart the server again and got error message (see picture below).

We managed to run it again by setting innodb_force_recovery=1. The server is running fine, but not knowing what caused this makes me worry

Mar 15 15:59:54 mysql02-server mariadbd: 2021-03-15 15:59:54 3535477 [Warning] Sort aborted, host: web06, user: dbuser, thread: 3535477, 
query: SELECT * FROM `******` WHERE id = 123 ORDER BY id ASC, id ASC LIMIT 1 FOR UPDATE
Mar 15 15:59:54 mysql02-server mariadbd: 2021-03-15 15:59:54 0x7fc49e404700 
 InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.8/storage/innobase/trx/trx0trx.cc line 1336
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: We intentionally generate a memory trap.
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: If you get repeated assertion failures or crashes, even
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: immediately after the mysqld startup, there may be
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: corruption in the InnoDB tablespace. Please refer to
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
Mar 15 15:59:54 mysql02-server mariadbd: InnoDB: about forcing recovery.
Mar 15 15:59:54 mysql02-server mariadbd: 210315 15:59:54 [ERROR] mysqld got signal 6 ;
Mar 15 15:59:54 mysql02-server mariadbd: This could be because you hit a bug. It is also possible that this binary
Mar 15 15:59:54 mysql02-server mariadbd: or one of the libraries it was linked against is corrupt, improperly built,
Mar 15 15:59:54 mysql02-server mariadbd: or misconfigured. This error can also be caused by malfunctioning hardware.
Mar 15 15:59:54 mysql02-server mariadbd: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Mar 15 15:59:54 mysql02-server mariadbd: We will try our best to scrape up some info that will hopefully help
Mar 15 15:59:54 mysql02-server mariadbd: diagnose the problem, but since we have already crashed,
Mar 15 15:59:54 mysql02-server mariadbd: something is definitely wrong and this may fail.



 Comments   
Comment by Marko Mäkelä [ 2021-03-16 ]

Can you provide a stack trace of all executing threads during the time of the crash?

Comment by Ingi Gauti Ragnarsson [ 2021-03-23 ]

Hey Marko, it happened again this sunday and we managed to get a core dump. I can provide you will link to it, it´s about 200mb zipped file (5GB), but can I provide you with the link in some more private way then public forum, such as email?

Comment by Ingi Gauti Ragnarsson [ 2021-04-14 ]

Hey Marko, we´ve had few more of those crashes. Have you been able to check out the log I emailed you?

Comment by Marko Mäkelä [ 2021-04-26 ]

ingig, sorry, I had not notice your message until today. The core dump alone cannot be analyzed; I would need a copy of the executable and the shared libraries that generated it.
Maybe you could help a little more and provide the output of the following:

gdb -ex 'set height 0' -ex 'thread apply all backtrace' -ex 'quit' /usr/sbin/mariadbd core.28668

Note: We already have some stack traces attached to MDEV-24322, but no copy of the corresponding core dump.

Do you use Galera replication?

Comment by Roel Van de Paar [ 2021-04-26 ]

ingig to quickly get the shared libs for mysqld, you can copy the core file (which should have the word 'core' in the file name for the purposes explained here; please rename to 'core' if needbe), plus the mysqld binary itself into a directory somewhere (note that if your binary is called mariadbd instead, then copying the mysqld symlink (which points to the mariadbd binary) will make a copy of the mariadbd binary and name it mysqld automatically, which is fine too). Then, execute this script from the directory where you copied mysqld to:
https://github.com/mariadb-corporation/mariadb-qa/blob/master/ldd_files.sh
It will copy all shared libs into the same directory. Then tar (tar -zhcf mdev-25163.tar.gz) or zip the lot (except the core file for confidentiality, which you can remove from the directory before compressing, and sent us in some other way for example via a support ticket or email) and upload the compressed file to this ticket please. Note that all this (except the tar) has to be done from and on the machine where the core was generated, as the binary and libs can be, and often are, specific to that machine. Let me know if you run into any issues.

Comment by Nikola Blazincic [ 2021-06-17 ]

Hi, I will be taking over this issue since the original reporter is leaving the company. Please reopen this bug/ticket.
and provide you with details.
The latest crash was yesterday, generated requested files from the latest core file. A tar archive is created but it is over 10 MB so it can not be uploaded here. What would be the most appropriate way to send it to you?
Version is 10.5.8.
Galera replication is not used. [^mdev-25163.tar.gz]

Hope to hear from you soon, and thank you.

Comment by Marko Mäkelä [ 2021-06-17 ]

nblazincic, I am reopening this as requested. But, I will be unable to proceed until you post some information from a core dump.

Comment by Nikola Blazincic [ 2021-06-17 ]

Thank you. How to send files over 10 MB ?

Comment by Nikola Blazincic [ 2021-06-17 ]

https://drive.google.com/file/d/1qsuHLk__EcRlf3aHj3WKyRd8VJoXrj1D/view?usp=sharing

Comment by Marko Mäkelä [ 2021-06-17 ]

nblazincic, can you post the stack traces from the core dump here?

Comment by Marko Mäkelä [ 2021-06-18 ]

Even from a core dump, it might be very tricky to determine what could have happened in the past that would have caused the problem. (And due to MDEV-10814, core dumps will by default not include a copy of the buffer pool, which might be crucial for analyzing this.)

I remembered that we had a somewhat similar mystery case in MDEV-25594 that affected a support customer. Their problems went away after they upgraded to a custom build that was almost the same as 10.5.10 the release, plus some InnoDB cleanup that will be available in the 10.5.11 release. I do not know whether their problem was addressed by the InnoDB changes or some fix to error handling outside InnoDB (fixing something similar to MDEV-21987). With race conditions, even small changes to unrelated code can have a huge effect on the probability of hitting the issue.

Comment by Nikola Blazincic [ 2021-06-18 ]

I hope that this is what you need. Output of command
gdb -ex 'set height 0' -ex 'thread apply all backtrace' -ex 'quit' /usr/sbin/mariadbd core._some_number
is attached.
gdb_output.txt

Also output of /var/log/messages may be of some use (i cleaned it up due to privacy reasons)
mariadb_log.txt

I noticed this line:
Jun 16 13:04:35 server_name mariadbd: 2021-06-16 13:04:35 0 [Warning] Could not increase number of max_open_files to more than 16384 (request: 34047)

So this will be adjusted immediately

Comment by Marko Mäkelä [ 2021-06-18 ]

nblazincic, in gdb_output.txt there are no function call parameters available. For diagnosing this, it is not sufficient to only have the function names in the call stacks. Even if the parameter values were available, we may have to be somewhat lucky.

To get the function parameters, an additional package with debugging symbols needs to be installed, if one is available for the server package that you are using. Alternatively, the executable has to be compiled with appropriate flags. It might be best to compile a debug version of the server. With luck, some debug assertion could fail earlier and give better hints.

Comment by Marko Mäkelä [ 2021-06-18 ]

If we ignore the fact that the source code line numbers are missing in gdb_output.txt, the crash looks similar to MDEV-25594.

#6  0x00005564786bb577 in DeadlockChecker::check_and_resolve(ib_lock_t const*, trx_t*) ()
#7  0x00005564786bb6af in lock_rec_enqueue_waiting(ib_lock_t*, unsigned int, buf_block_t const*, unsigned long, dict_index_t*, que_thr_t*, lock_prdt*) ()
#8  0x00005564786bc2cf in lock_rec_lock(bool, unsigned int, buf_block_t const*, unsigned long, dict_index_t*, que_thr_t*) ()
#9  0x0000556478760520 in Row_sel_get_clust_rec_for_mysql::operator()(row_prebuilt_t*, dict_index_t*, unsigned char const*, que_thr_t*, unsigned char const**, unsigned short**, mem_block_info_t**, dtuple_t**, mtr_t*) ()

And in fact, we do have the line number in mariadb_log.txt:

Jun 16 13:02:03 server_name mariadbd: 2021-06-16 13:02:03 0x7f4ea854d700  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.8/storage/innobase/lock/lock0lock.cc line 6617

It indeed is the check_trx_state() call that triggered the crash in MDEV-25594. But, the originally reported crash was in trx0trx.cc. Those crashes might be correlated.

Comment by Nikola Blazincic [ 2021-06-28 ]

Sorry for late reply, vacation. Will check if running on debug version is an option for us on this system, or we will try to update to latest version and see if this will be resolved.

Comment by Nikola Blazincic [ 2021-07-29 ]

We did an update to 10.5.11 and got the same crash today.
Do you have any suggestions (besides running debug version on the production system)?

Comment by Marko Mäkelä [ 2022-09-07 ]

MDEV-29478 looks like a duplicate of this.

We have been unable to reproduce this in our internal testing. An SQL test case or a fully resolved stack trace of the crash would be very much appreciated.

Normal installation packages of the server do not include debugging information (which is not to be confused with a debug-instrumented build). The debugging information (some files added to /usr/lib/debug/.build-id/) is typically available in a package whose name includes the string -debuginfo or -dbgsym.

Comment by Marko Mäkelä [ 2022-09-07 ]

nblazincic, ingig, if a later version crashes, can you please post the server error log and the resolved stack traces related to the crash?

Comment by Marko Mäkelä [ 2022-09-07 ]

ingig, why was innodb_force_recovery=1 needed in the crash recovery?

Comment by Roel Van de Paar [ 2023-03-10 ]

In MDEV-29142 we saw UT_LIST_GET_LEN(trx_list) == 0. It was fixed in 10.6.11. The stacks in last two comments are 10.6.12. Likely not related, but just mentioning in case it is (i.e. fix caused new issue or similar).

Comment by Niklas Wessén [ 2023-07-28 ]

We are having recurring crashes with MariaDB server 10.6.14 running on CentOS 7.9, getting the same error in mysql-error.log:

2023-07-21 17:33:04 0x7f988c6c0700 InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.6.14/storage/innobase/trx/trx0trx.cc line 1252
InnoDB: Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0

It restarts automatically after the crash. Out of 10 databases, all with the same setup, two behave this way, the other do not.

I can provide error logs, core dumps and full stack traces.

Comment by Ivan [ 2023-08-01 ]

We are having the same issue also with MariaDB server 10.6.14. The error happens several times per hour. We can also provide logs if needed. We switched from 10.4 because it was showing the same behavior hoping 10.6 would fix it but we had no luck.

Comment by Niklas Wessén [ 2023-08-07 ]

@marko , I see from the list of jiras assigned to you that you have a lot on your plate. But if I can provide fully resolved stack traces of core dumps, there might at least be a chance that the problem might be easy to pin down? (if I've read earlier comments correctly, you never did receive fully resolved stack traces from the previous reporters).

Comment by Marko Mäkelä [ 2023-08-07 ]

For the InnoDB: Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0 there is another open bug MDEV-31780, which I think may have been fixed by MDEV-31185.

nikwes, would the 10.6 development snapshot https://ci.mariadb.org/37167/ happen to fix this assertion failure for you?

Comment by Niklas Wessén [ 2023-08-07 ]

I hope so. But, I can not install that snapshot in production. And since I do not know what triggers the crash, other than that it happens intermittently on two of our production db servers, I, unfortunately, have no way of testing it. I'll wait for 10.6.15, and I will leave a new comment letting you know whether it solved the problem.

Comment by Ivan [ 2023-08-09 ]

Hi @marko
I updated to the latest release version MariaDB 10.6.15-MariaDB-log source revision 691e964d2357b579f64e08398f4b66f7d2c81566 last night and I can confirm the error still happens. It just happened a few hours ago in production again:

2023-08-09 10:39:12 0x7fcfa13e0640 InnoDB: Assertion failure in file /home/buildbot/amd64-rhel-9-rpm-autobake/build/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/storage/innobase/trx/trx0trx.cc line 1252
InnoDB: Failing assertion: UT_LIST_GET_LEN(lock.trx_locks) == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
230809 10:39:12 [ERROR] mysqld got signal 6 ;

Let me know if the full log would be useful to you or if there anything else I can do to help debug the issue.

Comment by Niklas Wessén [ 2023-08-14 ]

I upgraded one of the two crashing to MariaDB server 10.11.4, but no avail. As I suspected, the bug is present in that version too.

Comment by Niklas Wessén [ 2023-08-16 ]

Still getting the error with MariaDB 10.6.15. It ran without crashing for 33h after the upgrade, so it looked promising, but at last it once again crashed with the same Failing assertion error. Let me know if you want me to provide resolved bug traces, core dumps etc.

Comment by Niklas Wessén [ 2023-08-16 ]

I can now confirm that the error persists in MariaDB 10.11.5 as well.

Comment by Marko Mäkelä [ 2023-08-17 ]

Rey and nikwes, thank you for testing that this bug still occurs in both 10.6.15 and 10.11.5 after MDEV-31185 was fixed. We will have to spend more effort to try to reproduce this internally.

Comment by Marko Mäkelä [ 2023-08-17 ]

In MDEV-31941 there is an example where a SELECT that is executed ‘inside’ LOCK TABLES can cause this assertion failure.

Comment by Niklas Wessén [ 2023-08-21 ]

I have checked our code. We do not use LOCK TABLE(S) anywhere.

Comment by Marko Mäkelä [ 2023-09-04 ]

A support customer of ours got a crash on MariaDB 10.5.22 (after the fix of MDEV-31185) that could look like the following, based on some analysis of the core dump:

  1. The transaction object 0x7fdda16de490 was used for some transaction that had inserted some records.
  2. Another transaction started to wait for locks that were held by that earlier transaction, and invoked an "implicit-to-explicit" lock conversion.
  3. The first transaction was committed.
  4. The transaction object 0x7fdda16de490 was reused for a read-only, auto-commit, non-locking transaction that accesses a different table.
  5. The "implicit-to-explicit" lock conversion wrongly "donated" the old transaction's locks to the new one.
  6. The assertion fails when a commit is executed on the non-locking auto-commit read-only transaction object 0x7fdda16de490.
Comment by Niklas Wessén [ 2023-09-04 ]

I guess there can be different culprits triggering the same assertion failure. We run a master-slave setup, but we only get the assertion failure on the master node. To test if the master's data was corrupted, I removed all data from the master and replaced it with the slave's data, whereafter the frequency of the crashes went from 25 times per week to once a week. So the crashes did not cease entirely though.

Comment by Marko Mäkelä [ 2023-09-06 ]

nikwes, I believe that this bug heavily depends on timing. Any subtle changes of data layout could affect it. Examples: how fragmented are the indexes or undo logs, how much history of committed transactions remains to be purged.

To reproduce this, maybe vlad.lesin should carefully review the logic of trx_pools, to try to ensure that the reuse of memory objects for InnoDB transactions will be more likely.

Comment by Sergei Petrunia [ 2023-09-19 ]

Looking at the patch...

This part:

+        /* It's supposed that thd error state will be set(with
+        file->print_error(res, MYF(0)) call) inside of quick->get_next() call,
+        that's why there is no need to do this here. See
+        Mrr_simple_index_reader::get_next() code for example. */

This is true... But then the code in rr_quick is [sometimes] redundant ...
Debugging a query:

select * from t1 where a between 10 and 20 for update;

  Thread 32 "mysqld" hit Breakpoint 5, my_message_sql (error=1205, str=0x7fffcaffbf70 "Lock wait timeout exceeded; try restarting transaction", MyFlags=0)
(gdb) wher
  #0  my_message_sql (error=1205, str=0x7fffcaffbf70 "Lock wait timeout exceeded; try restarting transaction", MyFlags=0) at /home/psergey/dev-git2/10.4-l
  #1  0x00005555569af452 in my_error (nr=1205, MyFlags=0) at /home/psergey/dev-git2/10.4-look10/mysys/my_error.c:125
  #2  0x00005555560e9156 in handler::print_error (this=0x7fff70037a48, error=146, errflag=0) at /home/psergey/dev-git2/10.4-look10/sql/handler.cc:4093
  #3  0x0000555555f828e4 in Mrr_simple_index_reader::get_next (this=0x7fff70038030, range_info=0x7fffcaffd1a8) at /home/psergey/dev-git2/10.4-look10/sql/m
  #4  0x0000555555f85734 in DsMrr_impl::dsmrr_next (this=0x7fff70037ee0, range_info=0x7fffcaffd1a8) at /home/psergey/dev-git2/10.4-look10/sql/multi_range_
  #5  0x00005555564db79a in ha_innobase::multi_range_read_next (this=0x7fff70037a48, range_info=0x7fffcaffd1a8) at /home/psergey/dev-git2/10.4-look10/stor
  #6  0x000055555627e322 in QUICK_RANGE_SELECT::get_next (this=0x7fff70082f90) at /home/psergey/dev-git2/10.4-look10/sql/opt_range.cc:12345
  #7  0x00005555562920e0 in rr_quick (info=0x7fff70019250) at /home/psergey/dev-git2/10.4-look10/sql/records.cc:369
  #8  0x0000555555ce5eb9 in READ_RECORD::read_record (this=0x7fff70019250) at /home/psergey/dev-git2/10.4-look10/sql/records.h:70
  #9  0x0000555555e1b878 in sub_select (join=0x7fff70017288, join_tab=0x7fff70019188, end_of_records=false) at /home/psergey/dev-git2/10.4-look10/sql/sql_
  #10 0x0000555555e1aace in do_select (join=0x7fff70017288, procedure=0x0) at /home/psergey/dev-git2/10.4-look10/sql/sql_select.cc:20425
  #11 0x0000555555deee1f in JOIN::exec_inner (this=0x7fff70017288) at /home/psergey/dev-git2/10.4-look10/sql/sql_select.cc:4605
  #12 0x0000555555dedf36 in JOIN::exec (this=0x7fff70017288) at /home/psergey/dev-git2/10.4-look10/sql/sql_select.cc:4387

Then, again:

  Thread 32 "mysqld" hit Breakpoint 5, my_message_sql (error=1205, str=0x7fffcaffc040 "Lock wait timeout exceeded; try restarting transaction", MyFlags=0)
(gdb) wher
  #0  my_message_sql (error=1205, str=0x7fffcaffc040 "Lock wait timeout exceeded; try restarting transaction", MyFlags=0) at /home/psergey/dev-git2/10.4-l
  #1  0x00005555569af452 in my_error (nr=1205, MyFlags=0) at /home/psergey/dev-git2/10.4-look10/mysys/my_error.c:125
  #2  0x00005555560e9156 in handler::print_error (this=0x7fff70037a48, error=146, errflag=0) at /home/psergey/dev-git2/10.4-look10/sql/handler.cc:4093
  #3  0x000055555629209d in rr_handle_error (info=0x7fff70019250, error=146) at /home/psergey/dev-git2/10.4-look10/sql/records.cc:356
  #4  0x00005555562920ff in rr_quick (info=0x7fff70019250) at /home/psergey/dev-git2/10.4-look10/sql/records.cc:371
  #5  0x0000555555ce5eb9 in READ_RECORD::read_record (this=0x7fff70019250) at /home/psergey/dev-git2/10.4-look10/sql/records.h:70
  #6  0x0000555555e1b878 in sub_select (join=0x7fff70017288, join_tab=0x7fff70019188, end_of_records=false) at /home/psergey/dev-git2/10.4-look10/sql/sql_
  #7  0x0000555555e1aace in do_select (join=0x7fff70017288, procedure=0x0) at /home/psergey/dev-git2/10.4-look10/sql/sql_select.cc:20425
  #8  0x0000555555deee1f in JOIN::exec_inner (this=0x7fff70017288) at /home/psergey/dev-git2/10.4-look10/sql/sql_select.cc:4605
  #9  0x0000555555dedf36 in JOIN::exec (this=0x7fff70017288) at /home/psergey/dev-git2/10.4-look10/sql/sql_select.cc:4387

So, the error is reported twice... but this is outside of the scope of this patch...

Comment by Sergei Petrunia [ 2023-09-19 ]

+        /* It's supposed that thd error state will be set(with
+        file->print_error(res, MYF(0)) call) inside of quick->get_next() call,
+        that's why there is no need to do this here. See
+        Mrr_simple_index_reader::get_next() code for example. */

It's not "will be set" it is "has been set" at this point. get_next() call is dozen lines above. Please change the wording.

Please also add

  DBUG_ASSERT(thd->is_error());

to make sure the error state has been set.

Also this gives a clue how to add test coverage for the changed code in sql_join_cache.cc :

https://gist.github.com/spetrunia/d2e78780184e09842f4299a87531d848

Comment by Sergei Petrunia [ 2023-09-19 ]

Also please apply this coding style fix: https://gist.github.com/spetrunia/7edad4c40927c1d804aa13ec96aecd60

Comment by Sergei Petrunia [ 2023-09-25 ]

More review input provided on Slack. Ok to push when addressed.

Comment by Sergei Petrunia [ 2023-11-22 ]

Summary: a query that's doing a locking read and using Rowid Filter could cause a server crash if it has encountered a Lock Wait Timeout or Deadlock opr a similar error when building the Rowid Filter.

Generated at Thu Feb 08 09:35:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.