[MDEV-21100] [ERROR] [FATAL] InnoDB: Unknown error code 19: Required history data has been deleted Created: 2019-11-20  Updated: 2023-11-30  Resolved: 2023-04-11

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.3.10
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Sebastian Halfar Assignee: Jan Lindström (Inactive)
Resolution: Won't Fix Votes: 1
Labels: None
Environment:

Ubuntu 18.04 with 3 nodes Galera-Cluster
Ubuntu 18.04 Proxy maxscale 2.2.21


Attachments: HTML File galera3     Text File galera_2-crash.txt     Text File mariadb-10.3.18-crash.txt     Text File mariadb-10.3.21-crash.txt     File mariadb-10.3.21.cnf     Text File mariadb-10.3.22-crash.txt     Text File mariadb-10.3.23-crash-2.txt     Text File mariadb-10.3.23-crash.txt     Text File mariadb-10.4.25-stack-trace.txt     Text File mariadb_10.3.23_crash_2020-07-20_01.txt     Text File mariadb_10.3.23_crash_2020-07-20_02.txt     Text File mariadb_10.3.23_crash_2020-07-31.txt     Text File mariadb_10.3.27_crash_2020-02-01.txt     Text File mariadb_10.3.27_crash_2020-12-10.txt     File maxscale2.2.21    
Issue Links:
Relates
relates to MDEV-24027 [ERROR] [FATAL] InnoDB: Unknown error... Closed
relates to MDEV-29204 InnoDB: Unknown error Required histor... Closed

 Description   

The Cluster survived the first crash, but we would like to know what the stacktrace tells you guys. Hope i have all you need for a hint.

Could upgrade but first i would like to know if its version or code failure.

Thanks in advance.



 Comments   
Comment by Elena Stepanova [ 2019-11-25 ]

From the error log:

2019-11-Day 16:31:16 482717114 [ERROR] [FATAL] InnoDB: Unknown error code 19: Required
 history data has been deleted
1911Day 16:31:16 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.3.10-MariaDB-1:10.3.10+maria~xenial-log
key_buffer_size=134217728
read_buffer_size=2097152
max_used_connections=1665
max_threads=5002
thread_count=560
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 30975179 K  byt
es of memor
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7fab14a6ac88
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7faad862fdd8 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55bc7810f94e]
/usr/sbin/mysqld(handle_fatal_signal+0x347)[0x55bc77bbc4a7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fb15f019890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fb15e52de97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fb15e52f801]
/usr/sbin/mysqld(+0xaebfa8)[0x55bc77f57fa8]
/usr/sbin/mysqld(+0xa5eeac)[0x55bc77ecaeac]
/usr/sbin/mysqld(+0xa7d2fc)[0x55bc77ee92fc]
/usr/sbin/mysqld(+0x9a31da)[0x55bc77e0f1da]
/usr/sbin/mysqld(_ZN7handler17ha_index_read_mapEPhPKhm16ha_rkey_function+0x138)[0x55bc77bcDay68]
/usr/sbin/mysqld(+0x5b79d8)[0x55bc77a239d8]
/usr/sbin/mysqld(_Z10sub_selectP4JOINP13st_join_tableb+0x164)[0x55bc77a15e14]
/usr/sbin/mysqld(_ZN4JOIN10exec_innerEv+0xb15)[0x55bc77a35a45]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x33)[0x55bc77a35c93]
/usr/sbin/mysqld(_Z12mysql_selectP3THDP10TABLE_LISTjR4ListI4ItemEPS4_jP8st_orderS9_S7_S9_yP13select_resultPDayst_select_lex_unitP13st_select_lex+0xef)[0x55bc77a35ddf]
/usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0xf8)[0x55bc77a366d8]
/usr/sbin/mysqld(+0x48b95c)[0x55bc778f795c]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x6d2d)[0x55bc779e383d]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x21a)[0x55bc779e582a]
/usr/sbin/mysqld(+0x57a0f6)[0x55bc779e60f6]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0xe9a)[0x55bc779e744a]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x13c)[0x55bc779e8cfc]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x252)[0x55bc77aba022]
/usr/sbin/mysqld(handle_one_connection+0x3d)[0x55bc77aba19d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fb15f00e6db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fb15e61088f]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fab14370b20): SELECT * FROM `example_me_rd` WHERE `user_id` = '123456' AND `message_id` = '123456'
Connection ID (thread ID): 482717114
Status: NOT_KILLED
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains

Comment by Marko Mäkelä [ 2019-11-26 ]

As far as I can tell, the error code DB_MISSING_HISTORY should be impossible. The condition purge_sys.view.changes_visible() should not hold for record versions that are visible in any active read view.

I do not remember seeing this error with anything else than Galera. The previous occurrence could have been about 2 years ago. Also back then, we did not get anything reproducible.

I do not know exactly how the Galera library interacts with InnoDB, so I cannot exclude the possibility that Galera is breaking some assumptions around the purge view. With Galera, data nodes can be cloned to each other. Could a snapshot state transfer have gone wrong in the past? How was this node initialized?

If there has been an in-place upgrade of existing data files to 10.3.10, were the files originally created with something older than 10.3? If yes, was a slow shutdown (innodb_fast_shutdown=0) performed prior to the upgrade?

Comment by Sebastian Halfar [ 2019-11-27 ]

As far as I know, the creation of the files was not on 10.3 its should be an older version. Dont think any snapshot state transfer has gone wrong and we initialized the node with mysql start, which joined the node to the cluster. I dont think we did slow shutdown in the past. But i will ask my colleague about the slow shutdown and the version of mariadb when he startet the DB.

Thanks for helping.

The cluster is running since 01.10.2018 without issues.

On 26.11.19 12:43, Marko Mäkelä (Jira) wrote:

[ https://jira.mariadb.org/browse/MDEV-21100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marko Mäkelä reassigned MDEV-21100:
-----------------------------------

Assignee: Jan Lindström (was: Marko Mäkelä)

As far as I can tell, the error code DB_MISSING_HISTORY should be impossible. The condition purge_sys.view.changes_visible() should not hold for record versions that are visible in any active read view.

I do not remember seeing this error with anything else than Galera. The previous occurrence could have been about 2 years ago. Also back then, we did not get anything reproducible.

I do not know exactly how the Galera library interacts with InnoDB, so I cannot exclude the possibility that Galera is breaking some assumptions around the purge view. With Galera, data nodes can be cloned to each other. Could a snapshot state transfer have gone wrong in the past? How was this node initialized?

If there has been an in-place upgrade of existing data files to 10.3.10, were the files originally created with something older than 10.3? If yes, was a slow shutdown (innodb_fast_shutdown=0) performed prior to the upgrade?

[ERROR] [FATAL] InnoDB: Unknown error code 19: Required history data has been deleted
-------------------------------------------------------------------------------------

Key: MDEV-21100
URL: https://jira.mariadb.org/browse/MDEV-21100
Project: MariaDB Server
Issue Type: Bug
Components: Storage Engine - InnoDB
Affects Versions: 10.3.10
Environment: Ubuntu 18.04 with 3 nodes Galera-Cluster
Ubuntu 18.04 Proxy maxscale 2.2.21
Reporter: Sebastian Halfar
Assignee: Jan Lindström
Priority: Critical
Labels: innodb
Fix For: 10.3

Attachments: galera3, maxscale2.2.21

The Cluster survived the first crash, but we would like to know what the stacktrace tells you guys. Hope i have all you need for a hint.
Could upgrade but first i would like to know if its version or code failure.
Thanks in advance.


This message was sent by Atlassian Jira
(v7.13.1#713001)

Comment by Jan Lindström (Inactive) [ 2019-12-09 ]

Can you still repeat this? If yes can you provide some instructions how to repeat. Also please provide cluster configuration and full unedited error logs.

Comment by Sebastian Halfar [ 2019-12-10 ]

Hello Jan,

so far it didn't happened again and it never happened before. The logs are nearly unedited, just removed IP and timestamp.

If there has been an in-place upgrade of existing data files to 10.3.10, were the files originally created with something older than 10.3?

> yes

If yes, was a slow shutdown (innodb_fast_shutdown=0) performed prior to the upgrade?

> no

Comment by Ovidiu Stanila [ 2019-12-27 ]

Hello Jan,

We've recently hit this issue on MariaDB 10.3.21 (see attached crash error log) mariadb-10.3.21-crash.txt. There wasn't much traffic on the server while we encountered the problem and it quickly recovered without any issue after the restart.
We also hit this a couple of months ago on MariaDB 10.3.18 mariadb-10.3.18-crash.txt Same scenario: low traffic, nothing out of the ordinary happening on the server at that time. We thought this to be an oddity as it never happened before and it didn't happen again, until a few days ago.

We usually keep MariaDB server up to date, upgrading to the latest version in a week or two after a new version is released. None of the upgrades were done with innodb_fast_shutdown=0 .

We're running on an up-to-date system with CentOS 7.7.1908 , kernel: 3.10.0-1062.9.1.el7.x86_64

Just let us know if you need any additional info regarding latest crash. We don't have much info to go about regarding previous crash (10.3.18), besides the attached log file.
Besides usual package updates, no change was done to the configuration mariadb-10.3.21.cnf

Comment by Sebastian Halfar [ 2020-01-16 ]

Hello Jan,

this issue hit us again. We are still on MariaDB 10.3.10 galera_2-crash.txt on Ubuntu 18.04. There wasn't much traffic on the server while we encountered the problem and it quickly recovered without any issue after the restart. Galera worked well, but we would like to know why this is happening and where it could come from.

Kind regards

Comment by Marko Mäkelä [ 2020-01-16 ]

Seb0, I hope that you are not executing any ALTER TABLE…ADD COLUMN without , FORCE, because MDEV-19916 was only fixed in MariaDB 10.3.19. I do not think that it could be related to the issue at hand, though.

I had one more thought: Due to MDEV-16664 we reverted a default value to innodb_lock_schedule_algorithm=fcfs in 10.2.17, 10.3.9, 10.4.0 because of bugs introduced by MDEV-11039. Since you are using 10.3.10, you should already be avoiding that bug.

Comment by Sebastian Halfar [ 2020-01-17 ]

Marko Mäkelä, do you mean did we ever used "ALTER TABLE…ADD COLUMN without , FORCE", or at the moment of the crash? I will asked backend and will come back to you.

Comment by Ovidiu Stanila [ 2020-05-19 ]

Same issue, same system, same configuration, same behaviour, on MariaDB 10.3.22
CentOS 7.8.2003, kernel 3.10.0-1062.9.1.el7.x86_64
Attached crash report. mariadb-10.3.22-crash.txt

Any idea what could cause these random crashes?

Comment by Ovidiu Stanila [ 2020-05-29 ]

Hit this issue again, same behaviour. mariadb-10.3.23-crash.txt
The only pattern I can see is that we seem to encounter this only once after each upgrade, at random time intervals after the actual upgrade, and only with this table:

  • for 10.3.21 after four days: upgraded on Dec 21, hit problem on Dec 25
  • for 10.3.22 after three months: upgraded on Feb 04, hit problem on May 16
  • for 10.3.23 after ten days: upgraded on May 19, hit problem on May 29

We never encountered this problem more than once after an upgrade, might this be related to something left over from the upgrade?

Comment by Ovidiu Stanila [ 2020-06-16 ]

Encountered this issue again mariadb-10.3.23-crash-2.txt , so that's twice on MariaDB 10.3.23.
Nothing out of the ordinary happened on the system around that time, that I could notice.
Any ideas?

Comment by Ovidiu Stanila [ 2020-07-20 ]

Hit this again, twice in a row for today, 2 hours apart: mariadb_10.3.23_crash_2020-07-20_02.txt mariadb_10.3.23_crash_2020-07-20_01.txt
This doesn't seem to want to go away.

Comment by Ovidiu Stanila [ 2020-07-31 ]

One more occurrence of this issue: mariadb_10.3.23_crash_2020-07-31.txt
Any progress on this? any chance to get rid of it in future releases? we're hitting this more and more often.

Comment by Ovidiu Stanila [ 2020-12-10 ]

We're still experiencing this issue, from time to time, most recently after upgrading to latest version in 10.3 branch (10.3.27) mariadb_10.3.27_crash_2020-12-10.txt
Any progress on this? it's been almost a year since we first encountered it
We plan to upgrade our instances to MariaDB 10.4 but would have been nice to have this fixed on 10.3 until we move over.

Comment by Ovidiu Stanila [ 2021-02-01 ]

Hello @Marko , we're still hitting this, randomly from time to time. No idea what could cause this, didn't find a specific pattern, except it's always referencing the same table (there are many concurrent queries done on that one all the time, read and write). Here's the latest crash report: mariadb_10.3.27_crash_2020-02-01.txt

We plan on moving to 10.4 and are waiting for the next release to come out, but I fear we wont escape this even then. Any ideas on what could trigger this?

Comment by Marko Mäkelä [ 2021-06-09 ]

ovidiu.stanila, sorry, I missed your updates, and accidentally found this ticket while searching for similar issues before filing MDEV-25886.

I think that we would need a somewhat repeatable test case in order to fix this. There have been locking bugs in Galera. It is difficult to tell whether MDEV-23328 and some follow-up fixes could have fixed also this. In 10.6, after MDEV-20612, MDEV-24738, MDEV-24915, MDEV-24966, I have some confidence that the Galera conflict resolution is correct. I am now reassigning this to a Galera developer, because I do not think that this is an InnoDB issue. Nothing like this has ever occurred during our internal InnoDB stress testing (except MDEV-25886, related to innodb_read_only=ON).

In MariaDB Server 10.4 or later, galera-4 is being used instead of galera-3. As part of that change, many things were cleaned up. Therefore it could make sense to try a newer major version.

Comment by Marko Mäkelä [ 2021-06-09 ]

Seb0, in my note about MDEV-19916 I meant that executing instant ADD COLUMN at any time in the past, and then using anything older than MariaDB 10.3.17 or 10.4.7, could corrupt the table.

Comment by Ovidiu Stanila [ 2021-06-10 ]

Hello @Marko , for us I don't think this is related to Galera, we don't use that on the server where we've noticed the problem, but at that time I didn't find any other reference to this kind of error except in this ticket and thought the two might be related somehow.
Reading through MDEV-25886, that fits much better with what we were experiencing, as we saw this problem only on high concurrent write/read operations on a table that we use a sort of queue, adding/updating/removing rows at a high rate to process app jobs.
We actually managed to work around the problem by removing some unnecessary SELECTs (queue checks) that were triggered very often, reducing the probability to run into the problem. Since then we didn't have a single crash.

Thanks for checking, we'll upgrade to 10.4.20 when that's available and make a note if we happen to hit this problem again, but so far that workflow workaround managed to avoid it.

Comment by Ovidiu Stanila [ 2022-07-28 ]

Hello @Jan, it's been over a year since we updated to MariaDB 10.4.20+ , now on MariaDB 10.4.25, and the issue is still there. We randomly hit it every couple of months or so, no clear pattern as to why this happens.
Today it's the first time we had this problem on another instance (db04) than the one we reported in this ticket until now (db10), which hit the issue a day before.
I've attached the stack trace for both encountersmariadb-10.4.25-stack-trace.txt
Any ideas what could cause this? We aim to move to MariaDB 10.5 later this year, but I doubt this will go away.

Comment by Marko Mäkelä [ 2022-07-28 ]

ovidiu.stanila, if you used the default wsrep_sst_method=rsync, I believe that it failed to block all concurrent writes unless you happened to run the donor with innodb_use_native_aio=0. This was fixed in MDEV-25975 (10.4.25). I vaguely remember that our tester ramesh mentioned encountering some corruption even after that fix was applied, but I do not know if a bug was filed.

A fundamental downside of physical replication (in this context, anything else than wsrep_sst_method=mysqldump) is that it will copy also any form of corruption. Especially if you are making snapshots from any node in the cluster, say, A→B→C→A, you may end up propagating corruption to the entire cluster, including the originally healthy node A in my example.

Comment by Ovidiu Stanila [ 2022-07-29 ]

@Marko, that's the thing, we don't use Galera on these servers. It's the "main" master (the other one is just a hot copy used in emergency cases) in a classic master-master setup. But this was the only bug report that had the same error as the one we were experiencing. I think for us it's mostly something related to the InnoDB engine, indexing.

Comment by Marko Mäkelä [ 2022-07-29 ]

I would suspect that something bad has happened to the data files in the past. It could be something that broke the crash recovery (such as, an error made when copying data files while a server was running, or someone removed ib_logfile* or used innodb_force_recovery=6 to ‘fix’ a recovery error), or an upgrade to 10.3 from very old data files, before MDEV-27800 was fixed in 10.3.35.

I do not think that we see this error in our internal stress testing. By design, InnoDB should never purge any history of committed transactions that may still be visible to old read views. MDEV-29153 is another report where an InnoDB error message indicates that the transaction metadata was corrupted, but it seems to involve the ‘nuclear option’ innodb_force_recovery=6.

ovidiu.stanila, if your scenario does not involve Galera (and its state transfers), then it would be better to file a separate ticket about it, especially if you have a reproducible test case that is based on SQL statements and not starting a server on some previously written data files.

Comment by Ovidiu Stanila [ 2022-07-29 ]

The problem started showing up after upgrading from 10.2.25 to 10.3.16 which went smoothly and we didn't have any issues. All updates were done incrementally from one version to the other (10.2 -> 10.3 -> 10.4) , same for minor versions (10.2.23 -> 10.2.24 -> 10.2.25). We didn't use innodb_force_recovery on this instance as it was fairly stable thought it's history, except for this issue which doesn't affect us that much because the recovery is fast.
The other odd thing is that the table on which we hit this more often (that jobs table that's showing up in the crash reports) was re-created a few times until now (DROP/CREATE - auto_increment reset) as it just holds some temporary data and we also run OPTIMIZE on it almost daily (defragmentation). So it's strange we're getting errors on accessing InnoDB "history" data on a table with very dynamic data which doesn't keep records for too long stored in it.
I've tried to replicate this by running the same queries for a longer time period on a similar instance but failed to replicate the crash, so I'm stuck.
I'll log this in a separate ticket.

Comment by Marko Mäkelä [ 2023-04-11 ]

A possible explanation of this bug could be MDEV-30671. We only saw evidence of it when using innodb_undo_log_truncate=ON. I think that similar corruption is possible also with the default setting innodb_undo_log_truncate=ON, but with different symptoms and much lower probability.

Comment by Marko Mäkelä [ 2023-11-23 ]

MDEV-32115 is a potential data corruption bug in wsrep_sst_method=rsync that appears to affect the 10.4 release series only.

Comment by Marko Mäkelä [ 2023-11-30 ]

Here are some more possible causes of this:

  • An upgrade of data files that were originally created with MySQL 5.1.47 or MariaDB 5.1.47 or earlier: MDEV-27800
  • A bug in the purge of history: MDEV-30671, MDEV-31355
Generated at Thu Feb 08 09:04:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.