[MDEV-27411] Index Corruption in mariadb 10.5.11 slave Created: 2022-01-03 Updated: 2022-03-10 Resolved: 2022-03-10 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.5.11 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Mamo | Assignee: | Marko Mäkelä |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | crash, replication | ||
| Environment: |
Ubuntu 20.04.3 LTS |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
We have a simple mariadb replication with a master and a bunch of slaves. without any prior alert or warning, randomly in our slaves some indexes ( some of them are not present in master) become corrupted. This happened for us after migrating from we receive a bunch of error log each time. |
| Comments |
| Comment by Daniel Black [ 2022-01-03 ] | |||||||||||||||||
|
Please recheck the errorLog.tar uploaded - its 10k of NULL bytes. Can you include show create table foo? | |||||||||||||||||
| Comment by Mamo [ 2022-01-04 ] | |||||||||||||||||
|
It happened to a bunch of tables. | |||||||||||||||||
| Comment by Heiki Laaniste [ 2022-01-06 ] | |||||||||||||||||
|
The end result of getting a "bug" of "index corrupted" seems very familiar: | |||||||||||||||||
| Comment by Mamo [ 2022-01-06 ] | |||||||||||||||||
|
we experienced the same issue on a slave with mariadb 10.6 | |||||||||||||||||
| Comment by Mamo [ 2022-01-12 ] | |||||||||||||||||
|
I got the same problem now on a slave with mariadb 10.6 I have attached the mariadb.err file. mariadb.err-1.2 | |||||||||||||||||
| Comment by Heiki Laaniste [ 2022-01-20 ] | |||||||||||||||||
|
We've started to migrate data from mariadb 10.5.13 to 10.4.22 RDS instances. We didn't experience such rebootloops and index corruptions in 10.4. | |||||||||||||||||
| Comment by Mamo [ 2022-01-23 ] | |||||||||||||||||
|
mariadb.err | |||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-28 ] | |||||||||||||||||
|
mamoghandi, the file mariadb.err
Once the server crashes, you should get SIGABRT trapped in the debugger. At that point, you can type the following GDB commands to dump the thread stack traces:
I would also suggest that you disable the change buffer, if this is related to
heikilaaniste, can you please write your comment about 10.4 in | |||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-28 ] | |||||||||||||||||
|
mamoghandi, in mariadb.err-1.2
| |||||||||||||||||
| Comment by Mamo [ 2022-01-29 ] | |||||||||||||||||
|
@Marko we know that the corruption happens in index pages (it happens only on slaves). first of all we dont know which index is corrupted (so we have to drop all indexes of the table and sometimes it says which one is corrupted but not all the times) the question is why this corruption occurs randomly on our slaves? and how can we avoid it? | |||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-29 ] | |||||||||||||||||
|
mamoghandi, I cannot provide support advice, not only because I am only a developer, but because my employer MariaDB Corporation has a paid support offering. But I can give some hints. How are you provisioning the slaves? If you are using physical backup (as opposed to initializing the server from SQL dump and binlog), then perhaps something is wrong with that physical backup process, or you are copying physical corruption from the source server. If you initialize a server from SQL and binlog only, then you should have a ‘clean’ starting point. InnoDB only provided rather weak page checksums until innodb_checksum_algorithm=full_crc32 was introduced in MariaDB 10.4 and made default in MariaDB 10.5. Only files created with that setting (or strict_full_crc32) have strong checksums. By default, for older data files, any variant of the old checksum algorithms (innodb, none) will be accepted. It is possible (albeit unlikely) that an invalid page in an old-format data file would be accepted as valid during mariadb-backup --backup and no attempt to re-read is made. (The server could have been writing the page while it was being read by backup. Your 10.6 crash could have been caused by that.) | |||||||||||||||||
| Comment by Mamo [ 2022-02-02 ] | |||||||||||||||||
|
this issue has already reported here: | |||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-02 ] | |||||||||||||||||
|
mamoghandi, yes, a hang could be caused by corruption, but there is not enough data in The "long semaphore wait" output has proven to be insufficient for analyzing any hangs. The output of thread apply all backtrace from GDB is much more helpful. I can provide an example of a hang due to corruption. Forward index scans in InnoDB work by acquiring a latch on the next page, and then releasing the latch on the current page. Let us assume that we have a corruption (a cycle) in the linked list of pages: 10→11→12→13→12. If we have two threads executing an index scan on such a corrupted index, we could have one thread holding a latch on page 12 and waiting for a latch on page 13, and the other thread doing exactly the opposite. That would be a deadlock between the threads. | |||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-08 ] | |||||||||||||||||
|
mamoghandi, if you initialize a server with innodb_change_buffering=none from a logical dump, can you repeat any problems? |