[MDEV-27411] Index Corruption in mariadb 10.5.11 slave Created: 2022-01-03  Updated: 2022-03-10  Resolved: 2022-03-10

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.5.11
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Mamo Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 1
Labels: crash, replication
Environment:

Ubuntu 20.04.3 LTS


Attachments: File mariadb-1.err     File mariadb.err     File mariadb.err-1.2     File mariadb.err.1     File mariadb.err.2     File mysqld.cnf     File slave.cnf    
Issue Links:
Relates
relates to MDEV-26893 innodb assertion on startup - rem0rec... Closed
relates to MDEV-27734 Set innodb_change_buffering=none by d... Closed

 Description   

We have a simple mariadb replication with a master and a bunch of slaves. without any prior alert or warning, randomly in our slaves some indexes ( some of them are not present in master) become corrupted. This happened for us after migrating from we receive a bunch of error log each time.
it happened for us several times we upgraded one of our slaves to mariadb 10.6 but it happened again.
mariadb hung when it want to insert (or update ) a row because it cannot renew its index Btree I think. and the major error is because of SEMAPHOR timeout.
I have attached our error logs.
some times it says you hit a bug, some times it says the index is corupted.
we have enough resources ( more than 180 G of ram and SSD Disk and 56 core CPU )
the command that we used for creating the indexes is :
alter table foo add index bar_idx(bar), algorithm =inplace, lock=none;



 Comments   
Comment by Daniel Black [ 2022-01-03 ]

Please recheck the errorLog.tar uploaded - its 10k of NULL bytes. Can you include show create table foo?

Comment by Mamo [ 2022-01-04 ]

It happened to a bunch of tables.
one of them is :
Create Table: CREATE TABLE `vehicles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`capacity` int(11) DEFAULT NULL,
`color` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
`congestion_zone_license` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`engine_type` varchar(50) CHARACTER SET utf8 DEFAULT NULL,
`inspection_date` datetime DEFAULT NULL,
`inspection_valid_till` datetime DEFAULT NULL,
`insurance_number` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
`insurance_valid_till` datetime DEFAULT NULL,
`is_company_owned` bit(1) DEFAULT NULL,
`model` varchar(100) CHARACTER SET utf8 DEFAULT NULL,
`odd_even` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
`plate_number` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
`production_year` int(11) DEFAULT NULL,
`serial_number` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`status` varchar(20) CHARACTER SET utf8 DEFAULT NULL,
`total_condition_point` double DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
`vehicle_id_card_url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`biker_id` int(11) DEFAULT NULL,
`biker_request_id` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
`snapp_vehicle_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `biker_idx` (`biker_id`),
KEY `biker_req_idx` (`biker_request_id`),
KEY `idx_vehicles_snapp_vehicle_id` (`snapp_vehicle_id`),
KEY `idx_vehicles_status` (`status`)
) ENGINE=InnoDB AUTO_INCREMENT=607425 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Comment by Heiki Laaniste [ 2022-01-06 ]

The end result of getting a "bug" of "index corrupted" seems very familiar:
MDEV-27282
MDEV-26977
I have had these issues since mariadb version 10.5.8, starting in 2021. August.

Comment by Mamo [ 2022-01-06 ]

we experienced the same issue on a slave with mariadb 10.6

Comment by Mamo [ 2022-01-12 ]

I got the same problem now on a slave with mariadb 10.6 I have attached the mariadb.err file. mariadb.err-1.2

Comment by Heiki Laaniste [ 2022-01-20 ]

We've started to migrate data from mariadb 10.5.13 to 10.4.22 RDS instances. We didn't experience such rebootloops and index corruptions in 10.4.

Comment by Mamo [ 2022-01-23 ]

mariadb.err we faced the same issue with another ERROR message on another slave with skyrocket resources.
we are thinking about switching to another DBMS!

Comment by Marko Mäkelä [ 2022-01-28 ]

mamoghandi, the file mariadb.err shows a hang somewhere in InnoDB. To diagnose hangs, I would need a stack trace of all threads during the hang. You can attach a debugger to the database server process before it crashes, and then get the stack traces. Something like this:

sudo gdb -p $(pgrep -x mariadbd)
handle SIGUSR1 nostop noprint
handle SIGUSR2 nostop noprint
handle SIGWAITING nostop noprint
handle SIGLWP nostop noprint
handle SIGPIPE nostop
handle SIGALRM nostop
handle SIGHUP nostop
handle SIGTERM nostop noprint
handle SIG32 nostop noprint
handle SIG33 nostop noprint
detach

Once the server crashes, you should get SIGABRT trapped in the debugger. At that point, you can type the following GDB commands to dump the thread stack traces:

set log file gdb.txt
set log on
set height 0
thread apply all backtrace

I would also suggest that you disable the change buffer, if this is related to MDEV-26977.

SET GLOBAL innodb_change_buffering=none;

heikilaaniste, can you please write your comment about 10.4 in MDEV-26977? The change buffering is different between 10.4 and 10.5 due to MDEV-19514. It is possible that a bug that affects 10.5 and later has been introduced. Without any way to repeat that bug, I am unable to do anything about it.

Comment by Marko Mäkelä [ 2022-01-28 ]

mamoghandi, in mariadb.err-1.2 the error should have nothing to do with the change buffer. You seem to have a corrupted index page that would be accessed for applying some change from the binlog.

2022-01-10 12:37:19 0x7f265d694700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.6.3/storage/innobase/rem/rem0rec.cc line 877

Comment by Mamo [ 2022-01-29 ]

@Marko we know that the corruption happens in index pages (it happens only on slaves). first of all we dont know which index is corrupted (so we have to drop all indexes of the table and sometimes it says which one is corrupted but not all the times) the question is why this corruption occurs randomly on our slaves? and how can we avoid it?

Comment by Marko Mäkelä [ 2022-01-29 ]

mamoghandi, I cannot provide support advice, not only because I am only a developer, but because my employer MariaDB Corporation has a paid support offering.

But I can give some hints. How are you provisioning the slaves? If you are using physical backup (as opposed to initializing the server from SQL dump and binlog), then perhaps something is wrong with that physical backup process, or you are copying physical corruption from the source server. If you initialize a server from SQL and binlog only, then you should have a ‘clean’ starting point.

InnoDB only provided rather weak page checksums until innodb_checksum_algorithm=full_crc32 was introduced in MariaDB 10.4 and made default in MariaDB 10.5. Only files created with that setting (or strict_full_crc32) have strong checksums. By default, for older data files, any variant of the old checksum algorithms (innodb, none) will be accepted. It is possible (albeit unlikely) that an invalid page in an old-format data file would be accepted as valid during mariadb-backup --backup and no attempt to re-read is made. (The server could have been writing the page while it was being read by backup. Your 10.6 crash could have been caused by that.)

Comment by Mamo [ 2022-02-02 ]

this issue has already reported here:
https://jira.mariadb.org/browse/MDEV-24378?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true
it is not being fixed for sure.
they say they face semaphor>600 and then server crashes. the root cause of semaphor>600 is page corruption as I experienced.

Comment by Marko Mäkelä [ 2022-02-02 ]

mamoghandi, yes, a hang could be caused by corruption, but there is not enough data in MDEV-24378 to conclude it for that particular case. There could be other reasons for hangs, such as a bug in MariaDB server, or in the operating system kernel, or some hardware issue that would cause memory corruption.

The "long semaphore wait" output has proven to be insufficient for analyzing any hangs. The output of thread apply all backtrace from GDB is much more helpful.

I can provide an example of a hang due to corruption. Forward index scans in InnoDB work by acquiring a latch on the next page, and then releasing the latch on the current page. Let us assume that we have a corruption (a cycle) in the linked list of pages: 10→11→12→13→12. If we have two threads executing an index scan on such a corrupted index, we could have one thread holding a latch on page 12 and waiting for a latch on page 13, and the other thread doing exactly the opposite. That would be a deadlock between the threads.

Comment by Marko Mäkelä [ 2022-02-08 ]

mamoghandi, if you initialize a server with innodb_change_buffering=none from a logical dump, can you repeat any problems?

Generated at Thu Feb 08 09:52:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.