Just an update on my side. After the kernel update (that resolved the hanging issue) things seem to get better for awhile, and we got through two weeks without any more of these issues.. We were monitoring. Now we've hit this issue again, four times in three days on two separate hosts. Interestingly, two incidents were copies, from approximately same point in time, affecting the same table at about the same time. Neither host have any errors logged. The symptom showed up when maria backup failed on the checksum check. Same 01 00 00 00 for the first four bytes of page 0.
I recently modified the workflow to restart mariadb with fast shutdown disabled to reduce mariabackup prepare time.
Prior to the service restart, we do a lot of bulk deletes. The victim table in this instance is one that receives a cascade delete.
xan.charbonnet, all of our hosts are AWS ec2, both x86_64 and arm64 have shown this issue.
Interesting observation: We have not seen this issue on our legacy CentOS 7 instances. Only on Debian 12 instances.
Interesting observation #2: One of the victim tables on one of the replicas is a backup table that has not had any reads or writes via query in over two years. So am unsure why an otherwise unmodified table would also suffer page0 corruption.
Just an update on my side. After the kernel update (that resolved the hanging issue) things seem to get better for awhile, and we got through two weeks without any more of these issues.. We were monitoring. Now we've hit this issue again, four times in three days on two separate hosts. Interestingly, two incidents were copies, from approximately same point in time, affecting the same table at about the same time. Neither host have any errors logged. The symptom showed up when maria backup failed on the checksum check. Same 01 00 00 00 for the first four bytes of page 0.
I recently modified the workflow to restart mariadb with fast shutdown disabled to reduce mariabackup prepare time.
Prior to the service restart, we do a lot of bulk deletes. The victim table in this instance is one that receives a cascade delete.
xan.charbonnet, all of our hosts are AWS ec2, both x86_64 and arm64 have shown this issue.
Interesting observation: We have not seen this issue on our legacy CentOS 7 instances. Only on Debian 12 instances.
Interesting observation #2: One of the victim tables on one of the replicas is a backup table that has not had any reads or writes via query in over two years. So am unsure why an otherwise unmodified table would also suffer page0 corruption.