we're running a 5 nodes Galera Cluster, based on MariaDB 10.1.12 (+maria-1~trusty). The only engine being used, is of course InnoDB.
Two days ago it started crashing (one node at a time, very quickly), with the error+stacktrace you can find in the attachment. For some reason, every time we restored the cluster, we saw 3 out of 5 nodes crashing in 5/10 minutes, the fourth one being able to be up and running for 30/45 minutes and the last one running for hours (but in the end crashing with the same error). This is just experimental and we can't figure out why this was happening.
As a few minutes before the first crash a TRUNCATE had been executed (we don't have such a good experience with TRUNCATE DDL in Galera, it's always ending up with deadlocks, so we're trying to replace them with "DELETE FROM" or RENAME + TRUNCATE on an "offline" table), we focused our efforts in recovering what we tought as a corrupted ibd file or InnoDB index (http://dba.stackexchange.com/questions/29870/mysql-innodb-corruption-after-server-crash-during-concurrent-truncate-command).
With no clue at all, as on a new cluster, where we had imported data dumped from the old one with innodb_force_recovery = 1 the issue was still present.
After this, we noticed that we had some kind of bot on a payment gateway page, that was causing the CMS (Prestashop) to execute tens of DELETE and SELECT per second on the same table:
SELECT * FROM `ps_ccpayments` WHERE `id_cart` = 0 LIMIT 1;
DELETE FROM `ps_ccpayments` WHERE `id_cart` = 0;
We saw no INSERT, and under no circumnstance in that table you would have found a row with id_cart = 0, so we should expect those queries always working on an empty datased.
Fixing the code for not executing them, fixed the issue. Let me clarify that the nodes don't have an high load, and there are no resource contraints. MariaDB is configured in order not to be able to fill up node's RAM.
We're not able to replicate the issue. If we manually launch those queries, they execute properly. I think the issue could be releated to:
- the extremely high number of concurrent requests of that kind
- some kind of 0day, related to a malicious input (I'm saying this because this is happening on a page where we process credit card payments)
If you need any additional detail, just ask please and any help would be really appreciated. Attaching my.cnf as well.