Details
Description
We updated MariaDB Galera Version to 5.5.41 on 19th of January.
We start a daily mysqldump on 21:00 CET. This worked for ~ 1 year without any issues.
Last night (18th of FEB), the following issue happend at 21:00 (9 p.m.):
- mysqldump starts on Node 1 with the following parameters:
mysqldump -uroot -p --create-options {DATABASE_NAME}-f
- simultaneously, the following query did not complete on Node 1 and remains with the state "sql end" on the server:
INSERT INTO ` {DBNAME}`.`{TABLE_NAME}` (isin, date, count )\n VALUES ('XXXX12345', CURDATE() , 1 )\n ON DUPLICATE KEY UPDATE count=count+1
* full mysqldump on Node 1 completed with a duration of only a minute without any issues like every day, dumps are valid and complete
* new incoming connections on Node 1 try to access the same table, but are waiting in state "Waiting for table level lock"
INSERT INTO `{DBNAME}`.`
{TABLE_NAME}` (isin, date, count )\n VALUES ('YYYY67890', CURDATE() , 1 )\n ON DUPLICATE KEY UPDATE count=count+1
- INSERT/UPDATE statements on the same table on Node 2/3 do also not complete. Sessions are haning with similar statements in states "update" and "query end", first sessions appeared on 21:00
- Number of connections increases on all 3 nodes, since all nodes are waiting for first INSERT statement on Node 1 to complete, therefore replication stopped.
- Nodes 2 and 3 ran out of free connections and resources at ~ 21:10
- INSERT statement on Node 1 could not be killed, whole Cluster needed to be shut down and restarted
- Errorlogs do not show any issues until the high amount of connections leads to service unavailability
- Nodes did not write any Core Dumps
- This Incident is (hopefully) not reproducable, as already mentioned the cluster ran wihtout any issues about a year, the mysqldump is made every day at the same time.