Status: Closed (View Workflow)
Ubuntu 14.04 LTS, amd64, VMware vSphere 6.0, VM v8, 2 vCPU, 6.1G RAM. /var/db volume is 56G used out of 200G total, FS is ext4 with rw,relatime mount flags. Deadline IO scheduler used for /var/db.
Background: We migrated from a MariaDB 5.5 active/passive replication cluster in february 2016 to MariaDB 10.1 galera active/active cluster with two DB nodes and one arbitrator node.
This setup was made in preparation for a new DC. So the final setup when the new DC is ready will be two db nodes in two DCs each, and one arbitrator in a third DC. For now it's all in one DC with two DB nodes handling queries and one arbitrator doing backups with innobackupex.
The solution was stable for a while and the first precisely recorded crash came 2016-03-30.
Some crash times I have recorded are.
2016-03-30 18:47: signal 11
2016-04-04 06:37: signal 11
2016-05-17 02:00: signal 11
2016-05-25 we upgraded from 10.1.12 to 10.1.14 and issue seemed resolved until last night.
2016-06-28 19:41: signal 11
There are more, equally random, that I have not recorded precisely. The crash happens randomly on either of the two db nodes.
Each crash has resulted in an unclean state, -1 in grastate for example, so the end result has always been a removal of the datadir and a full SST to the crashed node using xtrabackup-v2.
The server is used by an authentication system, so many simple read queries for user data but also the bulk of the stored data is auth logging. Simple insert queries. This is what takes up 54G of the total 56G on that volume, data retention.
I have attached one crashlog from each db node, two separate crash times.
I have also attached my configuration which is mostly centered in the file /etc/mysql/conf.d/replication.conf.
I monitor many things like tps, system load, memory use on the nodes but I can see no deviations in these graphs except that when the mysqld process crashes around 3G of RAM (out of 3.7G used) is freed and tps goes down.