[MDEV-10301] Signal 11 crash at random times Created: 2016-06-29 Updated: 2017-12-13 Resolved: 2017-12-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Replication |
| Affects Version/s: | 10.1.12, 10.1.14 |
| Fix Version/s: | 10.1.30 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stefan Midjich | Assignee: | Sachin Setiya (Inactive) |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | binlog, galera, replication | ||
| Environment: |
Ubuntu 14.04 LTS, amd64, VMware vSphere 6.0, VM v8, 2 vCPU, 6.1G RAM. /var/db volume is 56G used out of 200G total, FS is ext4 with rw,relatime mount flags. Deadline IO scheduler used for /var/db. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Background: We migrated from a MariaDB 5.5 active/passive replication cluster in february 2016 to MariaDB 10.1 galera active/active cluster with two DB nodes and one arbitrator node. This setup was made in preparation for a new DC. So the final setup when the new DC is ready will be two db nodes in two DCs each, and one arbitrator in a third DC. For now it's all in one DC with two DB nodes handling queries and one arbitrator doing backups with innobackupex. The solution was stable for a while and the first precisely recorded crash came 2016-03-30. Some crash times I have recorded are. 2016-03-30 18:47: signal 11 There are more, equally random, that I have not recorded precisely. The crash happens randomly on either of the two db nodes. Each crash has resulted in an unclean state, -1 in grastate for example, so the end result has always been a removal of the datadir and a full SST to the crashed node using xtrabackup-v2. The server is used by an authentication system, so many simple read queries for user data but also the bulk of the stored data is auth logging. Simple insert queries. This is what takes up 54G of the total 56G on that volume, data retention. I have attached one crashlog from each db node, two separate crash times. I have also attached my configuration which is mostly centered in the file /etc/mysql/conf.d/replication.conf. I monitor many things like tps, system load, memory use on the nodes but I can see no deviations in these graphs except that when the mysqld process crashes around 3G of RAM (out of 3.7G used) is freed and tps goes down. |
| Comments |
| Comment by Elena Stepanova [ 2016-06-29 ] |
|
It looks same as or very similar to |
| Comment by Stefan Midjich [ 2016-07-17 ] |
|
It happened again on my system with the same traceback. The difference this time was that I could not perform an SST for some reason. It kept complaining about this error. Binlog file './mydb-bin.000099' not found in binlog index, needed for recovery. And nothing I tried helped, for example upgrading percona-xtrabackup, manually transferring binlogs and index. Eventually what worked was to disable binlogs completely. I believe xtrabackup has an issue with binlogs, shown here: https://github.com/percona/percona-xtrabackup/pull/201 I also can't help notice that all of my crashes have shown _ZN13MYSQL_BIN_LOG13mark_xid_doneEmb+0xc7 in the traceback. Maybe disabling binary logs will help the crashes too, but if that's the case it's not an acceptable long term solution. I've also compiled a debug mysqld binary using the build scripts included in the debian package so I will see if I can get a better trace for next crash. |
| Comment by Maciej Radzikowski [ 2016-08-22 ] |
|
I was affected by related issue https://jira.mariadb.org/browse/MDEV-10276 that happend during logrotate, but now I'm experience also random crashes like this. Debian 8, 3-node Galera cluster, MariaDB 10.1.13 and 10.1.16. |
| Comment by Stefan Midjich [ 2016-08-30 ] |
|
It's now been 44 days since I disabled binary logs and not a single crash. It might have something to do with binary logging. My next step will be to use a debug binary and re-enable binary logs to see if I can produce more debug info. |
| Comment by Nirbhay Choubey (Inactive) [ 2016-08-30 ] |
Right. The issue is around binary log rotation and writing of binlog_checkpoint_log_event.
thanks! |
| Comment by Nirbhay Choubey (Inactive) [ 2016-09-01 ] |
|
10.1.17, released a couple of days back, will print some additional related details to the error log with --wsrep-debug=ON. |
| Comment by Stefan Midjich [ 2017-04-18 ] |
|
I'm sorry to say I never did try the debug build of mariadb. We've simply moved on without binary logs and since it's a production environment have not had time nor motivation to try anything else. We use xtrabackup for incremental backups so there is no need for binary logs. The system has been stable since we disabled them. |
| Comment by Andrei Elkin [ 2017-12-13 ] |
|
|