Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.1.12, 10.1.14
    • 10.1.30
    • Galera, Replication
    • Ubuntu 14.04 LTS, amd64, VMware vSphere 6.0, VM v8, 2 vCPU, 6.1G RAM. /var/db volume is 56G used out of 200G total, FS is ext4 with rw,relatime mount flags. Deadline IO scheduler used for /var/db.

    Description

      Background: We migrated from a MariaDB 5.5 active/passive replication cluster in february 2016 to MariaDB 10.1 galera active/active cluster with two DB nodes and one arbitrator node.

      This setup was made in preparation for a new DC. So the final setup when the new DC is ready will be two db nodes in two DCs each, and one arbitrator in a third DC. For now it's all in one DC with two DB nodes handling queries and one arbitrator doing backups with innobackupex.

      The solution was stable for a while and the first precisely recorded crash came 2016-03-30.

      Some crash times I have recorded are.

      2016-03-30 18:47: signal 11
      2016-04-04 06:37: signal 11
      2016-05-17 02:00: signal 11
      2016-05-25 we upgraded from 10.1.12 to 10.1.14 and issue seemed resolved until last night.
      2016-06-28 19:41: signal 11

      There are more, equally random, that I have not recorded precisely. The crash happens randomly on either of the two db nodes.

      Each crash has resulted in an unclean state, -1 in grastate for example, so the end result has always been a removal of the datadir and a full SST to the crashed node using xtrabackup-v2.

      The server is used by an authentication system, so many simple read queries for user data but also the bulk of the stored data is auth logging. Simple insert queries. This is what takes up 54G of the total 56G on that volume, data retention.

      I have attached one crashlog from each db node, two separate crash times.

      I have also attached my configuration which is mostly centered in the file /etc/mysql/conf.d/replication.conf.

      I monitor many things like tps, system load, memory use on the nodes but I can see no deviations in these graphs except that when the mysqld process crashes around 3G of RAM (out of 3.7G used) is freed and tps goes down.

      Attachments

        1. crashlog-20160330.txt
          4 kB
        2. crashlog-20160628.txt
          4 kB
        3. optimizations.cnf
          0.1 kB
        4. replication.cnf
          0.7 kB

        Issue Links

          Activity

            It looks same as or very similar to MDEV-9510 and Co.
            I'm keeping the whole group open so that nirbhay_c could look them together, maybe it will help.

            elenst Elena Stepanova added a comment - It looks same as or very similar to MDEV-9510 and Co. I'm keeping the whole group open so that nirbhay_c could look them together, maybe it will help.

            It happened again on my system with the same traceback. The difference this time was that I could not perform an SST for some reason. It kept complaining about this error.

            Binlog file './mydb-bin.000099' not found in binlog index, needed for recovery.

            And nothing I tried helped, for example upgrading percona-xtrabackup, manually transferring binlogs and index.

            Eventually what worked was to disable binlogs completely. I believe xtrabackup has an issue with binlogs, shown here: https://github.com/percona/percona-xtrabackup/pull/201

            I also can't help notice that all of my crashes have shown _ZN13MYSQL_BIN_LOG13mark_xid_doneEmb+0xc7 in the traceback. Maybe disabling binary logs will help the crashes too, but if that's the case it's not an acceptable long term solution.

            I've also compiled a debug mysqld binary using the build scripts included in the debian package so I will see if I can get a better trace for next crash.

            stemid Stefan Midjich added a comment - It happened again on my system with the same traceback. The difference this time was that I could not perform an SST for some reason. It kept complaining about this error. Binlog file './mydb-bin.000099' not found in binlog index, needed for recovery. And nothing I tried helped, for example upgrading percona-xtrabackup, manually transferring binlogs and index. Eventually what worked was to disable binlogs completely. I believe xtrabackup has an issue with binlogs, shown here: https://github.com/percona/percona-xtrabackup/pull/201 I also can't help notice that all of my crashes have shown _ZN13MYSQL_BIN_LOG13mark_xid_doneEmb+0xc7 in the traceback. Maybe disabling binary logs will help the crashes too, but if that's the case it's not an acceptable long term solution. I've also compiled a debug mysqld binary using the build scripts included in the debian package so I will see if I can get a better trace for next crash.

            I was affected by related issue https://jira.mariadb.org/browse/MDEV-10276 that happend during logrotate, but now I'm experience also random crashes like this. Debian 8, 3-node Galera cluster, MariaDB 10.1.13 and 10.1.16.

            mradzikowski Maciej Radzikowski added a comment - I was affected by related issue https://jira.mariadb.org/browse/MDEV-10276 that happend during logrotate, but now I'm experience also random crashes like this. Debian 8, 3-node Galera cluster, MariaDB 10.1.13 and 10.1.16.
            stemid Stefan Midjich added a comment - - edited

            It's now been 44 days since I disabled binary logs and not a single crash. It might have something to do with binary logging.

            My next step will be to use a debug binary and re-enable binary logs to see if I can produce more debug info.

            stemid Stefan Midjich added a comment - - edited It's now been 44 days since I disabled binary logs and not a single crash. It might have something to do with binary logging. My next step will be to use a debug binary and re-enable binary logs to see if I can produce more debug info.

            It's now been 44 days since I disabled binary logs and not a single crash. It might have something to do with binary logging.

            Right. The issue is around binary log rotation and writing of binlog_checkpoint_log_event.

            My next step will be to use a debug binary and re-enable binary logs to see if I can produce more debug info.

            thanks!

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - It's now been 44 days since I disabled binary logs and not a single crash. It might have something to do with binary logging. Right. The issue is around binary log rotation and writing of binlog_checkpoint_log_event. My next step will be to use a debug binary and re-enable binary logs to see if I can produce more debug info. thanks!

            10.1.17, released a couple of days back, will print some additional related details to the error log with --wsrep-debug=ON.

            nirbhay_c Nirbhay Choubey (Inactive) added a comment - 10.1.17, released a couple of days back, will print some additional related details to the error log with --wsrep-debug=ON.

            I'm sorry to say I never did try the debug build of mariadb.

            We've simply moved on without binary logs and since it's a production environment have not had time nor motivation to try anything else.

            We use xtrabackup for incremental backups so there is no need for binary logs. The system has been stable since we disabled them.

            stemid Stefan Midjich added a comment - I'm sorry to say I never did try the debug build of mariadb. We've simply moved on without binary logs and since it's a production environment have not had time nor motivation to try anything else. We use xtrabackup for incremental backups so there is no need for binary logs. The system has been stable since we disabled them.
            Elkin Andrei Elkin added a comment -

            MDEV-9510 is a fixed parent.

            Elkin Andrei Elkin added a comment - MDEV-9510 is a fixed parent.

            People

              sachin.setiya.007 Sachin Setiya (Inactive)
              stemid Stefan Midjich
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.