Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6225

Idle replication slave keeps crashing.

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • 5.5.37-galera
    • 5.5.39-galera, 5.5.39
    • None
    • Debian 7.5, kernel 3.2.54-2, 24 core Xeon E5645 @ 2.40 GHz, 48GB RAM
      Running mysqld_multi with a dozen instances, the crashing one has a 10G buffer pool.

    Description

      I'm using mysqld_multi to run many instances of mysql on the same machine. Each instance is in a 3 node Galera cluster, so there are three physical boxes running a dozen instances each comprising a dozen clusters.

      One instance will always crash given enough time. Sometimes it's less than a day, sometimes it runs for a week, but it always eventually crashes. That instance is no longer in a cluster as I'm trying to troubleshoot it but Galera is still loaded. I've tried not loading the provider as well with the same results.

      It is just replicating and not serving any real queries/traffic. I haven't been able to narrow it down to a specific table or query unfortunately. Today was the first time I had another instance crash ever (it's been running for several months). What does stand out though, is that it's always this host that is actually being a slave that crashes. I have not had the other instances of the cluster crash when it was clustered. It's always the one acting as a slave. Someone on #maria IRC mentioned deadlocks with replication and show slave status but I can't confirm anything.

      I was encouraged to submit a report and include my core dump traces. The 3rd dump labelled mysql-13 is the new crash I was referring to. The other two are from the same instance that has been repeatedly crashing. When it does crash, I rebuild the data fresh. I have the same data on a host that is running mariadb-server-5.5 (not galera) and has never crashed. It's actually what I use to rebuild/reseed this system from when it crashes.

      Attachments

        1. core-2014-05-12
          6 kB
        2. core-2014-05-21.tgz
          4 kB
        3. core-dumps
          12 kB
        4. crashes-2014-05-23-and-25.tgz
          3 kB
        5. mysqld.err.2014-05-12
          5 kB

        Activity

          Hi,

          Could you also attach here full unedited error log (at least one) from crashing server.

          R: Jan

          jplindst Jan Lindström (Inactive) added a comment - Hi, Could you also attach here full unedited error log (at least one) from crashing server. R: Jan
          sophomeric Eric Webster added a comment -

          I did not save them unfortunately, only the core dumps. I can post one as soon as it happens again though.

          sophomeric Eric Webster added a comment - I did not save them unfortunately, only the core dumps. I can post one as soon as it happens again though.
          sophomeric Eric Webster added a comment -

          New core backtrace attached and associated error log.

          sophomeric Eric Webster added a comment - New core backtrace attached and associated error log.
          sophomeric Eric Webster added a comment -

          Another core backtrace and associated error log.

          sophomeric Eric Webster added a comment - Another core backtrace and associated error log.
          sophomeric Eric Webster added a comment -

          Two more crashes from the weekend. I'm not running it on two machines (identical hardware and software and data) and they are both crashing, but at different times. If it was a bad query or something, you'd think it would crash both.

          sophomeric Eric Webster added a comment - Two more crashes from the weekend. I'm not running it on two machines (identical hardware and software and data) and they are both crashing, but at different times. If it was a bad query or something, you'd think it would crash both.
          sophomeric Eric Webster added a comment -

          I'm still collecting core dumps from these slaves but I've stopped posting them since there hasn't been any reply. If they are still useful, let me know and I'll continue to post them.

          sophomeric Eric Webster added a comment - I'm still collecting core dumps from these slaves but I've stopped posting them since there hasn't been any reply. If they are still useful, let me know and I'll continue to post them.

          revno: 3506
          committer: Jan Lindström <jplindst@mariadb.org>
          branch nick: maria-5.5-galera
          timestamp: Mon 2014-06-30 14:02:54 +0300
          message:
          MDEV-6225: Idle replication slave keeps crashing.

          Analysis: Based on crashed the buffer pool instance identifier is
          not correct on block to be freed. Add LRU list mutex holding
          on functions calling free and add additional safety checks.

          jplindst Jan Lindström (Inactive) added a comment - revno: 3506 committer: Jan Lindström <jplindst@mariadb.org> branch nick: maria-5.5-galera timestamp: Mon 2014-06-30 14:02:54 +0300 message: MDEV-6225 : Idle replication slave keeps crashing. Analysis: Based on crashed the buffer pool instance identifier is not correct on block to be freed. Add LRU list mutex holding on functions calling free and add additional safety checks.

          5.5:

          revno: 4221
          committer: Jan Lindström <jplindst@mariadb.org>
          branch nick: 5.5
          timestamp: Mon 2014-06-30 14:06:28 +0300
          message:
          MDEV-6225: Idle replication slave keeps crashing.

          Analysis: Based on crashed the buffer pool instance identifier is
          not correct on block to be freed. Add LRU list mutex holding
          on functions calling free and add additional safety checks.

          jplindst Jan Lindström (Inactive) added a comment - 5.5: revno: 4221 committer: Jan Lindström <jplindst@mariadb.org> branch nick: 5.5 timestamp: Mon 2014-06-30 14:06:28 +0300 message: MDEV-6225 : Idle replication slave keeps crashing. Analysis: Based on crashed the buffer pool instance identifier is not correct on block to be freed. Add LRU list mutex holding on functions calling free and add additional safety checks.

          People

            jplindst Jan Lindström (Inactive)
            sophomeric Eric Webster
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.