Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 10.1.25, 10.1.26
    • N/A
    • wsrep
    • None
    • Ubuntu 16.04.2, MariaDB 10.1.25 and 10.1.26

    Description

      We have a 3 node Galera Cluster and this weekend we tried rebooting each node one at a time for an EC2 instance upgrade. When the servers came back online they each crashed with signal 11 while trying to rejoin the cluster.

      The crash is occurring during WSREP recovery. If I set wsrep_on=OFF MySQL will startup without crashing, but it again crashes when dynamically setting wsrep_on=ON.

      Nothing shows up in the other two nodes' logs while the other join is starting up before it crashes. And all ports are open between the Galera nodes. Each node is running MariaDB 10.1.25 but I did upgrade one node to 10.1.26 to see if the problem was fixed there and it exhibited the same behavior. The only way I was get the nodes to rejoin the cluster was to force an SST sync. However the data directory is 1.8TB so that is far from ideal for each node restart.

      I've attached the wsrep_recovery log and apport crash file, but it doesn't contain the core dump for some reason. I've also uploaded the my.cnf and a mariadb.cnf config file containing the Galera Cluster related config options.

      Attachments

        1. _usr_sbin_mysqld.0.crash
          17 kB
        2. mariadb.cnf
          0.9 kB
        3. my.cnf
          2 kB
        4. wsrep_recovery.KbkcqG
          5 kB

        Issue Links

          Activity

            I do have the dump available and it is InnoDB in the dump:

            --
            -- Table structure for table `servers`
            --
             
            DROP TABLE IF EXISTS `servers`;
            /*!40101 SET @saved_cs_client     = @@character_set_client */;
            /*!40101 SET character_set_client = utf8 */;
            CREATE TABLE `servers` (
              `Server_name` char(64) NOT NULL DEFAULT '',
              `Host` char(64) NOT NULL DEFAULT '',
              `Db` char(64) NOT NULL DEFAULT '',
              `Username` char(80) NOT NULL DEFAULT '',
              `Password` char(64) NOT NULL DEFAULT '',
              `Port` int(4) NOT NULL DEFAULT '0',
              `Socket` char(64) NOT NULL DEFAULT '',
              `Wrapper` char(64) NOT NULL DEFAULT '',
              `Owner` char(64) NOT NULL DEFAULT '',
              PRIMARY KEY (`Server_name`)
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='MySQL Foreign Servers table';
            

            The enforce_storage_variables variable is blank in the new MariaDB 10.1.25 cluster and doesn't appear to be present at all on the older 10.1.14 cluster (just gives me an empty set when I try to show it).

            btraywick Bryan Traywick added a comment - I do have the dump available and it is InnoDB in the dump: -- -- Table structure for table `servers` --   DROP TABLE IF EXISTS `servers`; /*!40101 SET @saved_cs_client = @@character_set_client */ ; /*!40101 SET character_set_client = utf8 */ ; CREATE TABLE `servers` ( `Server_name` char (64) NOT NULL DEFAULT '' , `Host` char (64) NOT NULL DEFAULT '' , `Db` char (64) NOT NULL DEFAULT '' , `Username` char (80) NOT NULL DEFAULT '' , ` Password ` char (64) NOT NULL DEFAULT '' , `Port` int (4) NOT NULL DEFAULT '0' , `Socket` char (64) NOT NULL DEFAULT '' , `Wrapper` char (64) NOT NULL DEFAULT '' , `Owner` char (64) NOT NULL DEFAULT '' , PRIMARY KEY (`Server_name`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT= 'MySQL Foreign Servers table' ; The enforce_storage_variables variable is blank in the new MariaDB 10.1.25 cluster and doesn't appear to be present at all on the older 10.1.14 cluster (just gives me an empty set when I try to show it).

            Thank you for confirmation. It should be safe to just convert it back to MyISAM to avoid the problem, but it may be good idea to do that in downtime.

            anikitin Andrii Nikitin (Inactive) added a comment - Thank you for confirmation. It should be safe to just convert it back to MyISAM to avoid the problem, but it may be good idea to do that in downtime.
            btraywick Bryan Traywick added a comment - - edited

            Thank you so much Andrii. I was able to recreate the issue in our staging cluster by converting mysql.servers to InnoDB on one of the nodes and restarting that node. I was then able to startup that node with wsrep_on=OFF, convert the table back to MyISAM, and then restart MySQL with wsrep_on=ON and it was able to rejoin the cluster without an SST sync.

            We will be converting the table back to MyISAM in our production cluster tonight and will restart a node to ensure it doesn't need a full resync. I will report back with confirmation once that has gone successfully but this appears to be the fix we are looking for.

            btraywick Bryan Traywick added a comment - - edited Thank you so much Andrii. I was able to recreate the issue in our staging cluster by converting mysql.servers to InnoDB on one of the nodes and restarting that node. I was then able to startup that node with wsrep_on=OFF , convert the table back to MyISAM, and then restart MySQL with wsrep_on=ON and it was able to rejoin the cluster without an SST sync. We will be converting the table back to MyISAM in our production cluster tonight and will restart a node to ensure it doesn't need a full resync. I will report back with confirmation once that has gone successfully but this appears to be the fix we are looking for.

            Thanks again Andrii. We converted the table back to MyISAM and were able to restart MySQL with only an IST sync required. As a final test I also tried restarting MySQL on one of the nodes in the older Galera Cluster running MariaDB 10.1.14 and we didn't run into this crash despite the mysql.servers table being InnoDB there as well. So it's likely a change introduced between 10.1.14 and 10.1.25. The older cluster is also running Ubuntu 14.04 and the 10.1.25 cluster is running 16.04 so it could be something to do with the systemd init scripts.

            btraywick Bryan Traywick added a comment - Thanks again Andrii. We converted the table back to MyISAM and were able to restart MySQL with only an IST sync required. As a final test I also tried restarting MySQL on one of the nodes in the older Galera Cluster running MariaDB 10.1.14 and we didn't run into this crash despite the mysql.servers table being InnoDB there as well. So it's likely a change introduced between 10.1.14 and 10.1.25. The older cluster is also running Ubuntu 14.04 and the 10.1.25 cluster is running 16.04 so it could be something to do with the systemd init scripts.

            10.1 is EOL.

            janlindstrom Jan Lindström added a comment - 10.1 is EOL.

            People

              jplindst Jan Lindström (Inactive)
              btraywick Bryan Traywick
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.