[MDEV-13906] Crash during WSREP recovery Created: 2017-09-25 Updated: 2023-04-12 Resolved: 2023-04-11 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | wsrep |
| Affects Version/s: | 10.1.25, 10.1.26 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Bryan Traywick | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Ubuntu 16.04.2, MariaDB 10.1.25 and 10.1.26 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
We have a 3 node Galera Cluster and this weekend we tried rebooting each node one at a time for an EC2 instance upgrade. When the servers came back online they each crashed with signal 11 while trying to rejoin the cluster. The crash is occurring during WSREP recovery. If I set wsrep_on=OFF MySQL will startup without crashing, but it again crashes when dynamically setting wsrep_on=ON. Nothing shows up in the other two nodes' logs while the other join is starting up before it crashes. And all ports are open between the Galera nodes. Each node is running MariaDB 10.1.25 but I did upgrade one node to 10.1.26 to see if the problem was fixed there and it exhibited the same behavior. The only way I was get the nodes to rejoin the cluster was to force an SST sync. However the data directory is 1.8TB so that is far from ideal for each node restart. I've attached the wsrep_recovery log and apport crash file, but it doesn't contain the core dump for some reason. I've also uploaded the my.cnf and a mariadb.cnf config file containing the Galera Cluster related config options. |
| Comments |
| Comment by Andrii Nikitin (Inactive) [ 2017-09-26 ] | |||||||||||||||||||||||||||||||||||||
|
Could you provide output from one of nodes:
| |||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-09-26 ] | |||||||||||||||||||||||||||||||||||||
|
I strongly believe that I could generate similar crash by altering mysql.servers from MyISAM to InnoDB and then trying to re-join one node.
While we must insert protection from this behavior:
| |||||||||||||||||||||||||||||||||||||
| Comment by Bryan Traywick [ 2017-09-26 ] | |||||||||||||||||||||||||||||||||||||
|
Here is the output from those commands:
The mysql.servers table does indeed appear to be InnoDB instead of MyISAM. As far as I know we didn't perform any explicit conversion from MyISAM to InnoDB on this table. The table structure was created from a mysqldump from another Galera Cluster running MariaDB 10.1.14. | |||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-09-26 ] | |||||||||||||||||||||||||||||||||||||
|
So do you have that dump available? Is it InnoDB or MyISAM in it? Maybe you have following option configured on some of nodes? show variables like 'enforce_storage_engine'; | |||||||||||||||||||||||||||||||||||||
| Comment by Bryan Traywick [ 2017-09-26 ] | |||||||||||||||||||||||||||||||||||||
|
I do have the dump available and it is InnoDB in the dump:
The enforce_storage_variables variable is blank in the new MariaDB 10.1.25 cluster and doesn't appear to be present at all on the older 10.1.14 cluster (just gives me an empty set when I try to show it). | |||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-09-26 ] | |||||||||||||||||||||||||||||||||||||
|
Thank you for confirmation. It should be safe to just convert it back to MyISAM to avoid the problem, but it may be good idea to do that in downtime. | |||||||||||||||||||||||||||||||||||||
| Comment by Bryan Traywick [ 2017-09-26 ] | |||||||||||||||||||||||||||||||||||||
|
Thank you so much Andrii. I was able to recreate the issue in our staging cluster by converting mysql.servers to InnoDB on one of the nodes and restarting that node. I was then able to startup that node with wsrep_on=OFF, convert the table back to MyISAM, and then restart MySQL with wsrep_on=ON and it was able to rejoin the cluster without an SST sync. We will be converting the table back to MyISAM in our production cluster tonight and will restart a node to ensure it doesn't need a full resync. I will report back with confirmation once that has gone successfully but this appears to be the fix we are looking for. | |||||||||||||||||||||||||||||||||||||
| Comment by Bryan Traywick [ 2017-09-27 ] | |||||||||||||||||||||||||||||||||||||
|
Thanks again Andrii. We converted the table back to MyISAM and were able to restart MySQL with only an IST sync required. As a final test I also tried restarting MySQL on one of the nodes in the older Galera Cluster running MariaDB 10.1.14 and we didn't run into this crash despite the mysql.servers table being InnoDB there as well. So it's likely a change introduced between 10.1.14 and 10.1.25. The older cluster is also running Ubuntu 14.04 and the 10.1.25 cluster is running 16.04 so it could be something to do with the systemd init scripts. | |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström [ 2023-04-11 ] | |||||||||||||||||||||||||||||||||||||
|
10.1 is EOL. |