[MDEV-10420] MariaDB fails to start: WSREP Failed to recover position Created: 2016-07-22  Updated: 2020-03-12  Resolved: 2016-07-22

Status: Closed
Project: MariaDB Server
Component/s: Galera, Platform Debian, Replication, Scripts & Clients, wsrep
Affects Version/s: 10.1.16
Fix Version/s: 10.1.17

Type: Bug Priority: Major
Reporter: Andrew Ensley Assignee: Nirbhay Choubey (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Ubuntu 16.04 amd64


Issue Links:
Blocks
is blocked by MDEV-21931 WSREP: Failed to start mysqld for wsr... Open
Duplicate
is duplicated by MDEV-10396 MariaDB does not restart after upgrad... Closed

 Description   

I believe I've found a bug in the systemd/systemctl scripts for MariaDB 10.1.16 on Ubuntu 16.04. None of the service or systemctl commands picked up the wsrep_cluster_address from my config files properly nor could they bootstrap a new cluster no matter what I tried.

I opened a question on DBA Stack Exchange about it: http://dba.stackexchange.com/q/144691/23088

I'll cut right to the conclusion: Downgrading to 10.1.14 fixed the problem for me.

Here are the details:

I'm using 10.1.16-MariaDB-1~xenial from the official MariaDB apt repository for 10.1 [stable], via the University of Texas mirror: https://downloads.mariadb.org/mariadb/repositories/#mirror=ut-austin&distro=Ubuntu&distro_release=xenial--ubuntu_xenial&version=10.1

I had a perfectly functioning MariaDB Galera cluster setup on 3 Ubuntu 16.04 servers.

Then I upgraded them with apt full-upgrade. Now I have nothing.

The upgrade to 10.1.16 failed, and quickly brought down the whole cluster. I don't have the output, but dpkg failed on setting up mariadb-server and mariadb-server-10.1.

I have backups, so I purged all traces of MariaDB/MySQL/Galera from my servers (including removing /var/lib/mysql/, /etc/mysql/, and /var/log/mysql/) and started over. However, now, with a clean install on each server, none of the standard system startup scripts work. I suspect this is why the upgrade process through apt failed, too.

I've tried each of the following on my first node:

galera_new_cluster
service mysql bootstrap
service mysql bootstrap --wsrep-new-cluster
service mysql bootstrap --wsrep-cluster-address="gcomm://"
service mysql start
service mysql start --wsrep-new-cluster
service mysql start --wsrep-cluster-address="gcomm://"
systemctl start mariadb
systemctl start mariadb --wsrep-new-cluster
systemctl start mariadb --wsrep-cluster-address="gcomm://"

Every single one gives me the same output:

Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.

systemctl status mariadb.service:

● mariadb.service - MariaDB database server
   Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: failed (Result: exit-code) since Fri 2016-07-22 13:29:45 CDT; 42s ago
  Process: 10799 ExecStartPre=/bin/sh -c VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] &&   systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=1/FAILURE)
  Process: 10794 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
 Main PID: 16865 (code=exited, status=0/SUCCESS)
    
Jul 22 13:29:41 sql2 systemd[1]: Starting MariaDB database server...
Jul 22 13:29:45 sql2 mysqld[10799]: WSREP: Failed to recover position: '2016-07-22 13:29:41 140110745778432 [Note] /usr/sbin/mysqld (mysqld 10.1.16-MariaDB-1~xenial) starting as process 11080 ...'
Jul 22 13:29:45 sql2 systemd[1]: mariadb.service: Control process exited, code=exited status=1
Jul 22 13:29:45 sql2 systemd[1]: Failed to start MariaDB database server.
Jul 22 13:29:45 sql2 systemd[1]: mariadb.service: Unit entered failed state.
Jul 22 13:29:45 sql2 systemd[1]: mariadb.service: Failed with result 'exit-code'.

The only way I can start my servers now is by manually executing:

sudo -u mysql mysqld --wsrep-cluster-address='gcomm://'

On the first node, and then:

sudo -u mysql mysqld --wsrep-cluster-address='gcomm://ip1,ip2,ip3'

On the other two nodes. That works, and I have a working cluster again. But now, systemd/systemctl have no idea the service is running. It seems like the systemd startup scripts can't use the wsrep-cluster-address setting in my configuration files at all. Specifying it to service or systemctl command line does not work either.

I've been able to temporarily alleviate my problems by downgrading to 10.1.14:

wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/mariadb-server-10.1_10.1.14+maria-1~xenial_amd64.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/libmariadbclient18_10.1.14+maria-1~xenial_amd64.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/libmysqlclient18_10.1.14+maria-1~xenial_amd64.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/mariadb-client-10.1_10.1.14+maria-1~xenial_amd64.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/mariadb-client-core-10.1_10.1.14+maria-1~xenial_amd64.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/mariadb-common_10.1.14+maria-1~xenial_all.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/mariadb-server-core-10.1_10.1.14+maria-1~xenial_amd64.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/m/mariadb-10.1/mysql-common_10.1.14+maria-1~xenial_all.deb
wget https://downloads.mariadb.com/files/MariaDB/mariadb-10.1.14/repo/ubuntu/pool/main/g/galera-3/galera-3_25.3.15-xenial_amd64.deb
apt purge mariadb-server
apt autoremove --purge
apt clean
rm -rf /etc/mysql/
rm -rf /var/lib/mysql/
rm -rf /var/log/mysql/
rm -f /var/log/mysql.*
rm -f /etc/systemd/system/mysql.service
rm -f /etc/systemd/system/mysqld.service
rm -f /etc/rc0.d/K03mysql
rm -f /etc/rc1.d/K03mysql
rm -f /etc/rc2.d/S03mysql
rm -f /etc/rc3.d/S03mysql
rm -f /etc/rc4.d/S03mysql
rm -f /etc/rc5.d/S03mysql
rm -f /etc/rc6.d/K03mysql
rm -f /var/lib/systemd/deb-systemd-helper-enabled/mysql.service
rm -f /var/lib/systemd/deb-systemd-helper-enabled/mysqld.service
rm -f /etc/apparmor.d/abstractions/mysql
rm -f /etc/apparmor.d/cache/usr.sbin.mysqld
rm -rf /etc/systemd/system/mariadb.service.d
rm -f /etc/systemd/system/multi-user.target.wants/mariadb.service
rm -f /var/lib/systemd/deb-systemd-helper-enabled/mariadb.service.dsh-also
rm -f /var/lib/systemd/deb-systemd-helper-enabled/multi-user.target.wants/mariadb.service
rm -f /var/lib/systemd/deb-systemd-helper-enabled/mysql.service
rm -f /var/lib/systemd/deb-systemd-helper-enabled/mysqld.service
rm -rf /var/lib/apt/lists/*
apt update
apt install iproute libaio1 libcgi-fast-perl libcgi-pm-perl libdbd-mysql-perl libdbi-perl libencode-locale-perl libfcgi-perl libhtml-parser-perl libhtml-tagset-perl libhtml-template-perl libhttp-date-perl libhttp-message-perl libio-html-perl libjemalloc1 liblwp-mediatypes-perl libtimedate-perl liburi-perl socat
dpkg -i mysql-common_10.1.14+maria-1~xenial_all.deb 
dpkg -i mariadb-common_10.1.14+maria-1~xenial_all.deb 
dpkg -i libmariadbclient18_10.1.14+maria-1~xenial_amd64.deb libmysqlclient18_10.1.14+maria-1~xenial_amd64.deb mariadb-client-10.1_10.1.14+maria-1~xenial_amd64.deb mariadb-client-core-10.1_10.1.14+maria-1~xenial_amd64.deb mariadb-server-10.1_10.1.14+maria-1~xenial_amd64.deb mariadb-server-core-10.1_10.1.14+maria-1~xenial_amd64.deb galera-3_25.3.15-xenial_amd64.deb

Now, I can start my first node with galera_new_cluster and all other nodes with service mysql start.

There must be a bug with the Ubuntu systemd/systemctl scripts in 10.1.16.



 Comments   
Comment by Nirbhay Choubey (Inactive) [ 2016-07-22 ]

MDEV-10396

Comment by Andrew Ensley [ 2016-07-22 ]

I don't believe this is a duplicate of MDEV-10396 because of a detail I forgot to include:

I only get the error message (WSREP Failed to recover position) when wsrep_on is set to ON.

I can leave everything else the same, disable that option, and start mariadb without issue via all the methods I mentioned above. Of course, I have no cluster that way either. But I don't think this is related to the log setting, since mariadb was starting just fine with my custom log settings.

Comment by Nirbhay Choubey (Inactive) [ 2016-07-25 ]

I only get the error message (WSREP Failed to recover position) when wsrep_on is set to ON.
..cut..
since mariadb was starting just fine with my custom log settings.

MDEV-10396 is about the same thing, having wsrep_on=ON and log-error=something.

Comment by Andrew Ensley [ 2016-07-25 ]

Duh. I missed that because it wasn't stated explicitly, but why else would galera_recovery be running? Sorry.

Comment by Oleksandr Diamantopulo [ 2017-02-05 ]

Andrew Ensley, were you able to start your cluster?

Nirbhay Choubey,
This issue is not related to log settings.
I'm using Distrib 10.1.21-MariaDB on Ubuntu 14.04.5 LTS
and I can't start my cluster. I'm getting all kind of different errors when I'm specifying in my config: wsrep-cluster-address="gcomm://ip1,ip2,ip3".
I can only start it without setting ip addresses: wsrep_cluster_address="gcomm://"
And it doesn't matter if I disable/enable log-error.

Generated at Thu Feb 08 07:42:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.