[MDEV-13376] Stopping mariadb.service on joiner node during SST does not stop state transfer - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Minor
Resolution: Duplicate
Affects Version/s: 10.2.7
Fix Version/s: N/A
Component/s: Galera, Galera SST, Scripts & Clients
Labels:
- galera
- sst
- systemd
Environment:
Debian jessie

Description

Example cluster with only two nodes: (normally would also have a third node)
mysql1 = IP 10.0.0.2 (JOINER)
mysql2 = IP 10.0.0.3 (DONOR)

wsrep_sst_method=mysqldump
(database has to be big enough for SST to take well over three minutes)

DONOR node started with "galera_new_cluster", JOINER node started with "systemctl start mariadb.service" after deleting grastate.dat to make sure node needs SST.
After SST has started, run "systemctl stop mariadb.service" on JOINER node and wait three minutes:

root@mysql1:/# systemctl stop mariadb.service

root@mysql1:/#

No errors, thus you would assume the node has been stopped successfully, right?

In reality, what happens is SST keeps on running (confirmed by "ps aux | grep mysql") until fully finished, which could take a looong time depending on database size, after which JOINER node completes its shutdown and DONOR finally returns to SYNCED state.

Now since the shutdown was not actually completed and no error was returned by systemctl, the user could try running "systemctl start mariadb.service" and end up with the new mysqld process failing to start. (I didn't try this but I assume it will fail gracefully like any other time)
Knowing that mysqld has not actually stopped, one can of course kill it manually to interrupt the SST.

When I run "systemctl stop", I would expect the node to be stopped (and thus SST to be interrupted), so the donor could return back to SYNCED state ASAP. Since interrupting an SST, the joining node's database consistency doesn't really matter as it will get a new SST anyway when started next time. If I was stopping a node in the middle of an IST, the current "non-interrupting" way could be preferred to possibly avoid SST on next startup in case a busy cluster. (didn't test this but I assume currently also IST wouldn't be interrupted)

Attached logs from both nodes showing what happened.
Added line "-- waiting for SST to finish, 12 minutes in this case" for clarity. Now imagine finding the cluster at that point in logs without knowing what has happened. You see one node tried to make an SST and apparently shutting down, donor node having formed new quorum alone and "doing nothing" (since it's still in DONOR/DESYNCED state although not that obvious with the new quorum). Without checking what processes are running, it's not that clear that the SST is still running.

TL;DR; I would suggest the systemctl command should at least return failure notice similar to when startup fails:

root@mysql1:/# systemctl start mariadb.service

Job for mariadb.service failed. See 'systemctl status mariadb.service' and 'journalctl -xn' for details.

Thus the user would immediately know to check status, which correctly shows something has failed:

root@mysql1:/# systemctl status mariadb.service -l

● mariadb.service - MariaDB database server

   Loaded: loaded (/lib/systemd/system/mariadb.service; disabled)

  Drop-In: /etc/systemd/system/mariadb.service.d

           └─migrated-from-my.cnf-settings.conf

   Active: failed (Result: timeout) since Sun 2017-07-23 14:24:58 EEST; 8s ago

  Process: 2650 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)

  Process: 2647 ExecStartPost=/etc/mysql/debian-start (code=exited, status=0/SUCCESS)

  Process: 2250 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)

  Process: 2244 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)

  Process: 2241 ExecStartPre=/usr/bin/install -m 755 -o mysql -g root -d /var/run/mysqld (code=exited, status=0/SUCCESS)

 Main PID: 2597

   Status: "Taking your SQL requests now..."

Jul 23 14:21:10 mysql1 mysqld[13061]: 2017-07-23 14:21:10 140440152819584 [Note] /usr/sbin/mysqld (mysqld 10.2.7-MariaDB-10.2.7+maria~jessie-log) starting as process 13061 ...

Jul 23 14:21:12 mysql1 systemd[1]: Started MariaDB database server.

Jul 23 14:21:57 mysql1 systemd[1]: Stopping MariaDB database server...

Jul 23 14:23:27 mysql1 systemd[1]: mariadb.service stop-sigterm timed out. Skipping SIGKILL.

Jul 23 14:24:58 mysql1 systemd[1]: mariadb.service stop-final-sigterm timed out. Skipping SIGKILL. Entering failed mode.

Jul 23 14:24:58 mysql1 systemd[1]: Stopped MariaDB database server.

Jul 23 14:24:58 mysql1 systemd[1]: Unit mariadb.service entered failed state.

It would also be nice to get a message in logs stating what happens to the SST (or IST), e.g. "waiting until SST has finished" or "SST interrupted, node in inconsistent state, new SST required".

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

donor.log
5 kB
2017-07-23 16:31
joiner.log
18 kB
2017-07-23 16:31

Issue Links

relates to

MDEV-13580 SIGTERM must terminate sst

Open

Activity

People

Assignee:: Andrii Nikitin (Inactive)

Reporter:: Juha Pyy

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2017-07-23 17:12

Updated:: 2017-08-22 12:52

Resolved:: 2017-08-22 12:52

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.