[MDEV-25887] "Got notification message from PID xxxx, but reception only permitted for main PID yyyy" in systemd during SST Created: 2021-06-09  Updated: 2023-11-28  Resolved: 2023-04-06

Status: Closed
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.3
Fix Version/s: 10.11.3, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Critical
Reporter: Richard Stracke Assignee: Sergei Golubchik
Resolution: Fixed Votes: 7
Labels: systemd
Environment:

Centos 7.9 with systemd 219


Issue Links:
Relates
relates to MDEV-15607 mysqld crashed few after node is bein... Closed
relates to MDEV-27613 Fixing debian to only run the full my... Stalled

 Description   

During SST, sometimes MariaDB service systemd failed with

"Got notification message from PID xxxx, but reception only permitted for main PID yyyy"

Found an interesting page, which face the same issue:

https://community.centminmod.com/threads/mariadb-failed-to-restart-at-first-install.18780/

Root Cause Analysis here is:

1.The issue may happen when reaching system limits in term of available PIDs, causing PID recycling to be used.

2. systemd is not always able to find the cgroup the systemd notification was issued from, typically when services spawn many short-living processes.

3. In such case, systemd collects the PID of the process that sent the notification, browses the list of processes it monitors and delivers the notification to the service that once had a process with that same PID attached to it, which may be wrong in case PIDs were recycled.
Usually, this has no impact, because usual systemd notification is just I'm alive.
In case a service defined in systemd with the Type=Notify and NotifyAccess=All settings shuts down quickly before its STOPPING=1 notification was processed by systemd, then systemd may deliver the notification to another service and make that service shut down instead.

So maybe
NotifyAccess=all
is a solution, but I can't estimate a possible drawback.



 Comments   
Comment by George L [ 2022-06-07 ]

Seems this is sometimes happening still on CentOS 7.9 with Mariadb official YUM repo even with 10.3.35. Maybe this assessment is applicable https://serverfault.com/a/835756/34814

To the background, the SD_NOTIFY was implemented recently (to allow systemd to follow the service status), but the notification is send wrongly also from the per-user processes, which causes this message from systemd.

The fix is already prepared and testing build is available in the bug comments. I will fix that soon as mentioned in the comments. Sorry for the noise.

and journalctl output for mariadb service after an attempted restart

Logs begin at Tue 2022-06-07 18:03:06 UTC, end at Tue 2022-06-07 20:26:33 UTC. --
Jun 07 18:14:06 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:14:06 hostname mysqld[14583]: 2022-06-07 18:14:06 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 14583 ...
Jun 07 18:14:07 hostname systemd[1]: Started MariaDB 10.3.35 database server.
Jun 07 18:14:19 hostname systemd[1]: Stopping MariaDB 10.3.35 database server...
Jun 07 18:14:20 hostname systemd[1]: Stopped MariaDB 10.3.35 database server.
Jun 07 18:14:20 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:14:21 hostname mysqld[15497]: 2022-06-07 18:14:21 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 15497 ...
Jun 07 18:14:22 hostname systemd[1]: Started MariaDB 10.3.35 database server.
Jun 07 18:14:22 hostname systemd[1]: Stopping MariaDB 10.3.35 database server...
Jun 07 18:14:24 hostname systemd[1]: Stopped MariaDB 10.3.35 database server.
Jun 07 18:14:34 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:14:34 hostname mysqld[15809]: 2022-06-07 18:14:34 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 15809 ...
Jun 07 18:14:35 hostname systemd[1]: Started MariaDB 10.3.35 database server.
Jun 07 18:14:41 hostname systemd[1]: Stopping MariaDB 10.3.35 database server...
Jun 07 18:14:43 hostname systemd[1]: Stopped MariaDB 10.3.35 database server.
Jun 07 18:14:43 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:14:44 hostname mysqld[16354]: 2022-06-07 18:14:44 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 16354 ...
Jun 07 18:14:44 hostname systemd[1]: Started MariaDB 10.3.35 database server.
Jun 07 18:21:25 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:21:25 hostname mysqld[56541]: 2022-06-07 18:21:25 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 56541 ...
Jun 07 18:21:59 hostname systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:21:59 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:22:00 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:22:00 hostname systemd[1]: Got notification message from PID 16354, but reception only permitted for main PID which is currently not known
Jun 07 18:22:00 hostname systemd[1]: Failed to start MariaDB 10.3.35 database server.
Jun 07 18:22:00 hostname systemd[1]: Unit mariadb.service entered failed state.
Jun 07 18:22:00 hostname systemd[1]: mariadb.service failed.
Jun 07 18:22:04 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:22:04 hostname mysqld[56935]: 2022-06-07 18:22:04 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 56935 ...
Jun 07 18:22:06 hostname systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
Jun 07 18:22:06 hostname systemd[1]: Failed to start MariaDB 10.3.35 database server.
Jun 07 18:22:06 hostname systemd[1]: Unit mariadb.service entered failed state.
Jun 07 18:22:06 hostname systemd[1]: mariadb.service failed.
Jun 07 18:25:44 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:25:44 hostname mysqld[58444]: 2022-06-07 18:25:44 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 58444 ...
Jun 07 18:25:45 hostname systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
Jun 07 18:25:45 hostname systemd[1]: Failed to start MariaDB 10.3.35 database server.
Jun 07 18:25:45 hostname systemd[1]: Unit mariadb.service entered failed state.
Jun 07 18:25:45 hostname systemd[1]: mariadb.service failed.
Jun 07 18:30:31 hostname systemd[1]: Starting MariaDB 10.3.35 database server...
Jun 07 18:30:32 hostname mysqld[58794]: 2022-06-07 18:30:32 0 [Note] /usr/sbin/mysqld (mysqld 10.3.35-MariaDB) starting as process 58794 ...
Jun 07 18:30:33 hostname systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
Jun 07 18:30:33 hostname systemd[1]: Failed to start MariaDB 10.3.35 database server.
Jun 07 18:30:33 hostname systemd[1]: Unit mariadb.service entered failed state.
Jun 07 18:30:33 hostname systemd[1]: mariadb.service failed.

Comment by Daniel Black [ 2022-07-05 ]

Richard, you right that its a notify issue.

Its a fairly serious one as any SST longer than the default system timeout of 90 seconds is going to fail.

Comment by Xan Charbonnet [ 2023-04-07 ]

Correct me if I'm wrong, but it looks like this fix cures the symptom of the erroneous message, but does not solve the real problem.

The message is intended to tell systemd that an SST is in progress and that it shouldn't consider Maria as failing to start. The message is routed to the wrong place, causing the warning. Removing the message completely does fix the warning, but there's still the serious problem of a Maria that is receiving an SST being shut down after 90 seconds by systemd for failing to start.

I've had to override my mariadb.service and give it hours to start up before timing out. This isn't a great solution because unless there's an SST going on, it should start up very quickly.

Comment by Sergei Golubchik [ 2023-05-24 ]

the fix solves the case of giving systemd spurious PIDs that systemd should not have been getting at all, in any circumstances.

The fact that systemd thinks something is taking too long — that's a different issue, if you think it's a bug, please, create a bug report for this and we'll look into that.

Generated at Thu Feb 08 09:41:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.