[MDEV-15748] Unable to stop mariadb.service or mysqld run with wsrep Created: 2018-04-02  Updated: 2021-04-19  Resolved: 2021-01-25

Status: Closed
Project: MariaDB Server
Component/s: Galera, Server, wsrep
Affects Version/s: 10.2.30
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Seppo Jaakola
Resolution: Incomplete Votes: 2
Labels: need_feedback

Attachments: Text File logs_15748.txt    
Issue Links:
PartOf
is part of MDEV-15749 WSREP [ERROR] gcs/src/gcs_core.cpp:... Closed

 Description   

Unable to stop mariadb.service or mysqld run with wsrep

with either of systemctl mariadb.service or service mysql
mysqld is not stopped and it's needed to remove manually mysld from the processlist
in order to recover MariaDB cluster

how to repeat :
problem is observed in MariaDB cluster --galera 3 Mater-Master Nodes after recovering

on all nodes was received the error , but mysqld it's found in the processlist

# mysql -u root -p1
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111)

# ps aux | grep mysql
mysql     6114  0.1  2.2 629180 45772 ?        Ssl  Mar30   5:00 /usr/sbin/mysqld --wsrep_start_position=dff6e041-1005-11e8-85c9-965f304f37bc:131618

 

  1. systemctl status mariadb.service
    ● mariadb.service - MariaDB 10.3.5 database server
    Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/mariadb.service.d
    └─migrated-from-my.cnf-settings.conf
    Active: failed (Result: timeout) since Fri 2018-03-30 20:20:37 EEST; 2 days ago
    Docs: man:mysqld(8)
    https://mariadb.com/kb/en/library/systemd/
    Process: 6303 ExecStart=/usr/sbin/mysqld $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=exited, status=1/FAILURE)
    Process: 6212 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)
    Process: 6210 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
    Main PID: 6303 (code=exited, status=1/FAILURE)
    Status: "MariaDB server is down"
    CGroup: /system.slice/mariadb.service
    └─6114 /usr/sbin/mysqld --wsrep_start_position=dff6e041-1005-11e8-85c9-965f304f37bc:131618

Mar 30 20:17:33 t4w6.xentio.lan systemd[1]: Starting MariaDB 10.3.5 database server...
Mar 30 20:17:36 t4w6.xentio.lan sh[6212]: WSREP: Recovered position dff6e041-1005-11e8-85c9-965f304f37bc:131618
Mar 30 20:17:36 t4w6.xentio.lan mysqld[6303]: 2018-03-30 20:17:36 0 [Note] /usr/sbin/mysqld (mysqld 10.3.5-MariaDB) starting as process 6303 ...
Mar 30 20:17:37 t4w6.xentio.lan systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
Mar 30 20:19:07 t4w6.xentio.lan systemd[1]: mariadb.service stop-sigterm timed out. Skipping SIGKILL.
Mar 30 20:20:37 t4w6.xentio.lan systemd[1]: mariadb.service stop-final-sigterm timed out. Skipping SIGKILL. Entering failed mode.
Mar 30 20:20:37 t4w6.xentio.lan systemd[1]: Failed to start MariaDB 10.3.5 database server.
Mar 30 20:20:37 t4w6.xentio.lan systemd[1]: Unit mariadb.service entered failed state.
Mar 30 20:20:37 t4w6.xentio.lan systemd[1]: mariadb.service failed.

 
Then try to stop mariadb.service or mysqld -- it's not returned error bud mysqld remained in the process list

  1. systemctl stop mariadb.service
    #

 

  1. service mysql stop
    Stopping mysql (via systemctl): [ OK ]
    #

 
 
 
 
 
 
mysqld is not stopped 

  1. ps aux | grep mysql
    mysql 6114 0.1 2.2 629180 45772 ? Ssl Mar30 5:00 /usr/sbin/mysqld --wsrep_start_position=dff6e041-1005-11e8-85c9-965f304f37bc:131618

 
 
 

[root@t4w3 ~]# ps aux | grep mysql
mysql 966 0.0 2.2 629072 44992 ? Ssl Mar28 3:00 /usr/sbin/mysqld --wsrep_start_position=dff6e041-1005-11e8-85c9-965f304f37bc:131588
\

[root@t4w5 ~]# ps aux | grep mysql
mysql 968 0.0 2.2 629072 46028 ? Ssl Mar28 2:56 /usr/sbin/mysqld --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1

[root@t4w6 ~]# ps aux | grep mysql
mysql 982 0.1 2.4 629044 48412 ? Ssl Mar28 4:38 /usr/sbin/mysqld --wsrep_start_position=dff6e041-1005-11e8-85c9-965f304f37bc:131618

 



 Comments   
Comment by Mario Karuza (Inactive) [ 2018-07-13 ]

winstone Can you attach logs ?

Comment by Zdravelina Sokolovska (Inactive) [ 2018-07-16 ]

mkaruza, attached logs;

We have actually WSREP Errors " Failed to open backend connection: -98 (Address already in use) " as a consequence of mariadb.service being remained in failed mode after Skipping SIGKILL.

From Joiner Error logs: 
 
[ERROR] WSREP: bind: Address already in use
 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_blocking=1', asio error 'bind: Address already in us                                           e': 98 (Address already in use)
 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'cluster1' at 'gcomm://192.168.104.193,192.168.104.195,192.168.104.196': -98 (Address already in use)
 [ERROR] WSREP: gcs connect failed: Address already in use
 [ERROR] WSREP: wsrep::connect(gcomm://192.168.104.193,192.168.104.195,192.168.104.196) failed: 7
 [ERROR] Aborting

From mariadb.service  systemctl status :
 
t4w6.xentio.lan systemd[1]: mariadb.service: main process exited, code=exited, status=1/FAILURE
t4w6.xentio.lan systemd[1]: mariadb.service stop-sigterm timed out. Skipping SIGKILL.
t4w6.xentio.lan systemd[1]: mariadb.service stop-final-sigterm timed out. Skipping SIGKILL. Entering failed mode.

That might be related to Daniel Black's analysis on starting service failure: services shouldn't start if there is residual processes left over (in SendSIGKILL=no case)
link title

Comment by Mario Karuza (Inactive) [ 2018-07-18 ]

winstone Do you have enabled pc.wait_prim ? Can you paste params that you provide to galera ?

Comment by Mario Karuza (Inactive) [ 2018-07-18 ]

Problem is duplicate of MDEV-15749.

Node 1 could not run because it can't bind to address / port. Nodes 2 & 3 seems that are successful.

There could be problem, if all 3 nodes die and later one of them is not joined in previous saved group. This will block signal to kill mysqld daemon.
In this case param pc.wait_param is used , it will wait whatever time is defined in pc.wait_prim_timeout and after that it will shutdown itself - but i assume this is not case what is happening.

Comment by Zdravelina Sokolovska (Inactive) [ 2018-07-18 ]

mkaruza, pc.wait_prim, pc.wait and pc.wait_prim_timeout wsrep provider options are set to theirs default values, eg they are not changed
pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1;
related to MDEV-15749.

Comment by Mario Karuza (Inactive) [ 2018-07-19 ]

As mentioned in previous comment. Issue for Node 1 is duplicate, abort due error 'Address already in use'.
Looking at logs for other nodes, there doesn't seem to be any problem.

Comment by Zdravelina Sokolovska (Inactive) [ 2018-07-19 ]

actually the issue abort due error 'Address already in use'. in that case occurred as a consequence of the current problem , eg being not able to stop mysqld run with wsrep
and therefore issues have to be looked over / resolved separately

Comment by Mario Karuza (Inactive) [ 2018-07-20 ]

winstone Than please provide concrete logs which leads to this problem. Traces that are attached doesn't show anything for analysis of this problem

Comment by Zdravelina Sokolovska (Inactive) [ 2018-07-20 ]

mkaruza, those are all logs including error logs, issued by WSREP and InnoDB and get by enabling error logging in server cnf .
is there any other way to get more detailed logs ?

Comment by Jan Lindström (Inactive) [ 2019-12-12 ]

Is this really repeatable still ?

Comment by Jan Lindström (Inactive) [ 2020-12-17 ]

Firstly, this looks a bug not a feature request. However, to analyze we would really need some way to reproduce the case and that could be problematic as how to cause those BF long waits is also not known. Here is some idea how testing could be done:

  • Create 3-node Galera cluster
  • Start sysbench to all nodes to make it multi-master
  • Monitor : sudo systemctl status mariadb
  • Stop one node while sysbench is running: sudo systemctl stop mariadb
  • Start it again
Generated at Thu Feb 08 08:23:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.