[MDEV-10442] "Address already in use" on restart Created: 2016-07-26  Updated: 2018-04-02  Resolved: 2018-01-24

Status: Closed
Project: MariaDB Server
Component/s: Scripts & Clients
Affects Version/s: 10.0.26-galera
Fix Version/s: 10.2.3

Type: Bug Priority: Trivial
Reporter: Andrzej Assignee: Unassigned
Resolution: Fixed Votes: 1
Labels: None
Environment:

Ubuntu 14.04


Issue Links:
Relates
relates to MDEV-14394 Nodes can't join with socat Address ... Closed

 Description   

When MariaDB starts or - more frequently - restarts and needs a state transfer, there is a recurring issue when port 4567 is already taken by a stray socat instance.

I've solved this by adding "killall -9 socat" to the start) section of the init script. It works for me, but it's of course not an universal solution. My init script hack is overwritten by MariaDB upgrades so I'd like to have this issue solved upstream.

What do you think would be the correct solution? I'm thinking pgrep -U mysql socat, but that may still not be robust enough.



 Comments   
Comment by Claudio Nanni [ 2017-11-14 ]

Hi,

Experiencing the same problem, nodes can't join on Ubuntu Xenial.
mysqld stops with:

socat[21170] E bind(6, {AF=2 0.0.0.0:4444}, 16): Address already in use

and socat, xbstream, wsrep_sst_xtrabackup-v2 keep running, parenthood inherited by 'init'.

As soon as you kill all the sst processes systemd restarts mysqld and the problems is back, mysqld stops, sst processes keep running.

10.1.28 seems not to contain this fix.

In sql/wsrep_utils.cc:

}
 
    err_ = posix_spawnattr_setflags (&attr, POSIX_SPAWN_SETSIGDEF  |
                                            POSIX_SPAWN_SETSIGMASK |
            /* start a new process group */ POSIX_SPAWN_SETPGROUP  |
                                            POSIX_SPAWN_USEVFORK);
    if (err_)
    {

Comment by Zdravelina Sokolovska (Inactive) [ 2018-01-22 ]

the same Error was observed on when joining new node to the cluster with srver 10.2.12
on CentOS 7.4

2018-01-22 15:25:21 140167331440384 [Note] WSREP: STATE EXCHANGE: got state msg: b1f7499a-ff77-11e7-85b9-ee24c1b0321e from 1 (t4w5)
2018-01-22 15:25:21 140167331440384 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 253,
members = 1/2 (joined/total),
act_id = 31,
last_appl. = -1,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = ca2bb6d4-ff5d-11e7-8871-7f48103f96c0
2018-01-22 15:25:21 140167331440384 [Note] WSREP: Flow-control interval: [23, 23]
2018-01-22 15:25:21 140167331440384 [Note] WSREP: Trying to continue unpaused monitor
2018-01-22 15:25:21 140167331440384 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 31)
2018-01-22 15:25:21 140167538136832 [Note] WSREP: State transfer required:
Group state: ca2bb6d4-ff5d-11e7-8871-7f48103f96c0:31
Local state: 00000000-0000-0000-0000-000000000000:-1
2018-01-22 15:25:21 140167538136832 [Note] WSREP: New cluster view: global state: ca2bb6d4-ff5d-11e7-8871-7f48103f96c0:31, view# 254: Primary, number of nodes: 2, my index: 0, protocol version 3
2018-01-22 15:25:21 140167538136832 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-01-22 15:25:21 140167323047680 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '192.168.104.195' --datadir '/var/lib/mysql/' --parent '14715' '' '
WSREP_SST: [INFO] Streaming with xbstream (20180122 15:25:21.348)
WSREP_SST: [INFO] Using socat as streamer (20180122 15:25:21.351)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20180122 15:25:21.357)
2018-01-22 15:25:21 140167538136832 [Note] WSREP: Prepared SST request: mariabackup|192.168.104.195:4444/xtrabackup_sst//1
2018-01-22 15:25:21 140167538136832 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-01-22 15:25:21 140167538136832 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-01-22 15:25:21 140167538136832 [Note] WSREP: Assign initial position for certification: 31, protocol version: 3
2018-01-22 15:25:21 140167580100352 [Note] WSREP: Service thread queue flushed.
2018-01-22 15:25:21 140167538136832 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (ca2bb6d4-ff5d-11e7-8871-7f48103f96c0): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2018-01-22 15:25:21 140167331440384 [Note] WSREP: Member 0.0 (t4w5) requested state transfer from 'any'. Selected 1.0 (t4w5)(SYNCED) as donor.
2018-01-22 15:25:21 140167331440384 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 31)
2018-01-22 15:25:21 140167538136832 [Note] WSREP: Requesting state transfer: success, donor: 1
2018-01-22 15:25:21 140167538136832 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> ca2bb6d4-ff5d-11e7-8871-7f48103f96c0:31
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | mbstream -x; RC=( ${PIPESTATUS[@]} ) (20180122 15:25:21.412)
2018/01/22 15:25:21 socat[15000] E bind(6,

{AF=2 0.0.0.0:4444}

, 16): Address already in use
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 0 (20180122 15:25:21.422)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180122 15:25:21.425)
2018-01-22 15:25:21 140167323047680 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address '192.168.104.195' --datadir '/var/lib/mysql/' --parent '14715' '' : 32 (Broken pipe)
2018-01-22 15:25:21 140167323047680 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
2018-01-22 15:25:21 140167672174720 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2018-01-22 15:25:21 140167672174720 [ERROR] Aborting

2018-01-22 15:25:21 140167331440384 [Warning] WSREP: 1.0 (t4w5): State transfer to 0.0 (t4w5) failed: -32 (Broken pipe)
2018-01-22 15:25:21 140167331440384 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():736: Will never receive state. Need to abort.

Note2 :
add port 4444/tcp as permanent to firewall or iptables rules
avoids the Errors socat[15000] E bind(6,

{AF=2 0.0.0.0:4444}

, 16): Address already in use
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 0 (20180122 15:25:21.422)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180122 15:25:21.425)
2018-01-22 15:25:21 140167323047680 [ERROR] WSREP: Process completed with error:

Comment by Daniel Black [ 2018-01-22 ]

winstone can you provide the log from where socat was first launched successfully?

If this is repeatable and you are using systemd, can you try setting 'killmode=control-group' (https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=) by following https://mariadb.com/kb/en/library/systemd/ . Please tell us if this corrects the problem.

Comment by Zdravelina Sokolovska (Inactive) [ 2018-01-23 ]

actually the issue

socat[28262] E bind(6, {AF=2 0.0.0.0:4444}, 16): Address already in use 

was observed with both service or systemctl utilities ; miss to mention that it's was used wsrep_sst_method mariabackup ; and also that mysqld is restating endless after the received failure ; --A
However as mentioned yesterday add port 4444/tcp as permanent to firewall or iptables rules resolves the problem --B
and 1 more note – if Node is joining for the first time to the cluster start mysql with option will prevent from socat[28262] E bind(6,

{AF=2 0.0.0.0:4444}

, 16): Address already in use --C

A.

service mysql start
Starting mysql (via systemctl):  Job for mariadb.service failed because the cont                                                                             rol process exited with error code. See "systemctl status mariadb.service" and "                                                                             journalctl -xe" for details.
                                                           [FAILED]

systemctl start mysql
Job for mariadb.service failed because a fatal signal was delivered to the contr                                                                             ol process. See "systemctl status mariadb.service" and "journalctl -xe" for deta                                                                             ils.

B. enable port 4444

[root@t4w1 ~]# firewall-cmd --zone=public --add-port=4444/tcp --permanent
success
[root@t4w1 ~]# firewall-cmd --reload
success

C. start mysql service with option pouting out address of cluster active Node

[root@t4w1 ~]# service mysql start --wsrep_cluster_address=gcomm://192.168.104.1                                                                          
Starting mysql (via systemctl):                            [  OK  ]

then node is joining and updating db successfully

[root@t4w1 ~]# service mysql restart
Restarting mysql (via systemctl):                          [  OK  ]
[root@t4w1 ~]# service mysql stop
Stopping mysql (via systemctl):                            [  OK  ]
[root@t4w1 ~]# service mysql start
Starting mysql (via systemctl):                            [  OK  ]
[root@t4w1 ~]# mysql -u root

Comment by Zdravelina Sokolovska (Inactive) [ 2018-04-02 ]

problem Address already in use" on restart is observed again on 10.3.5 and the problem derives from
mysqld --wsrep process was not stopped upon restart command

  1. systemctl restart mariadb.service
    or
  2. service mysql restart

2018-04-02 17:24:05 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2018-04-02 17:24:05 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2018-04-02 17:24:05 0 [Note] WSREP: wsrep_load(): Galera 25.3.23(r3789) by Codership Oy <info@codership.com> loaded successfully.
2018-04-02 17:24:05 0 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2018-04-02 17:24:05 0 [Note] WSREP: Found saved state: dff6e041-1005-11e8-85c9-965f304f37bc:-1, safe_to_bootstrap: 1
2018-04-02 17:24:05 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.104.196; base_port = 4567; cert.log_con           flicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.ina           ctive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.susp           ect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.me           m_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gc           s.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q           _hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout            = PT3S; pc.bootstrap = YES; pc
2018-04-02 17:24:05 0 [Note] WSREP: GCache history reset: dff6e041-1005-11e8-85c9-965f304f37bc:0 -> dff6e041-1005-11e8-85c9-965f304f37bc:131618
2018-04-02 17:24:05 0 [Note] WSREP: Assign initial position for certification: 131618, protocol version: -1
2018-04-02 17:24:05 0 [Note] WSREP: wsrep_sst_grab()
2018-04-02 17:24:05 0 [Note] WSREP: Start replication
2018-04-02 17:24:05 0 [Note] WSREP: Setting initial position to dff6e041-1005-11e8-85c9-965f304f37bc:131618
2018-04-02 17:24:05 0 [Note] WSREP: protonet asio version 0
2018-04-02 17:24:05 0 [Note] WSREP: Using CRC-32C for message checksums.
2018-04-02 17:24:05 0 [Note] WSREP: backend: asio
2018-04-02 17:24:05 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2018-04-02 17:24:05 0 [Note] WSREP: restore pc from disk successfully
2018-04-02 17:24:05 0 [Note] WSREP: GMCast version 0
2018-04-02 17:24:05 0 [Note] WSREP: (de19f79a, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2018-04-02 17:24:05 0 [Note] WSREP: (de19f79a, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2018-04-02 17:24:05 0 [Note] WSREP: EVS version 0
2018-04-02 17:24:05 0 [Note] WSREP: gcomm: connecting to group 'cluster1', peer '192.168.104.193:,192.168.104.195:,192.168.104.196:'
2018-04-02 17:24:05 0 [ERROR] WSREP: bind: Address already in use
2018-04-02 17:24:05 0 [ERROR] WSREP: failed to open gcomm backend connection: 98: error while trying to listen 'tcp://0.0.0.0:4567?socket.non_bloc           king=1', asio error 'bind: Address already in use': 98 (Address already in use)
         at gcomm/src/asio_tcp.cpp:listen():830
2018-04-02 17:24:05 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -98 (Address already in use)
2018-04-02 17:24:05 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'cluster1' at 'gcomm://192.168.104.193,192.168.104.19           5,192.168.104.196': -98 (Address already in use)
2018-04-02 17:24:05 0 [ERROR] WSREP: gcs connect failed: Address already in use
2018-04-02 17:24:05 0 [ERROR] WSREP: wsrep::connect(gcomm://192.168.104.193,192.168.104.195,192.168.104.196) failed: 7
2018-04-02 17:24:05 0 [ERROR] Aborting
 
 

Comment by Daniel Black [ 2018-04-02 ]

A good view of the current and previous mysqld processes will be required.

Switching between init systems is bound to cause troubles as they both (should) prevent a duplicate instance in different ways.

Related is a systemd bug I filed that can cause it on that side of things: https://github.com/systemd/systemd/issues/8630. SendSIGKILL=yes would be a workaround (https://mariadb.com/kb/en/mariadb/systemd/).

Maybe for sysv init scripts - pkill --parent

{mysqldpid}

(before kiling mysqld otherwise the parent pid will go to 1)

Generated at Thu Feb 08 07:42:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.