[MDEV-5479] mysqld daemon should check if any process uses the socket file before removing Created: 2013-12-20  Updated: 2023-04-27

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.34
Fix Version/s: 10.4

Type: Bug Priority: Minor
Reporter: Honza Horak Assignee: Daniel Black
Resolution: Unresolved Votes: 0
Labels: contribution, foundation, patch, upstream
Environment:

Linux



 Description   

Already reported to http://bugs.mysql.com/71194

When running some MySQL daemon A and we are trying to run another instance B, while these conditions are met:

  • network ports are different for A and B
  • unix socket location is the same for A and B

Then the new daemon B removes the unix socket file that is actually necessary for daemon A.

How to repeat:
Steps to reproduce:
$ /usr/libexec/mysqld --port 13306 --datadir /var/lib/mysql/
$ fuser /var/lib/mysql/mysql.sock
$ /usr/libexec/mysqld --port 13307 --datadir /var/lib/mysql2/
$ fuser /var/lib/mysql/mysql.sock

Actual results:
/var/lib/mysql/mysql.sock: 5683
/var/lib/mysql/mysql.sock: 5717
which means the first daemon is not able to accept connections on the unix socket

Expected results:
/var/lib/mysql/mysql.sock: 5683
/var/lib/mysql/mysql.sock: 5683
the second daemon shouldn't start at all

Suggested fix:
Either check if some proc is attached to the socket or (portable solution) having a lock file for the socket file, that would contain pid of the process using the socket file.



 Comments   
Comment by Daniel Black [ 2017-12-30 ]

I tested by removing the unlink of the unix socket before the bind and indeed the second instance fails to start reporting the error:

$  strace -s 99 -e trace=network  sql/mysqld  --skip-networking  --datadir=/tmp/datadir2 --socket /tmp/s.sock  --lc-messages-dir=`pwd`/sql/share --verbose
2017-12-30 15:44:20 139830008162496 [Note] sql/mysqld (mysqld 10.2.12-MariaDB) starting as process 22939 ...
2017-12-30 15:44:20 139830008162496 [Warning] Changed limits: max_open_files: 1024  max_connections: 151  table_cache: 431
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Uses event mutexes
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Compressed tables use zlib 1.2.8
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Using Linux native AIO
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Number of pools: 1
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Using SSE2 crc32 instructions
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Completed initialization of buffer pool
2017-12-30 15:44:20 139829451699968 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Highest supported file format is Barracuda.
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: 128 out of 128 rollback segments are active.
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Creating shared tablespace for temporary tables
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2017-12-30 15:44:20 139830008162496 [Note] InnoDB: 5.7.20 started; log sequence number 1619282
2017-12-30 15:44:20 139829077591808 [Note] InnoDB: Loading buffer pool(s) from /tmp/datadir2/ib_buffer_pool
2017-12-30 15:44:20 139829077591808 [Note] InnoDB: Buffer pool(s) load completed at 171230 15:44:20
2017-12-30 15:44:20 139830008162496 [Note] Plugin 'FEEDBACK' is disabled.
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 20
setsockopt(20, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(20, {sa_family=AF_UNIX, sun_path="/tmp/s.sock"}, 110) = -1 EADDRINUSE (Address already in use)
2017-12-30 15:44:20 139830008162496 [ERROR] Can't start server : Bind on unix socket: Address already in use
2017-12-30 15:44:20 139830008162496 [ERROR] Do you already have another mysqld server running on socket: /tmp/s.sock ?
2017-12-30 15:44:20 139830008162496 [ERROR] Aborting

So removing the unlink is the minimal fix.

A more comprehensive check could test if the existing socket is responsive and only in the non-responsive case remove it like https://github.com/grooverdan/mariadb-server/commit/f4191b0628531b3e0ebe1d2ce53eb8312433fde6 (breaks the way some of mtr works).

This totally ignores your suggested fix mainly because I see it as too susceptible to race conditions and the behaviour of the other instance (which might be a different server version). The existing implementation and my variant are still susceptible to race conditions (in which the second server will abort) however as there isn't a truncate option with bind there isn't a race free implementation.

Do you think this is on the right track?

Comment by Sergey Vojtovich [ 2020-02-20 ]

hhorak, we support bright and shiny abstract sockets since 10.4. Is it viable alternative?

Comment by Honza Horak [ 2022-03-08 ]

Sorry for the long delay. I think abstract sockets help with the proper removing when the daemon ends. It's not clear to me whether it should help with not removing the socket of another daemon, at least based on the source, the second daemon can still remove (unlink) the socket from a different daemon. I think the abstract sockets would work well together with the Daniel's commit above, which seems to be not applied.

So, while this issue does not seem to be very high priority for us, it doesn't seem to be solved by abstract sockets.

Comment by Honza Horak [ 2022-03-08 ]

To share more about priority – we used to use Software Collections (a packaging concept, that allowed to install more versions of mariadb/mysql database servers in a single OS instance). With using a different concept (modules) that only allows to install one version at the same time (pick one from more variants available), it is not anymore that common that a user would run two daemons with a single unix socket by accident. Thus, not having this issue fixed does not seem to cause too much troubles in reality.

Comment by Daniel Black [ 2022-03-09 ]

Abstract sockets (10.4+) aren't unlinked. But still after all these years, its not common use, so same problem.

I tried and it seems that you can advisory lock a unix socket. I think those disappear when the process goes, so something like the following with acquiring the lock before an unlink.

diff --git a/sql/mysqld.cc b/sql/mysqld.cc
index b6748170942..3893d973043 100644
--- a/sql/mysqld.cc
+++ b/sql/mysqld.cc
@@ -2582,21 +2582,40 @@ static void network_init(void)
     else
 #endif
     {
-      (void) unlink(mysqld_unix_port);
       port_len= sizeof(UNIXaddr);
     }
     arg= 1;
     (void) mysql_socket_setsockopt(unix_sock,SOL_SOCKET,SO_REUSEADDR,
                                    (char*)&arg, sizeof(arg));
     umask(0);
+
+rebind:
+    int tries= 0;
+
     if (mysql_socket_bind(unix_sock,
                           reinterpret_cast<struct sockaddr *>(&UNIXaddr),
                           port_len) < 0)
     {
-      sql_perror("Can't start server : Bind on unix socket"); /* purecov: tested */
-      sql_print_error("Do you already have another server running on socket: %s ?",mysqld_unix_port);
-      unireg_abort(1);                                 /* purecov: tested */
+      switch (errno)
+      {
+      case 0:
+       break;
+      case EADDRINUSE:
+        if (tries == 0)
+       {
+         // TODO open, acquire lock, then unbind?
+          (void) unlink(mysqld_unix_port);
+         tries++;
+         goto rebind;
+       }
+       /* fall through */
+      default:
+        sql_perror("Can't start server : Bind on unix socket"); /* purecov: tested */
+        sql_print_error("Do you already have another server running on socket: %s ?",mysqld_unix_port);
+        unireg_abort(1);                                       /* purecov: tested */
+      }
     }
+    my_lock(unix_sock.fd, F_WRLCK, 0, 1, MYF(0));
     umask(((~my_umask) & 0666));
 #if defined(S_IFSOCK) && defined(SECURE_SOCKETS)
     (void) chmod(mysqld_unix_port,S_IFSOCK);   /* Fix solaris 2.6 bug */

Generated at Thu Feb 08 07:04:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.