[CONC-322] mysql_real_connect - EAGAIN treated like EINPROGRESS Created: 2018-04-10  Updated: 2018-05-12  Resolved: 2018-04-23

Status: Closed
Project: MariaDB Connector/C
Component/s: None
Affects Version/s: 3.0.4
Fix Version/s: 3.0.4

Type: Bug Priority: Major
Reporter: Daniel Black Assignee: Georg Richter
Resolution: Fixed Votes: 0
Labels: None
Environment:

Ubuntu artful, ppc64le, p9


Attachments: File p.p    

 Description   

Version from mariadb-server-10.3 branch built last week

dpkg -l | grep mariadb-dev
ii  libmariadb-dev                                 10.3.6+maria~artful                          ppc64el      MariaDB database development files

Stracing sysbench liked against libmariadb-dev:

$  egrep -v '(send|recv)'  sysbench-oltp_read_only-64-mysqld-n8-1_split12.strace  | grep -C 10 -i failed      
56416 fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
56416 connect(11, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n8-1.sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)
56416 poll([{fd=11, events=POLLOUT}], 1, -1) = 1 ([{fd=11, revents=POLLOUT|POLLHUP}])
56416 getsockopt(11, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
56416 fcntl(11, F_SETFL, O_RDONLY)      = 0
56416 setsockopt(11, SOL_IP, IP_TOS, [8], 4) = -1 EOPNOTSUPP (Operation not supported)
56416 setsockopt(11, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
56416 close(11)                         = 0
56416 write(1, "FATAL: unable to connect to MySQL server on socket '//var/run/mysqld//mysqld-n8-1.sock', aborting...\n", 101) = 101
56416 write(1, "FATAL: error 2013: Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 22\n", 121) = 121
56416 write(1, "FATAL: `thread_init' function failed: /usr/local/share/sysbench/oltp_common.lua:352: connection creation failed\n", 112) = 112

The EAGAIN was returned by the connect syscall.

In https://github.com/MariaDB/mariadb-connector-c/blob/master/plugins/pvio/pvio_socket.c#L614 its handled like EINPROGRESS where eventually via poll (pvio_socket_wait_io_or_timeout) a file the connection will be ready.

Looking at linux source (net/unix/af_unix.c) this happens when the recv queue is full. When this state the socket isn't in a connected state and polling it won't make it connect.

So options are:

  • connect code should EAGAIN be handled like EINTR and another attempt is made and the server clears its backlog
  • attempt again while within timeout period
  • an immediate error return


 Comments   
Comment by Daniel Black [ 2018-04-11 ]

EAGAIN, which is the same as EWOULDBLOCK in linux, and as you see the connect(2) description isn't helpful/

Tried patch as attached. For the case where a connect returns EAGAIN:

176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176706 openat(AT_FDCWD, "./oltp_common/init.lua", O_RDONLY <unfinished ...>
176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
176732 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176750 <... poll resumed> )             = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176732 <... poll resumed> )             = 1 ([{fd=83, revents=POLLOUT|POLLHUP}])
176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176750 <... poll resumed> )             = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176732 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
176732 <... poll resumed> )             = 1 ([{fd=83, revents=POLLOUT|POLLHUP}])
176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176732 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
176732 <... poll resumed> )             = 1 ([{fd=83, revents=POLLOUT|POLLHUP}])

So it seems either the epoll is uncorrelated to the success of a connect or another process is grabbing the resources (i did have multiple sysbench processes attempting that socket). It took ~7000 attempts at connection before successful so something more intelligent is needed.

Unfortunately I didn't record timings in this strace.

Looking similar: https://lists.mysql.com/internals/35077

If you can think of something that might be useful/test let me know.

Comment by Daniel Black [ 2018-04-17 ]

Thanks for the commit.

I can now see why the poll didn't have an effect, the destination socket is the cause of the EAGAIN, polling on an unconnected socket is unaware of this backlog.

Could also defer the setting of the socket to non-blocking after the connect which would allow the kernel to wait (https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c#L1266..L1268) and possibly not error saving time in this loop.

Comment by Georg Richter [ 2018-04-23 ]

Fixed in 3.0.4

Generated at Thu Feb 08 03:04:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.