Uploaded image for project: 'MariaDB Connector/C'
  1. MariaDB Connector/C
  2. CONC-322

mysql_real_connect - EAGAIN treated like EINPROGRESS

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 3.0.4
    • 3.0.4
    • None
    • None
    • Ubuntu artful, ppc64le, p9

    Description

      Version from mariadb-server-10.3 branch built last week

      dpkg -l | grep mariadb-dev
      ii  libmariadb-dev                                 10.3.6+maria~artful                          ppc64el      MariaDB database development files
      

      Stracing sysbench liked against libmariadb-dev:

      $  egrep -v '(send|recv)'  sysbench-oltp_read_only-64-mysqld-n8-1_split12.strace  | grep -C 10 -i failed      
      56416 fcntl(11, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
      56416 connect(11, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n8-1.sock"}, 110) = -1 EAGAIN (Resource temporarily unavailable)
      56416 poll([{fd=11, events=POLLOUT}], 1, -1) = 1 ([{fd=11, revents=POLLOUT|POLLHUP}])
      56416 getsockopt(11, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
      56416 fcntl(11, F_SETFL, O_RDONLY)      = 0
      56416 setsockopt(11, SOL_IP, IP_TOS, [8], 4) = -1 EOPNOTSUPP (Operation not supported)
      56416 setsockopt(11, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
      56416 close(11)                         = 0
      56416 write(1, "FATAL: unable to connect to MySQL server on socket '//var/run/mysqld//mysqld-n8-1.sock', aborting...\n", 101) = 101
      56416 write(1, "FATAL: error 2013: Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 22\n", 121) = 121
      56416 write(1, "FATAL: `thread_init' function failed: /usr/local/share/sysbench/oltp_common.lua:352: connection creation failed\n", 112) = 112
      

      The EAGAIN was returned by the connect syscall.

      In https://github.com/MariaDB/mariadb-connector-c/blob/master/plugins/pvio/pvio_socket.c#L614 its handled like EINPROGRESS where eventually via poll (pvio_socket_wait_io_or_timeout) a file the connection will be ready.

      Looking at linux source (net/unix/af_unix.c) this happens when the recv queue is full. When this state the socket isn't in a connected state and polling it won't make it connect.

      So options are:

      • connect code should EAGAIN be handled like EINTR and another attempt is made and the server clears its backlog
      • attempt again while within timeout period
      • an immediate error return

      Attachments

        1. p.p
          1 kB
          Daniel Black

        Activity

          danblack Daniel Black added a comment -

          EAGAIN, which is the same as EWOULDBLOCK in linux, and as you see the connect(2) description isn't helpful/

          Tried patch as attached. For the case where a connect returns EAGAIN:

          176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176706 openat(AT_FDCWD, "./oltp_common/init.lua", O_RDONLY <unfinished ...>
          176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
          176732 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176750 <... poll resumed> )             = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
          176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
          176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176732 <... poll resumed> )             = 1 ([{fd=83, revents=POLLOUT|POLLHUP}])
          176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
          176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176750 <... poll resumed> )             = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
          176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176732 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
          176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
          176732 <... poll resumed> )             = 1 ([{fd=83, revents=POLLOUT|POLLHUP}])
          176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176732 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}])
          176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...>
          176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...>
          176750 <... connect resumed> )          = -1 EAGAIN (Resource temporarily unavailable)
          176732 <... poll resumed> )             = 1 ([{fd=83, revents=POLLOUT|POLLHUP}])
          

          So it seems either the epoll is uncorrelated to the success of a connect or another process is grabbing the resources (i did have multiple sysbench processes attempting that socket). It took ~7000 attempts at connection before successful so something more intelligent is needed.

          Unfortunately I didn't record timings in this strace.

          Looking similar: https://lists.mysql.com/internals/35077

          If you can think of something that might be useful/test let me know.

          danblack Daniel Black added a comment - EAGAIN, which is the same as EWOULDBLOCK in linux, and as you see the connect(2) description isn't helpful/ Tried patch as attached. For the case where a connect returns EAGAIN: 176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176706 openat(AT_FDCWD, "./oltp_common/init.lua", O_RDONLY <unfinished ...> 176750 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...> 176732 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176750 <... poll resumed> ) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}]) 176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...> 176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176732 <... poll resumed> ) = 1 ([{fd=83, revents=POLLOUT|POLLHUP}]) 176750 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...> 176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176750 <... poll resumed> ) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}]) 176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176732 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176750 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...> 176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}]) 176732 <... poll resumed> ) = 1 ([{fd=83, revents=POLLOUT|POLLHUP}]) 176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176732 connect(83, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176750 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176732 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176750 poll([{fd=82, events=POLLIN|POLLPRI|POLLOUT}], 1, 0) = 1 ([{fd=82, revents=POLLOUT|POLLHUP}]) 176750 connect(82, {sa_family=AF_UNIX, sun_path="//var/run/mysqld//mysqld-n0-0.sock"}, 110 <unfinished ...> 176732 poll([{fd=83, events=POLLIN|POLLPRI|POLLOUT}], 1, 0 <unfinished ...> 176750 <... connect resumed> ) = -1 EAGAIN (Resource temporarily unavailable) 176732 <... poll resumed> ) = 1 ([{fd=83, revents=POLLOUT|POLLHUP}]) So it seems either the epoll is uncorrelated to the success of a connect or another process is grabbing the resources (i did have multiple sysbench processes attempting that socket). It took ~7000 attempts at connection before successful so something more intelligent is needed. Unfortunately I didn't record timings in this strace. Looking similar: https://lists.mysql.com/internals/35077 If you can think of something that might be useful/test let me know.
          danblack Daniel Black added a comment -

          Thanks for the commit.

          I can now see why the poll didn't have an effect, the destination socket is the cause of the EAGAIN, polling on an unconnected socket is unaware of this backlog.

          Could also defer the setting of the socket to non-blocking after the connect which would allow the kernel to wait (https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c#L1266..L1268) and possibly not error saving time in this loop.

          danblack Daniel Black added a comment - Thanks for the commit. I can now see why the poll didn't have an effect, the destination socket is the cause of the EAGAIN, polling on an unconnected socket is unaware of this backlog. Could also defer the setting of the socket to non-blocking after the connect which would allow the kernel to wait ( https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c#L1266..L1268 ) and possibly not error saving time in this loop.
          georg Georg Richter added a comment -

          Fixed in 3.0.4

          georg Georg Richter added a comment - Fixed in 3.0.4

          People

            georg Georg Richter
            danblack Daniel Black
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.