[MDEV-32515] The test spider/bugfix.mdev_30370 fails with "98: Address already in use" Created: 2023-10-19  Updated: 2023-12-07  Resolved: 2023-10-20

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - Spider
Affects Version/s: 10.10
Fix Version/s: 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Critical
Reporter: Yuchen Pei Assignee: Yuchen Pei
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-22979 "mysqld --bootstrap" / mysql_install_... Closed

 Description   

It only happens on builders at buildbot.mariadb.net, but it works fine for builders at buildbot.mariadb.org (see e.g. https://buildbot.mariadb.org/#grid?branch=bb-10.10-mdev-32507)

For example

https://buildbot.mariadb.net/buildbot/builders/kvm-deb-bullseye-amd64/builds/3143
https://buildbot.mariadb.net/buildbot/builders/kvm-deb-bullseye-amd64/builds/3143/steps/mtr/logs/stdio

  spider/bugfix.mdev_30370                 w4 [ fail ]
          Test ended at 2023-10-18 03:56:23
 
  CURRENT_TEST: spider/bugfix.mdev_30370
  2023-10-18  3:56:09 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32186)
  2023-10-18  3:56:09 0 [Warning] Changed limits: max_open_files: 1024  max_connections: 151 (was 151)  table_cache: 421 (was 2000)
  2023-10-18  3:56:09 0 [Note] Starting MariaDB 10.10.7-MariaDB-1:10.10.7+maria~deb11 source revision f63845524aacdd06431399c857c93aa52559b76c as process 5437
  2023-10-18  3:56:09 0 [Note] mariadbd: Aria engine: starting recovery
  recovered pages: 0% 11% 22% 36% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 3 2 1 0
   (0.0 seconds);
  #  [... 39 lines elided]
  2023-10-18  3:56:17 0 [Note] Retrying bind on TCP/IP port 3306
  2023-10-18  3:56:23 0 [ERROR] Can't start server: Bind on TCP/IP port. Got error: 98: Address already in use
  2023-10-18  3:56:23 0 [ERROR] Do you already have another server running on port: 3306 ?
  2023-10-18  3:56:23 0 [ERROR] Aborting
  mysqltest: At line 9: exec of '/usr/sbin/mariadbd --defaults-group-suffix=.1 --defaults-file=/dev/shm/var/4/my.cnf  --datadir=/dev/shm/var/4/mysqld.1.1/data/ --wsrep-recover --plugin-dir=/usr/lib/mysql/plugin/ --plugin-load-add=ha_spider' failed, error: 256, status: 1, errno: 32
  Output from before failure:
  # Kill the server
 
 
 
  The result from queries just before the failure was:
  #
  # MDEV-30370 mariadbd hangs when running with --wsrep-recover and --plugin-load-add=ha_spider.so
  #
  # Kill the server



 Comments   
Comment by Yuchen Pei [ 2023-10-19 ]

As some background, this is a peculiar test, because it tests the
server start with a flag (--wsrep-recover) that "aborts" the server
start.

mtr does not like to see its own server dead, so we can't have the
flags in an .opt file, or in a restart_parameter for
--source include/restart_mysqld.inc.

Prior to MDEV-22979, the test used $MYSQLD_BOOTSTRAP_CMD. For whatever
reason it stopped working with the fix of the spider init bugs, but in
any case, we should use $MYSQLD_CMD instead.

Comment by Yuchen Pei [ 2023-10-19 ]

Here's an initial attempt to fix this issue:

upstream/bb-10.10-mdev-32515 upstream/bb-10.10-all-builders 5bd85cb229f187c7c24c69659ff2caedb99f6366
MDEV-32515 Use the mtr cnf in the spider/bugfix.mdev_30370 mysqld invocation
 
This makes sure the $MYSQLD_CMD invocation uses the same port, and
does not use the default port which may already be in use.

Let's wait and see how it works in the CI. That did not work, as
we get the same failures:
https://buildbot.mariadb.net/buildbot/grid?category=main&branch=bb-10.10-all-builders
https://buildbot.mariadb.org/#grid?branch=bb-10.10-mdev-32515

Comment by Yuchen Pei [ 2023-10-19 ]

It is strange that this test fails at --exec $MYSQLD_CMD with "Do you
already have another server running on socket"[1][2] or "Do you
already have another server running on port: 3306 ?" [3][4]

let $MYSQLD_DATADIR= `select @@datadir`;
let $PLUGIN_DIR=`select @@plugin_dir`;
--source include/kill_mysqld.inc
--exec $MYSQLD_CMD --datadir=$MYSQLD_DATADIR --wsrep-recover --plugin-dir=$PLUGIN_DIR --plugin-load-add=ha_spider
--source include/start_mysqld.inc
--disable_query_log
--source ../../include/clean_up_spider.inc

but this test passes

let $MYSQLD_DATADIR= `select @@datadir`;
let $PLUGIN_DIR=`select @@plugin_dir`;
--source include/kill_mysqld.inc
--write_file $MYSQLTEST_VARDIR/tmp/mdev_22979.sql
drop table if exists foo.bar;
EOF
--exec $MYSQLD_CMD --datadir=$MYSQLD_DATADIR --bootstrap --plugin-dir=$PLUGIN_DIR --plugin-load-add=ha_spider < $MYSQLTEST_VARDIR/tmp/mdev_22979.sql
--source include/start_mysqld.inc
--disable_query_log
--source ../../include/clean_up_spider.inc

The only difference I can see is the --bootstrap flag... It only
happens in certain old buildbot CI builders[5].

[1]
https://buildbot.mariadb.net/buildbot/builders/kvm-zyp-opensuse150-amd64/builds/10857/steps/mtr/logs/stdio
[2]
https://buildbot.mariadb.net/buildbot/builders/kvm-zyp-opensuse150-amd64/builds/10857
[3]
https://buildbot.mariadb.net/buildbot/builders/kvm-deb-bullseye-aarch64/builds/1585
[4]
https://buildbot.mariadb.net/buildbot/builders/kvm-deb-bullseye-aarch64/builds/1585/steps/mtr/logs/stdio
[5]
https://buildbot.mariadb.net/buildbot/grid?category=main&branch=bb-10.10-all-builders

In any case, I created a commit to add --bootstrap. Let's see whether
that helps.

upstream/bb-10.10-all-builders c93d8c32c97170d63e968da047927ecf0a3b2001
MDEV-32515 [experiment] Add --bootstrap to the $MYSQLD_CMD invocation in mdev_30370
 
After all, spider/bugfix.mdev_22979 passes, and the only difference
that may matter is the --bootstrap flag

Comment by Yuchen Pei [ 2023-10-19 ]

This simple fix seems to work, see [1] for the commit c9e5d725bb8c
which is identical except the commit comment.

bb-10.10-mdev-32515 85262c138dbdd1e39046571cb87645621fa7baf2
MDEV-32515 Use $MYSQLD_LAST_CMD in spider/bugfix.mdev_30370
 
$MYSQLD_CMD uses .1 as the defaults-group-suffix, which could cause
the use of the default port (3306) or socket, which will fail in
environment where these defaults are already in use by another server.
 
Adding an extra --defaults-group-suffix=.1.1 does not help, because
the first flag wins.
 
So we use $MYSQLD_LAST_CMD instead, which uses the correct suffix.
 
The extra innodb buffer pool warning is irrelevant to the goal of the
test (running --wsrep-recover with --plug-load-add=ha_spider should
not cause hang)

[1] https://buildbot.mariadb.net/buildbot/grid?category=main&branch=bb-10.10-all-builders

Comment by Yuchen Pei [ 2023-10-20 ]

danblack said "c9e5d725bb8c0d8eb28caf6bc766e946fc0cf8d7 is fine. Reviewed by me complete." Thanks for the review.

So I am going to push 85262c138dbdd1e39046571cb87645621fa7baf2 which is the same as c9e5d725bb8c0d8eb28caf6bc766e946fc0cf8d7 except a more elaborate commit message.

Comment by Daniel Black [ 2023-10-20 ]

Yep. Its good.

Comment by Daniel Black [ 2023-10-20 ]

yep. good fix.

Comment by Yuchen Pei [ 2023-10-20 ]

Pushed 057fd528766eba150b9d7a0de8f95a4094f0e460 to 10.10

Generated at Thu Feb 08 10:31:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.