2022-12-24 13:19:02 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
2022-12-24 13:19:02 0 [Note] WSREP: Server initialized
2022-12-24 13:19:02 0 [Note] WSREP: Server status change initializing -> initialized
2022-12-24 13:19:02 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
2022-12-24 13:19:02 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
2022-12-24 13:19:02 5 [Note] WSREP: Cluster table is empty, not recovering transactions
2022-12-24 13:19:02 2 [Note] WSREP: Server status change initialized -> joined
2022-12-24 13:19:02 2 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status joined
As you see, server status has changed to "joined" before Wsrep_server_state::instance().initialized() returned, so the script is started with a new status, on "joined" it tries to connect to the server to update some table, but the server isn't initialized yet, it's not listening for connections. So the server waits for the script to finish, script waits for mariadb client to connect, and the client cannot connect, because the server isn't listening.
Sergei Golubchik
added a comment - I find it the easiest to reproduce locally with rr :
. /mtr galera.galera_var_notify_cmd --rr
this hits it every time for me, and without --rr the test mostly passes.
With the following change:
diff --git a/sql/wsrep_mysqld.cc b/sql/wsrep_mysqld.cc
--- a/sql/wsrep_mysqld.cc
+++ b/sql/wsrep_mysqld.cc
@@ -847,7 +847,9 @@ void wsrep_init_globals()
wsrep_init_schema();
if (WSREP_ON)
{
+ WSREP_INFO("before Wsrep_server_state::instance().initialized()");
Wsrep_server_state::instance().initialized();
+ WSREP_INFO("after Wsrep_server_state::instance().initialized()");
}
}
diff --git a/sql/wsrep_utils.cc b/sql/wsrep_utils.cc
index 854c11c1eea..56477d662ed 100644
--- a/sql/wsrep_utils.cc
+++ b/sql/wsrep_utils.cc
@@ -377,6 +377,7 @@ process::wait ()
if (pid_)
{
int status;
+ WSREP_INFO("Waiting for process %s", str_);
if (-1 == waitpid(pid_, &status, 0))
{
err_= errno; assert (err_);
one can easily see why it hangs.
This is when the test passed:
2022-12-24 13:32:49 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
2022-12-24 13:32:49 0 [Note] WSREP: Server initialized
2022-12-24 13:32:49 0 [Note] WSREP: Server status change initializing -> initialized
2022-12-24 13:32:49 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
2022-12-24 13:32:49 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
2022-12-24 13:32:49 0 [Note] WSREP: after Wsrep_server_state::instance().initialized()
Here:
the server status is changed to "initialized"
notification script runs (does nothing for --status initialized )
Wsrep_server_state::instance().initialized() returns
server initialization continues
And this is from the timed out test:
2022-12-24 13:19:02 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
2022-12-24 13:19:02 0 [Note] WSREP: Server initialized
2022-12-24 13:19:02 0 [Note] WSREP: Server status change initializing -> initialized
2022-12-24 13:19:02 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
2022-12-24 13:19:02 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
2022-12-24 13:19:02 5 [Note] WSREP: Cluster table is empty, not recovering transactions
2022-12-24 13:19:02 2 [Note] WSREP: Server status change initialized -> joined
2022-12-24 13:19:02 2 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status joined
As you see, server status has changed to "joined" before Wsrep_server_state::instance().initialized() returned, so the script is started with a new status, on "joined" it tries to connect to the server to update some table, but the server isn't initialized yet, it's not listening for connections. So the server waits for the script to finish, script waits for mariadb client to connect, and the client cannot connect, because the server isn't listening.
julien.fritsch I do not think this is critical bug as it effects at the moment only tests, and requires wsrep_notify_cmd configuration variable to be used, by default it is not defined. Regardless, I will work on this issue and propose a fix.
Jan Lindström
added a comment - julien.fritsch I do not think this is critical bug as it effects at the moment only tests, and requires wsrep_notify_cmd configuration variable to be used, by default it is not defined. Regardless, I will work on this issue and propose a fix.
I find it the easiest to reproduce locally with rr:
this hits it every time for me, and without --rr the test mostly passes.
With the following change:
diff --git a/sql/wsrep_mysqld.cc b/sql/wsrep_mysqld.cc
--- a/sql/wsrep_mysqld.cc
+++ b/sql/wsrep_mysqld.cc
@@ -847,7 +847,9 @@ void wsrep_init_globals()
wsrep_init_schema();
if (WSREP_ON)
{
+ WSREP_INFO("before Wsrep_server_state::instance().initialized()");
Wsrep_server_state::instance().initialized();
+ WSREP_INFO("after Wsrep_server_state::instance().initialized()");
}
}
diff --git a/sql/wsrep_utils.cc b/sql/wsrep_utils.cc
index 854c11c1eea..56477d662ed 100644
--- a/sql/wsrep_utils.cc
+++ b/sql/wsrep_utils.cc
@@ -377,6 +377,7 @@ process::wait ()
if (pid_)
{
int status;
+ WSREP_INFO("Waiting for process %s", str_);
if (-1 == waitpid(pid_, &status, 0))
{
one can easily see why it hangs.
This is when the test passed:
2022-12-24 13:32:49 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
2022-12-24 13:32:49 0 [Note] WSREP: Server initialized
2022-12-24 13:32:49 0 [Note] WSREP: Server status change initializing -> initialized
2022-12-24 13:32:49 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
2022-12-24 13:32:49 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
2022-12-24 13:32:49 0 [Note] WSREP: after Wsrep_server_state::instance().initialized()
Here:
And this is from the timed out test:
2022-12-24 13:19:02 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
2022-12-24 13:19:02 0 [Note] WSREP: Server initialized
2022-12-24 13:19:02 0 [Note] WSREP: Server status change initializing -> initialized
2022-12-24 13:19:02 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
2022-12-24 13:19:02 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
2022-12-24 13:19:02 5 [Note] WSREP: Cluster table is empty, not recovering transactions
2022-12-24 13:19:02 2 [Note] WSREP: Server status change initialized -> joined
2022-12-24 13:19:02 2 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status joined
As you see, server status has changed to "joined" before Wsrep_server_state::instance().initialized() returned, so the script is started with a new status, on "joined" it tries to connect to the server to update some table, but the server isn't initialized yet, it's not listening for connections. So the server waits for the script to finish, script waits for mariadb client to connect, and the client cannot connect, because the server isn't listening.