I find it the easiest to reproduce locally with rr:
./mtr galera.galera_var_notify_cmd --rr
|
this hits it every time for me, and without --rr the test mostly passes.
With the following change:
diff --git a/sql/wsrep_mysqld.cc b/sql/wsrep_mysqld.cc
|
--- a/sql/wsrep_mysqld.cc
|
+++ b/sql/wsrep_mysqld.cc
|
@@ -847,7 +847,9 @@ void wsrep_init_globals()
|
wsrep_init_schema();
|
if (WSREP_ON)
|
{
|
+ WSREP_INFO("before Wsrep_server_state::instance().initialized()");
|
Wsrep_server_state::instance().initialized();
|
+ WSREP_INFO("after Wsrep_server_state::instance().initialized()");
|
}
|
}
|
|
diff --git a/sql/wsrep_utils.cc b/sql/wsrep_utils.cc
|
index 854c11c1eea..56477d662ed 100644
|
--- a/sql/wsrep_utils.cc
|
+++ b/sql/wsrep_utils.cc
|
@@ -377,6 +377,7 @@ process::wait ()
|
if (pid_)
|
{
|
int status;
|
+ WSREP_INFO("Waiting for process %s", str_);
|
if (-1 == waitpid(pid_, &status, 0))
|
{
|
err_= errno; assert (err_);
|
one can easily see why it hangs.
This is when the test passed:
2022-12-24 13:32:49 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
|
2022-12-24 13:32:49 0 [Note] WSREP: Server initialized
|
2022-12-24 13:32:49 0 [Note] WSREP: Server status change initializing -> initialized
|
2022-12-24 13:32:49 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
|
2022-12-24 13:32:49 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
|
2022-12-24 13:32:49 0 [Note] WSREP: after Wsrep_server_state::instance().initialized()
|
Here:
- the server status is changed to "initialized"
- notification script runs (does nothing for --status initialized)
- Wsrep_server_state::instance().initialized() returns
- server initialization continues
And this is from the timed out test:
2022-12-24 13:19:02 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
|
2022-12-24 13:19:02 0 [Note] WSREP: Server initialized
|
2022-12-24 13:19:02 0 [Note] WSREP: Server status change initializing -> initialized
|
2022-12-24 13:19:02 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
|
2022-12-24 13:19:02 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
|
2022-12-24 13:19:02 5 [Note] WSREP: Cluster table is empty, not recovering transactions
|
2022-12-24 13:19:02 2 [Note] WSREP: Server status change initialized -> joined
|
2022-12-24 13:19:02 2 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status joined
|
As you see, server status has changed to "joined" before Wsrep_server_state::instance().initialized() returned, so the script is started with a new status, on "joined" it tries to connect to the server to update some table, but the server isn't initialized yet, it's not listening for connections. So the server waits for the script to finish, script waits for mariadb client to connect, and the client cannot connect, because the server isn't listening.
I find it the easiest to reproduce locally with rr:
this hits it every time for me, and without --rr the test mostly passes.
With the following change:
diff --git a/sql/wsrep_mysqld.cc b/sql/wsrep_mysqld.cc
--- a/sql/wsrep_mysqld.cc
+++ b/sql/wsrep_mysqld.cc
@@ -847,7 +847,9 @@ void wsrep_init_globals()
wsrep_init_schema();
if (WSREP_ON)
{
+ WSREP_INFO("before Wsrep_server_state::instance().initialized()");
Wsrep_server_state::instance().initialized();
+ WSREP_INFO("after Wsrep_server_state::instance().initialized()");
}
}
diff --git a/sql/wsrep_utils.cc b/sql/wsrep_utils.cc
index 854c11c1eea..56477d662ed 100644
--- a/sql/wsrep_utils.cc
+++ b/sql/wsrep_utils.cc
@@ -377,6 +377,7 @@ process::wait ()
if (pid_)
{
int status;
+ WSREP_INFO("Waiting for process %s", str_);
if (-1 == waitpid(pid_, &status, 0))
{
one can easily see why it hangs.
This is when the test passed:
2022-12-24 13:32:49 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
2022-12-24 13:32:49 0 [Note] WSREP: Server initialized
2022-12-24 13:32:49 0 [Note] WSREP: Server status change initializing -> initialized
2022-12-24 13:32:49 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
2022-12-24 13:32:49 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
2022-12-24 13:32:49 0 [Note] WSREP: after Wsrep_server_state::instance().initialized()
Here:
And this is from the timed out test:
2022-12-24 13:19:02 0 [Note] WSREP: before Wsrep_server_state::instance().initialized()
2022-12-24 13:19:02 0 [Note] WSREP: Server initialized
2022-12-24 13:19:02 0 [Note] WSREP: Server status change initializing -> initialized
2022-12-24 13:19:02 0 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status initialized
2022-12-24 13:19:02 2 [Note] WSREP: Bootstrapping a new cluster, setting initial position to 00000000-0000-0000-0000-000000000000:-1
2022-12-24 13:19:02 5 [Note] WSREP: Cluster table is empty, not recovering transactions
2022-12-24 13:19:02 2 [Note] WSREP: Server status change initialized -> joined
2022-12-24 13:19:02 2 [Note] WSREP: Waiting for process /home/serg/Abk/maria/mysql-test/std_data/wsrep_notify.sh --status joined
As you see, server status has changed to "joined" before Wsrep_server_state::instance().initialized() returned, so the script is started with a new status, on "joined" it tries to connect to the server to update some table, but the server isn't initialized yet, it's not listening for connections. So the server waits for the script to finish, script waits for mariadb client to connect, and the client cannot connect, because the server isn't listening.