[MDEV-18665] Server hangs on shutdown after setting wsrep_cluster_address at runtime Created: 2019-02-20  Updated: 2022-09-19  Resolved: 2021-12-23

Status: Closed
Project: MariaDB Server
Component/s: wsrep
Affects Version/s: 10.1, 10.2, 10.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Jan Lindström (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-22342 Galera nodes hang during shutdown, un... Closed

 Description   
  • start the server normally, with all defaults, without wsrep
  • run

    set global wsrep_provider='/usr/lib/libgalera_smm.so'; set global wsrep_cluster_address='gcomm://'
    

    (adjust path to the library if necessary).

  • shut down the server

= The server hangs, seemingly forever. The last lines in the log are

10.1 431da59f

2019-02-20 17:19:42 139659940031232 [Note] /data/bld/10.1/bin/mysqld: Normal shutdown
2019-02-20 17:19:42 139659940031232 [Note] WSREP: Stop replication
2019-02-20 17:19:42 139659940031232 [Note] WSREP: Closing send monitor...
2019-02-20 17:19:42 139659940031232 [Note] WSREP: Closed send monitor.
2019-02-20 17:19:42 139659940031232 [Note] WSREP: gcomm: terminating thread
2019-02-20 17:19:42 139659940031232 [Note] WSREP: gcomm: joining thread
2019-02-20 17:19:42 139659940031232 [Note] WSREP: gcomm: closing backend
2019-02-20 17:19:42 139659940031232 [Note] WSREP: view((empty))
2019-02-20 17:19:42 139659940031232 [Note] WSREP: gcomm: closed
2019-02-20 17:19:42 139658977601280 [Note] WSREP: Received self-leave message.
2019-02-20 17:19:42 139658977601280 [Note] WSREP: Flow-control interval: [0, 0]
2019-02-20 17:19:42 139658977601280 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2019-02-20 17:19:42 139658977601280 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 0)
2019-02-20 17:19:42 139658977601280 [Note] WSREP: RECV thread exiting 0: Success
2019-02-20 17:19:42 139659940031232 [Note] WSREP: recv_thread() joined.
2019-02-20 17:19:42 139659940031232 [Note] WSREP: Closing replication queue.
2019-02-20 17:19:42 139659940031232 [Note] WSREP: Closing slave action queue.

Stack trace from threads of the hanging process which seem to be up to something:

Thread 6 (Thread 0x7f051f2adb00 (LWP 26855)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055d937962ec7 in safe_cond_wait (cond=0x55d93839c9a0 <COND_slave_background>, mp=0x55d93839fb00 <LOCK_slave_background>, file=0x55d9379f4500 "/data/src/10.1/include/mysql/psi/mysql_thread.h", line=1165) at /data/src/10.1/mysys/thr_mutex.c:493
#2  0x000055d93700da02 in inline_mysql_cond_wait (that=0x55d93839c9a0 <COND_slave_background>, mutex=0x55d93839fb00 <LOCK_slave_background>, src_file=0x55d9379f4c5d "/data/src/10.1/sql/slave.cc", src_line=336) at /data/src/10.1/include/mysql/psi/mysql_thread.h:1165
#3  0x000055d93700e6ea in handle_slave_background (arg=0x0) at /data/src/10.1/sql/slave.cc:336
#4  0x00007f051ef07494 in start_thread (arg=0x7f051f2adb00) at pthread_create.c:333
#5  0x00007f051d2c093f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 5 (Thread 0x7f04e7357700 (LWP 26861)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f04ef5485ee in wait (this=0x7f04e7356d10, cond=...) at galerautils/src/gu_lock.hpp:56
#2  galera::ServiceThd::thd_func (arg=0x7f04f051a350) at galera/src/galera_service_thd.cpp:30
#3  0x00007f051ef07494 in start_thread (arg=0x7f04e7357700) at pthread_create.c:333
#4  0x00007f051d2c093f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 4 (Thread 0x7f051f219b00 (LWP 26864)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055d937962ec7 in safe_cond_wait (cond=0x55d9383c7580 <COND_wsrep_rollback>, mp=0x55d9383c74c0 <LOCK_wsrep_rollback>, file=0x55d937a61800 "/data/src/10.1/include/mysql/psi/mysql_thread.h", line=1165) at /data/src/10.1/mysys/thr_mutex.c:493
#2  0x000055d93727d6ac in inline_mysql_cond_wait (that=0x55d9383c7580 <COND_wsrep_rollback>, mutex=0x55d9383c74c0 <LOCK_wsrep_rollback>, src_file=0x55d937a61900 "/data/src/10.1/sql/wsrep_thd.cc", src_line=471) at /data/src/10.1/include/mysql/psi/mysql_thread.h:1165
#3  0x000055d93727f2f2 in wsrep_rollback_process (thd=0x7f04e2816070) at /data/src/10.1/sql/wsrep_thd.cc:471
#4  0x000055d93726cfa2 in start_wsrep_THD (arg=0x55d93727f1fa <wsrep_rollback_process(THD*)>) at /data/src/10.1/sql/wsrep_mysqld.cc:2064
#5  0x00007f051ef07494 in start_thread (arg=0x7f051f219b00) at pthread_create.c:333
#6  0x00007f051d2c093f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 3 (Thread 0x7f051f1cfb00 (LWP 26865)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x000055d9379631d1 in safe_cond_timedwait (cond=0x55d9383c7480 <COND_wsrep_sst_init>, mp=0x55d9383c73c0 <LOCK_wsrep_sst_init>, abstime=0x7f051f1cd850, file=0x55d937a5ea80 "/data/src/10.1/include/mysql/psi/mysql_thread.h", line=1202) at /data/src/10.1/mysys/thr_mutex.c:547
#2  0x000055d937270a1b in inline_mysql_cond_timedwait (that=0x55d9383c7480 <COND_wsrep_sst_init>, mutex=0x55d9383c73c0 <LOCK_wsrep_sst_init>, abstime=0x7f051f1cd850, src_file=0x55d937a5ebd0 "/data/src/10.1/sql/wsrep_sst.cc", src_line=1411) at /data/src/10.1/include/mysql/psi/mysql_thread.h:1202
#3  0x000055d937275912 in wsrep_SE_init_wait () at /data/src/10.1/sql/wsrep_sst.cc:1411
#4  0x000055d9372668b2 in wsrep_view_handler_cb (app_ctx=0x6e6f00, recv_ctx=0x7f04e2416070, view=0x7f04e249f300, state=0x0, state_len=0, sst_req=0x7f051f1ce170, sst_req_len=0x7f051f1ce178) at /data/src/10.1/sql/wsrep_mysqld.cc:404
#5  0x00007f04ef57d580 in galera::ReplicatorSMM::process_conf_change (this=0x7f04f0519e00, recv_ctx=0x7f04e2416070, view_info=..., repl_proto=7, next_state=galera::Replicator::S_JOINED, seqno_l=<optimized out>) at galera/src/replicator_smm.cpp:1414
#6  0x00007f04ef554aab in galera::GcsActionSource::dispatch (this=this@entry=0x7f04f051a488, recv_ctx=recv_ctx@entry=0x7f04e2416070, act=..., exit_loop=@0x7f051f1ce65c: false) at galera/src/gcs_action_source.cpp:139
#7  0x00007f04ef556586 in galera::GcsActionSource::process (this=0x7f04f051a488, recv_ctx=0x7f04e2416070, exit_loop=@0x7f051f1ce65c: false) at galera/src/gcs_action_source.cpp:181
#8  0x00007f04ef57cbc3 in galera::ReplicatorSMM::async_recv (this=0x7f04f0519e00, recv_ctx=0x7f04e2416070) at galera/src/replicator_smm.cpp:371
#9  0x00007f04ef59121b in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:244
#10 0x000055d93727ec56 in wsrep_replication_process (thd=0x7f04e2416070) at /data/src/10.1/sql/wsrep_thd.cc:360
#11 0x000055d93726cfa2 in start_wsrep_THD (arg=0x55d93727eb8f <wsrep_replication_process(THD*)>) at /data/src/10.1/sql/wsrep_mysqld.cc:2064
#12 0x00007f051ef07494 in start_thread (arg=0x7f051f1cfb00) at pthread_create.c:333
#13 0x00007f051d2c093f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Thread 2 (Thread 0x7f051d1d6b00 (LWP 26871)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000055d937962ec7 in safe_cond_wait (cond=0x55d9383a0200 <COND_thread_count>, mp=0x55d93839efc0 <LOCK_thread_count>, file=0x55d937a5ce00 "/data/src/10.1/include/mysql/psi/mysql_thread.h", line=1165) at /data/src/10.1/mysys/thr_mutex.c:493
#2  0x000055d937266024 in inline_mysql_cond_wait (that=0x55d9383a0200 <COND_thread_count>, mutex=0x55d93839efc0 <LOCK_thread_count>, src_file=0x55d937a5d210 "/data/src/10.1/sql/wsrep_mysqld.cc", src_line=2369) at /data/src/10.1/include/mysql/psi/mysql_thread.h:1165
#3  0x000055d93726dfc6 in wsrep_wait_appliers_close (thd=0x0) at /data/src/10.1/sql/wsrep_mysqld.cc:2369
#4  0x000055d9372686dd in wsrep_stop_replication (thd=0x0) at /data/src/10.1/sql/wsrep_mysqld.cc:901
#5  0x000055d936fe809d in kill_server (sig_ptr=0x0) at /data/src/10.1/sql/mysqld.cc:1924
#6  0x000055d936fe8108 in kill_server_thread (arg=0x7f051f2f7220) at /data/src/10.1/sql/mysqld.cc:1961
#7  0x00007f051ef07494 in start_thread (arg=0x7f051d1d6b00) at pthread_create.c:333
#8  0x00007f051d2c093f in clone () from /lib/x86_64-linux-gnu/libc.so.6

10.1-10.3 are affected. I didn't check 10.4.



 Comments   
Comment by Jan Lindström (Inactive) [ 2021-12-23 ]

Cluster address can't anymore set dynamically because security.

Comment by Mehdin Masinovic [ 2022-09-19 ]

I have experienced a similar problem in our three-node Galera Cluster (MariaDB 10.6.8, Galera 4-26.4.11, Centos 7.8), with a simple node restart (unrelated to the wsrep_cluster_address):

```
2022-09-14 12:38:28 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
2022-09-14 12:38:28 0 [Note] WSREP: Shutdown replication
2022-09-14 12:38:28 0 [Note] WSREP: Server status change synced -> disconnecting
2022-09-14 12:38:28 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-09-14 12:38:28 0 [Note] WSREP: Closing send monitor...
2022-09-14 12:38:28 0 [Note] WSREP: Closed send monitor.
2022-09-14 12:38:28 0 [Note] WSREP: gcomm: terminating thread
2022-09-14 12:38:28 0 [Note] WSREP: gcomm: joining thread
2022-09-14 12:38:28 0 [Note] WSREP: gcomm: closing backend
2022-09-14 12:38:28 0 [Note] WSREP: view(view_id(NON_PRIM,dc8d7a84-a820,10) memb

{ df21fb70-b102,0 }

joined {
} left {
} partitioned

{ dc8d7a84-a820,0 e6965b60-830b,0 }

)
2022-09-14 12:38:28 0 [Note] WSREP: PC protocol downgrade 1 -> 0
2022-09-14 12:38:28 0 [Note] WSREP: view((empty))
2022-09-14 12:38:28 0 [Note] WSREP: gcomm: closed
2022-09-14 12:38:28 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2022-09-14 12:38:28 0 [Note] WSREP: Flow-control interval: [16, 16]
2022-09-14 12:38:28 0 [Note] WSREP: Received NON-PRIMARY.
2022-09-14 12:38:28 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 6125)
2022-09-14 12:38:28 0 [Note] WSREP: New SELF-LEAVE.
2022-09-14 12:38:28 0 [Note] WSREP: Flow-control interval: [0, 0]
2022-09-14 12:38:28 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2022-09-14 12:38:28 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 6125)
2022-09-14 12:38:28 0 [Note] WSREP: RECV thread exiting 0: Success
2022-09-14 12:38:28 0 [Note] WSREP: recv_thread() joined.
2022-09-14 12:38:28 0 [Note] WSREP: Closing replication queue.
2022-09-14 12:38:28 0 [Note] WSREP: Closing slave action queue.
2022-09-14 13:04:45 0 [Note] WSREP: Loading provider /usr/lib64/galera-4/libgalera_smm.so initial position: 6957351b-1fa8-11ed-97ea-4369705cab87:6115
2022-09-14 13:04:45 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera-4/libgalera_smm.so'
```

The journald log of the timestamp:

```
Sep 14 12:35:04 hadb1fra2 postfix/cleanup[28402]: B388E39D4: message-id=<20220914103504.B388E39D4@hadb1fra2.localdomain>
Sep 14 12:35:04 hadb1fra2 postfix/qmgr[1648]: B388E39D4: from=<root@hadb1fra2.localdomain>, size=896, nrcpt=1 (queue active)
Sep 14 12:35:04 hadb1fra2 postfix/local[28406]: B388E39D4: to=<root@hadb1fra2.localdomain>, orig_to=<root>, relay=local, delay=3.3, delays=3.2/0.02/0/0.15, dsn=2.0.0, status=sent (delivere
Sep 14 12:35:04 hadb1fra2 postfix/qmgr[1648]: B388E39D4: removed
Sep 14 12:38:28 hadb1fra2 polkitd[912]: Registered Authentication Agent for unix-process:28597:224782834 (system bus name :1.26426 [/usr/bin/pkttyagent --notify-fd 5 --fallback], object pa
Sep 14 12:38:28 hadb1fra2 systemd[1]: Stopping MariaDB 10.6.8 database server...
Sep 14 12:40:01 hadb1fra2 systemd[1]: Started Session 13154 of user root.
Sep 14 12:40:01 hadb1fra2 CROND[28698]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Sep 14 12:40:01 hadb1fra2 systemd[1]: Started Session 13155 of user root.
```

We had to pkill the PID, after which a restart would work again.

Does anyone have a suggestion on what could be causing this issue?

Generated at Thu Feb 08 08:45:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.