[MDEV-10762] Joiner and donor hung after failed SST Created: 2016-09-07  Updated: 2019-05-21  Resolved: 2019-05-21

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.1.16
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Maciej Radzikowski Assignee: Jan Lindström (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Debian 8



 Description   

We have 3-nodes Galera cluster. After one of them went down after segfault it was automatically restarted by systemd. Let say it was NODE1.

After restart NODE1 requested SST from NODE2. But in error log from NODE2 we can see that rsync tried to connect to NODE3 and get connection refused. After that it repeated "long semaphore wait".

Result - NODE1 hung in JOINER state, NODE2 hung in DONOR state. NODE3 worked well.
NODE3 showed cluster size = 3.

We had to manually kill mariadb and galera processes on NODE1 and NODE2 and than start databases. They started without troubles.

I don't know why rsync on NODE2 tried to connect to NODE3 and why it get connection refused. NODE3 was in cluster all the time. But bigger problem was that NODE1 and NODE2 hung until manual kill.

Log from NODE1 (JOINER):

2016-09-07 12:18:54 140564997596928 [Note] WSREP: Member 0.0 (NODE1) requested state transfer from '*any*'. Selected 1.0 (NODE2)(SYNCED) as donor.
2016-09-07 12:18:54 140564997596928 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 16699814)
2016-09-07 12:18:54 140565373377280 [Note] WSREP: Requesting state transfer: success, donor: 1

nothing after that until next db start.

Log from NODE2 (DONOR):

2016-09-07 12:18:54 139680871864064 [Note] WSREP: Flushing tables for SST...
2016-09-07 12:18:54 139680871864064 [Note] WSREP: Provider paused at 97f799e6-b828-11e4-8c4b-5a16ba3e9c9d:16699814 (1556340)
2016-09-07 12:18:54 139680871864064 [Note] WSREP: Tables flushed.
WSREP_SST: [INFO] Preparing binlog files for transfer: (20160907 12:18:54.852)
mariadb-bin.000019
rsync: failed to connect to XXX.XXX.XXX.XXX (XXX.XXX.XXX.XXX): Connection refused (111) ///// XXX.XXX.XXX.XXX is NODE3 IP address
rsync error: error in socket IO (code 10) at clientserver.c(128) [sender=3.1.1]
WSREP_SST: [ERROR] rsync returned code 10: (20160907 12:18:54.862)
InnoDB: Warning: a long semaphore wait:
--Thread 139684182416128 has waited at trx0sys.ic line 103 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f0a967fe560 '&block->lock'
a writer (thread id 139684182416128) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/include/trx0sys.ic line 103
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 139684182416128 has waited at trx0sys.ic line 103 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x7f0a967fe560 '&block->lock'
a writer (thread id 139684182416128) has reserved it in mode  wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/include/trx0sys.ic line 103
Holder thread 0 file not yet reserved line 0

And many more semaphore warnings after that.



 Comments   
Comment by Andrii Nikitin (Inactive) [ 2017-07-18 ]

So two questions are here:
1. Why rsync did show error Connection refused (111)
2. Will cluster survive if such error is shown.

Regarding first question - I can think about two possible resons: transport error (firewall on either side or proxy blocked access for some reasons) or rsync somehow exited immediately after startup. In any case I don't think it is feasible to confirm exact reason with amount of provided info.

Regarding second question - below is script which I used to put some write load on cluster and make sure that another node generates identical 'rsync: failed to connect' error during join.

https://github.com/AndriiNikitin/mariadb-environs-galera/blob/master/_bugscript/MDEV-10762.sh

Everything remains operational, which proves that Cluster survives such situation, at least in basic scenario.

You can the run script directly (which also will download and unpack 10.1.16 and start local cluster on nodes 3306 - 3310) or in docker using script https://github.com/AndriiNikitin/mariadb-environs-galera/blob/master/_bugscript/show_bug_in_docker.sh like below :

ENVIRON=10.1.16 bash -x -v show_bug_in_docker.sh MDEV-10762

(running with docker build script has advantages as it caches previous successful steps and continues from them with next attempts).

Below are extracts from output I observe, which confirms that the error is generated on donor node and cluster remains operational

Four nodes cluster initialized properly:

m0 :wsrep_cluster_size 4
m1 :wsrep_cluster_size 4
m2 :wsrep_cluster_size 4
m3 :wsrep_cluster_size 4

Fifth node attempts to join with 'rsync_buggy' sst script, which will generate the error:

m4*/galera_join.sh $(hostname -i) $(hostname -i) wsrep_sst_method=rsync_buggy || true

The cluster is under write load:

+ cluster1/sql.sh 'select count(*) from test.t'
m0 :579
m1 :580
m2 :580
m3 :580

The new node didn't come up:

error: 'Can't connect to local MySQL server through socket '/mariadb-environs/m4-10.1.16/dt/my.sock' (111 "Connection refused")'

Donor error log shows desired error:

m0-10.1.16/dt/error.log-2017-07-18 11:19:18 140673076426496 [Note] WSREP: Tables flushed.
m0-10.1.16/dt/error.log-rsync: failed to connect to 172.17.0.2 (172.17.0.2): Connection refused (111)
m0-10.1.16/dt/error.log-rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.9]
m0-10.1.16/dt/error.log:WSREP_SST: [ERROR] rsync returned code 10: (20170718 11:19:18.736)
m0-10.1.16/dt/error.log:2017-07-18 11:19:18 140673076426496 [ERROR] WSREP: Failed to read from: wsrep_sst_rsync_buggy --role 'donor' --address '172.17.0.2:44441/rsync_sst' --socket '/mariadb-environs/m0-10.1.16/dt/my.sock' --datadir '/mariadb-environs/m0-10.1.16/dt/'  --defaults-file '/mariadb-environs/m0-10.1.16/my.cnf'    '' --gtid '9c6de81e-6baa-11e7-bcf4-0ec0db3e6ba1:681' --gtid-domain-id '0'

Cluster is still under load:

+ cluster1/sql.sh 'select count(*) from test.t'
m0 :1685
m1 :1685
m2 :1685
m3 :1685
...
+ cluster1/sql.sh 'select count(*) from test.t'
m0 :2598
m1 :2599
m2 :2599
m3 :2599

Processlist shows no issue:

+ cluster1/sql.sh show processlist
m0 :1 system user NULL Sleep 0 committing 2601 NULL 0.000 2 system user NULL Sleep 201 wsrep aborter idle NULL 0.000 659 root localhost test Query 0 init show processlist 0.000
m1 :1 system user NULL Sleep 185 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 0 committing 2601 NULL 0.000 659 root localhost test Query 0 init show processlist 0.000
m2 :1 system user NULL Sleep 165 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 0 committing 2601 NULL 0.000 659 root localhost test Query 0 init show processlist 0.000
m3 :1 system user NULL Sleep 145 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 0 committed 2600 NULL 0.000 659 root localhost test Query 0 init show processlist 0.000

Thus I have no other option at the moment rather than closing this as 'Cannot reproduce'.
If somebody sees similar issue in future - please capture following information to increase chance of reproducing the issue with test cases or find root cause by other means.
1. Full error log from each node
2. Output of SQL commands from each node:
show global variables;
show processlist;
show engine innodb status\G
show global status;
do sleep(10);
show processlist;
show engine innodb status\G
show global status;

3. Output of shell commands from each node:
top -cbn1
df -h
mount

4. (preferable) Several outputs of gdb stack traces from each problem node:
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)

5. (preferable) any syslog messages within greater timeframe which covers problem period.

Generated at Thu Feb 08 07:44:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.