Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Cannot Reproduce
-
10.1.16
-
None
-
Debian 8
Description
We have 3-nodes Galera cluster. After one of them went down after segfault it was automatically restarted by systemd. Let say it was NODE1.
After restart NODE1 requested SST from NODE2. But in error log from NODE2 we can see that rsync tried to connect to NODE3 and get connection refused. After that it repeated "long semaphore wait".
Result - NODE1 hung in JOINER state, NODE2 hung in DONOR state. NODE3 worked well.
NODE3 showed cluster size = 3.
We had to manually kill mariadb and galera processes on NODE1 and NODE2 and than start databases. They started without troubles.
I don't know why rsync on NODE2 tried to connect to NODE3 and why it get connection refused. NODE3 was in cluster all the time. But bigger problem was that NODE1 and NODE2 hung until manual kill.
Log from NODE1 (JOINER):
2016-09-07 12:18:54 140564997596928 [Note] WSREP: Member 0.0 (NODE1) requested state transfer from '*any*'. Selected 1.0 (NODE2)(SYNCED) as donor.
|
2016-09-07 12:18:54 140564997596928 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 16699814)
|
2016-09-07 12:18:54 140565373377280 [Note] WSREP: Requesting state transfer: success, donor: 1
|
nothing after that until next db start.
Log from NODE2 (DONOR):
2016-09-07 12:18:54 139680871864064 [Note] WSREP: Flushing tables for SST...
|
2016-09-07 12:18:54 139680871864064 [Note] WSREP: Provider paused at 97f799e6-b828-11e4-8c4b-5a16ba3e9c9d:16699814 (1556340)
|
2016-09-07 12:18:54 139680871864064 [Note] WSREP: Tables flushed.
|
WSREP_SST: [INFO] Preparing binlog files for transfer: (20160907 12:18:54.852)
|
mariadb-bin.000019
|
rsync: failed to connect to XXX.XXX.XXX.XXX (XXX.XXX.XXX.XXX): Connection refused (111) ///// XXX.XXX.XXX.XXX is NODE3 IP address
|
rsync error: error in socket IO (code 10) at clientserver.c(128) [sender=3.1.1]
|
WSREP_SST: [ERROR] rsync returned code 10: (20160907 12:18:54.862)
|
InnoDB: Warning: a long semaphore wait:
|
--Thread 139684182416128 has waited at trx0sys.ic line 103 for 241.00 seconds the semaphore:
|
X-lock (wait_ex) on RW-latch at 0x7f0a967fe560 '&block->lock'
|
a writer (thread id 139684182416128) has reserved it in mode wait exclusive
|
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
|
Last time read locked in file buf0flu.cc line 1093
|
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/include/trx0sys.ic line 103
|
Holder thread 0 file not yet reserved line 0
|
InnoDB: Warning: semaphore wait:
|
--Thread 139684182416128 has waited at trx0sys.ic line 103 for 241.00 seconds the semaphore:
|
X-lock (wait_ex) on RW-latch at 0x7f0a967fe560 '&block->lock'
|
a writer (thread id 139684182416128) has reserved it in mode wait exclusive
|
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
|
Last time read locked in file buf0flu.cc line 1093
|
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/include/trx0sys.ic line 103
|
Holder thread 0 file not yet reserved line 0
|
And many more semaphore warnings after that.