[MDEV-10762] Joiner and donor hung after failed SST Created: 2016-09-07 Updated: 2019-05-21 Resolved: 2019-05-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Galera SST |
| Affects Version/s: | 10.1.16 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Maciej Radzikowski | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Debian 8 |
||
| Description |
|
We have 3-nodes Galera cluster. After one of them went down after segfault it was automatically restarted by systemd. Let say it was NODE1. After restart NODE1 requested SST from NODE2. But in error log from NODE2 we can see that rsync tried to connect to NODE3 and get connection refused. After that it repeated "long semaphore wait". Result - NODE1 hung in JOINER state, NODE2 hung in DONOR state. NODE3 worked well. We had to manually kill mariadb and galera processes on NODE1 and NODE2 and than start databases. They started without troubles. I don't know why rsync on NODE2 tried to connect to NODE3 and why it get connection refused. NODE3 was in cluster all the time. But bigger problem was that NODE1 and NODE2 hung until manual kill. Log from NODE1 (JOINER):
nothing after that until next db start. Log from NODE2 (DONOR):
And many more semaphore warnings after that. |
| Comments |
| Comment by Andrii Nikitin (Inactive) [ 2017-07-18 ] | |||||||||||||||||||||||||||||||||
|
So two questions are here: Regarding first question - I can think about two possible resons: transport error (firewall on either side or proxy blocked access for some reasons) or rsync somehow exited immediately after startup. In any case I don't think it is feasible to confirm exact reason with amount of provided info. Regarding second question - below is script which I used to put some write load on cluster and make sure that another node generates identical 'rsync: failed to connect' error during join. https://github.com/AndriiNikitin/mariadb-environs-galera/blob/master/_bugscript/MDEV-10762.sh Everything remains operational, which proves that Cluster survives such situation, at least in basic scenario. You can the run script directly (which also will download and unpack 10.1.16 and start local cluster on nodes 3306 - 3310) or in docker using script https://github.com/AndriiNikitin/mariadb-environs-galera/blob/master/_bugscript/show_bug_in_docker.sh like below :
(running with docker build script has advantages as it caches previous successful steps and continues from them with next attempts). Below are extracts from output I observe, which confirms that the error is generated on donor node and cluster remains operational Four nodes cluster initialized properly:
Fifth node attempts to join with 'rsync_buggy' sst script, which will generate the error:
The cluster is under write load:
The new node didn't come up:
Donor error log shows desired error:
Cluster is still under load:
Processlist shows no issue:
Thus I have no other option at the moment rather than closing this as 'Cannot reproduce'. 3. Output of shell commands from each node: 4. (preferable) Several outputs of gdb stack traces from each problem node: 5. (preferable) any syslog messages within greater timeframe which covers problem period. |