[MDEV-13478] Full SST sync fails because of the error in the cleaning part Created: 2017-08-09 Updated: 2020-08-25 Resolved: 2017-12-19 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Configuration |
| Affects Version/s: | 10.1.20, 5.5.51-galera, 10.2.7 |
| Fix Version/s: | 10.1.30, 10.2.12, 5.5.59-galera, 10.0.34-galera |
| Type: | Bug | Priority: | Major |
| Reporter: | Kamil | Assignee: | Sachin Setiya (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | galera, wsrep | ||
| Environment: |
[root@eap-db01 ~]# rpm -qa | grep -i maria |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Sprint: | 10.1.30 | ||||||||
| Description |
|
Hi MariaDB Folks, We had the same problem with MariaDB verion 5.5 and also now after upgrading to 10.2.7 Problem desc: When node needs a full state snapshot (SST) first sync attempt always fails in our case. It does not fail at the very beginning of SST phase but only after data transfer is completed. We are waiting 45 min to transfer 180GB and eventually it always fails. Error message is seen in the logs when data are being streamed:
So key thing in above snapshot is a second attempt to remove the file which already had been removed earlier:
two lines before:
It's a random file, not always same. As a result whole cleanup phase fails with error:
Code responsible for this part:
Second SST attempt is successful as all files and directories have already been removed. I wasn't able to find the root cause why find is trying to remove same files again. We have quite a lot databases and tables in mysql data directory, but I don't think it may cause this kind of issues.
To make the first SST workable we need to change line 891 to:
|
| Comments |
| Comment by Andrii Nikitin (Inactive) [ 2017-08-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In my understanding with default configuration value of $cpat that command will not attempt to remove .frm files:
Do you have custom configuration for cpat in your configuration files? (If yes - please provide it for reference). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kamil [ 2017-08-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Andrii, We don't have a custom configuration for that. find execution I mentioned above does not remove files matched with cpat variable but instead it removs everything which is not matched with regex:
"-prune -o" We don't have any of parameters like innodb-data-home-dir , innodb-log-group-home-dir and innodb-undo-directory defined:
Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-08-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for explanation.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kamil [ 2017-08-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Andrii, I'm sorry for delayed response. For some reason find is trying to remove the same file twice, it's not clear to me why it happens. Below output for requested checks:
Br ,Kamil | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-08-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Just for reference: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-08-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It doesn't look easy to reproduce the problem and I really don't want to suggest a patch without knowing exact reasons when problem occurs (because for every patch some regression is possible and there may be some environment which stops to work properly). Thus I have to ask more questions which may help in finding the best resolution here: 1. Are you sure that your mentioned fix with xargs will completely solve the problem? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kamil [ 2017-08-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Andrii, 1. Since the time we replaced exec with xargs we haven't seen this problem anymore. Before the change when any node left the cluster and SST was required to fully sync the node it never worked for us. This leads us to problems during cleanup phase. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kamil [ 2017-08-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Now I've noticed that find actually tried to remove more than one file twice:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-09-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Working on Could you check if line `shopt -s nullglob` is present somewhere in .bashrc or similar places and provide output of command below from problem environment:
This what may demonstrate 'No such file or directory' error from find with explicit nullglob :
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kamil [ 2017-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Andrii, That's interesting! I've checked my all DB nodes but haven't found nullglob anywhere.
Full output:
Also nothing in the scripts:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for feedback. While it didn't confirm my assumptions, after all I managed to prove that the command used in sst may show 'No such file or directory' with many files.
It worth to note that your proposed solution still prints error messages for case above, just overall operation still succeeds:
So, unless me or reviewers would come with better solution, I will try to get your suggestion pushed into code. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-09-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
could you also provide output from the problem machine(s):
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kamil [ 2017-09-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
HI Andrii, I'm sorry I must have missed your query, please see the output:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Valerii Kravchuk [ 2017-10-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It seems 10.1.20 on SLES 12 is also affected by this bug. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2017-12-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
No, -print0 has nothing to do with the problem in question, there are no spaces or other strange characters there. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2017-12-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think prune position is slightly wrong it should be
(before -regex) , so that it should be applied in both cases. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2017-12-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
So the reason for this is in current find works like this | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2017-12-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
http://lists.askmonty.org/pipermail/commits/2017-December/011744.html | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2017-12-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
http://lists.askmonty.org/pipermail/commits/2017-December/011745.html |