[MDEV-22159] Galera SST tests fail on Debian sid and focal: all SST method tests fail i.e. mariabackup, mysqldump and rsync Created: 2020-04-06 Updated: 2020-05-05 Resolved: 2020-04-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera SST, Tests |
| Affects Version/s: | 10.3, 10.4, 10.5 |
| Fix Version/s: | 10.2.32, 10.3.23, 10.4.13, 10.5.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jan Lindström (Inactive) | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
| Comments |
| Comment by Jan Lindström (Inactive) [ 2020-04-06 ] | |||||||||||||||||||||||||
|
It seems that node_2 does not shutdown
| |||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-04-07 ] | |||||||||||||||||||||||||
|
Local testing with
| |||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-04-07 ] | |||||||||||||||||||||||||
|
Note that in test machines all SST method tests fail i.e. mariabackup, mysqldump and rsync and based on error logs all cases second node does not shutdown. | |||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-04-07 ] | |||||||||||||||||||||||||
|
Locally, I can't repeat problems on mysqldump or rsync:
And there is no running mysqld processes after mtr has finished. | |||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2020-04-15 ] | |||||||||||||||||||||||||
|
I'm not sure what you were trying to reproduce by running MTR tests, they pass in the buildbot as well, you can see it by the links that you pasted:
Consequently, if you are quoting MTR logs for something that didn't shut down, it is irrelevant, MTR will kill the servers at the end anyway. The problems are unrelated to MTR and start at the next stage, which is also clearly visible from the buildbot output, as every command is written there. The test attempts to start 3 nodes using mysqld_safe, and none of them starts. The server logs for these attempts are node1, node2, node3 (which also doesn't need to be guessed, it can be seen in the buildbot output), they are attached, but all of them are empty, from which one can conclude that the failure happens before the server is able to write anything in the log. Further debugging is primitive, as mysqld_safe is just a shell script. You create the configs the same way the test creates them, run mysqld_safe the same way the test does, observe the same silent exit, debug. If you did that, you would find out that mysqld_safe fails during WSREP recovery stage, before it actually starts the server (which is why server error logs remain empty). And this WSREP logic disables all the output in the process, which is why you don't see anything meaningful in the output either. Actual problemThe WSEP recovery problem boils down to this:
That's what the logic does with some wr_logfile or alike. Please investigate further and fix as needed, as Focal goes GA next week, so our upcoming releases will have to build on it. | |||||||||||||||||||||||||
| Comment by Daniel Black [ 2020-04-15 ] | |||||||||||||||||||||||||
|
Recommend raising this to the kernel developers on https://www.spinics.net/lists/linux-fsdevel/ or a mailing list specific to the file-system that fails. | |||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-20 ] | |||||||||||||||||||||||||
|
This affects 10.3 and 10.4 as well. I suppose that this could affect 10.1 and 10.2 too, but our Debian Sid builder only covers 10.3 and later. Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder). | |||||||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2020-04-21 ] | |||||||||||||||||||||||||
|
I updated the title to ensure everybody on this issue are debugging the same thing. So the problem is these three tests: Buildbot tries to start 3 Galera nodes, but none of them start and eventually buildbot run stops on connection error:
Exactly the same happens for all tree:
Logs and their summaries:
| |||||||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2020-04-21 ] | |||||||||||||||||||||||||
|
jplindst Could you find out when this bug started, i.e. what is the commit range of suspected changes that caused it? Or run the same thing manually and check out what node1.err contains on failure? | |||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-21 ] | |||||||||||||||||||||||||
|
otto, for 10.3, I posted the very first failure on the Debian Sid builder. It seems to me that the culprit commit could be very old. It could be helpful if elenst provided some more details of how her finding that looks like a change in the Linux kernel is related to this. | |||||||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2020-04-21 ] | |||||||||||||||||||||||||
|
> Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder). So you mean by *which appears to be the very first 10.3 build of that builder* that somebody added a new builder and it was broken from the onset, and there was no previous commit where it worked? So we are actually not debugging a buildbot failure, but figuring out how to correctly implement a new builder? I find that hard to believe. There must have been a point in time when the builder worked before the source was updated to current MariaDB version or before the virtual image contents was apt upgraded into latest Linux distro release. It should be possible to find out that git commit range introduced the failure or what OS upgrade introduced the error. If not else, take the eoan builder currently passing on latest 10.5 git head (http://buildbot.askmonty.org/buildbot/builders/kvm-deb-eoan-amd64/builds/990/steps/galera-sst-mariabackup) and start upgrading system software in suitable chuncks until you get the failure, then you will know what OS upgrade part broke it. | |||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2020-04-21 ] | |||||||||||||||||||||||||
I really don't know what else I can provide. mysqld_safe attempts to write into a file which exists in /tmp with certain permissions which make it impossible to write into, it fails, mysqld_safe aborts. I provided an example of permissions and failure to write, identical to what mysqld_safe does. I don't know how it can be any clearer. It cannot be fixed by the power of thought, somebody needs to actually run it, debug a tiny part of not-so-big mysqld_safe shell script and find a way to do it better, assuming that it was an intentional change in the system and not a system bug. If it is a bug, then it may be eventually fixed by just waiting for an upgrade with a patch, but if that's our approach, whoever takes it should make sure the system developers are aware of it. | |||||||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2020-04-21 ] | |||||||||||||||||||||||||
|
I confirm the steps outlined by elenst fail on Ubuntu Focal, but what log did you find them from, or which lines of code in https://github.com/MariaDB/server/blob/10.5/scripts/mysqld_safe.sh do you simulate?
There are no journal events, no AppArmor denies etc visible. Filesystem ext4. Also happens without any sudo/sh:
| |||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2020-04-21 ] | |||||||||||||||||||||||||
|
otto For the logic, it's the part starting from line 229 by your link, or search for wr_logfile.
A few lines later it changes ownership and permissions like that
A few more lines later it attempts to run server (wsrep recovery apparently like that
Nothing is written to the file due to the permissions problem, it greps the file, doesn't find in there what it's looking for, and bails out. | |||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-04-21 ] | |||||||||||||||||||||||||
|
Is this somehow related to https://jira.mariadb.org/browse/MDEV-21140 that was "fixed" but not sure if fix was confirmed. | |||||||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2020-04-21 ] | |||||||||||||||||||||||||
|
jplindst Why do you ask instead of checking it yourself?
Can you jplindst please proceed and fix this everywhere in Galera? What Alexander did in | |||||||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2020-04-21 ] | |||||||||||||||||||||||||
|
Tentative fix in https://github.com/MariaDB/server/pull/1510. I am doing this only for 10.5 and only for this one chown scenario in order to not get stalled with fixing the bugs I actually have assigned on my name and which I indend to have closed by the end of the month. Buildbot greenness is holy for me. I don't understand how anybody can develop without having CI that confirms all the time that no testable regressions exist... I at least can't. I leave it to Seppo or Jan to fix this properly. | |||||||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2020-04-22 ] | |||||||||||||||||||||||||
|
Alternative PR targetting 10.2 branch as requested by jplindst: https://github.com/MariaDB/server/pull/1511 | |||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-22 ] | |||||||||||||||||||||||||
|
I merged otto’s 10.2 change to 10.3, and it does work there. I see that jplindst separately pushed a change to 10.5 without waiting for a merge. |