Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.3(EOL), 10.4(EOL), 10.5
-
None
Description
Attachments
- mysqld.1.err
- 241 kB
- mysqld.2.err
- 287 kB
- mysqld-local.1.err
- 28 kB
- mysqld-local.2.err
- 13 kB
- screenshot-1.png
- 34 kB
- stdout-local.log
- 2 kB
- xtrabackup_galera_info
- 0.0 kB
Activity
Local testing with
jan@jan-HP-ZBook-15u-G5:~$ lsb_release -a
|
No LSB modules are available.
|
Distributor ID: Ubuntu
|
Description: Ubuntu Focal Fossa (development branch)
|
Release: 20.04
|
Codename: focal
|
- branch: 10.5 CS
- commit: dcc2eaebbdf658af35a92249a565d80c19967cbe
- how to repeat: perl mysql-test-run.pl --big-test --suite=galera --do-test=galera_sst_mariabackup*
- Error logs attached
Note that in test machines all SST method tests fail i.e. mariabackup, mysqldump and rsync and based on error logs all cases second node does not shutdown.
Locally, I can't repeat problems on mysqldump or rsync:
galera.galera_sst_mysqldump 'debug,innodb' [ pass ] 134393
|
galera.galera_sst_mysqldump_with_key 'debug,innodb' [ pass ] 28990
|
galera.galera_sst_rsync 'debug,innodb' [ pass ] 44524
|
galera.galera_sst_rsync2 'debug,innodb' [ pass ] 44538
|
galera.galera_sst_rsync_data_dir 'debug,innodb' [ pass ] 44172
|
And there is no running mysqld processes after mtr has finished.
I'm not sure what you were trying to reproduce by running MTR tests, they pass in the buildbot as well, you can see it by the links that you pasted:
==============================================================================
|
|
TEST RESULT TIME (ms) or COMMENT
|
--------------------------------------------------------------------------
|
|
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
|
galera.galera_sst_rsync 'innodb,release' [ pass ] 30769
|
galera.galera_sst_rsync2 'innodb,release' [ pass ] 30858
|
galera.galera_sst_rsync_data_dir 'innodb,release' [ pass ] 30768
|
--------------------------------------------------------------------------
|
The servers were restarted 2 times
|
Spent 92.395 of 138 seconds executing testcases
|
|
Completed: All 3 tests were successful.
|
Consequently, if you are quoting MTR logs for something that didn't shut down, it is irrelevant, MTR will kill the servers at the end anyway.
The problems are unrelated to MTR and start at the next stage, which is also clearly visible from the buildbot output, as every command is written there. The test attempts to start 3 nodes using mysqld_safe, and none of them starts. The server logs for these attempts are node1, node2, node3 (which also doesn't need to be guessed, it can be seen in the buildbot output), they are attached, but all of them are empty, from which one can conclude that the failure happens before the server is able to write anything in the log.
Further debugging is primitive, as mysqld_safe is just a shell script. You create the configs the same way the test creates them, run mysqld_safe the same way the test does, observe the same silent exit, debug.
If you did that, you would find out that mysqld_safe fails during WSREP recovery stage, before it actually starts the server (which is why server error logs remain empty). And this WSREP logic disables all the output in the process, which is why you don't see anything meaningful in the output either.
Actual problem
The WSEP recovery problem boils down to this:
sudo rm -f /tmp/xx
|
sudo touch /tmp/xx
|
sudo chmod 600 /tmp/xx
|
sudo chown <existing user> /tmp/xx
|
sudo sh -c "echo 1 > /tmp/xx"
|
That's what the logic does with some wr_logfile or alike.
On existing systems, it works all right. On Focal and Sid (apparently, on the newest kernels), the last line, the attempt to write into the file under sudo, fails with permission denied. At this point mysqld_safe bails out, nothing is started.
Please investigate further and fix as needed, as Focal goes GA next week, so our upcoming releases will have to build on it.
Recommend raising this to the kernel developers on https://www.spinics.net/lists/linux-fsdevel/ or a mailing list specific to the file-system that fails.
This affects 10.3 and 10.4 as well. I suppose that this could affect 10.1 and 10.2 too, but our Debian Sid builder only covers 10.3 and later. Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder).
Note that Debian Sid is in the experimental grid view (link to 10.4).
I updated the title to ensure everybody on this issue are debugging the same thing. So the problem is these three tests:
Buildbot tries to start 3 Galera nodes, but none of them start and eventually buildbot run stops on connection error:
sudo mysqld_safe --defaults-extra-file=/home/buildbot/node1.cnf --user=mysql --wsrep-new-cluster & |
...
|
+ mysql -uroot -prootpass --port=8301 --protocol=tcp -e 'show status like '\''wsrep_cluster_size'\''' |
ERROR 2002 (HY000): Can't connect to MySQL server on 'localhost' (115) |
Exactly the same happens for all tree:
- galera-sst-mariabackup galera SST with mariabackup failed
- galera-sst-mysqldump galera SST with mysqldump failed
- galera-sst-rsync galera SST with rsync failed
Logs and their summaries:
- stdio: ends as mentioned above
- kernel: nothing special to my eye
- syslog:
# boot
|
Apr 20 17:13:36 debian-sid-amd64 systemd[1]: Started Session 1 of user buildbot.
|
Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Reached target Main User Target.
|
Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Startup finished in 598ms.
|
...
|
# mysqld runs upgrade and shutdown at the same time
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: mysql.user OK
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: performance_schema
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: Phase 4/7: Running 'mysql_fix_privilege_tables'
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 328: MySQL server has gone away
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 329: MySQL server has gone away
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 330: MySQL server has gone away
|
...
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: FATAL ERROR: Upgrade failed
|
Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[980]: Checking for insecure root accounts.
|
Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Shutdown completed; log sequence number 49104; transaction id 42
|
Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
|
Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] /usr/sbin/mysqld: Shutdown complete
|
Apr 20 17:13:38 debian-sid-amd64 systemd[1]: mariadb.service: Succeeded.
|
Apr 20 17:18:44 debian-sid-amd64 systemd[1]: session-5.scope: Succeeded.
|
Apr 20 17:18:44 debian-sid-amd64 systemd[1]: Started Session 7 of user buildbot.
|
- node1: empty
- node3: empty
- node2: empty
- mysqld.1.err: irrelevant mtr logs, we need to see node1.err
- mysqld.2.err: irrelevant mtr logs, we need to see node2.err
- mysqld.3.err: irrelevant mtr logs, we need to see node3.err
jplindst Could you find out when this bug started, i.e. what is the commit range of suspected changes that caused it?
Or run the same thing manually and check out what node1.err contains on failure?
otto, for 10.3, I posted the very first failure on the Debian Sid builder. It seems to me that the culprit commit could be very old. It could be helpful if elenst provided some more details of how her finding that looks like a change in the Linux kernel is related to this.
> Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder).
So you mean by *which appears to be the very first 10.3 build of that builder* that somebody added a new builder and it was broken from the onset, and there was no previous commit where it worked? So we are actually not debugging a buildbot failure, but figuring out how to correctly implement a new builder? I find that hard to believe. There must have been a point in time when the builder worked before the source was updated to current MariaDB version or before the virtual image contents was apt upgraded into latest Linux distro release. It should be possible to find out that git commit range introduced the failure or what OS upgrade introduced the error.
If not else, take the eoan builder currently passing on latest 10.5 git head (http://buildbot.askmonty.org/buildbot/builders/kvm-deb-eoan-amd64/builds/990/steps/galera-sst-mariabackup) and start upgrading system software in suitable chuncks until you get the failure, then you will know what OS upgrade part broke it.
It could be helpful if Elena Stepanova provided some more details of how her finding that looks like a change in the Linux kernel is related to this.
I really don't know what else I can provide. mysqld_safe attempts to write into a file which exists in /tmp with certain permissions which make it impossible to write into, it fails, mysqld_safe aborts. I provided an example of permissions and failure to write, identical to what mysqld_safe does. I don't know how it can be any clearer. It cannot be fixed by the power of thought, somebody needs to actually run it, debug a tiny part of not-so-big mysqld_safe shell script and find a way to do it better, assuming that it was an intentional change in the system and not a system bug. If it is a bug, then it may be eventually fixed by just waiting for an upgrade with a patch, but if that's our approach, whoever takes it should make sure the system developers are aware of it.
I confirm the steps outlined by elenst fail on Ubuntu Focal, but what log did you find them from, or which lines of code in https://github.com/MariaDB/server/blob/10.5/scripts/mysqld_safe.sh do you simulate?
vagrant@ubuntu-focal:~$ lsb_release -a
|
No LSB modules are available.
|
Distributor ID: Ubuntu
|
Description: Ubuntu 20.04 LTS
|
Release: 20.04
|
Codename: focal
|
vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo touch /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
|
sh: 1: cannot create /tmp/xx: Permission denied
|
There are no journal events, no AppArmor denies etc visible. Filesystem ext4.
Also happens without any sudo/sh:
root@ubuntu-focal:/tmp# ll xx
|
-rw------- 1 vagrant root 2 Apr 21 12:28 xx
|
root@ubuntu-focal:/tmp# echo 1 > /tmp/xx
|
bash: /tmp/xx: Permission denied
|
|
# Goes away if root owns the file
|
root@ubuntu-focal:/tmp# chown root xx
|
root@ubuntu-focal:/tmp# echo 1 > /tmp/xx
|
root@ubuntu-focal:/tmp#
|
otto
There is no log, wsrep is uncharacteristically quiet in there. Luckily it's shell, doesn't require much to debug. Setting -x is already a good step forward.
For the logic, it's the part starting from line 229 by your link, or search for wr_logfile.
mysqld_safe is run under sudo with --user=mysql, as usual. FIrst it creates a temporary file like that
local wr_logfile=$(mktemp /tmp/wsrep_recovery.XXXXXX)
|
A few lines later it changes ownership and permissions like that
[ "$euid" = "0" ] && chown $user $wr_logfile
|
chmod 600 $wr_logfile
|
A few more lines later it attempts to run server (wsrep recovery apparently like that
eval "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile"
|
Nothing is written to the file due to the permissions problem, it greps the file, doesn't find in there what it's looking for, and bails out.
It is supposed to print an error message upon abort, like "WSREP: Failed to recover position", it is not seen anywhere in the logs, I didn't dig into the reasons of it. Maybe logging isn't properly initialized at this point, or there are some other reasons, hopefully whoever debugs it will find out.
Is this somehow related to https://jira.mariadb.org/browse/MDEV-21140 that was "fixed" but not sure if fix was confirmed.
jplindst Why do you ask instead of checking it yourself?
vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular
|
fs.protected_regular = 2
|
|
vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo touch /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx
|
vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
|
sh: 1: cannot create /tmp/xx: Permission denied
|
|
vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=0
|
fs.protected_regular = 0
|
vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
|
|
vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=1
|
fs.protected_regular = 1
|
vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
|
sh: 1: cannot create /tmp/xx: Permission denied
|
Can you jplindst please proceed and fix this everywhere in Galera? What Alexander did in MDEV-21140 was most likely not the only place where this is needed.
Tentative fix in https://github.com/MariaDB/server/pull/1510. I am doing this only for 10.5 and only for this one chown scenario in order to not get stalled with fixing the bugs I actually have assigned on my name and which I indend to have closed by the end of the month. Buildbot greenness is holy for me. I don't understand how anybody can develop without having CI that confirms all the time that no testable regressions exist... I at least can't.
I leave it to Seppo or Jan to fix this properly.
Alternative PR targetting 10.2 branch as requested by jplindst: https://github.com/MariaDB/server/pull/1511
I merged otto’s 10.2 change to 10.3, and it does work there. I see that jplindst separately pushed a change to 10.5 without waiting for a merge.
It seems that node_2 does not shutdown
2020-04-06 0:39:19 0 [Note] /usr/sbin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2020-04-06 0:39:19 0 [Note] WSREP: Shutdown replication
2020-04-06 0:39:19 0 [Note] WSREP: Disconnect provider
2020-04-06 0:39:19 0 [Note] [Debug] WSREP: server debian-sid-amd64 state change: synced -> disconnecting
2020-04-06 0:39:19 0 [Note] WSREP: Server status change synced -> disconnecting
2020-04-06 0:39:19 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-04-06 0:39:19 0 [Note] WSREP: Closing send monitor...
2020-04-06 0:39:19 0 [Note] WSREP: Closed send monitor.
2020-04-06 0:39:19 0 [Note] WSREP: gcomm: terminating thread
2020-04-06 0:39:19 0 [Note] WSREP: gcomm: joining thread
2020-04-06 0:39:19 0 [Note] WSREP: gcomm: closing backend
2020-04-06 0:39:21 0 [Note] WSREP: (83340fb3-a9d2, 'tcp://0.0.0.0:16005') reconnecting to 72936f9e-9ff1 (tcp://127.0.0.1:16002), attempt 0
2020-04-06 0:39:24 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
2020-04-06 0:39:29 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1