[MDEV-22159] Galera SST tests fail on Debian sid and focal: all SST method tests fail i.e. mariabackup, mysqldump and rsync - Jira

Jan Lindström (Inactive) added a comment - 2020-04-06 09:25

It seems that node_2 does not shutdown

2020-04-06  0:39:19 0 [Note] /usr/sbin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown

2020-04-06  0:39:19 0 [Note] WSREP: Shutdown replication

2020-04-06  0:39:19 0 [Note] WSREP: Disconnect provider

2020-04-06  0:39:19 0 [Note] [Debug] WSREP: server debian-sid-amd64 state change: synced -> disconnecting

2020-04-06  0:39:19 0 [Note] WSREP: Server status change synced -> disconnecting

2020-04-06  0:39:19 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2020-04-06  0:39:19 0 [Note] WSREP: Closing send monitor...

2020-04-06  0:39:19 0 [Note] WSREP: Closed send monitor.

2020-04-06  0:39:19 0 [Note] WSREP: gcomm: terminating thread

2020-04-06  0:39:19 0 [Note] WSREP: gcomm: joining thread

2020-04-06  0:39:19 0 [Note] WSREP: gcomm: closing backend

2020-04-06  0:39:21 0 [Note] WSREP: (83340fb3-a9d2, 'tcp://0.0.0.0:16005') reconnecting to 72936f9e-9ff1 (tcp://127.0.0.1:16002), attempt 0

2020-04-06  0:39:24 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

2020-04-06  0:39:29 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

Jan Lindström (Inactive) added a comment - 2020-04-06 09:25 It seems that node_2 does not shutdown 2020-04-06 0:39:19 0 [Note] /usr/sbin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown 2020-04-06 0:39:19 0 [Note] WSREP: Shutdown replication 2020-04-06 0:39:19 0 [Note] WSREP: Disconnect provider 2020-04-06 0:39:19 0 [Note] [Debug] WSREP: server debian-sid-amd64 state change: synced -> disconnecting 2020-04-06 0:39:19 0 [Note] WSREP: Server status change synced -> disconnecting 2020-04-06 0:39:19 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-06 0:39:19 0 [Note] WSREP: Closing send monitor... 2020-04-06 0:39:19 0 [Note] WSREP: Closed send monitor. 2020-04-06 0:39:19 0 [Note] WSREP: gcomm: terminating thread 2020-04-06 0:39:19 0 [Note] WSREP: gcomm: joining thread 2020-04-06 0:39:19 0 [Note] WSREP: gcomm: closing backend 2020-04-06 0:39:21 0 [Note] WSREP: (83340fb3-a9d2, 'tcp://0.0.0.0:16005') reconnecting to 72936f9e-9ff1 (tcp://127.0.0.1:16002), attempt 0 2020-04-06 0:39:24 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:29 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1

Jan Lindström (Inactive) added a comment - 2020-04-07 06:41 - edited

Local testing with

jan@jan-HP-ZBook-15u-G5:~$ lsb_release -a

No LSB modules are available.

Distributor ID:	Ubuntu

Description:	Ubuntu Focal Fossa (development branch)

Release:	20.04

Codename:	focal

branch: 10.5 CS
commit: dcc2eaebbdf658af35a92249a565d80c19967cbe
how to repeat: perl mysql-test-run.pl --big-test --suite=galera --do-test=galera_sst_mariabackup*
Error logs attached

Jan Lindström (Inactive) added a comment - 2020-04-07 06:41 - edited Local testing with jan@jan-HP-ZBook-15u-G5:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu Focal Fossa (development branch) Release: 20.04 Codename: focal branch: 10.5 CS commit: dcc2eaebbdf658af35a92249a565d80c19967cbe how to repeat: perl mysql-test-run.pl --big-test --suite=galera --do-test=galera_sst_mariabackup* Error logs attached

Jan Lindström (Inactive) added a comment - 2020-04-07 06:48

Note that in test machines all SST method tests fail i.e. mariabackup, mysqldump and rsync and based on error logs all cases second node does not shutdown.

Jan Lindström (Inactive) added a comment - 2020-04-07 06:48 Note that in test machines all SST method tests fail i.e. mariabackup, mysqldump and rsync and based on error logs all cases second node does not shutdown.

Jan Lindström (Inactive) added a comment - 2020-04-07 07:23

Locally, I can't repeat problems on mysqldump or rsync:

galera.galera_sst_mysqldump 'debug,innodb' [ pass ]  134393

galera.galera_sst_mysqldump_with_key 'debug,innodb' [ pass ]  28990

galera.galera_sst_rsync 'debug,innodb'   [ pass ]  44524

galera.galera_sst_rsync2 'debug,innodb'  [ pass ]  44538

galera.galera_sst_rsync_data_dir 'debug,innodb' [ pass ]  44172

And there is no running mysqld processes after mtr has finished.

Jan Lindström (Inactive) added a comment - 2020-04-07 07:23 Locally, I can't repeat problems on mysqldump or rsync: galera.galera_sst_mysqldump 'debug,innodb' [ pass ] 134393 galera.galera_sst_mysqldump_with_key 'debug,innodb' [ pass ] 28990 galera.galera_sst_rsync 'debug,innodb' [ pass ] 44524 galera.galera_sst_rsync2 'debug,innodb' [ pass ] 44538 galera.galera_sst_rsync_data_dir 'debug,innodb' [ pass ] 44172 And there is no running mysqld processes after mtr has finished.

Elena Stepanova added a comment - 2020-04-15 12:05 - edited

I'm not sure what you were trying to reproduce by running MTR tests, they pass in the buildbot as well, you can see it by the links that you pasted:

==============================================================================

TEST                                      RESULT   TIME (ms) or COMMENT

--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019

galera.galera_sst_rsync 'innodb,release' [ pass ]  30769

galera.galera_sst_rsync2 'innodb,release' [ pass ]  30858

galera.galera_sst_rsync_data_dir 'innodb,release' [ pass ]  30768

--------------------------------------------------------------------------

The servers were restarted 2 times

Spent 92.395 of 138 seconds executing testcases

Completed: All 3 tests were successful.

Consequently, if you are quoting MTR logs for something that didn't shut down, it is irrelevant, MTR will kill the servers at the end anyway.

The problems are unrelated to MTR and start at the next stage, which is also clearly visible from the buildbot output, as every command is written there. The test attempts to start 3 nodes using mysqld_safe, and none of them starts. The server logs for these attempts are node1, node2, node3 (which also doesn't need to be guessed, it can be seen in the buildbot output), they are attached, but all of them are empty, from which one can conclude that the failure happens before the server is able to write anything in the log.

Further debugging is primitive, as mysqld_safe is just a shell script. You create the configs the same way the test creates them, run mysqld_safe the same way the test does, observe the same silent exit, debug.

If you did that, you would find out that mysqld_safe fails during WSREP recovery stage, before it actually starts the server (which is why server error logs remain empty). And this WSREP logic disables all the output in the process, which is why you don't see anything meaningful in the output either.

Actual problem

The WSEP recovery problem boils down to this:

sudo rm -f /tmp/xx

sudo touch /tmp/xx

sudo chmod 600 /tmp/xx

sudo chown <existing user> /tmp/xx

sudo sh -c "echo 1 > /tmp/xx"

That's what the logic does with some wr_logfile or alike.
On existing systems, it works all right. On Focal and Sid (apparently, on the newest kernels), the last line, the attempt to write into the file under sudo, fails with permission denied. At this point mysqld_safe bails out, nothing is started.

Please investigate further and fix as needed, as Focal goes GA next week, so our upcoming releases will have to build on it.

Elena Stepanova added a comment - 2020-04-15 12:05 - edited I'm not sure what you were trying to reproduce by running MTR tests, they pass in the buildbot as well, you can see it by the links that you pasted: ============================================================================== TEST RESULT TIME (ms) or COMMENT -------------------------------------------------------------------------- worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 galera.galera_sst_rsync 'innodb,release' [ pass ] 30769 galera.galera_sst_rsync2 'innodb,release' [ pass ] 30858 galera.galera_sst_rsync_data_dir 'innodb,release' [ pass ] 30768 -------------------------------------------------------------------------- The servers were restarted 2 times Spent 92.395 of 138 seconds executing testcases Completed: All 3 tests were successful. Consequently, if you are quoting MTR logs for something that didn't shut down, it is irrelevant, MTR will kill the servers at the end anyway. The problems are unrelated to MTR and start at the next stage, which is also clearly visible from the buildbot output, as every command is written there. The test attempts to start 3 nodes using mysqld_safe , and none of them starts. The server logs for these attempts are node1 , node2 , node3 (which also doesn't need to be guessed, it can be seen in the buildbot output), they are attached, but all of them are empty, from which one can conclude that the failure happens before the server is able to write anything in the log. Further debugging is primitive, as mysqld_safe is just a shell script. You create the configs the same way the test creates them, run mysqld_safe the same way the test does, observe the same silent exit, debug. If you did that, you would find out that mysqld_safe fails during WSREP recovery stage, before it actually starts the server (which is why server error logs remain empty). And this WSREP logic disables all the output in the process, which is why you don't see anything meaningful in the output either. Actual problem The WSEP recovery problem boils down to this: sudo rm -f /tmp/xx sudo touch /tmp/xx sudo chmod 600 /tmp/xx sudo chown <existing user> /tmp/xx sudo sh -c "echo 1 > /tmp/xx" That's what the logic does with some wr_logfile or alike. On existing systems, it works all right. On Focal and Sid (apparently, on the newest kernels), the last line, the attempt to write into the file under sudo , fails with permission denied. At this point mysqld_safe bails out, nothing is started. Please investigate further and fix as needed, as Focal goes GA next week, so our upcoming releases will have to build on it.

Daniel Black added a comment - 2020-04-15 22:24

Recommend raising this to the kernel developers on https://www.spinics.net/lists/linux-fsdevel/ or a mailing list specific to the file-system that fails.

Daniel Black added a comment - 2020-04-15 22:24 Recommend raising this to the kernel developers on https://www.spinics.net/lists/linux-fsdevel/ or a mailing list specific to the file-system that fails.

Marko Mäkelä added a comment - 2020-04-20 10:10 - edited

This affects 10.3 and 10.4 as well. I suppose that this could affect 10.1 and 10.2 too, but our Debian Sid builder only covers 10.3 and later. Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder).
Note that Debian Sid is in the experimental grid view (link to 10.4).

Marko Mäkelä added a comment - 2020-04-20 10:10 - edited This affects 10.3 and 10.4 as well. I suppose that this could affect 10.1 and 10.2 too, but our Debian Sid builder only covers 10.3 and later. Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder). Note that Debian Sid is in the experimental grid view ( link to 10.4 ).

Otto Kekäläinen added a comment - 2020-04-21 06:40

I updated the title to ensure everybody on this issue are debugging the same thing. So the problem is these three tests:

Buildbot tries to start 3 Galera nodes, but none of them start and eventually buildbot run stops on connection error:

sudo mysqld_safe --defaults-extra-file=/home/buildbot/node1.cnf --user=mysql --wsrep-new-cluster &

...

+ mysql -uroot -prootpass --port=8301 --protocol=tcp -e 'show status like '\''wsrep_cluster_size'\'''

ERROR 2002 (HY000): Can't connect to MySQL server on 'localhost' (115)

Exactly the same happens for all tree:

galera-sst-mariabackup galera SST with mariabackup failed
galera-sst-mysqldump galera SST with mysqldump failed
galera-sst-rsync galera SST with rsync failed

Logs and their summaries:

stdio: ends as mentioned above
kernel: nothing special to my eye
syslog:

# boot

Apr 20 17:13:36 debian-sid-amd64 systemd[1]: Started Session 1 of user buildbot.

Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Reached target Main User Target.

Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Startup finished in 598ms.

...

# mysqld runs upgrade and shutdown at the same time

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: mysql.user                                         OK

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: performance_schema

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: Phase 4/7: Running 'mysql_fix_privilege_tables'

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 328: MySQL server has gone away

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 329: MySQL server has gone away

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 330: MySQL server has gone away

...

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: FATAL ERROR: Upgrade failed

Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[980]: Checking for insecure root accounts.

Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Shutdown completed; log sequence number 49104; transaction id 42

Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"

Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] /usr/sbin/mysqld: Shutdown complete

Apr 20 17:13:38 debian-sid-amd64 systemd[1]: mariadb.service: Succeeded.

Apr 20 17:18:44 debian-sid-amd64 systemd[1]: session-5.scope: Succeeded.

Apr 20 17:18:44 debian-sid-amd64 systemd[1]: Started Session 7 of user buildbot.

node1: empty
node3: empty
node2: empty
mysqld.1.err: irrelevant mtr logs, we need to see node1.err
mysqld.2.err: irrelevant mtr logs, we need to see node2.err
mysqld.3.err: irrelevant mtr logs, we need to see node3.err

Otto Kekäläinen added a comment - 2020-04-21 06:40 I updated the title to ensure everybody on this issue are debugging the same thing. So the problem is these three tests: Buildbot tries to start 3 Galera nodes, but none of them start and eventually buildbot run stops on connection error: sudo mysqld_safe --defaults-extra-file=/home/buildbot/node1.cnf --user=mysql --wsrep- new -cluster & ... + mysql -uroot -prootpass --port= 8301 --protocol=tcp -e 'show status like ' \ '' wsrep_cluster_size '\'' ' ERROR 2002 (HY000): Can 't connect to MySQL server on ' localhost' ( 115 ) Exactly the same happens for all tree: galera-sst-mariabackup galera SST with mariabackup failed galera-sst-mysqldump galera SST with mysqldump failed galera-sst-rsync galera SST with rsync failed Logs and their summaries: stdio: ends as mentioned above kernel: nothing special to my eye syslog: # boot Apr 20 17:13:36 debian-sid-amd64 systemd[1]: Started Session 1 of user buildbot. Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Reached target Main User Target. Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Startup finished in 598ms. ... # mysqld runs upgrade and shutdown at the same time Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: mysql.user OK Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: performance_schema Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: Phase 4/7: Running 'mysql_fix_privilege_tables' Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 328: MySQL server has gone away Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 329: MySQL server has gone away Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 330: MySQL server has gone away ... Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: FATAL ERROR: Upgrade failed Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[980]: Checking for insecure root accounts. Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Shutdown completed; log sequence number 49104; transaction id 42 Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] /usr/sbin/mysqld: Shutdown complete Apr 20 17:13:38 debian-sid-amd64 systemd[1]: mariadb.service: Succeeded. Apr 20 17:18:44 debian-sid-amd64 systemd[1]: session-5.scope: Succeeded. Apr 20 17:18:44 debian-sid-amd64 systemd[1]: Started Session 7 of user buildbot. node1: empty node3: empty node2: empty mysqld.1.err: irrelevant mtr logs, we need to see node1.err mysqld.2.err: irrelevant mtr logs, we need to see node2.err mysqld.3.err: irrelevant mtr logs, we need to see node3.err

Otto Kekäläinen added a comment - 2020-04-21 06:45

jplindst Could you find out when this bug started, i.e. what is the commit range of suspected changes that caused it?

Or run the same thing manually and check out what node1.err contains on failure?

Otto Kekäläinen added a comment - 2020-04-21 06:45 jplindst Could you find out when this bug started, i.e. what is the commit range of suspected changes that caused it? Or run the same thing manually and check out what node1.err contains on failure?

Marko Mäkelä added a comment - 2020-04-21 07:03

otto, for 10.3, I posted the very first failure on the Debian Sid builder. It seems to me that the culprit commit could be very old. It could be helpful if elenst provided some more details of how her finding that looks like a change in the Linux kernel is related to this.

Marko Mäkelä added a comment - 2020-04-21 07:03 otto , for 10.3, I posted the very first failure on the Debian Sid builder. It seems to me that the culprit commit could be very old. It could be helpful if elenst provided some more details of how her finding that looks like a change in the Linux kernel is related to this.

Otto Kekäläinen added a comment - 2020-04-21 07:16

> Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder).

So you mean by *which appears to be the very first 10.3 build of that builder* that somebody added a new builder and it was broken from the onset, and there was no previous commit where it worked? So we are actually not debugging a buildbot failure, but figuring out how to correctly implement a new builder? I find that hard to believe. There must have been a point in time when the builder worked before the source was updated to current MariaDB version or before the virtual image contents was apt upgraded into latest Linux distro release. It should be possible to find out that git commit range introduced the failure or what OS upgrade introduced the error.

If not else, take the eoan builder currently passing on latest 10.5 git head (http://buildbot.askmonty.org/buildbot/builders/kvm-deb-eoan-amd64/builds/990/steps/galera-sst-mariabackup) and start upgrading system software in suitable chuncks until you get the failure, then you will know what OS upgrade part broke it.

Otto Kekäläinen added a comment - 2020-04-21 07:16 > Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder). So you mean by * which appears to be the very first 10.3 build of that builder * that somebody added a new builder and it was broken from the onset, and there was no previous commit where it worked? So we are actually not debugging a buildbot failure, but figuring out how to correctly implement a new builder? I find that hard to believe. There must have been a point in time when the builder worked before the source was updated to current MariaDB version or before the virtual image contents was apt upgraded into latest Linux distro release. It should be possible to find out that git commit range introduced the failure or what OS upgrade introduced the error. If not else, take the eoan builder currently passing on latest 10.5 git head ( http://buildbot.askmonty.org/buildbot/builders/kvm-deb-eoan-amd64/builds/990/steps/galera-sst-mariabackup ) and start upgrading system software in suitable chuncks until you get the failure, then you will know what OS upgrade part broke it.

Elena Stepanova added a comment - 2020-04-21 11:18 - edited

It could be helpful if Elena Stepanova provided some more details of how her finding that looks like a change in the Linux kernel is related to this.

I really don't know what else I can provide. mysqld_safe attempts to write into a file which exists in /tmp with certain permissions which make it impossible to write into, it fails, mysqld_safe aborts. I provided an example of permissions and failure to write, identical to what mysqld_safe does. I don't know how it can be any clearer. It cannot be fixed by the power of thought, somebody needs to actually run it, debug a tiny part of not-so-big mysqld_safe shell script and find a way to do it better, assuming that it was an intentional change in the system and not a system bug. If it is a bug, then it may be eventually fixed by just waiting for an upgrade with a patch, but if that's our approach, whoever takes it should make sure the system developers are aware of it.

Elena Stepanova added a comment - 2020-04-21 11:18 - edited It could be helpful if Elena Stepanova provided some more details of how her finding that looks like a change in the Linux kernel is related to this. I really don't know what else I can provide. mysqld_safe attempts to write into a file which exists in /tmp with certain permissions which make it impossible to write into, it fails, mysqld_safe aborts. I provided an example of permissions and failure to write, identical to what mysqld_safe does. I don't know how it can be any clearer. It cannot be fixed by the power of thought, somebody needs to actually run it, debug a tiny part of not-so-big mysqld_safe shell script and find a way to do it better, assuming that it was an intentional change in the system and not a system bug. If it is a bug, then it may be eventually fixed by just waiting for an upgrade with a patch, but if that's our approach, whoever takes it should make sure the system developers are aware of it.

Otto Kekäläinen added a comment - 2020-04-21 12:40

I confirm the steps outlined by elenst fail on Ubuntu Focal, but what log did you find them from, or which lines of code in https://github.com/MariaDB/server/blob/10.5/scripts/mysqld_safe.sh do you simulate?

vagrant@ubuntu-focal:~$ lsb_release -a

No LSB modules are available.

Distributor ID:	Ubuntu

Description:	Ubuntu 20.04 LTS

Release:	20.04

Codename:	focal

vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx

vagrant@ubuntu-focal:~$ sudo touch /tmp/xx

vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx

vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx

vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"

sh: 1: cannot create /tmp/xx: Permission denied

There are no journal events, no AppArmor denies etc visible. Filesystem ext4.

Also happens without any sudo/sh:

root@ubuntu-focal:/tmp# ll xx

-rw------- 1 vagrant root 2 Apr 21 12:28 xx

root@ubuntu-focal:/tmp# echo 1 > /tmp/xx

bash: /tmp/xx: Permission denied

# Goes away if root owns the file

root@ubuntu-focal:/tmp# chown root xx

root@ubuntu-focal:/tmp# echo 1 > /tmp/xx

root@ubuntu-focal:/tmp#

Otto Kekäläinen added a comment - 2020-04-21 12:40 I confirm the steps outlined by elenst fail on Ubuntu Focal, but what log did you find them from, or which lines of code in https://github.com/MariaDB/server/blob/10.5/scripts/mysqld_safe.sh do you simulate? vagrant@ubuntu-focal:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04 LTS Release: 20.04 Codename: focal vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx vagrant@ubuntu-focal:~$ sudo touch /tmp/xx vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx" sh: 1: cannot create /tmp/xx: Permission denied There are no journal events, no AppArmor denies etc visible. Filesystem ext4. Also happens without any sudo/sh: root@ubuntu-focal:/tmp# ll xx -rw------- 1 vagrant root 2 Apr 21 12:28 xx root@ubuntu-focal:/tmp# echo 1 > /tmp/xx bash: /tmp/xx: Permission denied # Goes away if root owns the file root@ubuntu-focal:/tmp# chown root xx root@ubuntu-focal:/tmp# echo 1 > /tmp/xx root@ubuntu-focal:/tmp#

Elena Stepanova added a comment - 2020-04-21 14:10 - edited

otto
There is no log, wsrep is uncharacteristically quiet in there. Luckily it's shell, doesn't require much to debug. Setting -x is already a good step forward.

For the logic, it's the part starting from line 229 by your link, or search for wr_logfile.
mysqld_safe is run under sudo with --user=mysql, as usual. FIrst it creates a temporary file like that

 local wr_logfile=$(mktemp /tmp/wsrep_recovery.XXXXXX)

A few lines later it changes ownership and permissions like that

   [ "$euid" = "0" ] && chown $user $wr_logfile

    chmod 600 $wr_logfile

A few more lines later it attempts to run server (wsrep recovery apparently like that

eval "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile"

Nothing is written to the file due to the permissions problem, it greps the file, doesn't find in there what it's looking for, and bails out.
It is supposed to print an error message upon abort, like "WSREP: Failed to recover position", it is not seen anywhere in the logs, I didn't dig into the reasons of it. Maybe logging isn't properly initialized at this point, or there are some other reasons, hopefully whoever debugs it will find out.

Elena Stepanova added a comment - 2020-04-21 14:10 - edited otto There is no log, wsrep is uncharacteristically quiet in there. Luckily it's shell, doesn't require much to debug. Setting -x is already a good step forward. For the logic, it's the part starting from line 229 by your link, or search for wr_logfile . mysqld_safe is run under sudo with --user=mysql , as usual. FIrst it creates a temporary file like that local wr_logfile=$(mktemp /tmp/wsrep_recovery.XXXXXX) A few lines later it changes ownership and permissions like that [ "$euid" = "0" ] && chown $user $wr_logfile chmod 600 $wr_logfile A few more lines later it attempts to run server (wsrep recovery apparently like that eval "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile" Nothing is written to the file due to the permissions problem, it greps the file, doesn't find in there what it's looking for, and bails out. It is supposed to print an error message upon abort, like "WSREP: Failed to recover position" , it is not seen anywhere in the logs, I didn't dig into the reasons of it. Maybe logging isn't properly initialized at this point, or there are some other reasons, hopefully whoever debugs it will find out.

Jan Lindström (Inactive) added a comment - 2020-04-21 14:27

Is this somehow related to https://jira.mariadb.org/browse/MDEV-21140 that was "fixed" but not sure if fix was confirmed.

Jan Lindström (Inactive) added a comment - 2020-04-21 14:27 Is this somehow related to https://jira.mariadb.org/browse/MDEV-21140 that was "fixed" but not sure if fix was confirmed.

Otto Kekäläinen added a comment - 2020-04-21 17:01

jplindst Why do you ask instead of checking it yourself?

vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular

fs.protected_regular = 2

vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx

vagrant@ubuntu-focal:~$ sudo touch /tmp/xx

vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx

vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx

vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"

sh: 1: cannot create /tmp/xx: Permission denied

vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=0

fs.protected_regular = 0

vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"

vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=1

fs.protected_regular = 1

vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"

sh: 1: cannot create /tmp/xx: Permission denied

Can you jplindst please proceed and fix this everywhere in Galera? What Alexander did in ~~MDEV-21140~~ was most likely not the only place where this is needed.

Otto Kekäläinen added a comment - 2020-04-21 17:01 jplindst Why do you ask instead of checking it yourself? vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular fs.protected_regular = 2 vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx vagrant@ubuntu-focal:~$ sudo touch /tmp/xx vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx" sh: 1: cannot create /tmp/xx: Permission denied vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=0 fs.protected_regular = 0 vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx" vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=1 fs.protected_regular = 1 vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx" sh: 1: cannot create /tmp/xx: Permission denied Can you jplindst please proceed and fix this everywhere in Galera? What Alexander did in MDEV-21140 was most likely not the only place where this is needed.

Otto Kekäläinen added a comment - 2020-04-21 19:36

Tentative fix in https://github.com/MariaDB/server/pull/1510. I am doing this only for 10.5 and only for this one chown scenario in order to not get stalled with fixing the bugs I actually have assigned on my name and which I indend to have closed by the end of the month. Buildbot greenness is holy for me. I don't understand how anybody can develop without having CI that confirms all the time that no testable regressions exist... I at least can't.

I leave it to Seppo or Jan to fix this properly.

Otto Kekäläinen added a comment - 2020-04-21 19:36 Tentative fix in https://github.com/MariaDB/server/pull/1510 . I am doing this only for 10.5 and only for this one chown scenario in order to not get stalled with fixing the bugs I actually have assigned on my name and which I indend to have closed by the end of the month. Buildbot greenness is holy for me. I don't understand how anybody can develop without having CI that confirms all the time that no testable regressions exist... I at least can't. I leave it to Seppo or Jan to fix this properly.

Otto Kekäläinen added a comment - 2020-04-22 06:06

Alternative PR targetting 10.2 branch as requested by jplindst: https://github.com/MariaDB/server/pull/1511

Otto Kekäläinen added a comment - 2020-04-22 06:06 Alternative PR targetting 10.2 branch as requested by jplindst : https://github.com/MariaDB/server/pull/1511

Marko Mäkelä added a comment - 2020-04-22 14:12

I merged otto’s 10.2 change to 10.3, and it does work there. I see that jplindst separately pushed a change to 10.5 without waiting for a merge.

Marko Mäkelä added a comment - 2020-04-22 14:12 I merged otto ’s 10.2 change to 10.3, and it does work there. I see that jplindst separately pushed a change to 10.5 without waiting for a merge.

MariaDB Server

Galera SST tests fail on Debian sid and focal: all SST method tests fail i.e. mariabackup, mysqldump and rsync

Details

Description

Attachments

Attachments

Activity

Actual problem

People

Dates

Git Integration