Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22159

Galera SST tests fail on Debian sid and focal: all SST method tests fail i.e. mariabackup, mysqldump and rsync

Details

    Attachments

      1. mysqld.1.err
        241 kB
      2. mysqld.2.err
        287 kB
      3. mysqld-local.1.err
        28 kB
      4. mysqld-local.2.err
        13 kB
      5. screenshot-1.png
        screenshot-1.png
        34 kB
      6. stdout-local.log
        2 kB
      7. xtrabackup_galera_info
        0.0 kB

      Activity

        It seems that node_2 does not shutdown

        2020-04-06  0:39:19 0 [Note] /usr/sbin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
        2020-04-06  0:39:19 0 [Note] WSREP: Shutdown replication
        2020-04-06  0:39:19 0 [Note] WSREP: Disconnect provider
        2020-04-06  0:39:19 0 [Note] [Debug] WSREP: server debian-sid-amd64 state change: synced -> disconnecting
        2020-04-06  0:39:19 0 [Note] WSREP: Server status change synced -> disconnecting
        2020-04-06  0:39:19 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
        2020-04-06  0:39:19 0 [Note] WSREP: Closing send monitor...
        2020-04-06  0:39:19 0 [Note] WSREP: Closed send monitor.
        2020-04-06  0:39:19 0 [Note] WSREP: gcomm: terminating thread
        2020-04-06  0:39:19 0 [Note] WSREP: gcomm: joining thread
        2020-04-06  0:39:19 0 [Note] WSREP: gcomm: closing backend
        2020-04-06  0:39:21 0 [Note] WSREP: (83340fb3-a9d2, 'tcp://0.0.0.0:16005') reconnecting to 72936f9e-9ff1 (tcp://127.0.0.1:16002), attempt 0
        2020-04-06  0:39:24 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        2020-04-06  0:39:29 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        

        jplindst Jan Lindström (Inactive) added a comment - It seems that node_2 does not shutdown 2020-04-06 0:39:19 0 [Note] /usr/sbin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown 2020-04-06 0:39:19 0 [Note] WSREP: Shutdown replication 2020-04-06 0:39:19 0 [Note] WSREP: Disconnect provider 2020-04-06 0:39:19 0 [Note] [Debug] WSREP: server debian-sid-amd64 state change: synced -> disconnecting 2020-04-06 0:39:19 0 [Note] WSREP: Server status change synced -> disconnecting 2020-04-06 0:39:19 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2020-04-06 0:39:19 0 [Note] WSREP: Closing send monitor... 2020-04-06 0:39:19 0 [Note] WSREP: Closed send monitor. 2020-04-06 0:39:19 0 [Note] WSREP: gcomm: terminating thread 2020-04-06 0:39:19 0 [Note] WSREP: gcomm: joining thread 2020-04-06 0:39:19 0 [Note] WSREP: gcomm: closing backend 2020-04-06 0:39:21 0 [Note] WSREP: (83340fb3-a9d2, 'tcp://0.0.0.0:16005') reconnecting to 72936f9e-9ff1 (tcp://127.0.0.1:16002), attempt 0 2020-04-06 0:39:24 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:25 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:26 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:27 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:28 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1 2020-04-06 0:39:29 0 [Note] WSREP: evs::proto(83340fb3-a9d2, LEAVING, view_id(REG,72936f9e-9ff1,8)) suspecting node: 72936f9e-9ff1
        jplindst Jan Lindström (Inactive) added a comment - - edited

        Local testing with

        jan@jan-HP-ZBook-15u-G5:~$ lsb_release -a
        No LSB modules are available.
        Distributor ID:	Ubuntu
        Description:	Ubuntu Focal Fossa (development branch)
        Release:	20.04
        Codename:	focal
        

        • branch: 10.5 CS
        • commit: dcc2eaebbdf658af35a92249a565d80c19967cbe
        • how to repeat: perl mysql-test-run.pl --big-test --suite=galera --do-test=galera_sst_mariabackup*
        • Error logs attached
        jplindst Jan Lindström (Inactive) added a comment - - edited Local testing with jan@jan-HP-ZBook-15u-G5:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu Focal Fossa (development branch) Release: 20.04 Codename: focal branch: 10.5 CS commit: dcc2eaebbdf658af35a92249a565d80c19967cbe how to repeat: perl mysql-test-run.pl --big-test --suite=galera --do-test=galera_sst_mariabackup* Error logs attached

        Note that in test machines all SST method tests fail i.e. mariabackup, mysqldump and rsync and based on error logs all cases second node does not shutdown.

        jplindst Jan Lindström (Inactive) added a comment - Note that in test machines all SST method tests fail i.e. mariabackup, mysqldump and rsync and based on error logs all cases second node does not shutdown.

        Locally, I can't repeat problems on mysqldump or rsync:

        galera.galera_sst_mysqldump 'debug,innodb' [ pass ]  134393
        galera.galera_sst_mysqldump_with_key 'debug,innodb' [ pass ]  28990
        galera.galera_sst_rsync 'debug,innodb'   [ pass ]  44524
        galera.galera_sst_rsync2 'debug,innodb'  [ pass ]  44538
        galera.galera_sst_rsync_data_dir 'debug,innodb' [ pass ]  44172
        

        And there is no running mysqld processes after mtr has finished.

        jplindst Jan Lindström (Inactive) added a comment - Locally, I can't repeat problems on mysqldump or rsync: galera.galera_sst_mysqldump 'debug,innodb' [ pass ] 134393 galera.galera_sst_mysqldump_with_key 'debug,innodb' [ pass ] 28990 galera.galera_sst_rsync 'debug,innodb' [ pass ] 44524 galera.galera_sst_rsync2 'debug,innodb' [ pass ] 44538 galera.galera_sst_rsync_data_dir 'debug,innodb' [ pass ] 44172 And there is no running mysqld processes after mtr has finished.
        elenst Elena Stepanova added a comment - - edited

        I'm not sure what you were trying to reproduce by running MTR tests, they pass in the buildbot as well, you can see it by the links that you pasted:

        ==============================================================================
         
        TEST                                      RESULT   TIME (ms) or COMMENT
        --------------------------------------------------------------------------
         
        worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
        galera.galera_sst_rsync 'innodb,release' [ pass ]  30769
        galera.galera_sst_rsync2 'innodb,release' [ pass ]  30858
        galera.galera_sst_rsync_data_dir 'innodb,release' [ pass ]  30768
        --------------------------------------------------------------------------
        The servers were restarted 2 times
        Spent 92.395 of 138 seconds executing testcases
         
        Completed: All 3 tests were successful.
        

        Consequently, if you are quoting MTR logs for something that didn't shut down, it is irrelevant, MTR will kill the servers at the end anyway.

        The problems are unrelated to MTR and start at the next stage, which is also clearly visible from the buildbot output, as every command is written there. The test attempts to start 3 nodes using mysqld_safe, and none of them starts. The server logs for these attempts are node1, node2, node3 (which also doesn't need to be guessed, it can be seen in the buildbot output), they are attached, but all of them are empty, from which one can conclude that the failure happens before the server is able to write anything in the log.

        Further debugging is primitive, as mysqld_safe is just a shell script. You create the configs the same way the test creates them, run mysqld_safe the same way the test does, observe the same silent exit, debug.

        If you did that, you would find out that mysqld_safe fails during WSREP recovery stage, before it actually starts the server (which is why server error logs remain empty). And this WSREP logic disables all the output in the process, which is why you don't see anything meaningful in the output either.

        Actual problem

        The WSEP recovery problem boils down to this:

        sudo rm -f /tmp/xx
        sudo touch /tmp/xx
        sudo chmod 600 /tmp/xx
        sudo chown <existing user> /tmp/xx
        sudo sh -c "echo 1 > /tmp/xx"
        

        That's what the logic does with some wr_logfile or alike.
        On existing systems, it works all right. On Focal and Sid (apparently, on the newest kernels), the last line, the attempt to write into the file under sudo, fails with permission denied. At this point mysqld_safe bails out, nothing is started.

        Please investigate further and fix as needed, as Focal goes GA next week, so our upcoming releases will have to build on it.

        elenst Elena Stepanova added a comment - - edited I'm not sure what you were trying to reproduce by running MTR tests, they pass in the buildbot as well, you can see it by the links that you pasted: ==============================================================================   TEST RESULT TIME (ms) or COMMENT --------------------------------------------------------------------------   worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 galera.galera_sst_rsync 'innodb,release' [ pass ] 30769 galera.galera_sst_rsync2 'innodb,release' [ pass ] 30858 galera.galera_sst_rsync_data_dir 'innodb,release' [ pass ] 30768 -------------------------------------------------------------------------- The servers were restarted 2 times Spent 92.395 of 138 seconds executing testcases   Completed: All 3 tests were successful. Consequently, if you are quoting MTR logs for something that didn't shut down, it is irrelevant, MTR will kill the servers at the end anyway. The problems are unrelated to MTR and start at the next stage, which is also clearly visible from the buildbot output, as every command is written there. The test attempts to start 3 nodes using mysqld_safe , and none of them starts. The server logs for these attempts are node1 , node2 , node3 (which also doesn't need to be guessed, it can be seen in the buildbot output), they are attached, but all of them are empty, from which one can conclude that the failure happens before the server is able to write anything in the log. Further debugging is primitive, as mysqld_safe is just a shell script. You create the configs the same way the test creates them, run mysqld_safe the same way the test does, observe the same silent exit, debug. If you did that, you would find out that mysqld_safe fails during WSREP recovery stage, before it actually starts the server (which is why server error logs remain empty). And this WSREP logic disables all the output in the process, which is why you don't see anything meaningful in the output either. Actual problem The WSEP recovery problem boils down to this: sudo rm -f /tmp/xx sudo touch /tmp/xx sudo chmod 600 /tmp/xx sudo chown <existing user> /tmp/xx sudo sh -c "echo 1 > /tmp/xx" That's what the logic does with some wr_logfile or alike. On existing systems, it works all right. On Focal and Sid (apparently, on the newest kernels), the last line, the attempt to write into the file under sudo , fails with permission denied. At this point mysqld_safe bails out, nothing is started. Please investigate further and fix as needed, as Focal goes GA next week, so our upcoming releases will have to build on it.
        danblack Daniel Black added a comment -

        Recommend raising this to the kernel developers on https://www.spinics.net/lists/linux-fsdevel/ or a mailing list specific to the file-system that fails.

        danblack Daniel Black added a comment - Recommend raising this to the kernel developers on https://www.spinics.net/lists/linux-fsdevel/ or a mailing list specific to the file-system that fails.
        marko Marko Mäkelä added a comment - - edited

        This affects 10.3 and 10.4 as well. I suppose that this could affect 10.1 and 10.2 too, but our Debian Sid builder only covers 10.3 and later. Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder).
        Note that Debian Sid is in the experimental grid view (link to 10.4).

        marko Marko Mäkelä added a comment - - edited This affects 10.3 and 10.4 as well. I suppose that this could affect 10.1 and 10.2 too, but our Debian Sid builder only covers 10.3 and later. Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder). Note that Debian Sid is in the experimental grid view ( link to 10.4 ).

        I updated the title to ensure everybody on this issue are debugging the same thing. So the problem is these three tests:

        Buildbot tries to start 3 Galera nodes, but none of them start and eventually buildbot run stops on connection error:

        sudo mysqld_safe --defaults-extra-file=/home/buildbot/node1.cnf --user=mysql --wsrep-new-cluster &
        ...
        + mysql -uroot -prootpass --port=8301 --protocol=tcp -e 'show status like '\''wsrep_cluster_size'\'''
        ERROR 2002 (HY000): Can't connect to MySQL server on 'localhost' (115)
        

        Exactly the same happens for all tree:

        • galera-sst-mariabackup galera SST with mariabackup failed
        • galera-sst-mysqldump galera SST with mysqldump failed
        • galera-sst-rsync galera SST with rsync failed

        Logs and their summaries:

        • stdio: ends as mentioned above
        • kernel: nothing special to my eye
        • syslog:

        # boot
        Apr 20 17:13:36 debian-sid-amd64 systemd[1]: Started Session 1 of user buildbot.
        Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Reached target Main User Target.
        Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Startup finished in 598ms.
        ...
        # mysqld runs upgrade and shutdown at the same time
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: mysql.user                                         OK
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: performance_schema
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: Phase 4/7: Running 'mysql_fix_privilege_tables'
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 328: MySQL server has gone away
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 329: MySQL server has gone away
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 330: MySQL server has gone away
        ...
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: FATAL ERROR: Upgrade failed
        Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[980]: Checking for insecure root accounts.
        Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Shutdown completed; log sequence number 49104; transaction id 42
        Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
        Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] /usr/sbin/mysqld: Shutdown complete
        Apr 20 17:13:38 debian-sid-amd64 systemd[1]: mariadb.service: Succeeded.
        Apr 20 17:18:44 debian-sid-amd64 systemd[1]: session-5.scope: Succeeded.
        Apr 20 17:18:44 debian-sid-amd64 systemd[1]: Started Session 7 of user buildbot.
        

        • node1: empty
        • node3: empty
        • node2: empty
        • mysqld.1.err: irrelevant mtr logs, we need to see node1.err
        • mysqld.2.err: irrelevant mtr logs, we need to see node2.err
        • mysqld.3.err: irrelevant mtr logs, we need to see node3.err
        otto Otto Kekäläinen added a comment - I updated the title to ensure everybody on this issue are debugging the same thing. So the problem is these three tests: Buildbot tries to start 3 Galera nodes, but none of them start and eventually buildbot run stops on connection error: sudo mysqld_safe --defaults-extra-file=/home/buildbot/node1.cnf --user=mysql --wsrep- new -cluster & ... + mysql -uroot -prootpass --port= 8301 --protocol=tcp -e 'show status like ' \ '' wsrep_cluster_size '\'' ' ERROR 2002 (HY000): Can 't connect to MySQL server on ' localhost' ( 115 ) Exactly the same happens for all tree: galera-sst-mariabackup galera SST with mariabackup failed galera-sst-mysqldump galera SST with mysqldump failed galera-sst-rsync galera SST with rsync failed Logs and their summaries: stdio: ends as mentioned above kernel: nothing special to my eye syslog: # boot Apr 20 17:13:36 debian-sid-amd64 systemd[1]: Started Session 1 of user buildbot. Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Reached target Main User Target. Apr 20 17:13:36 debian-sid-amd64 systemd[922]: Startup finished in 598ms. ... # mysqld runs upgrade and shutdown at the same time Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: mysql.user OK Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: performance_schema Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: Phase 4/7: Running 'mysql_fix_privilege_tables' Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 328: MySQL server has gone away Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 329: MySQL server has gone away Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: ERROR 2006 (HY000) at line 330: MySQL server has gone away ... Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[642]: FATAL ERROR: Upgrade failed Apr 20 17:13:37 debian-sid-amd64 /etc/mysql/debian-start[980]: Checking for insecure root accounts. Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Shutdown completed; log sequence number 49104; transaction id 42 Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" Apr 20 17:13:38 debian-sid-amd64 mysqld[617]: 2020-04-20 17:13:38 0 [Note] /usr/sbin/mysqld: Shutdown complete Apr 20 17:13:38 debian-sid-amd64 systemd[1]: mariadb.service: Succeeded. Apr 20 17:18:44 debian-sid-amd64 systemd[1]: session-5.scope: Succeeded. Apr 20 17:18:44 debian-sid-amd64 systemd[1]: Started Session 7 of user buildbot. node1: empty node3: empty node2: empty mysqld.1.err: irrelevant mtr logs, we need to see node1.err mysqld.2.err: irrelevant mtr logs, we need to see node2.err mysqld.3.err: irrelevant mtr logs, we need to see node3.err

        jplindst Could you find out when this bug started, i.e. what is the commit range of suspected changes that caused it?

        Or run the same thing manually and check out what node1.err contains on failure?

        otto Otto Kekäläinen added a comment - jplindst Could you find out when this bug started, i.e. what is the commit range of suspected changes that caused it? Or run the same thing manually and check out what node1.err contains on failure?

        otto, for 10.3, I posted the very first failure on the Debian Sid builder. It seems to me that the culprit commit could be very old. It could be helpful if elenst provided some more details of how her finding that looks like a change in the Linux kernel is related to this.

        marko Marko Mäkelä added a comment - otto , for 10.3, I posted the very first failure on the Debian Sid builder. It seems to me that the culprit commit could be very old. It could be helpful if elenst provided some more details of how her finding that looks like a change in the Linux kernel is related to this.

        > Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder).

        So you mean by *which appears to be the very first 10.3 build of that builder* that somebody added a new builder and it was broken from the onset, and there was no previous commit where it worked? So we are actually not debugging a buildbot failure, but figuring out how to correctly implement a new builder? I find that hard to believe. There must have been a point in time when the builder worked before the source was updated to current MariaDB version or before the virtual image contents was apt upgraded into latest Linux distro release. It should be possible to find out that git commit range introduced the failure or what OS upgrade introduced the error.

        If not else, take the eoan builder currently passing on latest 10.5 git head (http://buildbot.askmonty.org/buildbot/builders/kvm-deb-eoan-amd64/builds/990/steps/galera-sst-mariabackup) and start upgrading system software in suitable chuncks until you get the failure, then you will know what OS upgrade part broke it.

        otto Otto Kekäläinen added a comment - > Here is the very first 10.3 failure (which appears to be the very first 10.3 build of that builder). So you mean by * which appears to be the very first 10.3 build of that builder * that somebody added a new builder and it was broken from the onset, and there was no previous commit where it worked? So we are actually not debugging a buildbot failure, but figuring out how to correctly implement a new builder? I find that hard to believe. There must have been a point in time when the builder worked before the source was updated to current MariaDB version or before the virtual image contents was apt upgraded into latest Linux distro release. It should be possible to find out that git commit range introduced the failure or what OS upgrade introduced the error. If not else, take the eoan builder currently passing on latest 10.5 git head ( http://buildbot.askmonty.org/buildbot/builders/kvm-deb-eoan-amd64/builds/990/steps/galera-sst-mariabackup ) and start upgrading system software in suitable chuncks until you get the failure, then you will know what OS upgrade part broke it.
        elenst Elena Stepanova added a comment - - edited

        It could be helpful if Elena Stepanova provided some more details of how her finding that looks like a change in the Linux kernel is related to this.

        I really don't know what else I can provide. mysqld_safe attempts to write into a file which exists in /tmp with certain permissions which make it impossible to write into, it fails, mysqld_safe aborts. I provided an example of permissions and failure to write, identical to what mysqld_safe does. I don't know how it can be any clearer. It cannot be fixed by the power of thought, somebody needs to actually run it, debug a tiny part of not-so-big mysqld_safe shell script and find a way to do it better, assuming that it was an intentional change in the system and not a system bug. If it is a bug, then it may be eventually fixed by just waiting for an upgrade with a patch, but if that's our approach, whoever takes it should make sure the system developers are aware of it.

        elenst Elena Stepanova added a comment - - edited It could be helpful if Elena Stepanova provided some more details of how her finding that looks like a change in the Linux kernel is related to this. I really don't know what else I can provide. mysqld_safe attempts to write into a file which exists in /tmp with certain permissions which make it impossible to write into, it fails, mysqld_safe aborts. I provided an example of permissions and failure to write, identical to what mysqld_safe does. I don't know how it can be any clearer. It cannot be fixed by the power of thought, somebody needs to actually run it, debug a tiny part of not-so-big mysqld_safe shell script and find a way to do it better, assuming that it was an intentional change in the system and not a system bug. If it is a bug, then it may be eventually fixed by just waiting for an upgrade with a patch, but if that's our approach, whoever takes it should make sure the system developers are aware of it.

        I confirm the steps outlined by elenst fail on Ubuntu Focal, but what log did you find them from, or which lines of code in https://github.com/MariaDB/server/blob/10.5/scripts/mysqld_safe.sh do you simulate?

        vagrant@ubuntu-focal:~$ lsb_release -a
        No LSB modules are available.
        Distributor ID:	Ubuntu
        Description:	Ubuntu 20.04 LTS
        Release:	20.04
        Codename:	focal
        vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx
        vagrant@ubuntu-focal:~$ sudo touch /tmp/xx
        vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx
        vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx
        vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
        sh: 1: cannot create /tmp/xx: Permission denied
        

        There are no journal events, no AppArmor denies etc visible. Filesystem ext4.

        Also happens without any sudo/sh:

        root@ubuntu-focal:/tmp# ll xx
        -rw------- 1 vagrant root 2 Apr 21 12:28 xx
        root@ubuntu-focal:/tmp# echo 1 > /tmp/xx
        bash: /tmp/xx: Permission denied
         
        # Goes away if root owns the file
        root@ubuntu-focal:/tmp# chown root xx
        root@ubuntu-focal:/tmp# echo 1 > /tmp/xx
        root@ubuntu-focal:/tmp# 
        

        otto Otto Kekäläinen added a comment - I confirm the steps outlined by elenst fail on Ubuntu Focal, but what log did you find them from, or which lines of code in https://github.com/MariaDB/server/blob/10.5/scripts/mysqld_safe.sh do you simulate? vagrant@ubuntu-focal:~$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04 LTS Release: 20.04 Codename: focal vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx vagrant@ubuntu-focal:~$ sudo touch /tmp/xx vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx" sh: 1: cannot create /tmp/xx: Permission denied There are no journal events, no AppArmor denies etc visible. Filesystem ext4. Also happens without any sudo/sh: root@ubuntu-focal:/tmp# ll xx -rw------- 1 vagrant root 2 Apr 21 12:28 xx root@ubuntu-focal:/tmp# echo 1 > /tmp/xx bash: /tmp/xx: Permission denied   # Goes away if root owns the file root@ubuntu-focal:/tmp# chown root xx root@ubuntu-focal:/tmp# echo 1 > /tmp/xx root@ubuntu-focal:/tmp#
        elenst Elena Stepanova added a comment - - edited

        otto
        There is no log, wsrep is uncharacteristically quiet in there. Luckily it's shell, doesn't require much to debug. Setting -x is already a good step forward.

        For the logic, it's the part starting from line 229 by your link, or search for wr_logfile.
        mysqld_safe is run under sudo with --user=mysql, as usual. FIrst it creates a temporary file like that

         local wr_logfile=$(mktemp /tmp/wsrep_recovery.XXXXXX)
        

        A few lines later it changes ownership and permissions like that

           [ "$euid" = "0" ] && chown $user $wr_logfile
            chmod 600 $wr_logfile
        

        A few more lines later it attempts to run server (wsrep recovery apparently like that

        eval "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile"
        

        Nothing is written to the file due to the permissions problem, it greps the file, doesn't find in there what it's looking for, and bails out.
        It is supposed to print an error message upon abort, like "WSREP: Failed to recover position", it is not seen anywhere in the logs, I didn't dig into the reasons of it. Maybe logging isn't properly initialized at this point, or there are some other reasons, hopefully whoever debugs it will find out.

        elenst Elena Stepanova added a comment - - edited otto There is no log, wsrep is uncharacteristically quiet in there. Luckily it's shell, doesn't require much to debug. Setting -x is already a good step forward. For the logic, it's the part starting from line 229 by your link, or search for wr_logfile . mysqld_safe is run under sudo with --user=mysql , as usual. FIrst it creates a temporary file like that local wr_logfile=$(mktemp /tmp/wsrep_recovery.XXXXXX) A few lines later it changes ownership and permissions like that [ "$euid" = "0" ] && chown $user $wr_logfile chmod 600 $wr_logfile A few more lines later it attempts to run server (wsrep recovery apparently like that eval "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile" Nothing is written to the file due to the permissions problem, it greps the file, doesn't find in there what it's looking for, and bails out. It is supposed to print an error message upon abort, like "WSREP: Failed to recover position" , it is not seen anywhere in the logs, I didn't dig into the reasons of it. Maybe logging isn't properly initialized at this point, or there are some other reasons, hopefully whoever debugs it will find out.

        Is this somehow related to https://jira.mariadb.org/browse/MDEV-21140 that was "fixed" but not sure if fix was confirmed.

        jplindst Jan Lindström (Inactive) added a comment - Is this somehow related to https://jira.mariadb.org/browse/MDEV-21140 that was "fixed" but not sure if fix was confirmed.

        jplindst Why do you ask instead of checking it yourself?

        vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular 
        fs.protected_regular = 2
         
        vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx
        vagrant@ubuntu-focal:~$ sudo touch /tmp/xx
        vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx
        vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx
        vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
        sh: 1: cannot create /tmp/xx: Permission denied
         
        vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=0
        fs.protected_regular = 0
        vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
         
        vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=1
        fs.protected_regular = 1
        vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"
        sh: 1: cannot create /tmp/xx: Permission denied
        

        Can you jplindst please proceed and fix this everywhere in Galera? What Alexander did in MDEV-21140 was most likely not the only place where this is needed.

        otto Otto Kekäläinen added a comment - jplindst Why do you ask instead of checking it yourself? vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular fs.protected_regular = 2   vagrant@ubuntu-focal:~$ sudo rm -f /tmp/xx vagrant@ubuntu-focal:~$ sudo touch /tmp/xx vagrant@ubuntu-focal:~$ sudo chmod 600 /tmp/xx vagrant@ubuntu-focal:~$ sudo chown $USER /tmp/xx vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx" sh: 1: cannot create /tmp/xx: Permission denied   vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=0 fs.protected_regular = 0 vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx"   vagrant@ubuntu-focal:~$ sudo sysctl fs.protected_regular=1 fs.protected_regular = 1 vagrant@ubuntu-focal:~$ sudo sh -c "echo 1 > /tmp/xx" sh: 1: cannot create /tmp/xx: Permission denied Can you jplindst please proceed and fix this everywhere in Galera? What Alexander did in MDEV-21140 was most likely not the only place where this is needed.

        Tentative fix in https://github.com/MariaDB/server/pull/1510. I am doing this only for 10.5 and only for this one chown scenario in order to not get stalled with fixing the bugs I actually have assigned on my name and which I indend to have closed by the end of the month. Buildbot greenness is holy for me. I don't understand how anybody can develop without having CI that confirms all the time that no testable regressions exist... I at least can't.

        I leave it to Seppo or Jan to fix this properly.

        otto Otto Kekäläinen added a comment - Tentative fix in https://github.com/MariaDB/server/pull/1510 . I am doing this only for 10.5 and only for this one chown scenario in order to not get stalled with fixing the bugs I actually have assigned on my name and which I indend to have closed by the end of the month. Buildbot greenness is holy for me. I don't understand how anybody can develop without having CI that confirms all the time that no testable regressions exist... I at least can't. I leave it to Seppo or Jan to fix this properly.

        Alternative PR targetting 10.2 branch as requested by jplindst: https://github.com/MariaDB/server/pull/1511

        otto Otto Kekäläinen added a comment - Alternative PR targetting 10.2 branch as requested by jplindst : https://github.com/MariaDB/server/pull/1511

        I merged otto’s 10.2 change to 10.3, and it does work there. I see that jplindst separately pushed a change to 10.5 without waiting for a merge.

        marko Marko Mäkelä added a comment - I merged otto ’s 10.2 change to 10.3, and it does work there. I see that jplindst separately pushed a change to 10.5 without waiting for a merge.

        People

          jplindst Jan Lindström (Inactive)
          jplindst Jan Lindström (Inactive)
          Votes:
          1 Vote for this issue
          Watchers:
          7 Start watching this issue

          Dates

            Created:
            Updated:
            Resolved:

            Git Integration

              Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.