Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25880

rsync may be mistakenly killed when overlapping SST

Details

    Description

      This bug was originally seen in the galera_nbo_sst_slave mtr test for 10.6, however it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server) overlaps the old SST process from the previous (already terminated) server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems.
      For example:

      2021-06-09  3:28:56 0 [Warning] WSREP: 0.0 (panda): State transfer to 1.0 (panda) failed: -11 (Resource temporarily unavailable)
      2021-06-09  3:28:56 0 [ERROR] WSREP: /home/panda/galera-es-4.x/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1205: Will never receive state. Need to abort.
      2021-06-09  3:28:56 0 [Note] WSREP: gcomm: terminating thread
      2021-06-09  3:28:56 0 [Note] WSREP: gcomm: joining thread
      2021-06-09  3:28:56 0 [Note] WSREP: gcomm: closing backend
      2021-06-09  3:28:56 2 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 24 bytes
      2021-06-09  3:28:56 2 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/2 locked buffers
      2021-06-09  3:28:57 0 [Note] WSREP: PC protocol downgrade 1 -> 0
      2021-06-09  3:28:57 0 [Note] WSREP: view((empty))
      2021-06-09  3:28:57 0 [Note] WSREP: gcomm: closed
      2021-06-09  3:28:57 0 [Note] WSREP: /home/panda/maria-10.6/build/sql/mariadbd: Terminated.
      2021-06-09  3:28:58 0 [Warning] WSREP: option --wsrep-causal-reads is deprecated
      2021-06-09  3:28:58 0 [Note] /home/panda/maria-10.6/build/sql/mariadbd (mysqld 10.6.1-1-MariaDB-debug-log) starting as process 410627 ...
      .................
      .................
      2021-06-09  3:28:58 0 [Note] WSREP: save pc into disk
      WSREP_SST: [ERROR] Parent mysqld process (PID: 410497) terminated unexpectedly. (20210609 03:28:58.800)
      /home/panda/maria-10.6/build/scripts/wsrep_sst_rsync: line 681: kill: (-410497) - No such process
      WSREP_SST: [INFO] Joiner cleanup: rsync PID=0, stunnel PID=410592 (20210609 03:28:58.803)
      

      Attachments

        Issue Links

          Activity

            This commit fixes a bug was originally discovered during the galera_nbo_sst_slave mtr test for 10.6 branch. However it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server instance) overlaps the old SST process started by the previous, already terminated server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems:
            .
            https://github.com/MariaDB/server/commit/dfb6931fe214cacbfbf889d1b6a6273221987697
            https://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.2-MDEV-25880-galera

            sysprg Julius Goryavsky added a comment - This commit fixes a bug was originally discovered during the galera_nbo_sst_slave mtr test for 10.6 branch. However it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server instance) overlaps the old SST process started by the previous, already terminated server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems: . https://github.com/MariaDB/server/commit/dfb6931fe214cacbfbf889d1b6a6273221987697 https://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.2-MDEV-25880-galera

            ok to push

            jplindst Jan Lindström (Inactive) added a comment - ok to push

            ok to push when bb has finished testing.

            jplindst Jan Lindström (Inactive) added a comment - ok to push when bb has finished testing.
            sysprg Julius Goryavsky added a comment - - edited Fixed by https://github.com/MariaDB/server/commit/2edb8e12e10179b970007b3e1d5c465b9d0e110e and https://github.com/MariaDB/server/commit/18d5be5b54b1a05e6107a1c5828d9eed9cf18636

            People

              sysprg Julius Goryavsky
              sysprg Julius Goryavsky
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.