Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25880

rsync may be mistakenly killed when overlapping SST

Details

    Description

      This bug was originally seen in the galera_nbo_sst_slave mtr test for 10.6, however it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server) overlaps the old SST process from the previous (already terminated) server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems.
      For example:

      2021-06-09  3:28:56 0 [Warning] WSREP: 0.0 (panda): State transfer to 1.0 (panda) failed: -11 (Resource temporarily unavailable)
      2021-06-09  3:28:56 0 [ERROR] WSREP: /home/panda/galera-es-4.x/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1205: Will never receive state. Need to abort.
      2021-06-09  3:28:56 0 [Note] WSREP: gcomm: terminating thread
      2021-06-09  3:28:56 0 [Note] WSREP: gcomm: joining thread
      2021-06-09  3:28:56 0 [Note] WSREP: gcomm: closing backend
      2021-06-09  3:28:56 2 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 24 bytes
      2021-06-09  3:28:56 2 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/2 locked buffers
      2021-06-09  3:28:57 0 [Note] WSREP: PC protocol downgrade 1 -> 0
      2021-06-09  3:28:57 0 [Note] WSREP: view((empty))
      2021-06-09  3:28:57 0 [Note] WSREP: gcomm: closed
      2021-06-09  3:28:57 0 [Note] WSREP: /home/panda/maria-10.6/build/sql/mariadbd: Terminated.
      2021-06-09  3:28:58 0 [Warning] WSREP: option --wsrep-causal-reads is deprecated
      2021-06-09  3:28:58 0 [Note] /home/panda/maria-10.6/build/sql/mariadbd (mysqld 10.6.1-1-MariaDB-debug-log) starting as process 410627 ...
      .................
      .................
      2021-06-09  3:28:58 0 [Note] WSREP: save pc into disk
      WSREP_SST: [ERROR] Parent mysqld process (PID: 410497) terminated unexpectedly. (20210609 03:28:58.800)
      /home/panda/maria-10.6/build/scripts/wsrep_sst_rsync: line 681: kill: (-410497) - No such process
      WSREP_SST: [INFO] Joiner cleanup: rsync PID=0, stunnel PID=410592 (20210609 03:28:58.803)
      

      Attachments

        Issue Links

          Activity

            sysprg Julius Goryavsky created issue -
            sysprg Julius Goryavsky made changes -
            Field Original Value New Value
            Summary rsync may be mistakenly killed when overlaying SST rsync may be mistakenly killed when overlapping SST
            sysprg Julius Goryavsky made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            sysprg Julius Goryavsky made changes -
            Assignee Julius Goryavsky [ sysprg ]
            sysprg Julius Goryavsky made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]

            This commit fixes a bug was originally discovered during the galera_nbo_sst_slave mtr test for 10.6 branch. However it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server instance) overlaps the old SST process started by the previous, already terminated server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems:
            .
            https://github.com/MariaDB/server/commit/dfb6931fe214cacbfbf889d1b6a6273221987697
            https://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.2-MDEV-25880-galera

            sysprg Julius Goryavsky added a comment - This commit fixes a bug was originally discovered during the galera_nbo_sst_slave mtr test for 10.6 branch. However it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server instance) overlaps the old SST process started by the previous, already terminated server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems: . https://github.com/MariaDB/server/commit/dfb6931fe214cacbfbf889d1b6a6273221987697 https://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.2-MDEV-25880-galera
            sysprg Julius Goryavsky made changes -
            Assignee Julius Goryavsky [ sysprg ] Jan Lindström [ jplindst ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            sysprg Julius Goryavsky made changes -
            sysprg Julius Goryavsky made changes -

            ok to push

            jplindst Jan Lindström (Inactive) added a comment - ok to push
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Julius Goryavsky [ sysprg ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            sysprg Julius Goryavsky made changes -
            sysprg Julius Goryavsky made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            sysprg Julius Goryavsky made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            sysprg Julius Goryavsky made changes -
            Assignee Julius Goryavsky [ sysprg ] Jan Lindström [ jplindst ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            ok to push when bb has finished testing.

            jplindst Jan Lindström (Inactive) added a comment - ok to push when bb has finished testing.
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Julius Goryavsky [ sysprg ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            sysprg Julius Goryavsky added a comment - - edited Fixed by https://github.com/MariaDB/server/commit/2edb8e12e10179b970007b3e1d5c465b9d0e110e and https://github.com/MariaDB/server/commit/18d5be5b54b1a05e6107a1c5828d9eed9cf18636
            sysprg Julius Goryavsky made changes -
            Fix Version/s 10.6.2 [ 25800 ]
            Fix Version/s 10.2.39 [ 25731 ]
            Fix Version/s 10.3.30 [ 25732 ]
            Fix Version/s 10.4.20 [ 25733 ]
            Fix Version/s 10.5.11 [ 25734 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            Fix Version/s 10.2.40 [ 26027 ]
            Fix Version/s 10.3.31 [ 26028 ]
            Fix Version/s 10.4.21 [ 26030 ]
            Fix Version/s 10.5.12 [ 26025 ]
            Fix Version/s 10.6.3 [ 25904 ]
            Fix Version/s 10.2.39 [ 25731 ]
            Fix Version/s 10.3.30 [ 25732 ]
            Fix Version/s 10.4.20 [ 25733 ]
            Fix Version/s 10.5.11 [ 25734 ]
            Fix Version/s 10.6.2 [ 25800 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 122547 ] MariaDB v4 [ 159378 ]

            People

              sysprg Julius Goryavsky
              sysprg Julius Goryavsky
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.