Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-18407

Galera: Rolling upgrade: 10.3 nodes are stopped with signal 6 on attempt to join upgraded 10.4 node

Details

    • Bug
    • Status: Closed (View Workflow)
    • Blocker
    • Resolution: Fixed
    • 10.4.2, 10.3.13
    • 10.4.3, 10.3.14
    • Galera, Galera SST
    • CentOS Linux release 7.6.1810 (Core)

    Description

      This issue was discovered on testing of Rolling Upgrade according to "MariaDB 10.4 Cluster Rolling Upgrade - Naive Approach" by Seppo Jaakola: https://docs.google.com/document/d/1z4XTpLpzStWMFaNnrSmiESaIVeCoKhu9Hbb1SrDPf0w

      Steps:

      0. Build MariaDB Server 10.3 with Galera 3 and MariaDB Server 10.4 with Galera 4.

      0.1. Galera 3.

      git clone https://github.com/MariaDB/galera.git galera3
      cd galera3
      git checkout mariadb-3.x
      git submodule init
      git submodule update
      ./scripts/build.sh -d --dl 2
      sudo cp libgalera_smm.so /usr/lib/libgalera_smm_3.so
      

      0.2. Server 10.3.

      git clone https://github.com/mariadb/server 10.3
      cd 10.3
      git checkout 10.3
      git pull
      git clean -d -f -x
      cmake . -DCMAKE_BUILD_TYPE=Debug
      make -j16
      

      0.3. Galera 4.
      The same steps as described for Galera 3 in the p.0.1, but checkout mariadb-4.x branch.
      sudo cp libgalera_smm.so /usr/lib/libgalera_smm_4.so

      0.4. Server 10.4.
      The same steps as described for 10.3 in the p.0.2, but checkout bb-10.4-galera4 branch.

      1. Start 3 MariaDB 10.3 nodes with mtr:
      1.0. export WSREP_PROVIDER=/usr/lib/libgalera_smm_3.so
      1.1. cd mysql-test
      1.2. "./mtr --suite=galera_3nodes --start-and-exit"

      Actual result:
      3 servers are running:

      Started [mysqld.1 - pid: 8432, winpid: 8432] [mysqld.2 - pid: 8433, winpid: 8433] [mysqld.3 - pid: 8434, winpid: 8434]
      worker[1] Using config for test galera_3nodes.galera_certification_ccc
      worker[1] Port and socket path for server(s):
      worker[1] mysqld.1  16000  /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.1.sock
      worker[1] mysqld.2  16001  /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.2.sock
      worker[1] mysqld.3  16002  /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock
       
      $ ps -ef | grep mysqld
      stepan     8432      1  1 19:46 pts/2    00:00:01 /home/stepan/mariadb/10.3/sql/mysqld --defaults-group-suffix=.1 --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --core-file --loose-debug-sync-timeout=300
      stepan     8433      1  1 19:46 pts/2    00:00:01 /home/stepan/mariadb/10.3/sql/mysqld --defaults-group-suffix=.2 --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --core-file --loose-debug-sync-timeout=300
      stepan     8434      1  1 19:46 pts/2    00:00:01 /home/stepan/mariadb/10.3/sql/mysqld --defaults-group-suffix=.3 --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --core-file --loose-debug-sync-timeout=300
      

      And the ports are used:

      $ sudo netstat -tulpn | grep mysqld
      tcp        0      0 0.0.0.0:16009           0.0.0.0:*               LISTEN      198749/mysqld
      tcp        0      0 127.0.0.1:16000         0.0.0.0:*               LISTEN      198747/mysqld
      tcp        0      0 127.0.0.1:16001         0.0.0.0:*               LISTEN      198748/mysqld
      tcp        0      0 127.0.0.1:16002         0.0.0.0:*               LISTEN      198749/mysqld
      tcp        0      0 0.0.0.0:16003           0.0.0.0:*               LISTEN      198747/mysqld
      tcp        0      0 0.0.0.0:16006           0.0.0.0:*               LISTEN      198748/mysqld
      

      2. Copy [mysqld.3] group from var/my.cnf into separate configuration file: mysqld.3.cnf, and make following edits:

      2.1. Edit:

      wsrep_cluster_address='gcomm://127.0.0.1:16003,127.0.0.1:16006,127.0.0.1:16009'
      wsrep_provider=<path to galera 4 library>
      basedir=<10.4 source tree>
      character-sets-dir=<10.4 source tree>/sql/share/charsets
      lc-messages-dir=<10.4 source tree>/sql/share/
      

      2.2. And add there also:

      binlog-format=row
      wsrep_sst_method=rsync
      innodb-autoinc-lock-mode=2
      

      3. Load some data for each node:

      3.1. Run the client:
      /home/stepan/mariadb/10.3/client/mysql -u root -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.1.sock

      3.2. Create a table:

      use test;
      MariaDB [test]> create table t (i int primary key auto_increment, j int);
      

      3.3. Load data for some time:

      watch "/home/stepan/mariadb/10.3/client/mysql -uroot -h0 -P16000 -e'insert into test.t(j) values(1)' "
      watch "/home/stepan/mariadb/10.3/client/mysql -uroot -h0 -P16001 -e'insert into test.t(j) values(1)' "
      watch "/home/stepan/mariadb/10.3/client/mysql -uroot -h0 -P16002 -e'insert into test.t(j) values(1)' "
      

      3.4. Stop loading data.

      4. Upgrade node 3.

      4.1 Stop the Server:
      /home/stepan/mariadb/10.3/client/mysqladmin -u root shutdown -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

      4.2. Run 10.4 binaries with 10.3 data:
      /home/stepan/mariadb/10.4/sql/mysqld --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf --wsrep_provider=none

      4.3. Run mysql_upgrade:
      /home/stepan/mariadb/10.4/client/mysql_upgrade --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf -uroot -h0 -P16002

      Actual result:

      Phase 1/7: Checking and upgrading mysql database
      Processing databases
      mysql
      mysql.column_stats                                 OK
      mysql.columns_priv                                 OK
      mysql.db                                           OK
      mysql.event                                        OK
      mysql.func                                         OK
      mysql.gtid_slave_pos                               OK
      mysql.help_category                                OK
      mysql.help_keyword                                 OK
      mysql.help_relation                                OK
      mysql.help_topic                                   OK
      mysql.host                                         OK
      mysql.index_stats                                  OK
      mysql.innodb_index_stats                           OK
      mysql.innodb_table_stats                           OK
      mysql.plugin                                       OK
      mysql.proc                                         OK
      mysql.procs_priv                                   OK
      mysql.proxies_priv                                 OK
      mysql.roles_mapping                                OK
      mysql.servers                                      OK
      mysql.table_stats                                  OK
      mysql.tables_priv                                  OK
      mysql.time_zone                                    OK
      mysql.time_zone_leap_second                        OK
      mysql.time_zone_name                               OK
      mysql.time_zone_transition                         OK
      mysql.time_zone_transition_type                    OK
      mysql.transaction_registry                         OK
      mysql.user                                         OK
      Phase 2/7: Installing used storage engines... Skipped
      Phase 3/7: Fixing views
      Phase 4/7: Running 'mysql_fix_privilege_tables'
      Phase 5/7: Fixing table and database names
      Phase 6/7: Checking and upgrading tables
      Processing databases
      information_schema
      mtr
      mtr.global_suppressions                            OK
      mtr.test_suppressions                              OK
      performance_schema
      test
      test.t                                             OK
      Phase 7/7: Running 'FLUSH PRIVILEGES'
      OK
      

      4.4. Stop the Server:
      /home/stepan/mariadb/10.3/client/mysqladmin -u root shutdown -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

      4.5. export PATH=$PATH:/home/stepan/mariadb/10.4/scripts

      5. Check upgraded node 3 without the cluster.

      5.1. Start the server:
      /home/stepan/mariadb/10.4/sql/mysqld --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf

      5.2. Start the client:
      /home/stepan/mariadb/10.3/client/mysql -u root -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

      Actual result:
      Output: "Server version: 10.4.2-MariaDB-debug-log Source distribution"

      5.3. Stop the Server:
      /home/stepan/mariadb/10.3/client/mysqladmin -u root shutdown -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

      6. Attempt to join node 3 back to the cluster.

      6.1. Add to mysqld.3.cnf:

      wsrep-on=1
      

      6.2. Start the server:
      /home/stepan/mariadb/10.4/sql/mysqld --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf

      Actual results:

      1. Nodes 1 and 2 stopped with [ERROR] mysqld got signal 6"[ERROR] mysqld got signal 6":

      mysqld.1.err:

      2019-01-28 12:22:26 0 [Note] WSREP: Flow-control interval: [28, 28]
      2019-01-28 12:22:26 0 [Note] WSREP: Trying to continue unpaused monitor
      2019-01-28 12:22:26 2 [Note] WSREP: New cluster view: global state: e2c6689d-2189-11e9-b139-1bd4f4035078:2241, view# 8: Primary, number of nodes: 3, my index: 1, protocol version 3
      2019-01-28 12:22:26 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2019-01-28 12:22:26 2 [Note] WSREP: REPL Protocols: 9 (4, 2)
      2019-01-28 12:22:26 2 [Note] WSREP: Assign initial position for certification: 2241, protocol version: 4
      2019-01-28 12:22:26 0 [Note] WSREP: Service thread queue flushed.
      mysqld: gcs/src/gcs_group.cpp:1186: int gcs_group_find_donor(const gcs_group_t*, int, int, const char*, int, const gu_uuid_t*, gcs_seqno_t): Assertion `ist_seqno != GCS_SEQNO_ILL' failed.
      190128 12:22:27 [ERROR] mysqld got signal 6 ;
      

      2. Upgraded node 3 is still working.

      Expected:
      Upgraded node 3 successfully joins the cluster and other nodes continue to operate.

      Other logs are also attached.

      Attachments

        1. my.cnf
          8 kB
        2. mysqld.1.err
          46 kB
        3. mysqld.2.err
          41 kB
        4. mysqld.3.cnf
          1 kB
        5. mysqld.3.err
          107 kB
        6. mysqld.3.log
          170 kB

        Issue Links

          Activity

            I've checked that this issue is also reproduced during loading data. So it's apparently not dependent on the step 3: "Load some date for each node" which can be skipped.

            stepan.patryshev Stepan Patryshev (Inactive) added a comment - I've checked that this issue is also reproduced during loading data. So it's apparently not dependent on the step 3: "Load some date for each node" which can be skipped.

            Verified it on Server 10.4.3 commit c568e25379600db8af4bd39df4761ba0fbc1a14e and galera4 lib commit 9cdbeb86c330b808571b14270e6428accb899c58.
            It has been fixed, but I've filed the new bug MDEV-18580 Galera: Rolling upgrade: Upgraded node 3 is stopped with signal 6 after node 2 shutdown.

            stepan.patryshev Stepan Patryshev (Inactive) added a comment - Verified it on Server 10.4.3 commit c568e25379600db8af4bd39df4761ba0fbc1a14e and galera4 lib commit 9cdbeb86c330b808571b14270e6428accb899c58. It has been fixed, but I've filed the new bug MDEV-18580 Galera: Rolling upgrade: Upgraded node 3 is stopped with signal 6 after node 2 shutdown.

            People

              seppo Seppo Jaakola
              stepan.patryshev Stepan Patryshev (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.