[MDEV-18699] Galera: Rolling upgrade: Upgraded node is stopped on commit if wsrep_trx_fragment_size > 0 Created: 2019-02-22  Updated: 2019-07-09  Resolved: 2019-05-03

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.3
Fix Version/s: 10.4.5

Type: Bug Priority: Critical
Reporter: Stepan Patryshev (Inactive) Assignee: Seppo Jaakola
Resolution: Fixed Votes: 0
Labels: galera, galera_4
Environment:

CentOS Linux release 7.6.1810 (Core)


Attachments: File my.cnf     File mysqld.1.err     File mysqld.2.err     File mysqld.3.cnf     File mysqld.3.err    
Issue Links:
Relates
relates to MDEV-18271 Galera 4: test manually rolling upgra... Closed
relates to MDEV-18552 Galera: Rolling upgrade: Assertion `v... Closed

 Description   


Galera: Rolling upgrade: Upgraded with 10.4 node is stopped with signal 6 on commit being joined to the cluster with not yet upgraded nodes if wsrep_trx_fragment_size > 0.

This issue was discovered on testing of Rolling Upgrade according to "MariaDB 10.4 Cluster Rolling Upgrade - Naive Approach" by Seppo Jaakola: https://docs.google.com/document/d/1z4XTpLpzStWMFaNnrSmiESaIVeCoKhu9Hbb1SrDPf0w

10.4.3-MariaDB-debug built from sources: commit f0b65102b23f006f596eef35e6e5f4f8b6d8146d
galera4 lib: Galera 26.4.0, commit 9cdbeb86c330b808571b14270e6428accb899c58

Steps:

1. Start 3 MariaDB 10.3 nodes with mtr:
1.0. export WSREP_PROVIDER=/usr/lib/libgalera_smm_3.so
1.1. cd mysql-test
1.2. "./mtr --suite=galera_3nodes --start-and-exit"

2. Copy [mysqld.3] group from var/my.cnf (attached my.cnf) into separate configuration file: mysqld.3.cnf (attached mysqld.3.cnf), and make following edits:

2.1. Edit:

wsrep_cluster_address='gcomm://127.0.0.1:16003,127.0.0.1:16006,127.0.0.1:16009'
wsrep_provider=<path to galera 4 library>
basedir=<10.4 source tree>
character-sets-dir=<10.4 source tree>/sql/share/charsets
lc-messages-dir=<10.4 source tree>/sql/share/

2.2. And add there also:

binlog-format=row
wsrep_sst_method=rsync
innodb-autoinc-lock-mode=2

3.1 Load some data.
3.2. Stop data loading.

4. Upgrade node 3.

4.1 Stop the Server:
/home/stepan/mariadb/10.3/client/mysqladmin -u root shutdown -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

4.2. Make sure that wsrep-on is off:
sudo vi /home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf
#wsrep-on=1

4.3. Run 10.4 binaries with 10.3 data:
/home/stepan/mariadb/10.4/sql/mysqld --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf --wsrep_provider=none

4.4. Run mysql_upgrade:
/home/stepan/mariadb/10.4/client/mysql_upgrade --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf -uroot -h0 -P16002

4.5. Stop the Server:
/home/stepan/mariadb/10.3/client/mysqladmin -u root shutdown -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

4.6. export PATH=$PATH:/home/stepan/mariadb/10.4/scripts

5. Check upgraded node 3 without the cluster.

5.1. Start the server:
/home/stepan/mariadb/10.4/sql/mysqld --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf

5.2. Start the client:
/home/stepan/mariadb/10.3/client/mysql -u root -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

Actual result:
Server version: 10.4.3-MariaDB-debug-log Source distribution

5.3. Stop the Server:
/home/stepan/mariadb/10.3/client/mysqladmin -u root shutdown -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

6. Join node 3 back to the cluster.

6.1. Add to /home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf:

wsrep-on=1

6.2. Start the server:
/home/stepan/mariadb/10.4/sql/mysqld --defaults-file=/home/stepan/mariadb/10.3/mysql-test/var/mysqld.3.cnf

7. Check how streaming replication behaves on partially upgraded cluster.

7.1. Run clients for all three nodes:

/home/stepan/mariadb/10.3/client/mysql -u root -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.1.sock
/home/stepan/mariadb/10.3/client/mysql -u root -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.2.sock
/home/stepan/mariadb/10.3/client/mysql -u root -S /home/stepan/mariadb/10.3/mysql-test/var/tmp/mysqld.3.sock

7.2. Check with wsrep_trx_fragment_size by default.

7.2.1. On the Node 3:

START TRANSACTION;
update t set j = 28700 where i = 287;
update t set j = 28900 where i = 289;

Actual result:
The rows which have been updated on the node 3 have not been yet updated on the nodes 1 and 2.

7.2.2. On the Node 3:

commit;

Actual result:
The rows which have been updated on the node 3 have been updated on the nodes 1 and 2 only after commit!

7.3. Check with wsrep_trx_fragment_size > 0.

7.3.1. Set wsrep_trx_fragment_size > 0 on the Node 3:

SET SESSION wsrep_trx_fragment_size = 1;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [test]> SHOW VARIABLES LIKE 'wsrep_trx%';
+-------------------------+-------+
| Variable_name           | Value |
+-------------------------+-------+
| wsrep_trx_fragment_size | 1     |
| wsrep_trx_fragment_unit | bytes |
+-------------------------+-------+

7.3.2. On the Node 3:

START TRANSACTION;
update t set j = 28300 where i = 283;

Actual result:
The row which has been updated on the node 3 has been updated on the nodes 1 and 2 without commit!

7.3.3. On the Node 3:

commit;

Actual result:

The node 3 has stopped:
Client:

ERROR 2013 (HY000): Lost connection to MySQL server during query

mysqld.3.err:

190222 20:57:14 [ERROR] mysqld got signal 6 ;

Expected result:
Upgraded node 3 is NOT stopped on commit being joined to the cluster with not yet upgraded nodes if wsrep_trx_fragment_size > 0.

Other log and config files are also attached.



 Comments   
Comment by Stepan Patryshev (Inactive) [ 2019-02-25 ]

Here is the similar scenario:

1. Upgrade 2-nd and 3-rd nodes and join them to the cluster with the 1-st node which is still running on 10.3.

2. On the node 3:
2.1. SET SESSION wsrep_trx_fragment_size = 1;
2.2. START TRANSACTION;
2.3. update t set j = 28900 where i = 289;
2.4. update t set j = 28300 where i = 283;

Actual result:
The node 3 has stopped just after second update, even without commit.

And see also similar MDEV-18552.

Comment by Stepan Patryshev (Inactive) [ 2019-07-09 ]

Confirm that it's fixed. Verified two scenarios on:

MariaDB Server 10.4: branch 10.4, commit 9d6b601e797dd8333340dadaefae09ebafc787db.
Galera Lib4: branch mariadb-4.x, commit ba337dd0ac281a5e9f29c652a890bd7ad2ac464e.

MariaDB Server 10.3: branch 10.3, commit 099007c3c92d1405625777fa86d2fba3da1d339c.
Galera Lib3: branch mariadb-3.x, commit 227e96e457acb60037450bc1e81c45594782e906.

Generated at Thu Feb 08 08:46:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.