Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Won't Fix
-
10.0.24-galera
-
CentOS Linux release 7.2.1511 (Core)
Linux 100-103-10-310-db 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
Description
We have a Galera cluster 10.0.24 running with 3 nodes. We got a requirement from application team to add a new column to one of the table that is actively being used. This table had 100k rows and we directly issued alter statement on the primary node. Alter statement ran for 2-3 mins and failed with a duplicate key error and the remaining two nodes died by throwing the following error in their error logs.
161005 19:06:57 [ERROR] Slave SQL: Column 34 of table 'mps.pages' cannot be converted from type 'tinyblob' to type 'varchar(2
55)', Internal MariaDB error code: 1677
161005 19:06:57 [Warning] WSREP: RBR event 2 Update_rows_v1 apply warning: 3, 32400488
161005 19:06:58 [ERROR] Slave SQL: Column 34 of table 'mps.pages' cannot be converted from type 'tinyblob' to type 'varchar(2
55)', Internal MariaDB error code: 1677
161005 19:06:58 [Warning] WSREP: RBR event 2 Write_rows_v1 apply warning: 3, 32400489
161005 19:06:58 [Warning] WSREP: Failed to apply app buffer: seqno: 32400488, status: 1
at galera/src/trx_handle.cpp:apply():351
Retrying 2th time
After 4 retrys,
161005 19:06:58 [ERROR] WSREP: Failed to apply trx 32400492 4 times
161005 19:06:58 [ERROR] WSREP: Node consistency compromized, aborting...
161005 19:06:58 [Note] WSREP: Closing send monitor...
161005 19:06:58 [Note] WSREP: Closed send monitor.
161005 19:06:58 [Note] WSREP: /usr/sbin/mysqld: Terminated.
161005 19:06:58 [Note] WSREP: /usr/sbin/mysqld: Terminated.
161005 19:06:58 [Note] WSREP: /usr/sbin/mysqld: Terminated.
161005 19:07:02 mysqld_safe Number of processes running now: 0
161005 19:07:02 mysqld_safe WSREP: not restarting wsrep node automatically
161005 19:07:02 mysqld_safe mysqld from pid file /mysql/prod_ng_misc3/mysqld.pid ended
Attached is the my.cnf and error log on the 3rd node which died first. Same error message was recorded on the second node which died as well. Due to this, first node which is used by application became non-primary and stopped accepting writes.