Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Won't Fix
-
5.5.43-galera, 10.0.19-galera
Description
Consider the following sequence of events happening with Galera Cluster if wsrep_OSU_method is set to TOI:
- You have a 3 node cluster: node1, node2, and node3.
- node1 is almost out of disk space.
- You execute DDL on node1, such as: ALTER TABLE tab DROP COLUMN col;
- node1 executes the DDL statement, and tells node2 and node3 to execute it in Total Order Isolation.
- ALTER TABLE statement fails on node1 because it ran out of disk space, but the command succeeds on node2 and node3.
node1 will see an error like this:
2015-06-16 04:45:55 7f7b000c7700 InnoDB: Error: Write to file (merge) failed at offset 68157440.
|
InnoDB: 1048576 bytes should have been written, only 1036288 were written.
|
InnoDB: Operating system error number 28.
|
InnoDB: Check that your OS and file system support files of this size.
|
InnoDB: Check also that the disk is not full or a disk quota exceeded.
|
InnoDB: Error number 28 means 'No space left on device'.
|
InnoDB: Some operating system error numbers are described at
|
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
|
150616 4:45:55 [ERROR] Slave SQL: Error 'Got error 64 'Temp file write failure' from InnoDB' on query. Default database: 'db1'. Query: 'ALTER TABLE tab DROP COLUMN col', Internal MariaDB error code: 1296
|
150616 4:45:55 [Warning] WSREP: RBR event 1 Query apply warning: 1, 19743667
|
150616 4:45:55 [Warning] WSREP: Ignoring error for TO isolated action: source: 9f6bdb3d-0bc1-11e5-a9f2-ca15da9a1a8b version: 3 local: 0 state: APPLYING flags: 65 conn_id: 24689371 trx_id: -1 seqnos (l: 3471879, g: 19743667, s: 19743666, d: 19743666, ts: 912410740656233)
|
- Since node1 now has a different table definition than node2 and node3, you will eventually have consistency errors.
node2 and node3 might see errors like this:
150616 5:15:11 [ERROR] Slave SQL: Column 11 of table 'db1.tab' cannot be converted from type 'int' to type 'date', Internal MariaDB error code: 1677
|
150616 5:15:11 [Warning] WSREP: RBR event 2 Write_rows_v1 apply warning: 3, 19743684
|
150616 5:15:11 [ERROR] WSREP: Failed to apply trx: source: 75edc58a-0bb2-11e5-a1fe-cb59d7f111b4 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 23347068 trx_id: 59826665 seqnos (l: 3742229, g: 19743684, s: 19743683, d: 19743667, ts: 768200703670749)
|
150616 5:15:11 [ERROR] WSREP: Failed to apply trx 19743684 4 times
|
150616 5:15:11 [ERROR] WSREP: Node consistency compromized, aborting...
|
And node1 will see node2 and node3 leave the cluster, causing a loss of quorum and total cluster failure:
150616 5:15:12 [Note] WSREP: forgetting 07459bc1 (tcp://$node2_ip:4567)
|
150616 5:15:12 [Note] WSREP: (75edc58a, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.72:4567' pointing to uuid 75edc58a is blacklisted, skipping
|
150616 5:15:12 [Note] WSREP: forgetting 9f6bdb3d (tcp://$node3_ip:4567)
|
150616 5:15:12 [Note] WSREP: Node 75edc58a state prim
|
150616 5:15:12 [Note] WSREP: view(view_id(PRIM,75edc58a,10) memb {
|
75edc58a,0
|
} joined {
|
} left {
|
} partitioned {
|
07459bc1,0
|
9f6bdb3d,0
|
})
|
Should it be possible for this to happen?
Can we fix this by making a node crash if DDL fails if wsrep_OSU_method is set to TOI? Making one node crash is probably better than total cluster failure most of the time.