[MDEV-8323] Failed DDL execution can cause a full Galera Cluster crash - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 5.5.43-galera, 10.0.19-galera
Fix Version/s: N/A
Component/s: Galera, wsrep
Labels:
- galera

Description

Consider the following sequence of events happening with Galera Cluster if wsrep_OSU_method is set to TOI:

You have a 3 node cluster: node1, node2, and node3.
node1 is almost out of disk space.
You execute DDL on node1, such as: ALTER TABLE tab DROP COLUMN col;
node1 executes the DDL statement, and tells node2 and node3 to execute it in Total Order Isolation.
ALTER TABLE statement fails on node1 because it ran out of disk space, but the command succeeds on node2 and node3.

node1 will see an error like this:

2015-06-16 04:45:55 7f7b000c7700 InnoDB: Error: Write to file (merge) failed at offset 68157440.

InnoDB: 1048576 bytes should have been written, only 1036288 were written.

InnoDB: Operating system error number 28.

InnoDB: Check that your OS and file system support files of this size.

InnoDB: Check also that the disk is not full or a disk quota exceeded.

InnoDB: Error number 28 means 'No space left on device'.

InnoDB: Some operating system error numbers are described at

InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html

150616 4:45:55 [ERROR] Slave SQL: Error 'Got error 64 'Temp file write failure' from InnoDB' on query. Default database: 'db1'. Query: 'ALTER TABLE tab DROP COLUMN col', Internal MariaDB error code: 1296

150616 4:45:55 [Warning] WSREP: RBR event 1 Query apply warning: 1, 19743667

150616 4:45:55 [Warning] WSREP: Ignoring error for TO isolated action: source: 9f6bdb3d-0bc1-11e5-a9f2-ca15da9a1a8b version: 3 local: 0 state: APPLYING flags: 65 conn_id: 24689371 trx_id: -1 seqnos (l: 3471879, g: 19743667, s: 19743666, d: 19743666, ts: 912410740656233)

Since node1 now has a different table definition than node2 and node3, you will eventually have consistency errors.

node2 and node3 might see errors like this:

150616 5:15:11 [ERROR] Slave SQL: Column 11 of table 'db1.tab' cannot be converted from type 'int' to type 'date', Internal MariaDB error code: 1677

150616 5:15:11 [Warning] WSREP: RBR event 2 Write_rows_v1 apply warning: 3, 19743684

150616 5:15:11 [ERROR] WSREP: Failed to apply trx: source: 75edc58a-0bb2-11e5-a1fe-cb59d7f111b4 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 23347068 trx_id: 59826665 seqnos (l: 3742229, g: 19743684, s: 19743683, d: 19743667, ts: 768200703670749)

150616 5:15:11 [ERROR] WSREP: Failed to apply trx 19743684 4 times

150616 5:15:11 [ERROR] WSREP: Node consistency compromized, aborting...

And node1 will see node2 and node3 leave the cluster, causing a loss of quorum and total cluster failure:

150616 5:15:12 [Note] WSREP: forgetting 07459bc1 (tcp://$node2_ip:4567)

150616 5:15:12 [Note] WSREP: (75edc58a, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.72:4567' pointing to uuid 75edc58a is blacklisted, skipping

150616 5:15:12 [Note] WSREP: forgetting 9f6bdb3d (tcp://$node3_ip:4567)

150616 5:15:12 [Note] WSREP: Node 75edc58a state prim

150616 5:15:12 [Note] WSREP: view(view_id(PRIM,75edc58a,10) memb {

75edc58a,0

} joined {

} left {

} partitioned {

07459bc1,0

9f6bdb3d,0

})

Should it be possible for this to happen?

Can we fix this by making a node crash if DDL fails if wsrep_OSU_method is set to TOI? Making one node crash is probably better than total cluster failure most of the time.

Attachments

Activity

People

Assignee:: Nirbhay Choubey (Inactive)

Reporter:: Geoff Montee (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2015-06-17 00:28

Updated:: 2015-07-14 21:37

Resolved:: 2015-07-08 00:23

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.