Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8323

Failed DDL execution can cause a full Galera Cluster crash

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 5.5.43-galera, 10.0.19-galera
    • N/A
    • Galera, wsrep

    Description

      Consider the following sequence of events happening with Galera Cluster if wsrep_OSU_method is set to TOI:

      • You have a 3 node cluster: node1, node2, and node3.
      • node1 is almost out of disk space.
      • You execute DDL on node1, such as: ALTER TABLE tab DROP COLUMN col;
      • node1 executes the DDL statement, and tells node2 and node3 to execute it in Total Order Isolation.
      • ALTER TABLE statement fails on node1 because it ran out of disk space, but the command succeeds on node2 and node3.

      node1 will see an error like this:

      2015-06-16 04:45:55 7f7b000c7700 InnoDB: Error: Write to file (merge) failed at offset 68157440.
      InnoDB: 1048576 bytes should have been written, only 1036288 were written.
      InnoDB: Operating system error number 28.
      InnoDB: Check that your OS and file system support files of this size.
      InnoDB: Check also that the disk is not full or a disk quota exceeded.
      InnoDB: Error number 28 means 'No space left on device'.
      InnoDB: Some operating system error numbers are described at
      InnoDB: http://dev.mysql.com/doc/refman/5.6/en/operating-system-error-codes.html
      150616 4:45:55 [ERROR] Slave SQL: Error 'Got error 64 'Temp file write failure' from InnoDB' on query. Default database: 'db1'. Query: 'ALTER TABLE tab DROP COLUMN col', Internal MariaDB error code: 1296
      150616 4:45:55 [Warning] WSREP: RBR event 1 Query apply warning: 1, 19743667
      150616 4:45:55 [Warning] WSREP: Ignoring error for TO isolated action: source: 9f6bdb3d-0bc1-11e5-a9f2-ca15da9a1a8b version: 3 local: 0 state: APPLYING flags: 65 conn_id: 24689371 trx_id: -1 seqnos (l: 3471879, g: 19743667, s: 19743666, d: 19743666, ts: 912410740656233)

      • Since node1 now has a different table definition than node2 and node3, you will eventually have consistency errors.

      node2 and node3 might see errors like this:

      150616 5:15:11 [ERROR] Slave SQL: Column 11 of table 'db1.tab' cannot be converted from type 'int' to type 'date', Internal MariaDB error code: 1677
      150616 5:15:11 [Warning] WSREP: RBR event 2 Write_rows_v1 apply warning: 3, 19743684
      150616 5:15:11 [ERROR] WSREP: Failed to apply trx: source: 75edc58a-0bb2-11e5-a1fe-cb59d7f111b4 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 23347068 trx_id: 59826665 seqnos (l: 3742229, g: 19743684, s: 19743683, d: 19743667, ts: 768200703670749)
      150616 5:15:11 [ERROR] WSREP: Failed to apply trx 19743684 4 times
      150616 5:15:11 [ERROR] WSREP: Node consistency compromized, aborting...

      And node1 will see node2 and node3 leave the cluster, causing a loss of quorum and total cluster failure:

      150616 5:15:12 [Note] WSREP: forgetting 07459bc1 (tcp://$node2_ip:4567)
      150616 5:15:12 [Note] WSREP: (75edc58a, 'tcp://0.0.0.0:4567') address 'tcp://10.0.0.72:4567' pointing to uuid 75edc58a is blacklisted, skipping
      150616 5:15:12 [Note] WSREP: forgetting 9f6bdb3d (tcp://$node3_ip:4567)
      150616 5:15:12 [Note] WSREP: Node 75edc58a state prim
      150616 5:15:12 [Note] WSREP: view(view_id(PRIM,75edc58a,10) memb {
      75edc58a,0
      } joined {
      } left {
      } partitioned {
      07459bc1,0
      9f6bdb3d,0
      })

      Should it be possible for this to happen?

      Can we fix this by making a node crash if DDL fails if wsrep_OSU_method is set to TOI? Making one node crash is probably better than total cluster failure most of the time.

      Attachments

        Activity

          People

            nirbhay_c Nirbhay Choubey (Inactive)
            GeoffMontee Geoff Montee (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.