To provide a comprehensive encryption solution, we must encrypt the Galera gcache (http://www.severalnines.com/blog/understanding-gcache-galera).
"GCache" above actually stands for Galera in general, but since GCache is the most visible and contested part, we’ll leave it as GCache.
- Filesystem encryption - fully transparent and most performant, but not controllable, e.g. not possible to rotate keys and choose algorithms at will.
- Userspace encryption:
- Whole file encryption inside GCache - most safe, but mmap() no longer directly usable. Certification data (e.g. keys) will have to be cached in RAM and replication events will have to be either passed up to application (MariaDB) in encrypted form or also decrypted into RAM before that. The latter may be nasty in case of a huge writeset. On the other hand, if streaming replication is used, the writesets will be reasonably sized.
- Separate encryption of replication events and metadata. Replication events can be encrypted on the application (MariaDB) side before passing them to wsrep provider and so Galera would not care about them and only metadata will need to be encrypted on Galera side. This would reveal internal structure of the GCache file and does not really buy us anything in terms of simplicity or performance. Except in the case when we choose not to encrypt metadata at all - then no changes need to be made to GCache, but it may be considered quite an unsafe compromise.
I would not go into more elaborate schemes because they offer only more complexity and less safe compromises.
Of all that I chose as a basis approach 2.1 and tried to come up with an API and a protocol that would combine simplicity, flexibility and control by the user (MariaDB).
Due to the asynchronous nature of sending the writesets, the key cannot be rotated synchronously on all nodes - writesets encrypted with different keys will mix. That means that
- the writesets will have to be replicated in plain text, protected by SSL/TLS.
- the writesets will be encrypted post replication and that means that there is no need to sync encryption keys between the nodes and encryption is a purely local operation. E.g. the writesets encrypted with one key on donor may be decrypted, sent by IST in plaintext to joiner and there stored in GCache using another key. This is suboptimal, but most simple and flexible. Another possibility may be to introduce a key rotation event (for IST only) but that will seriously complicate the control flow (two page files written at once) and recovery - see next paragraph.
While maintaining the directory of used keys in runtime is not a big deal, recovery after crash/restart with only the last known key may be problematic. With this in mind, in case of encryption ring buffer cache will be disabled and only page-based store will be used, with each new key starting a new page file where the previous key will be encrypted as a first message. This shall allow to recover the cache traversing the page files in reverse order. This implies that the key rotation shall be strictly sequential.
The wirtesets shall be received, encrypted and stored to the GCache page file fragment by fragment. When they are pulled from the cache, they are decrypted whole into RAM and processed as usual. They are freed from RAM when they are marked released. There is a number of potential optimizations. Like
- Sufficiently small writesets may be kept plaintext in RAM from the moment of reception till release to avoid extra decryption
- Key set (needed for subsequent certifications) may be cached in RAM separately, so that the rest of the wirteset can be freed right after commit.
- For local writesets on master only Key set needs to be cached.
Each GCache buffer will be encrypted end-to-end and written to the page file, so that there will be no unencrypted information whatsoever except for a 32 byte page nonce which shall be a sufficiently random string so that the probability of its repetition is negligible. The IV for a given buffer will be a hash of the page nonce and, say, buffer offset. Since buffer offsets are kept in a map, in runtime the nonce can readily be known before decryption. During recovery there may be small overhead.
On master side initially we may disallow spilling prepared writeset to disk. Subsequently, with using streaming replication that should cease being an issue.
The API is deliberately stream-oriented to allow for encryption of arbitrarily sized buffers and streams for more functional generality.
Stream orientation and resulting block size agnosticism means that only streaming encryption modes like CFB, OFB and CTR are generally possible. This limitation can be lifted at the expense of adding more parameters to the encryption callback and requiring additional memcopies. This is probably not worth it.