== Data mapping from HBase to SQL ==
Hbase table consists of rows, which are identified by row key. Each row has an
arbitrary (potentially, very large) number of columns. (Looking at HBase
documentation, one can conclude that HBase is specifically targeting two use
cases:
- a column family has few columns
- a column family has so many columns that "get column by name" operation is
specifically optimized for (TODO: exact link!)
One can see two ways to map that to SQL tables:
=== Per-row mapping ===
Let each row in HBase table be mapped into a row from SQL point of view:
SELECT * FROM hbase_table;
row-id column1 column2
------ ------- -------
row1 data data
row2 data data
row3 data data
This is the most straightforward mapping. However, accessing some of the hbase
data will be difficult. For example, in hbase, it is natural to have this query
given a row-id, get a list of columns in the row (maybe, with their values)
in this mapping, result of this query will be one row, and there is no
convinient way to return arbitrarily-long list of columns on one row.
Table DDL could look like this:
CREATE TABLE hbase_tbl_rows (
row_id BINARY(MAX_HBASE_ROWID_LEN),
columns BLOB,
PRIMARY KEY (row_id)
) ENGINE=hbase_row;
- TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?
=== Per-cell mapping ===
HBase shell has 'scan' command, here's an example of its output:
hbase(main):007:0> scan 'testtable'
ROW COLUMN+CELL
myrow-1 column=colfam1:q1, timestamp=1297345476469, value=value-1
myrow-2 column=colfam1:q2, timestamp=1297345495663, value=value-2
myrow-2 column=colfam1:q3, timestamp=1297345508999, value=value-3
Here, one HBase row produces multiple rows in the output. Each output row
represents one (row_id, column) combination, so rows with multiple columns
(and multiple revisions of column data) are easily represented.
CREATE TABLE hbase_tbl_cells (
row_id binary(MAX_HBASE_ROWID_LEN),
column_name binary(MAX_HBASE_NAME_LEN),
timestamp TIMESTAMP,
value BLOB,
PRIMARY KEY (row_id, column_name)
) ENGINE=hbase_cell;
- TODO: Does Hbase have MAX_HBASE_ROWID_LEN limit? What is it?
- NOTE: It is nice to have SQL table DDLs independent of the content of
the backend hbase table. This saves us from the need to synchronize
table DDLs between hbase and mysql (NDB cluster had to do this and they
have ended up implementing a rather complex system for this).
== Consistency, transactions, etc ==
- HBase has single-record transactions. Does this mean that HBase storage
engine will have MyISAM-like characteristics? e.g. if we fail in the
middle of a multi-row UPDATE, there is no way to go back.
- How should writes be made? Should UPDATEs/INSERTs use checkAndPut operation
so that they don't overwrite data that's already there?
- Are the writes important at all? (e.g. if we've had the first version with
provide read-only access, would that be useful?)
== Data mapping from HBase to SQL ==
Hbase table consists of rows, which are identified by row key. Each row has an
arbitrary (potentially, very large) number of columns. (Looking at HBase
documentation, one can conclude that HBase is specifically targeting two use
cases:
specifically optimized for (TODO: exact link!)
One can see two ways to map that to SQL tables:
=== Per-row mapping ===
Let each row in HBase table be mapped into a row from SQL point of view:
SELECT * FROM hbase_table;
row-id column1 column2
------ ------- -------
row1 data data
row2 data data
row3 data data
This is the most straightforward mapping. However, accessing some of the hbase
data will be difficult. For example, in hbase, it is natural to have this query
given a row-id, get a list of columns in the row (maybe, with their values)
in this mapping, result of this query will be one row, and there is no
convinient way to return arbitrarily-long list of columns on one row.
Table DDL could look like this:
create table hbase_tbl_rows (
row_id binary(MAX_HBASE_ROWID_LEN),
columns blob,
key (row_content)
) engine=hbase_row;
=== Per-cell mapping ===
HBase shell has 'scan' command, here's an example of its output:
hbase(main):007:0> scan 'testtable'
ROW COLUMN+CELL
myrow-1 column=colfam1:q1, timestamp=1297345476469, value=value-1
myrow-2 column=colfam1:q2, timestamp=1297345495663, value=value-2
myrow-2 column=colfam1:q3, timestamp=1297345508999, value=value-3
Here, one HBase row produces multiple rows in the output. Each output row
represents one (row_id, column) combination, so rows with multiple columns
(and multiple revisions of column data) are easily represented.
create table hbase_tbl_cells (
row_id binary(MAX_HBASE_ROWID_LEN),
column_name binary(N),
timestamp TIMESTAMP,
value binary(N)
) engine=hbase_row;
saves us from need to synchronize table DDLs between hbase and mysql (which
NDB cluster had to do and they ended up implementing a rather complex
subsystem for it)
== Consistency and transactionality ==
will have MyISAM-like characteristics? e.g. if we fail in the middle of
an UPDATE, there is no way to go back?