[MDEV-431] Cassandra storage engine Created: 2012-08-04  Updated: 2020-07-14  Resolved: 2013-01-25

Status: Closed
Project: MariaDB Server
Component/s: None
Fix Version/s: 10.0.1

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MDEV-4012 10.0-base merge Closed
PartOf
includes MDEV-377 Name support for dynamic columns Closed
includes MDEV-506 Cassandra dynamic columns access Closed
includes MDEV-530 Cassandra SE: Locking is incorrect Closed
includes MDEV-534 Cassandra SE: small features Closed
includes MDEV-535 Cassandra SE: Internal error: 'TimedO... Closed
includes MDEV-3931 Cassandra SE packaging Closed
Relates
relates to MDEV-122 Data mapping between HBase and SQL Closed
relates to MDEV-476 Cassandra: Server crashes in calculat... Closed
relates to MDEV-477 Cassandra: Assertion `!table || (!tab... Closed
relates to MDEV-480 Cassandra: TRUNCATE TABLE on a Cassan... Closed
relates to MDEV-494 Cassandra: terminate called after thr... Closed
relates to MDEV-497 Cassandra: Table elimination is not w... Closed
relates to MDEV-498 Cassandra: Inserting a timestamp does... Closed
relates to MDEV-501 Cassandra SE fails to compile on Ubuntu Closed
relates to MDEV-560 Cassandra: DynCols: Server crashes in... Closed
relates to MDEV-561 Cassandra: DynCols: debugger aborting... Closed
relates to MDEV-565 Cassandra: DynCols: Server crashes in... Closed
relates to MDEV-3996 Cassandra: Error message for ER_CONNE... Closed
relates to MDEV-3997 Querying a Cassandra table on a serve... Closed
relates to MDEV-3998 Cassandra SE: Cryptic error message o... Closed
relates to MDEV-4000 Mapping between Cassandra blob (Bytes... Closed
relates to MDEV-4001 Cassandra: server crashes in ha_cassa... Closed
relates to MDEV-4003 Cassandra: Error 1032 (Can't find rec... Closed
relates to MDEV-4005 Server crashes on creating a Cassandr... Closed
relates to MDEV-4014 Cassandra: TException or Internal (un... Closed
relates to MDEV-3792 review the handler part of the cassan... Closed
relates to MDEV-23024 Remove Cassandra Storage Engine Closed

 Description   

Implement HBase storage for Cassandra instead.

See http://kb.askmonty.org/en/cassandra-storage-engine/ for user-level description



 Comments   
Comment by Sergei Petrunia [ 2012-08-15 ]

Ok, column validators do not seem to be mandatory. cassandra-cli allows to insert any (rowid, column_name, column_value) regardless of what validators are present.
This opportunity seems to be missing from CQL. There you can only use column names that were defined.
SELECTs in CQL ignore (=do not produce) rows that do not have the required columns.

Comment by Sergei Petrunia [ 2012-08-17 ]

Mapping between CQL type names and ColumnDef::validation_class values:

blob, "org.apache.cassandra.db.marshal.BytesType"
ascii, "org.apache.cassandra.db.marshal.AsciiType"
text, "org.apache.cassandra.db.marshal.UTF8Type"
varint, "org.apache.cassandra.db.marshal.IntegerType"
int, "org.apache.cassandra.db.marshal.Int32Type"
bigint, "org.apache.cassandra.db.marshal.LongType"
uuid, "org.apache.cassandra.db.marshal.UUIDType"
timestamp, "org.apache.cassandra.db.marshal.DateType"
boolean, "org.apache.cassandra.db.marshal.BooleanType"
float, "org.apache.cassandra.db.marshal.FloatType"
double, "org.apache.cassandra.db.marshal.DoubleType"
decimal "org.apache.cassandra.db.marshal.DecimalType"

Comment by Sergei Petrunia [ 2012-08-17 ]

...,
counter org.apache.cassandra.db.marshal.CounterColumnType

Comment by Jonathan Ellis [ 2012-08-20 ]

Cassandra schema definition has evolved a bit. The old way to define a "wide row" CF (i.e., representing a partition of data, clustered on the comparator) was to define comparator and default_validation_class, and leave column names implicit in your application code. That is, the Cassandra "column name" would really be the value of an unnamed column that would be part of the primary key in the partition.

Since columnfamily definition defaulted to no column names, this was the default behavior, but mixing this with "static" column definitions is very bad practice. (But Cassandra will not ignore validators that are correctly declared, you are mistaken on that point.)

We cleaned this up for Cassandra 1.1 with CQL3, as outlined here: http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

Old-style schema will be supported indefinitely for backwards compatibility, but cql schema is far more straightforward to use correctly.

Comment by Sergei Petrunia [ 2012-08-22 ]

yes, I've repeated my experiment and see that indeed the defined validators are enforced.

Thanks a lot for pointing out what the CQL's composite PRIMARY KEYs are! I was overwhelmed by all the new things I was learning about Cassandra, and dismissed composite PKs as "ok, these are probably just like non-composite ones, except that they are tuples". Apparently, I was wrong, they play a more important role. I'll need to think more before I understand what this means for this project, though. At least, we shouldn't ignore them.

Comment by Sergei Petrunia [ 2012-08-22 ]

Testing todo task

Comment by Sergei Petrunia [ 2012-09-12 ]
  • there is now read-only support for counter datatype
  • Started to do benchmarks in Amazon. First results for data load operations:
    = ha_cassandra fails to utilize available network bandwidth
    = ha_cassandra occupies about 50% of one cpu, and seems to be the bottleneck.

Possible directions for speedup:

  • Use async API and multiple connections to Cassandra
  • Optimize ha_cassandra code be less CPU-intensive.
Comment by Sergei Petrunia [ 2012-09-12 ]

Tried profiling ha_cassandra on home setup and on EC2. Results from EC2 (% numbers are cumulative-time)
mysqld - 99.9%

  • start_thread 66.41 %
  • mysql_load 66.34 %
  • read_sep_field 65.77%
  • write_record 55.81%
  • ha_cassandra::write_row 54.75 %
  • (the next big one is "No map [/home/ubuntu/5.5-cassandra/sql/mysqld]") with 11.23%
  • then assortment of libc/libgcc locations, a lot of them pointing to std::string members.

This means: at least 54% of time is spent in ha_cassandra::write_row(). Some of other time should probably be blamed on ha_cassandra also, because no other part of the server uses std::string.

Comment by Sergei Petrunia [ 2012-09-14 ]

Did some more benchmarks, results summarized here: https://lists.launchpad.net/maria-developers/msg04889.html. It seems, CPU usage of SQL node is not actually a problem - get a release build + better CPU. Lack of ability to use multiple connections IS a problem.

Comment by Sergei Petrunia [ 2012-09-18 ]

Started to think about how we could use multiple Thrift API connections. Thrift library doesn't support asynchronous clients. We could use thread-per-connection model, but think of all the effort (both development and run-time) we'll need to sync the threads.

There is a patch for Thrift that allows to use async client: https://issues.apache.org/jira/browse/THRIFT-579. It's been maintained across a few years, which probably means it's not just something that "barely compiles". I'm going to try using it.

Comment by Sergei Petrunia [ 2012-09-18 ]

It compiles and works (btw figured how get Cassandra.thrift output to compile with fewer edits in generated code).

Problems:

  • this thing relies heavily on boost library which is not debugger-friendly or newbie-friendly.
  • thrift-trunk/lib/cpp/src/async/TAsioAsync.cpp: TAsioClient::handleConnect() has " Todo: call user-provided errback" for error handling.
  • I don't know what should I call in "on_connect" call back to have the code return from io_service.run().
Comment by Jonathan Ellis [ 2012-09-18 ]

You might want to use the Cassandra native protocol introduced in 1.2 (trunk): https://github.com/apache/cassandra/blob/trunk/doc/native_protocol.spec

Comment by Sergei Petrunia [ 2012-09-19 ]

Thanks for the note. I guess the protocol has been pushed fairly recently? IIRC when I checked for it, there was only a Jira entry.
I think, I won't be able to implement the protocol before my nearest delivery, we'll consider implementing it for the milestone after that.

Comment by Sergei Petrunia [ 2012-09-25 ]

Added support for 'varint' type.

Comment by Jonathan Ellis [ 2012-09-30 ]

Saw that you announced a preview release. Congrats!

Wanted to make sure you guys were aware of a couple changes we're making:

  1. Cassandra is moving away from "dynamic columns" per se. Although supporting that is nice for legacy purposes, in CQL3 (opt-in in C* 1.1, default in C* 1.2, although fallback to CQL2 is still available) columns must be defined before use.
  2. Cassandra is adding collections (maps, lists, and sets) to CQL3 in 1.2. (This is available in our recent beta1 release.) Not sure how you'd want to expose that, tbh... I don't think dynamic columns is necessarily a good fit for Maps for instance since you can have key conflicts.
Comment by Oleksandr Byelkin [ 2012-10-01 ]

JFYI Our dynamic columns collect not only Cassandra dynamic columns but all not mentioned in MariaDB descriptions columns.

Comment by Sergei Golubchik [ 2013-01-25 ]

pushed in 10.0-base

Generated at Thu Feb 08 06:28:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.