MDEV-476Cassandra: Server crashes in calculate_key_len on DELETE with ORDER BY
Closed
MDEV-477Cassandra: Assertion `!table || (!table->read_set || bitmap_is_set(table->read_set, field_index))' failed on DELETE with ORDER BY
Closed
MDEV-480Cassandra: TRUNCATE TABLE on a Cassandra table does not remove rows
Closed
MDEV-494Cassandra: terminate called after throwing an instance of 'apache::thrift::transport::TTransportException' or a phantom row after big INSERT or LOAD
Closed
MDEV-497Cassandra: Table elimination is not working
Closed
MDEV-498Cassandra: Inserting a timestamp does not work on a 32-bit system
Ok, column validators do not seem to be mandatory. cassandra-cli allows to insert any (rowid, column_name, column_value) regardless of what validators are present.
This opportunity seems to be missing from CQL. There you can only use column names that were defined.
SELECTs in CQL ignore (=do not produce) rows that do not have the required columns.
Sergei Petrunia
added a comment - Ok, column validators do not seem to be mandatory. cassandra-cli allows to insert any (rowid, column_name, column_value) regardless of what validators are present.
This opportunity seems to be missing from CQL. There you can only use column names that were defined.
SELECTs in CQL ignore (=do not produce) rows that do not have the required columns.
Cassandra schema definition has evolved a bit. The old way to define a "wide row" CF (i.e., representing a partition of data, clustered on the comparator) was to define comparator and default_validation_class, and leave column names implicit in your application code. That is, the Cassandra "column name" would really be the value of an unnamed column that would be part of the primary key in the partition.
Since columnfamily definition defaulted to no column names, this was the default behavior, but mixing this with "static" column definitions is very bad practice. (But Cassandra will not ignore validators that are correctly declared, you are mistaken on that point.)
Old-style schema will be supported indefinitely for backwards compatibility, but cql schema is far more straightforward to use correctly.
Jonathan Ellis
added a comment - Cassandra schema definition has evolved a bit. The old way to define a "wide row" CF (i.e., representing a partition of data, clustered on the comparator) was to define comparator and default_validation_class, and leave column names implicit in your application code. That is, the Cassandra "column name" would really be the value of an unnamed column that would be part of the primary key in the partition.
Since columnfamily definition defaulted to no column names, this was the default behavior, but mixing this with "static" column definitions is very bad practice. (But Cassandra will not ignore validators that are correctly declared, you are mistaken on that point.)
We cleaned this up for Cassandra 1.1 with CQL3, as outlined here: http://www.datastax.com/dev/blog/schema-in-cassandra-1-1
Old-style schema will be supported indefinitely for backwards compatibility, but cql schema is far more straightforward to use correctly.
yes, I've repeated my experiment and see that indeed the defined validators are enforced.
Thanks a lot for pointing out what the CQL's composite PRIMARY KEYs are! I was overwhelmed by all the new things I was learning about Cassandra, and dismissed composite PKs as "ok, these are probably just like non-composite ones, except that they are tuples". Apparently, I was wrong, they play a more important role. I'll need to think more before I understand what this means for this project, though. At least, we shouldn't ignore them.
Sergei Petrunia
added a comment - yes, I've repeated my experiment and see that indeed the defined validators are enforced.
Thanks a lot for pointing out what the CQL's composite PRIMARY KEYs are! I was overwhelmed by all the new things I was learning about Cassandra, and dismissed composite PKs as "ok, these are probably just like non-composite ones, except that they are tuples". Apparently, I was wrong, they play a more important role. I'll need to think more before I understand what this means for this project, though. At least, we shouldn't ignore them.
there is now read-only support for counter datatype
Started to do benchmarks in Amazon. First results for data load operations:
= ha_cassandra fails to utilize available network bandwidth
= ha_cassandra occupies about 50% of one cpu, and seems to be the bottleneck.
Possible directions for speedup:
Use async API and multiple connections to Cassandra
Optimize ha_cassandra code be less CPU-intensive.
Sergei Petrunia
added a comment -
there is now read-only support for counter datatype
Started to do benchmarks in Amazon. First results for data load operations:
= ha_cassandra fails to utilize available network bandwidth
= ha_cassandra occupies about 50% of one cpu, and seems to be the bottleneck.
Possible directions for speedup:
Use async API and multiple connections to Cassandra
Optimize ha_cassandra code be less CPU-intensive.
Tried profiling ha_cassandra on home setup and on EC2. Results from EC2 (% numbers are cumulative-time)
mysqld - 99.9%
start_thread 66.41 %
mysql_load 66.34 %
read_sep_field 65.77%
write_record 55.81%
ha_cassandra::write_row 54.75 %
(the next big one is "No map [/home/ubuntu/5.5-cassandra/sql/mysqld]") with 11.23%
then assortment of libc/libgcc locations, a lot of them pointing to std::string members.
This means: at least 54% of time is spent in ha_cassandra::write_row(). Some of other time should probably be blamed on ha_cassandra also, because no other part of the server uses std::string.
Sergei Petrunia
added a comment - Tried profiling ha_cassandra on home setup and on EC2. Results from EC2 (% numbers are cumulative-time)
mysqld - 99.9%
start_thread 66.41 %
mysql_load 66.34 %
read_sep_field 65.77%
write_record 55.81%
ha_cassandra::write_row 54.75 %
(the next big one is "No map [/home/ubuntu/5.5-cassandra/sql/mysqld] ") with 11.23%
then assortment of libc/libgcc locations, a lot of them pointing to std::string members.
This means: at least 54% of time is spent in ha_cassandra::write_row(). Some of other time should probably be blamed on ha_cassandra also, because no other part of the server uses std::string.
Did some more benchmarks, results summarized here: https://lists.launchpad.net/maria-developers/msg04889.html. It seems, CPU usage of SQL node is not actually a problem - get a release build + better CPU. Lack of ability to use multiple connections IS a problem.
Sergei Petrunia
added a comment - Did some more benchmarks, results summarized here: https://lists.launchpad.net/maria-developers/msg04889.html . It seems, CPU usage of SQL node is not actually a problem - get a release build + better CPU. Lack of ability to use multiple connections IS a problem.
Started to think about how we could use multiple Thrift API connections. Thrift library doesn't support asynchronous clients. We could use thread-per-connection model, but think of all the effort (both development and run-time) we'll need to sync the threads.
There is a patch for Thrift that allows to use async client: https://issues.apache.org/jira/browse/THRIFT-579. It's been maintained across a few years, which probably means it's not just something that "barely compiles". I'm going to try using it.
Sergei Petrunia
added a comment - Started to think about how we could use multiple Thrift API connections. Thrift library doesn't support asynchronous clients. We could use thread-per-connection model, but think of all the effort (both development and run-time) we'll need to sync the threads.
There is a patch for Thrift that allows to use async client: https://issues.apache.org/jira/browse/THRIFT-579 . It's been maintained across a few years, which probably means it's not just something that "barely compiles". I'm going to try using it.
It compiles and works (btw figured how get Cassandra.thrift output to compile with fewer edits in generated code).
Problems:
this thing relies heavily on boost library which is not debugger-friendly or newbie-friendly.
thrift-trunk/lib/cpp/src/async/TAsioAsync.cpp: TAsioClient::handleConnect() has " Todo: call user-provided errback" for error handling.
I don't know what should I call in "on_connect" call back to have the code return from io_service.run().
Sergei Petrunia
added a comment - It compiles and works (btw figured how get Cassandra.thrift output to compile with fewer edits in generated code).
Problems:
this thing relies heavily on boost library which is not debugger-friendly or newbie-friendly.
thrift-trunk/lib/cpp/src/async/TAsioAsync.cpp: TAsioClient::handleConnect() has " Todo: call user-provided errback" for error handling.
I don't know what should I call in "on_connect" call back to have the code return from io_service.run().
Jonathan Ellis
added a comment - You might want to use the Cassandra native protocol introduced in 1.2 (trunk): https://github.com/apache/cassandra/blob/trunk/doc/native_protocol.spec
Thanks for the note. I guess the protocol has been pushed fairly recently? IIRC when I checked for it, there was only a Jira entry.
I think, I won't be able to implement the protocol before my nearest delivery, we'll consider implementing it for the milestone after that.
Sergei Petrunia
added a comment - Thanks for the note. I guess the protocol has been pushed fairly recently? IIRC when I checked for it, there was only a Jira entry.
I think, I won't be able to implement the protocol before my nearest delivery, we'll consider implementing it for the milestone after that.
Saw that you announced a preview release. Congrats!
Wanted to make sure you guys were aware of a couple changes we're making:
Cassandra is moving away from "dynamic columns" per se. Although supporting that is nice for legacy purposes, in CQL3 (opt-in in C* 1.1, default in C* 1.2, although fallback to CQL2 is still available) columns must be defined before use.
Cassandra is adding collections (maps, lists, and sets) to CQL3 in 1.2. (This is available in our recent beta1 release.) Not sure how you'd want to expose that, tbh... I don't think dynamic columns is necessarily a good fit for Maps for instance since you can have key conflicts.
Jonathan Ellis
added a comment - Saw that you announced a preview release. Congrats!
Wanted to make sure you guys were aware of a couple changes we're making:
Cassandra is moving away from "dynamic columns" per se. Although supporting that is nice for legacy purposes, in CQL3 (opt-in in C* 1.1, default in C* 1.2, although fallback to CQL2 is still available) columns must be defined before use.
Cassandra is adding collections (maps, lists, and sets) to CQL3 in 1.2. (This is available in our recent beta1 release.) Not sure how you'd want to expose that, tbh... I don't think dynamic columns is necessarily a good fit for Maps for instance since you can have key conflicts.
JFYI Our dynamic columns collect not only Cassandra dynamic columns but all not mentioned in MariaDB descriptions columns.
Oleksandr Byelkin
added a comment - JFYI Our dynamic columns collect not only Cassandra dynamic columns but all not mentioned in MariaDB descriptions columns.
Ok, column validators do not seem to be mandatory. cassandra-cli allows to insert any (rowid, column_name, column_value) regardless of what validators are present.
This opportunity seems to be missing from CQL. There you can only use column names that were defined.
SELECTs in CQL ignore (=do not produce) rows that do not have the required columns.