Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.1
    • None
    • None

    Description

      Implement HBase storage for Cassandra instead.

      See http://kb.askmonty.org/en/cassandra-storage-engine/ for user-level description

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia created issue -
            serg Sergei Golubchik made changes -
            Field Original Value New Value
            psergei Sergei Petrunia made changes -
            Assignee Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            Ok, column validators do not seem to be mandatory. cassandra-cli allows to insert any (rowid, column_name, column_value) regardless of what validators are present.
            This opportunity seems to be missing from CQL. There you can only use column names that were defined.
            SELECTs in CQL ignore (=do not produce) rows that do not have the required columns.

            psergei Sergei Petrunia added a comment - Ok, column validators do not seem to be mandatory. cassandra-cli allows to insert any (rowid, column_name, column_value) regardless of what validators are present. This opportunity seems to be missing from CQL. There you can only use column names that were defined. SELECTs in CQL ignore (=do not produce) rows that do not have the required columns.

            Mapping between CQL type names and ColumnDef::validation_class values:

            blob, "org.apache.cassandra.db.marshal.BytesType"
            ascii, "org.apache.cassandra.db.marshal.AsciiType"
            text, "org.apache.cassandra.db.marshal.UTF8Type"
            varint, "org.apache.cassandra.db.marshal.IntegerType"
            int, "org.apache.cassandra.db.marshal.Int32Type"
            bigint, "org.apache.cassandra.db.marshal.LongType"
            uuid, "org.apache.cassandra.db.marshal.UUIDType"
            timestamp, "org.apache.cassandra.db.marshal.DateType"
            boolean, "org.apache.cassandra.db.marshal.BooleanType"
            float, "org.apache.cassandra.db.marshal.FloatType"
            double, "org.apache.cassandra.db.marshal.DoubleType"
            decimal "org.apache.cassandra.db.marshal.DecimalType"

            psergei Sergei Petrunia added a comment - Mapping between CQL type names and ColumnDef::validation_class values: blob, "org.apache.cassandra.db.marshal.BytesType" ascii, "org.apache.cassandra.db.marshal.AsciiType" text, "org.apache.cassandra.db.marshal.UTF8Type" varint, "org.apache.cassandra.db.marshal.IntegerType" int, "org.apache.cassandra.db.marshal.Int32Type" bigint, "org.apache.cassandra.db.marshal.LongType" uuid, "org.apache.cassandra.db.marshal.UUIDType" timestamp, "org.apache.cassandra.db.marshal.DateType" boolean, "org.apache.cassandra.db.marshal.BooleanType" float, "org.apache.cassandra.db.marshal.FloatType" double, "org.apache.cassandra.db.marshal.DoubleType" decimal "org.apache.cassandra.db.marshal.DecimalType"

            ...,
            counter org.apache.cassandra.db.marshal.CounterColumnType

            psergei Sergei Petrunia added a comment - ..., counter org.apache.cassandra.db.marshal.CounterColumnType
            psergei Sergei Petrunia made changes -
            Description Implement HBase storage for Cassandra instead.

            Implement HBase storage for Cassandra instead.

            See http://kb.askmonty.org/en/cassandra-storage-engine/ for user-level description

            Cassandra schema definition has evolved a bit. The old way to define a "wide row" CF (i.e., representing a partition of data, clustered on the comparator) was to define comparator and default_validation_class, and leave column names implicit in your application code. That is, the Cassandra "column name" would really be the value of an unnamed column that would be part of the primary key in the partition.

            Since columnfamily definition defaulted to no column names, this was the default behavior, but mixing this with "static" column definitions is very bad practice. (But Cassandra will not ignore validators that are correctly declared, you are mistaken on that point.)

            We cleaned this up for Cassandra 1.1 with CQL3, as outlined here: http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

            Old-style schema will be supported indefinitely for backwards compatibility, but cql schema is far more straightforward to use correctly.

            jbellis Jonathan Ellis added a comment - Cassandra schema definition has evolved a bit. The old way to define a "wide row" CF (i.e., representing a partition of data, clustered on the comparator) was to define comparator and default_validation_class, and leave column names implicit in your application code. That is, the Cassandra "column name" would really be the value of an unnamed column that would be part of the primary key in the partition. Since columnfamily definition defaulted to no column names, this was the default behavior, but mixing this with "static" column definitions is very bad practice. (But Cassandra will not ignore validators that are correctly declared, you are mistaken on that point.) We cleaned this up for Cassandra 1.1 with CQL3, as outlined here: http://www.datastax.com/dev/blog/schema-in-cassandra-1-1 Old-style schema will be supported indefinitely for backwards compatibility, but cql schema is far more straightforward to use correctly.

            yes, I've repeated my experiment and see that indeed the defined validators are enforced.

            Thanks a lot for pointing out what the CQL's composite PRIMARY KEYs are! I was overwhelmed by all the new things I was learning about Cassandra, and dismissed composite PKs as "ok, these are probably just like non-composite ones, except that they are tuples". Apparently, I was wrong, they play a more important role. I'll need to think more before I understand what this means for this project, though. At least, we shouldn't ignore them.

            psergei Sergei Petrunia added a comment - yes, I've repeated my experiment and see that indeed the defined validators are enforced. Thanks a lot for pointing out what the CQL's composite PRIMARY KEYs are! I was overwhelmed by all the new things I was learning about Cassandra, and dismissed composite PKs as "ok, these are probably just like non-composite ones, except that they are tuples". Apparently, I was wrong, they play a more important role. I'll need to think more before I understand what this means for this project, though. At least, we shouldn't ignore them.

            Testing todo task

            psergei Sergei Petrunia added a comment - Testing todo task
            psergei Sergei Petrunia made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            • there is now read-only support for counter datatype
            • Started to do benchmarks in Amazon. First results for data load operations:
              = ha_cassandra fails to utilize available network bandwidth
              = ha_cassandra occupies about 50% of one cpu, and seems to be the bottleneck.

            Possible directions for speedup:

            • Use async API and multiple connections to Cassandra
            • Optimize ha_cassandra code be less CPU-intensive.
            psergei Sergei Petrunia added a comment - there is now read-only support for counter datatype Started to do benchmarks in Amazon. First results for data load operations: = ha_cassandra fails to utilize available network bandwidth = ha_cassandra occupies about 50% of one cpu, and seems to be the bottleneck. Possible directions for speedup: Use async API and multiple connections to Cassandra Optimize ha_cassandra code be less CPU-intensive.

            Tried profiling ha_cassandra on home setup and on EC2. Results from EC2 (% numbers are cumulative-time)
            mysqld - 99.9%

            • start_thread 66.41 %
            • mysql_load 66.34 %
            • read_sep_field 65.77%
            • write_record 55.81%
            • ha_cassandra::write_row 54.75 %
            • (the next big one is "No map [/home/ubuntu/5.5-cassandra/sql/mysqld]") with 11.23%
            • then assortment of libc/libgcc locations, a lot of them pointing to std::string members.

            This means: at least 54% of time is spent in ha_cassandra::write_row(). Some of other time should probably be blamed on ha_cassandra also, because no other part of the server uses std::string.

            psergei Sergei Petrunia added a comment - Tried profiling ha_cassandra on home setup and on EC2. Results from EC2 (% numbers are cumulative-time) mysqld - 99.9% start_thread 66.41 % mysql_load 66.34 % read_sep_field 65.77% write_record 55.81% ha_cassandra::write_row 54.75 % (the next big one is "No map [/home/ubuntu/5.5-cassandra/sql/mysqld] ") with 11.23% then assortment of libc/libgcc locations, a lot of them pointing to std::string members. This means: at least 54% of time is spent in ha_cassandra::write_row(). Some of other time should probably be blamed on ha_cassandra also, because no other part of the server uses std::string.
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -
            psergei Sergei Petrunia made changes -

            Did some more benchmarks, results summarized here: https://lists.launchpad.net/maria-developers/msg04889.html. It seems, CPU usage of SQL node is not actually a problem - get a release build + better CPU. Lack of ability to use multiple connections IS a problem.

            psergei Sergei Petrunia added a comment - Did some more benchmarks, results summarized here: https://lists.launchpad.net/maria-developers/msg04889.html . It seems, CPU usage of SQL node is not actually a problem - get a release build + better CPU. Lack of ability to use multiple connections IS a problem.

            Started to think about how we could use multiple Thrift API connections. Thrift library doesn't support asynchronous clients. We could use thread-per-connection model, but think of all the effort (both development and run-time) we'll need to sync the threads.

            There is a patch for Thrift that allows to use async client: https://issues.apache.org/jira/browse/THRIFT-579. It's been maintained across a few years, which probably means it's not just something that "barely compiles". I'm going to try using it.

            psergei Sergei Petrunia added a comment - Started to think about how we could use multiple Thrift API connections. Thrift library doesn't support asynchronous clients. We could use thread-per-connection model, but think of all the effort (both development and run-time) we'll need to sync the threads. There is a patch for Thrift that allows to use async client: https://issues.apache.org/jira/browse/THRIFT-579 . It's been maintained across a few years, which probably means it's not just something that "barely compiles". I'm going to try using it.

            It compiles and works (btw figured how get Cassandra.thrift output to compile with fewer edits in generated code).

            Problems:

            • this thing relies heavily on boost library which is not debugger-friendly or newbie-friendly.
            • thrift-trunk/lib/cpp/src/async/TAsioAsync.cpp: TAsioClient::handleConnect() has " Todo: call user-provided errback" for error handling.
            • I don't know what should I call in "on_connect" call back to have the code return from io_service.run().
            psergei Sergei Petrunia added a comment - It compiles and works (btw figured how get Cassandra.thrift output to compile with fewer edits in generated code). Problems: this thing relies heavily on boost library which is not debugger-friendly or newbie-friendly. thrift-trunk/lib/cpp/src/async/TAsioAsync.cpp: TAsioClient::handleConnect() has " Todo: call user-provided errback" for error handling. I don't know what should I call in "on_connect" call back to have the code return from io_service.run().
            serg Sergei Golubchik made changes -
            Fix Version/s 10.0.1 [ 11400 ]
            Fix Version/s 10.0.0 [ 10000 ]

            You might want to use the Cassandra native protocol introduced in 1.2 (trunk): https://github.com/apache/cassandra/blob/trunk/doc/native_protocol.spec

            jbellis Jonathan Ellis added a comment - You might want to use the Cassandra native protocol introduced in 1.2 (trunk): https://github.com/apache/cassandra/blob/trunk/doc/native_protocol.spec

            Thanks for the note. I guess the protocol has been pushed fairly recently? IIRC when I checked for it, there was only a Jira entry.
            I think, I won't be able to implement the protocol before my nearest delivery, we'll consider implementing it for the milestone after that.

            psergei Sergei Petrunia added a comment - Thanks for the note. I guess the protocol has been pushed fairly recently? IIRC when I checked for it, there was only a Jira entry. I think, I won't be able to implement the protocol before my nearest delivery, we'll consider implementing it for the milestone after that.

            Added support for 'varint' type.

            psergei Sergei Petrunia added a comment - Added support for 'varint' type.
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -

            Saw that you announced a preview release. Congrats!

            Wanted to make sure you guys were aware of a couple changes we're making:

            1. Cassandra is moving away from "dynamic columns" per se. Although supporting that is nice for legacy purposes, in CQL3 (opt-in in C* 1.1, default in C* 1.2, although fallback to CQL2 is still available) columns must be defined before use.
            2. Cassandra is adding collections (maps, lists, and sets) to CQL3 in 1.2. (This is available in our recent beta1 release.) Not sure how you'd want to expose that, tbh... I don't think dynamic columns is necessarily a good fit for Maps for instance since you can have key conflicts.
            jbellis Jonathan Ellis added a comment - Saw that you announced a preview release. Congrats! Wanted to make sure you guys were aware of a couple changes we're making: Cassandra is moving away from "dynamic columns" per se. Although supporting that is nice for legacy purposes, in CQL3 (opt-in in C* 1.1, default in C* 1.2, although fallback to CQL2 is still available) columns must be defined before use. Cassandra is adding collections (maps, lists, and sets) to CQL3 in 1.2. (This is available in our recent beta1 release.) Not sure how you'd want to expose that, tbh... I don't think dynamic columns is necessarily a good fit for Maps for instance since you can have key conflicts.

            JFYI Our dynamic columns collect not only Cassandra dynamic columns but all not mentioned in MariaDB descriptions columns.

            sanja Oleksandr Byelkin added a comment - JFYI Our dynamic columns collect not only Cassandra dynamic columns but all not mentioned in MariaDB descriptions columns.
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            serg Sergei Golubchik made changes -

            pushed in 10.0-base

            serg Sergei Golubchik added a comment - pushed in 10.0-base
            serg Sergei Golubchik made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow defaullt [ 13668 ] MariaDB v2 [ 44224 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Workflow MariaDB v2 [ 44224 ] MariaDB v3 [ 63513 ]
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 63513 ] MariaDB v4 [ 131958 ]

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.