Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 521 Vote(s) - 3.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Cassandra - Overlapping Data Ranges

#1
I have the following 'Tasks' table in Cassandra.

- Task_ID UUID - Partition Key
- Starts_On TIMESTAMP - Clustering Column
- Ends_On TIMESTAMP - Clustering Column

I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).

What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.
Reply

#2
In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.
Reply

#3
A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:

CREATE TABLE userEvents (
userid UUID,
eventTime TIMEUUID,
eventType TEXT,
eventDesc TEXT,
PRIMARY KEY ((userid),eventTime,eventType));

With this structure, I can query by `userid` and `eventtime`:

SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents
WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68
AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');

userid | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 | End | event1
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 | Begin | lunch
dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 | End | lunch

(3 rows)

That query will give me all event rows for a particular user for today.

**NOTES:**

- If you need to query by whether or not an event is starting or ending (I did not) you will want to order `eventType` ahead of `eventTime` in the primary key.
- You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
- In your case, you will want to find a good key to partition on, as `Task_ID` will be too unique (high cardinality). This is a *must* in Cassandra, as you cannot range query on a partition key (only a clustering key).
Reply

#4
Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).

Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):

CREATE TABLE tasks (
p int,
task_id timeuuid,
start int,
end int,
end_range int static,
PRIMARY KEY(p, start));

Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:

CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int)
CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS
'if (end <= end_range) return task_id; else return null;';

Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.

So first we have to set the end_range we want:

UPDATE tasks SET end_range=15 where p=1;

And let's say we have this data:

SELECT * FROM tasks;

p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
1 | 1 | 15 | 5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
1 | 2 | 15 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
1 | 4 | 15 | 22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
1 | 8 | 15 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb

Now let's get the task_id's that have start >= 2 and end <= 15:

SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks
WHERE p=1 AND start >= 2;

start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
2 | 7 | 3233a040-4a88-11e5-a180-433e07a8bafb
4 | 22 | null
8 | 15 | 37ec7840-4a88-11e5-a180-433e07a8bafb

So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.

Anyway not a perfect method obviously, but it might be something you can work with. :)
Reply

#5
There doesn't seem to be a completely satisfactory way to do this in Cassandra but the following method seems to work well:

I cluster the table on the `Starts_On` timestamp in descending order. (`Ends_On` is just a regular column.) Then I constrain the query with `Starts_On<?` where the parameter is the end of the period of interest - i.e. filter out events that start after our period of interest has finished.

I then iterate through the results until the row `Ends_On` is earlier than the start of the period of interest and throw away the rest of the results rows. (Note that this assumes events don't overlap - there are no subsequent results with a later `Ends_On`.)

Throwing away the rest of the result rows might seem wasteful, but **here's the crucial bit**: You can set the paging size sufficiently small that the number of rows to throw away is relatively small, even if the total number of rows is very large.

Ideally you want the paging size just a little bigger than the total number of relevant rows that you expect to receive back. If the paging size is too small the driver ends up retrieving multiple pages, which could hurt performance. If it is too large you end up throwing away a lot of rows and again this could hurt performance by transfering more data than is necessary. In practice you can probably find a good compromise.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through