Login

I am working on a Big Data solution for sensor data and predictive analytics.
I am new to Big Data, and have read about the lambda-architecture.
I thought about using Cassandra Database together with Hadoop.
Cassandra is a high available and Partition tolerance database and Hadoop hdfs a file system for large analytics jobs.

If I receive the data from a Internet of Things Device, should the data be saved first in Hadoop and then to Cassandra?
The lambda architecture has Hadoop in batch layer, receiving the data and sending it to the serving layer to a nosql database.

Why should the data be first in Hadoop?
and what kind of data is stored in Cassandra if Hadoop contains the raw data?

The stream layer is out of Focus at the moment.
I just want to understand the usage of Cassandra and Hadoop together.

The data in Hadoop is for large analytics and in cassandra there should be the result from my Hadoop jobs.

Does that mean i can store my raw data in both? i can store my raw data in Cassandra and in Hadoop if not only the large analytics jobs are useful for my application?

Example

INSERT INTO temperature(weatherstation_id,event_time,temperature)
VALUES (’1234ABCD’,’2013-04-03 07:02:00′,’73F’);

if this is my insert and i have thousands of them in one single minute.
I want to do some large jobs i use Hadoop ?

But also i need every single Data Row for my application without analytics. Cassandra is storing it too?

The trade off is between the latency and throughput. Hadoop is supposed to provide the high throughput but the latency is quite high. So hadoop is used for batch processing in lambda architecture. But there may be requirement when you would like to pass on the pre-computed data ( Or summarized data) to another layer like visualization layer .These precomputed data is basically stored in cassandra or hbase to have low latency.

As you receive the data from a IoT Device, you need to save this data as quickly as you can. That's exactly what Cassandra is great for. 
Than you need to process this data, and as the data amount is large, in the realistic case you do not want to have on-the-fly data processing, but to have batch(nightly, for example) processing instead. 
And it's turn of Hadoop here. 
So you have to extract the data from Cassandra, then put into Hadoop's file system (hdfs) and then do some processing (via Hive or Spark). 
You could also think of having Cassandra-Spark direct streaming job, but I'd suggest to copy data from Cassandra first, as this allows to use this data as sandbox (to debug jobs, testing new algorithms, etc.) without any impact on Casandra cluster performance.

You can read about Cassandra and big data [here](

[To see links please register here]

). 
Disclaimer: I am the author of this post.

gunar90

Procierragvvbdenrk

fulviaifdnbex

impeller84866