Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 339 Vote(s) - 3.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I want to move data from SQL server DB to Hbase/Cassandra etc.. How to decide which bigdata database to use?

#1
I need to develop a plan to move data from SQL server DB to any of the bigdata databases? Some of the questions that I have thought of are :

1. How big is the data?
2. What is the expected growth rate for this data?
3. What kind of queries will be run frequently? eg: look-up, range-scan, full-scan etc
4. How frequently the data moved from source to destination?

Can anyone help add to this questionnaire?
Reply

#2
couple more pointers


1. Type of no-sql DB that suits your requirement. i.e. key-value, document, column family and graph databases
2. CAP theorem to decide which is more critical amongst Consistency, Availability and Partition tolerance
Reply

#3
Firstly, `How big is the data` doesn't matter! This point barely can be used to decide on which NoSQL DB to use as most NoSQL DBs are made for easy scalability & storage. So all that matters is the ***query you fire*** rather than how much data is there. (Unless of course you intend to use it for storage & access of very small amounts of data because they would be a little expensive in many of the NoSQL DBs) **Your first question must be Why consider NoSQL? Can't RDBMS handle it?**

Expected growth-rate is a considerable parameter but then again not so valid, since most of the NOSQL DBs support storage of large amounts of data (without any scalability issues).


***The most important one in your list is `What kind of queries will be run?`***

This matters most since the **RDBMS stores data as `tuples`** and its easier to select tuples & output them with smaller amounts of data. Its faster at executing `*` queries(as its row-wise storage). But coming to **NoSQL, most DBs are [columnar](

[To see links please register here]

) or [Column-oriented DBMS](

[To see links please register here]

)**.

**Row-oriented system** : As data is inserted into the table, it is assigned an internal ID, the rowid that is used internally in the system to refer to data. In this case the records have sequential rowids independent of the user-assigned empid.

**Column-oriented systems** : A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.

*Comparisons between **row-oriented** and **column-oriented** databases are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other bottlenecks in computers.*


`How frequently the data will be moved/accessed?` is again a good question as accesses are costly and few of the NoSQL DBs are very slow the first time a query is shot(Eg: Hive).

**Other parameters you may consider are** :

1. ***Are update of rows(data in the table) required?*** (Hive has problems with updation, you usually have to delete and insert again)

2. ***Why are you using the database?*** *(Search, derive relationships or analytics, etc)* What type of operations would you want to perform on the data?
Will it require relationship searches? Like in case of Facebook Db(Presto)
Will it require aggregations?
Will it be used to relate various columns to derive insights?(like analytics to be done)

3. Last but a very important one, **Do you want to store that data on [HDFS(Hadoop distributed File System)](

[To see links please register here]

) as files or your DB's specific storage format or anything else?** This is important since your processing depends on how your data is stored, whether it can be accessed directly or needs a query call which may be time consuming , etc.
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through