Big Data
Home » Big Data » Hadoop Ecosystem

Hive – The next generation data warehouse

Posted by Ankit Jain | Jul 05, 2012 | (0) | Add a Comment  |   Bookmark and Share
Organization need to efficiently process, analyze and generate report from massive amount of data collected into Hadoop ecosystem. This blog explains Hive, which is most commonly used data processing and analyzing tool in Hadoop ecosystem.

Motivation for Hive:
Data, data and data everywhere and the data warehouse tools available to process those data are not scalable and are expensive and proprietary.  While the concept of map-reduce is available to process data but people require an expertise in programming to write a map-reduce job, and also it is very hard to develop the code that can be reused for other business cases. On the other hand, people are much comfortable to write SQL like queries. To fill this gap, Facebook team developed the next generation data warehouse which takes SQL like query as input and implicitly converts that query into map-reduce job.

What is Hive?

Hive is a data warehouse infrastructure build on top of Hadoop which is heavily used for data summarization, analysis and ad-hoc querying. It was created by Facebook and then contributed back to Hadoop ecosystem as a Hadoop’s subproject. Hive is not designed to handle online transactions and does not generate the real time results because Hive queries submit map-reduce job on Hadoop which then operates on files stored in HDFS. Hive has following important features:
  1. Hive supports indexing to provide acceleration.
  2. Support for different storage types such as plain text, RCFiles, HBase, and others.
  3. Hive stores metadata in an RDBMS which reduces significant time to perform the semantic checks during the query execution.
  4. Hive can operate on compressed data stored into Hadoop ecosystem.
  5. Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. If none serves our need, we can create our own UDFs.
  6. Hive supports SQL like queries (Hive QL) which is implicitly converted into map-reduce jobs.
Hive/HBase Integration:
HBase is generally used to store the real time data, but very rare use for analysis. However, we can integrate Hive with HBase and can query over HBase table using Hive to perform analysis on HBase data.

Advantages of integrating Hive with HBase.
  1. We can run SQL like queries on HBase data instead of using HBase client. It allows us to run Hive queries against data which is being continuously updated, which is not possible with Hive alone.
  2. We can also move data from Hive table into HBase table.
Drawbacks of integrating Hive with HBase.
  1. Integrating Hive with HBase increase the latency of Hive queries.
List of companies which use Hive:
Facebook- They use Hive for data mining, internal log analysis, and reporting.
eHarmony- They use Hadoop to store internal logs and dimension data sources and use it as a source for reporting/analytics and machine learning. Currently they have a cluster of 640 machines.
VideoEgg- They use Hive as the core database for their data warehouse where they track and analyze all the usage data of the ads across their network.
Scribd- They use hive for machine learning, data mining, ad-hoc querying, and both internal and user-facing analytics.

Many more companies are also using Hive making their business more agile.

0 Comment for this post

Post a Comment

Required Information *
Name* Email*
Comments*  

*

In accordance with our comment policy, we encourage comments that are on topic, relevant and to-the-point. Once submitted, your comments will be published by the Impetus blog moderator. We will remove comments that include profanity, personal attacks, racial slurs, or threats of violence, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.