Detail Focused: Hadoop & Nosql

Sunday, November 13, 2016

Hive installation

Hive is based on Hadoop. Before you install and run Hive, make sure Hadoop is up and running.

Download and unpack it
Add Hive to the system path by opening /etc/profile or ~/.bashrc and add the following two rows

export HIVE_HOME=/home/yao/mysoft/apache-hive-2.1.0-bin
export PATH=$PATH:$HIVE_HOME/bin:$HIVE_HOME/conf

Enable the settings by executing this command

source /etc/profile

Create the configuration files

cd conf
cp hive-default.xml.template hive-site.xml
cp hive-env.sh.template hive-env.sh
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties
cp hive-log4j2.properties.template hive-log4j.properties

Modify the configuration file hive-site.sh

replace <value>${system:java.io.tmpdir}/${system:user.name}</value> with <value>$HIVE_HOME/iotmp</value>
replace <value>${system:java.io.tmpdir}/${hive.session.id}_resources</value> with <value>$HIVE_HOME/iotmp</value>
replace <value>${system:java.io.tmpdir}/${system:user.name}</value> with <value>$HIVE_HOME/iotmp</value>
you may need create the directory iotmp

Modify hive-env.sh

add these two
export HADOOP_HOME=/home/yao/mysoft/hadoop-2.7.3
export HIVE_CONF_DIR=/home/yao/mysoft/apache-hive-2.1.0-bin/conf

Make sure Hadoop is running
Run Hive

$HIVE_HOME/bin/hiveserver2

Run beeline

$HIVE_HOME/bin/beeline -u jdbc:hive2://

Thursday, October 17, 2013

Hive, Pig and HBase

Hive is best suited for data warehouse applications, where real-time responsiveness to queries and record-level inserts, updates, and deletes are not required.

Pig is described as a data flow language, rather than a query language. In Pig, you write a series of declarative statements that define relations from other relations, where each new relation performs some new data transformation. Pig looks at these declarations and then builds up a sequence of MapReduce jobs to perform the transformations until the final results are computed the way that you want. This step-by-step “flow” of data can be more intuitive than a complex set of queries. For this reason, Pig is often used as part of ETL (Extract, Transform, and Load) processes used to ingest external data into a Hadoop cluster and transform it into a more desirable form.

HBase is a distributed and scalable data store that supports row-level updates, rapid queries, and row-level transactions (but not multirow transactions).