Apache Hadoop, Pig, Hive, Derby installation in Centos Linux
$ cd ~
$ wget http://mirrors.ibiblio.org/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
$ tar -xvzpf hadoop-2.8.5.tar.gz
$ sudo mkdir -p /opt/hadoop/2.8.5
$ sudo mv hadoop-2.8.5/* /opt/hadoop/2.8.5/
$ sudo ln -s /opt/hadoop/2.8.5/ /opt/hadoop/current
$ rm -rf hadoop-2.8.5/
$ rm -rf hadoop-2.8.5.tar.gz
$ sudo vi /etc/profile.d/hadoop.sh
#### HADOOP 2.8.5 #######################
export HADOOP_HOME=/opt/hadoop/current
export PATH=${HADOOP_HOME}/bin:$PATH
#### HADOOP 2.8.5 #######################
$ source /etc/profile.d/hadoop.sh
Create Users and Groups
$ sudo su -
# groupadd hadoop
# useradd -g hadoop yarn
# useradd -g hadoop hdfs
# useradd -g hadoop mapred
Make Data and Log Directories
# mkdir -p /var/data/hadoop/hdfs/nn
# mkdir -p /var/data/hadoop/hdfs/snn
# mkdir -p /var/data/hadoop/hdfs/dn
# chown -R hdfs:hadoop /var/data/hadoop/hdfs
Create the log directory and set the owner and group as follows:
# cd /opt/hadoop/current
# mkdir logs
# chmod g+w logs
# chown -R yarn:hadoop ./
Configure core-site.xml
# vi /opt/hadoop/current/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.http.staticuser.user</name>
<value>hdfs</value>
</property>
</configuration>
Configure hdfs-site.xml
# vi /opt/hadoop/current/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/var/data/hadoop/hdfs/nn</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>file:/var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>fs.checkpoint.edits.dir</name>
<value>file:/var/data/hadoop/hdfs/snn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/var/data/hadoop/hdfs/dn</value>
</property>
</configuration>
Configure mapred-site.xml
cd /opt/hadoop/current/etc/hadoop/
cp mapred-site.xml.template mapred-site.xml
vi /opt/hadoop/current/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>/mr-history/tmp </value>
</property>
<property>
<name> mapreduce.jobhistory.done-dir</name>
<value>/mr-history/done</value>
</property>
</configuration>
Configure yarn-site.xml
# vi /opt/hadoop/current/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Modify Java Heap Sizes
# vi /opt/hadoop/current/etc/hadoop/hadoop-env.sh
export HADOOP_HEAPSIZE=500
export HADOOP_NAMENODE_INIT_HEAPSIZE="500"
# vi /opt/hadoop/current/etc/hadoop/mapred-env.sh
# export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=1000
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=250
# vi /opt/hadoop/current/etc/hadoop/yarn-env.sh
JAVA_HEAP_MAX=-Xmx1000m
YARN_HEAPSIZE=1000
# vi /opt/hadoop/current/etc/hadoop/hadoop-env.sh
add the following to the end:
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib"
Format HDFS
# su - hdfs
$ cd /opt/hadoop/current/bin
$ ./hdfs namenode -format
If the command worked, you should see the following near the end of a long list of messages:
INFO common.Storage: Storage directory /var/data/hadoop/hdfs/nn has been successfully formatted.
Start the HDFS Services
As user hdfs
$ cd /opt/hadoop/current/sbin
$ ./hadoop-daemon.sh start namenode
$ ./hadoop-daemon.sh start secondarynamenode
$ ./hadoop-daemon.sh start datanode
If the daemon started, you should see responses above that will point to the log file. (Note that the actual log file is appended with “.log” not “.out.”) Issue a jps command to see that all the services are running. The actual PID values will be different than shown in this listing:
$ jps
2849 DataNode
2791 SecondaryNameNode
2701 NameNode
2943 Jps
All Hadoop services can be stopped using the hadoop-daemon.sh script.
For example, to stop the datanode service enter the following
$ ./hadoop-daemon.sh stop datanode
The same can be done for the Namenode and SecondaryNameNode
create /mr-history for job history server directory in hdfs
(Also a good test to make sure HDFS is working)
$ hdfs dfs -mkdir -p /mr-history/tmp
$ hdfs dfs -mkdir -p /mr-history/done
$ hdfs dfs -chown -R yarn:hadoop /mr-history
$ exit
Start YARN Services
as user “yarn”
# su - yarn
$ cd /opt/hadoop/current/sbin
$ ./yarn-daemon.sh start resourcemanager
$ ./yarn-daemon.sh start nodemanager
$ ./mr-jobhistory-daemon.sh start historyserver
$ jps
3156 ResourceManager
3637 Jps
3403 NodeManager
3563 JobHistoryServer
$ exit
Similar to HDFS, the services can be stopped by issuing a stop argument to the daemon script:
./yarn-daemon.sh stop nodemanager
Verify the Running Services Using the Web Interface
// To see the HDFS web interface (other browsers can be used):
http://<hadoop_server_ip>:50070
// To see the ResourceManager (YARN) web interface:
http://<hadoop_server_ip>:8088
Run a Sample MapReduce Examples
# su - hdfs
$ export YARN_EXAMPLES=/opt/hadoop/current/share/hadoop/mapreduce/
// To test your installation, run the sample "pi" application
$ yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-2.8.5.jar pi 8 100000
If these tests worked, the Hadoop installation should be working correctly.
Install Apache Pig
$ cd ~
$ wget http://mirrors.ibiblio.org/apache/pig/pig-0.17.0/pig-0.17.0.tar.gz
$ tar -xvzpf pig-0.17.0.tar.gz
$ sudo mkdir -p /opt/pig/0.17.0
$ sudo mv pig-0.17.0/* /opt/pig/0.17.0
$ sudo ln -s /opt/pig/0.17.0/ /opt/pig/current
$ rm -rf pig-0.17.0/
$ rm -rf pig-0.17.0.tar.gz
$ sudo vi /etc/profile.d/pig.sh
#### PIG 0.17.0 #######################
export PIG_HOME=/opt/pig/current
export PATH=${PIG_HOME}/bin:$PATH
export PIG_CLASSPATH=/opt/hadoop/current/etc/hadoop
#### PIG 0.17.0 #######################
$ source /etc/profile.d/pig.sh
Create a Pig user and change ownership (do as root)
# useradd -g hadoop pig
# chown -R pig:hadoop /opt/pig/
Pig is ready for use by users (they must re-login or “source /etc/profile.d/pig.sh”)
# pig --version
Apache Pig version 0.17.0 (r1797386)
compiled Jun 02 2017, 15:41:58
Install Apache Hive
Install and Configure Hive
$ wget http://mirrors.ibiblio.org/apache/hive/hive-2.3.5/apache-hive-2.3.5-bin.tar.gz
$ tar -xvzpf apache-hive-2.3.5-bin.tar.gz
$ sudo mkdir -p /opt/hive/2.3.5
$ sudo mv apache-hive-2.3.5-bin/* /opt/hive/2.3.5
$ sudo ln -s /opt/hive/2.3.5/ /opt/hive/current
$ rm -rf apache-hive-2.3.5-bin/
$ rm -rf apache-hive-2.3.5-bin.tar.gz
$ sudo vi /etc/profile.d/hive.sh
#### HIVE 2.3.5 #######################
export HIVE_HOME=/opt/hive/current
export PATH=${HIVE_HOME}/bin:$PATH
#### HIVE 2.3.5 #######################
$ source /etc/profile.d/hive.sh
$ hive --version
Hive 2.3.5
make needed directories in HDFS
$ su - hdfs
$ hdfs dfs -mkdir -p /user/hive/warehouse
$ hdfs dfs -chmod g+w /user/hive/warehouse
$ hdfs dfs -ls /user/hive
$ sudo vi /opt/hive/current/conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
</configuration>
remove the extra log4j-slf4j library (included in Hadoop install)
$ mv /opt/hive/current/lib/log4j-slf4j-impl-2.6.2.jar /opt/hive/current/lib/log4j-slf4j-impl-2.6.2.jar.extra
Create a Hive user and change ownership (do as root)
# useradd -g hadoop hive
# chown -R hive:hadoop /opt/hive/
Install Apache Derby
Hive needs a “metastore” database for metadata. The default is Apache Derby
$ wget https://archive.apache.org/dist/db/derby/db-derby-10.13.1.1/db-derby-10.13.1.1-bin.tar.gz
$ tar -xvzpf db-derby-10.13.1.1-bin.tar.gz
$ sudo mkdir -p /opt/derby/10.13.1.1
$ sudo mv db-derby-10.13.1.1-bin/* /opt/derby/10.13.1.1
$ sudo ln -s /opt/derby/10.13.1.1/ /opt/derby/current
$ rm -rf db-derby-10.13.1.1-bin/
$ rm -rf db-derby-10.13.1.1-bin.tar.gz
$ sudo vi /etc/profile.d/derby.sh
#### DERBY 10.13.1.1 #######################
export DERBY_HOME=/opt/derby/current
export PATH=${DERBY_HOME}/bin:$PATH
export DERBY_OPTS="-Dderby.system.home=$DERBY_HOME/data"
#### DERBY 10.13.1.1 #######################
$ source /etc/profile.d/derby.sh
Change derby to hive user
# chown -R hive:hadoop /opt/derby/
copy these libraries to $HIVE_HOME
# su - hive
$ cp $DERBY_HOME/lib/derbyclient.jar $HIVE_HOME/lib
$ cp $DERBY_HOME/lib/derbytools.jar $HIVE_HOME/lib
Start (and Stop) Derby (nohup will leave log file in the directory you run command)
$ nohup startNetworkServer -h 0.0.0.0 &
configure Hive schema
$ schematool -initSchema -dbType derby
Start/Test Hive
Make sure all Hadoop services are running As user hdfs
# su - hdfs
Enter “hive” at prompt. Output as follows. Ignore “which: no hbase” warning
$ hive
hive> show tables;
OK
Time taken: 9.731 seconds
hive> quit;