Setup And Configure Cluster Node Hadoop Installation
This describes how to setup and configure a cluster-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).
Single node cluster means only one Data Node running and setting up all the Name Node, Data Node, Resource Manager and Node Manager on a single machine. Easily and efficiently the sequential workflow in a smaller environment as compared to large environments which contains terabytes of data distributed across hundreds of machines.
Multi node cluster mean are more than one DataNode running and each DataNode is running on different machines. The multi node cluster is practically used in organizations for analyzing Big Data. In real time when deal with petabytes of data, We need to be distributed across thousand of machines to be processed.
But for now i will show you how to Setup And Configure Cluster-Node cluster.
- GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
- Windows is also a supported platform but the followings steps are for Linux only.
- Java™ must be installed. Recommended Java versions are described at Hadoop Java Versions
- Ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. Additionally, it is recommmended that pdsh also be installed for better ssh resource management.
If your cluster doesn’t have the requisite software you will need to install it.
The steps below use example IPs for each node. Adjust each example according to your configuration.
let’s assume look’s like below :
- master: 192.168.1.101
- node1: 192.168.1.102
- node2: 192.168.1.103
Now setup FQDN locate at
/etc/hosts, add line look likes below :
192.168.1.101 master 192.168.1.102 node1 192.168.1.103 node2
Prepare to Start the Hadoop Cluster
To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. For this example we will use hadoop version 3.2.1 that currently stable. You could download with script below, just do this for the master node :
$ wget https://www-eu.apache.org/dist/hadoop/common/stable/hadoop-3.2.1.tar.gz -O /tmp/hadoop-3.2.1.tar.gz
After your successfully download the distribution file you could find it at your /tmp folder. Unpack the downloaded Hadoop distribution.
$ tar -xvf /tmp/hadoop-3.2.1.tar.gz;
Now you could move the distribution file to the folder you want. But i prefer you move it to
/usr/local/hadoop. In this tutorial i will use that folder.
$ mv /tmp/hadoop-3.2.1 /usr/local/hadoop
In the distribution, edit the file
/usr/local/hadoop/etc/hadoop/hadoop-env.sh to define some parameters as follows:
If you didn’t now where your java binary path you could run this command to find out where is your java installation path :
$ dirname $(readlink -f $(which java))|sed 's^/bin^^'
To make sure that Java and Hadoop have been properly installed on your system and can be accessed through the Terminal, execute the java -version and hadoop version commands. Try the following command to find out your hadoop running:
$ bin/hadoop version
This will display the usage documentation for the hadoop script. By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.
Distribute Authentication Key-pairs for the Hadoop User
Setup and Configure Hadoop – Master node will use an SSH connection to connect to other nodes with key-pair authentication. This will allow the master node to actively manage the cluster. For advanced user you could use any user that you want except the root for security reason. But in this tutorial let me use user hadoop. If user is not exist please create new one.
$ adduser hadoop $ su - hadoop
If you could skip step above if user hadoop already exist.
- Login to master as the
hadoopuser, and generate an SSH key:
$ ssh-keygen -t rsa -b 2048 -C "master"
When generating this key, leave the password field blank so your Hadoop user can communicate unprompted.
- Copy all the public key to all node in cluster :
$ ssh-copy-id master
$ ssh-copy-id node1
$ ssh-copy-id node2
It’s easy way to create .ssh directory for all node.
- Replace all the cluster node authorized_keys with command below :
$ scp ~/.ssh/* node1:/home/hadoop/.ssh/
$ scp ~/.ssh/* node2:/home/hadoop/.ssh/
Configure the Master Node
Configuration will be performed on master and replicated to other nodes. So make sure user hadoop have a permission to the folder. If the permission setup for user root change that to user hadoop, perform script below for all node cluster you have :
$ chown -R hadoop:hadoop /usr/local/hadoop
/usr/local/hadoop/etc/hadoop/core-site.xml file to set the NameNode location to master on port
9000. core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings for Hadoop core such as I / O settings that are common to HDFS & MapReduce.
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://master:9000</value> </property> </configuration>
Set path for HDFS
/usr/local/hadoop/etc/hadoop/hdfs-site.xml. hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS. to resemble the following configuration:
<configuration> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/data/nameNode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/data/dataNode</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
The last property,
dfs.replication, indicates how many times data is replicated in the cluster. You can set 3 to have all the data duplicated on the two nodes. Don’t enter a value higher than the actual number of worker nodes.
workers is used by startup scripts to start required daemons on all nodes. Edit
/usr/local/hadoop/etc/hadoop/workers to include both of the nodes:
master # master node1 # node1 node2 # node2
Duplicate Config Files on Each Node
Copy the Hadoop binaries to worker nodes or for easy case copy all directory instead of Hadoop binary :
$ scp -r /usr/local/hadoop node1:/usr/local/hadoop $ scp -r /usr/local/hadoop node2:/usr/local/hadoop
HDFS needs to be formatted like any classical file system. On master, run the following command:
$ hdfs namenode -format $ start-dfs.sh
Your Hadoop installation is now configured and ready to run. Setup and Configure Hadoop all complete. There are automatic installation that i’ve been made.