Manado, Indonesia. 95252
(+62) 823-9602-9583

Setup And Configure Cluster Node Hadoop Installation

Software Engineer | DevOps Engineer

Apache_hadoop

Setup And Configure Cluster Node Hadoop Installation

This describes how to setup and configure a cluster-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

Single node cluster means only one Data Node running and setting up all the Name Node, Data Node, Resource Manager and Node Manager on a single machine. Easily and efficiently the sequential workflow in a smaller environment as compared to large environments which contains terabytes of data distributed across hundreds of machines.

Multi node cluster mean are more than one DataNode running and each DataNode is running on different machines. The multi node cluster is practically used in organizations for analyzing Big Data. In real time when deal with petabytes of data, We need to be distributed across thousand of machines to be processed.

Figure 1. MapR Data Platform. MapR , 2019.

But for now i will show you how to Setup And Configure Cluster-Node cluster.

Prerequisites

  • GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
  • Windows is also a supported platform but the followings steps are for Linux only.
  • Java™ must be installed. Recommended Java versions are described at Hadoop Java Versions
  • Ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons if the optional start and stop scripts are to be used. Additionally, it is recommmended that pdsh also be installed for better ssh resource management.

If your cluster doesn’t have the requisite software you will need to install it.

The steps below use example IPs for each node. Adjust each example according to your configuration.

let’s assume look’s like below :

  • master: 192.168.1.101
  • node1: 192.168.1.102
  • node2: 192.168.1.103

Now setup FQDN locate at /etc/hosts, add line look likes below :

192.168.1.101 master
192.168.1.102 node1
192.168.1.103 node2

Prepare to Start the Hadoop Cluster

To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors. For this example we will use hadoop version 3.2.1 that currently stable. You could download with script below, just do this for the master node :

$ wget https://www-eu.apache.org/dist/hadoop/common/stable/hadoop-3.2.1.tar.gz -O /tmp/hadoop-3.2.1.tar.gz

After your successfully download the distribution file you could find it at your /tmp folder. Unpack the downloaded Hadoop distribution.

$ tar -xvf /tmp/hadoop-3.2.1.tar.gz;

Now you could move the distribution file to the folder you want. But i prefer you move it to /usr/local/hadoop. In this tutorial i will use that folder.

$ mv /tmp/hadoop-3.2.1 /usr/local/hadoop

In the distribution, edit the file /usr/local/hadoop/etc/hadoop/hadoop-env.sh to define some parameters as follows:

export JAVA_HOME=/usr/java/latest

If you didn’t now where your java binary path you could run this command to find out where is your java installation path :

$ dirname $(readlink -f $(which java))|sed 's^/bin^^'

To make sure that Java and Hadoop have been properly installed on your system and can be accessed through the Terminal, execute the java -version and hadoop version commands. Try the following command to find out your hadoop running:

  $ bin/hadoop version

This will display the usage documentation for the hadoop script. By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.

Distribute Authentication Key-pairs for the Hadoop User

Setup and Configure Hadoop – Master node will use an SSH connection to connect to other nodes with key-pair authentication. This will allow the master node to actively manage the cluster. For advanced user you could use any user that you want except the root for security reason. But in this tutorial let me use user hadoop. If user is not exist please create new one.

$ adduser hadoop
$ su - hadoop 

If you could skip step above if user hadoop already exist.

  1. Login to master as the hadoop user, and generate an SSH key:
    $ ssh-keygen -t rsa -b 2048 -C "master"
    When generating this key, leave the password field blank so your Hadoop user can communicate unprompted.
  2. Copy all the public key to all node in cluster :
    $ ssh-copy-id master
    $ ssh-copy-id node1
    $ ssh-copy-id node2
    It’s easy way to create .ssh directory for all node.
  3. Replace all the cluster node authorized_keys with command below :
    $ scp ~/.ssh/* node1:/home/hadoop/.ssh/
    $ scp ~/.ssh/* node2:/home/hadoop/.ssh/

Configure the Master Node

Configuration will be performed on master and replicated to other nodes. So make sure user hadoop have a permission to the folder. If the permission setup for user root change that to user hadoop, perform script below for all node cluster you have :

$ chown -R hadoop:hadoop /usr/local/hadoop

Open /usr/local/hadoop/etc/hadoop/core-site.xml file to set the NameNode location to master on port 9000. core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains configuration settings for Hadoop core such as I / O settings that are common to HDFS & MapReduce.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://master:9000</value>
        </property>
    </configuration>

Set path for HDFS

Edit /usr/local/hadoop/etc/hadoop/hdfs-site.xml. hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode, Secondary NameNode). It also includes the replication factor and block size of HDFS. to resemble the following configuration:

<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/usr/local/hadoop/data/nameNode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/usr/local/hadoop/data/dataNode</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>3</value>
    </property>
</configuration>

The last property, dfs.replication, indicates how many times data is replicated in the cluster. You can set 3 to have all the data duplicated on the two nodes. Don’t enter a value higher than the actual number of worker nodes.

Configure Workers

The file workers is used by startup scripts to start required daemons on all nodes. Edit /usr/local/hadoop/etc/hadoop/workers to include both of the nodes:

master # master
node1 # node1
node2 # node2

Duplicate Config Files on Each Node

Copy the Hadoop binaries to worker nodes or for easy case copy all directory instead of Hadoop binary :

$ scp -r /usr/local/hadoop node1:/usr/local/hadoop
$ scp -r /usr/local/hadoop node2:/usr/local/hadoop

HDFS needs to be formatted like any classical file system. On master, run the following command:

$ hdfs namenode -format
$ start-dfs.sh

Your Hadoop installation is now configured and ready to run. Setup and Configure Hadoop all complete. There are automatic installation that i’ve been made.

 

No Comments

Leave a Reply

%d bloggers like this: