Install Hadoop On Ubuntu

In reference to Hadoop, the phrase "big data" is almost synonymous. As the term implies, big data is a collection of data that is too large to be contained in conventional databases. Hadoop is a system for storing large amounts of data in a distributed manner and processing them concurrently, making it useful for managing big data. Hadoop must be installed on your machine if you plan to work with big data. The Apache Hadoop cluster can therefore be installed on Ubuntu, but how? There are different Hadoop distributions; you might set up an Apache Hadoop cluster, the primary distribution, a Cloudera distribution, or even a Hortonworks distribution (acquired by Cloudera in 2018). In this article, we will learn how to install Hadoop in Ubuntu in this guide. Although we are using a cloud platform here (specifically Amazon Web Service), you can use your local system (Ubuntu) by using the same procedures. The most recent version of Hadoop, 3.2.2, will be installed. It was made available on January 9th, 2021. So, let’s get started, shall we?

What is Hadoop?

Hadoop is an Apache open-source platform used for storing, processing, and analyzing extremely large volumes of data. Hadoop is not OLAP and is written in Java (online analytical processing). It is made to use for batch or offline processing. Facebook, Yahoo, Google, Twitter, LinkedIn, and many other sites use it. In addition, scaling up only requires adding nodes to the cluster.

Become a Hadoop Certified professional by learning this HKR Hadoop Training !

There are 4 modules in Hadoop:

Hadoop Distributed File System, or HDFS:

Following the publication of Google's paper GFS, HDFS was created. The files will be divided into chunks and kept in nodes using the distributed design, according to HDFS.

Yarn:

The cluster is managed and job scheduling is done using another Resource Negotiator.

Map Reduce:

It is a framework that enables Java programs to do key-value pair-based concurrent computations on data. The Map task transforms input data into a data collection that can be calculated as a Key value pair. The reduce task employs the results from the map task, and the result from the reducer produces the required outcome.

Hadoop Common:

These Java libraries are utilized by other Hadoop modules to launch Hadoop.

Hadoop Training

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning

Install Hadoop on Ubuntu

For such an installation, we used an Amazon EC2 t1.micro free tier Ubuntu instance. Additionally, you can use your local Ubuntu system by following the same instructions.

Java Installation 

Step 1: Installing Java is the very first step in installing Hadoop because Hadoop is written in Java and needs Java to function. We'll set up OpenJDK version 8. Run the following command to achieve this:

sudo apt update

sudo apt install openjdk-8-jdk -y

Step 2: Generate a new user

sudo su

useradd hiberstack -m -d /home/hiberstack -s /bin/bash

Step 3: So that we can run the tasks without encountering any issues, add the new user to the sudoers file. In essence, this will grant our user root rights.

echo -e 'hiberstack ALL=(ALL)  NOPASSWD:  ALL' > /etc/sudoers.d/hiberstack

Step 4: Transform the user into a new user.

su - hiberstack
Configuring SSH

Step 5: To enable the new user to ssh without a password, we must configure SSH for them. To generate a new key, use the command listed below.

Ssh-keygen

A brand-new key is created. For ssh access without a password, save this key in an authorized keys file using the command below.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

To reflect the modifications, stop and restart your instance right away. Run the command below to see if ssh was effective.

Downloading and installing Hadoop in Ubuntu

Step 6: Download the Hadoop binary distribution package into a new directory that you have created.

mkdir hadoop

cd hadoop/

Step 7: Take the Hadoop binary package apart using the command below.

tar -xvf hadoop-3.2.2.tar.gz

Step 8: Put the Hadoop Environment variables in place. To modify the .bashrc file and set the variables, we must first modify the directory to the default directory.

cd

nano .bashrc

Add the lines listed below to the file to configure the environment variables.

export HADOOP_HOME=/home/hiberstack/hadoop/hadoop-3.2.2

export HADOOP_INSTALL=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME

export HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Step 9: Use the command shown below to keep the modifications made to the .bashrc file.

source .bashrc

Editing the hadoop-env.sh file

Step 10: Hadoop will recognize whichever Java to use if the Java path is added to the hadoop-env.sh file. To accomplish this, use the commands shown below.

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

The file would already contain the export JAVA HOME line. Simply uncomment it and make the changes listed below. If you can't find the line, just put the one below where it already is at the end of the document.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Editing the core-site.xml file

Step 11: To specify the default NameNode URL and Hadoop temporary directory, alter the core-site.xml file. This temporary directory is used by Hadoop's map and reduces operations.

nano $HADOOP_HOME/etc/hadoop/core-site.xml

Between the configuration> and /configuration> tags, insert the below mentioned commands.



  hadoop.tmp.dir

  /home/hiberstack/tmpdata





  fs.default.name

  hdfs://172.31.25.86:9000

Editing the hdfs-site.xml file

Step 12: Set the NameNode directory, DataNode directory, and hdfs default replication factor by editing the hdfs-site.xml file.

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Between the configuration> and /configuration> tags, insert the following lines.



  dfs.data.dir

  /home/hiberstack/dfsdata/namenode





  dfs.data.dir

  /home/hiberstack/dfsdata/datanode





  dfs.replication

  1

Editing the mapred-site.xml file

Step 13: The MapReduce framework can be defined by editing the mapred-site.xml file.

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Between the tags for "configuration" and "/configuration," add the following attribute.

 

  mapreduce.framework.name 

  yarn 

Editing the yarn-site.xml file

Step 14: To configure the Node Manager, Resource Manager, and Application Master, modify the yarn-site.xml file.

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Between the configuration> and /configuration> tags, insert the below mentioned commands.



  yarn.nodemanager.aux-services

  mapreduce_shuffle





  yarn.nodemanager.aux-services.mapreduce.shuffle.class

  org.apache.hadoop.mapred.ShuffleHandler





  yarn.resourcemanager.hostname

  172.31.25.86





  yarn.acl.enable

  0





  yarn.nodemanager.env-whitelist   

  JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME

 Want to know more about Hadoop,visit here Hadoop Tutorial 

Subscribe to our youtube channel to get new updates..!

Formatting the NameNode

Step 15: We must format the NameNode before launching the Hadoop operations for the first time. Run the following command to achieve this

hdfs namenode -format

A shutdown notice will appear once the formatting is finished.

Starting Hadoop Services

Step 16: Use the command listed below to launch the Hadoop service now.

$HADOOP_HOME/sbin/start-all.sh

Use the command listed below to check to see if the services are active.

jps

Congratulations. Ubuntu now have Hadoop installed successfully.

Setting Up Multiple Machines: 

Download Linux disk image by looking for "Ubuntu disc image iso file download."

Ubuntu disc image iso file download

Open the highlighted link and the following screen will appear:

Open the highlighted link

Download the Oracle VM box and Linux disc image to have the below display appear:

Download the Oracle VM box

For the setting up of machine, select "New", name it, select Linux (type) & Ubuntu (version).

Allocate RAM space.

Select "Next" & click on "Create a virtual hard disk now" and select "Create, choose VMDK and select "Next.

Choose "dynamically allocated" option, click "Next", and choose the size of the hard disk.

Select "Create". Go to "Settings --> System to increase or decrease the RAM. 

Select "Storage", click "Empty" and choose your disk image. The following screen will appear.

 click Empty

Select "Network". Choose "Bridge Adapter" for all the machines to have distinct IP and click on "OK".

Select "Start" to start the machine and for setting up of Ubuntu machine.

Look for "Install Ubuntu" and make the installation.

Select "New", name it and select "Linux".

Select "Next" and choose the required RAM.

Develop a virtual hard disk, click on "Create".

Select VMDK, go to "Next".

Select "Dynamically allocated", click "Next", mention hard disk size and select "Create".

Repeat the steps.

Download Cloudera QuickStart VM

You would need to wait a while after completing the aforementioned steps before the pop-up prompted you to reboot your computer. Apply the same procedures to the second machine as well. Wait until both of your devices are configured. Searching for the Cloudera QuickStart VM on Google will allow you to get it in the interim. Choose the option shown below:

Download Cloudera QuickStart VM

Choose Virtual Box as your platform after following the link. Give your information after that so you may obtain the QuickStart VM. After downloading the zip file, unzip it to create a single node Cloudera cluster. The two machines can be configured separately or by cloning. Now, we've demonstrated how to handle it on your own. When the installation is complete, it will seem as follows:

installation is complete

Now, restart your machine.

Download Java and Hadoop

On machine 1, let's launch the browser and look for Oracle JDK 1.8. Select the following link:

Oracle JDK 1.8

 Acknowledge the agreement when the link appears. By approving the licence agreement, you can select a stable version from the range of possibilities that will be shown. As this machine is 64-bit, we are selecting an x64 tar file. This need should download your tar file for Java.

 Open a separate browser and look for archive.apache.org because we also need a Hadoop package. Select the following link:

archive.apache.org

Select Hadoop after performing a search for it. Choose a stable version of Hadoop from the several that will be shown. In this case, we pick Hadoop-2.6.5/. After that, click Hadoop-2.6.5.tar.gz to install the Hadoop-related tar file. Hadoop and Java are both currently downloading.

Hadoop-2.6.5.tar.gz to install the Hadoop-related tar file. Hadoop and Java are both currently downloading.

 We already have the tar files and the ssh configured to let us copy the tar files from machine one to machine two, so downloading Java and Hadoop on the second file is not necessary.

Top 30+ frequently asked Big Data Hadoop Interview Questions !

Hadoop Training

Weekday / Weekend Batches

Conclusion

With this, we come to the end of this blog about how to install Hadoop on Ubuntu. We hope that you will be able to smoothly go through the installation with the help of our step-by-step process along with the commands we have mentioned with every step. 

Related blogs:

Find our upcoming Hadoop Training Online Classes

  • Batch starts on 7th Jun 2023, Weekday batch

  • Batch starts on 11th Jun 2023, Weekend batch

  • Batch starts on 15th Jun 2023, Weekday batch

Global Promotional Image
 

Categories

Request for more information

Gayathri
Gayathri
Research Analyst
As a senior Technical Content Writer for HKR Trainings, Gayathri has a good comprehension of the present technical innovations, which incorporates perspectives like Business Intelligence and Analytics. She conveys advanced technical ideas precisely and vividly, as conceivable to the target group, guaranteeing that the content is available to clients. She writes qualitative content in the field of Data Warehousing & ETL, Big Data Analytics, and ERP Tools. Connect me on LinkedIn.

FAQs

Hadoop is a Java framework that integrates elements of both the MapReduce computing paradigm and the Google File System (GFS) for executing the application on huge clusters of commodity hardware. Like Hadoop in particular, HDFS is a distributed file system with a high fault tolerance that is made to be installed on inexpensive hardware. It is appropriate for applications with huge data volumes and offers high throughput accessibility to application data.

By default, the Hadoop configuration file may be found in /etc/hadoop/hdfs-site.xml. The dfs. namenode property is located here.

Hadoop 2.x is being widely used and you can learn the same. 

Gigabytes to petabytes of data may be stored and processed effectively using the open source framework known as Apache Hadoop. Hadoop enables clustering many computers to examine big datasets in parallelism more swiftly than using a single powerful machine for data storage and processing.

Java-based Hadoop is a framework for running programmes on a sizable cluster of public hardware. It resembles the Google file system in many ways. Java must first be installed in Ubuntu because we require it to install Hadoop.