In reference to Hadoop, the phrase "big data" is almost synonymous. As the term implies, big data is a collection of data that is too large to be contained in conventional databases. Hadoop is a system for storing large amounts of data in a distributed manner and processing them concurrently, making it useful for managing big data. Hadoop must be installed on your machine if you plan to work with big data. The Apache Hadoop cluster can therefore be installed on Ubuntu, but how? There are different Hadoop distributions; you might set up an Apache Hadoop cluster, the primary distribution, a Cloudera distribution, or even a Hortonworks distribution (acquired by Cloudera in 2018). In this article, we will learn how to install Hadoop in Ubuntu in this guide. Although we are using a cloud platform here (specifically Amazon Web Service), you can use your local system (Ubuntu) by using the same procedures. The most recent version of Hadoop, 3.2.2, will be installed. It was made available on January 9th, 2021. So, let’s get started, shall we?
Hadoop is an Apache open-source platform used for storing, processing, and analyzing extremely large volumes of data. Hadoop is not OLAP and is written in Java (online analytical processing). It is made to use for batch or offline processing. Facebook, Yahoo, Google, Twitter, LinkedIn, and many other sites use it. In addition, scaling up only requires adding nodes to the cluster.
Become a Hadoop Certified professional by learning this HKR Hadoop Training !
There are 4 modules in Hadoop:
Following the publication of Google's paper GFS, HDFS was created. The files will be divided into chunks and kept in nodes using the distributed design, according to HDFS.
The cluster is managed and job scheduling is done using another Resource Negotiator.
It is a framework that enables Java programs to do key-value pair-based concurrent computations on data. The Map task transforms input data into a data collection that can be calculated as a Key value pair. The reduce task employs the results from the map task, and the result from the reducer produces the required outcome.
These Java libraries are utilized by other Hadoop modules to launch Hadoop.
For such an installation, we used an Amazon EC2 t1.micro free tier Ubuntu instance. Additionally, you can use your local Ubuntu system by following the same instructions.
Step 1: Installing Java is the very first step in installing Hadoop because Hadoop is written in Java and needs Java to function. We'll set up OpenJDK version 8. Run the following command to achieve this:
sudo apt update
sudo apt install openjdk-8-jdk -y
Step 2: Generate a new user
sudo su
useradd hiberstack -m -d /home/hiberstack -s /bin/bash
Step 3: So that we can run the tasks without encountering any issues, add the new user to the sudoers file. In essence, this will grant our user root rights.
echo -e 'hiberstack ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/hiberstack
Step 4: Transform the user into a new user.
su - hiberstack
Step 5: To enable the new user to ssh without a password, we must configure SSH for them. To generate a new key, use the command listed below.
Ssh-keygen
A brand-new key is created. For ssh access without a password, save this key in an authorized keys file using the command below.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
To reflect the modifications, stop and restart your instance right away. Run the command below to see if ssh was effective.
Step 6: Download the Hadoop binary distribution package into a new directory that you have created.
mkdir hadoop
cd hadoop/
Step 7: Take the Hadoop binary package apart using the command below.
tar -xvf hadoop-3.2.2.tar.gz
Step 8: Put the Hadoop Environment variables in place. To modify the .bashrc file and set the variables, we must first modify the directory to the default directory.
cd
nano .bashrc
Add the lines listed below to the file to configure the environment variables.
export HADOOP_HOME=/home/hiberstack/hadoop/hadoop-3.2.2
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Step 9: Use the command shown below to keep the modifications made to the .bashrc file.
source .bashrc
Editing the hadoop-env.sh file
Step 10: Hadoop will recognize whichever Java to use if the Java path is added to the hadoop-env.sh file. To accomplish this, use the commands shown below.
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
The file would already contain the export JAVA HOME line. Simply uncomment it and make the changes listed below. If you can't find the line, just put the one below where it already is at the end of the document.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Editing the core-site.xml file
Step 11: To specify the default NameNode URL and Hadoop temporary directory, alter the core-site.xml file. This temporary directory is used by Hadoop's map and reduces operations.
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Between the configuration> and /configuration> tags, insert the below mentioned commands.
hadoop.tmp.dir
/home/hiberstack/tmpdata
fs.default.name
hdfs://172.31.25.86:9000
Editing the hdfs-site.xml file
Step 12: Set the NameNode directory, DataNode directory, and hdfs default replication factor by editing the hdfs-site.xml file.
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Between the configuration> and /configuration> tags, insert the following lines.
dfs.data.dir
/home/hiberstack/dfsdata/namenode
dfs.data.dir
/home/hiberstack/dfsdata/datanode
dfs.replication
1
Editing the mapred-site.xml file
Step 13: The MapReduce framework can be defined by editing the mapred-site.xml file.
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Between the tags for "configuration" and "/configuration," add the following attribute.
mapreduce.framework.name
yarn
Editing the yarn-site.xml file
Step 14: To configure the Node Manager, Resource Manager, and Application Master, modify the yarn-site.xml file.
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Between the configuration> and /configuration> tags, insert the below mentioned commands.
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.aux-services.mapreduce.shuffle.class
org.apache.hadoop.mapred.ShuffleHandler
yarn.resourcemanager.hostname
172.31.25.86
yarn.acl.enable
0
yarn.nodemanager.env-whitelist
JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME
Want to know more about Hadoop,visit here Hadoop Tutorial
Step 15: We must format the NameNode before launching the Hadoop operations for the first time. Run the following command to achieve this
hdfs namenode -format
A shutdown notice will appear once the formatting is finished.
Starting Hadoop Services
Step 16: Use the command listed below to launch the Hadoop service now.
$HADOOP_HOME/sbin/start-all.sh
Use the command listed below to check to see if the services are active.
jps
Setting Up Multiple Machines:
Download Linux disk image by looking for "Ubuntu disc image iso file download."
Open the highlighted link and the following screen will appear:
Download the Oracle VM box and Linux disc image to have the below display appear:
For the setting up of machine, select "New", name it, select Linux (type) & Ubuntu (version).
Allocate RAM space.
Select "Next" & click on "Create a virtual hard disk now" and select "Create, choose VMDK and select "Next.
Choose "dynamically allocated" option, click "Next", and choose the size of the hard disk.
Select "Create". Go to "Settings --> System to increase or decrease the RAM.
Select "Storage", click "Empty" and choose your disk image. The following screen will appear.
Select "Network". Choose "Bridge Adapter" for all the machines to have distinct IP and click on "OK".
Select "Start" to start the machine and for setting up of Ubuntu machine.
Look for "Install Ubuntu" and make the installation.
Select "New", name it and select "Linux".
Select "Next" and choose the required RAM.
Develop a virtual hard disk, click on "Create".
Select VMDK, go to "Next".
Select "Dynamically allocated", click "Next", mention hard disk size and select "Create".
Repeat the steps.
You would need to wait a while after completing the aforementioned steps before the pop-up prompted you to reboot your computer. Apply the same procedures to the second machine as well. Wait until both of your devices are configured. Searching for the Cloudera QuickStart VM on Google will allow you to get it in the interim. Choose the option shown below:
Choose Virtual Box as your platform after following the link. Give your information after that so you may obtain the QuickStart VM. After downloading the zip file, unzip it to create a single node Cloudera cluster. The two machines can be configured separately or by cloning. Now, we've demonstrated how to handle it on your own. When the installation is complete, it will seem as follows:
Now, restart your machine.
Download Java and Hadoop
On machine 1, let's launch the browser and look for Oracle JDK 1.8. Select the following link:
Acknowledge the agreement when the link appears. By approving the licence agreement, you can select a stable version from the range of possibilities that will be shown. As this machine is 64-bit, we are selecting an x64 tar file. This need should download your tar file for Java.
Open a separate browser and look for archive.apache.org because we also need a Hadoop package. Select the following link:
Select Hadoop after performing a search for it. Choose a stable version of Hadoop from the several that will be shown. In this case, we pick Hadoop-2.6.5/. After that, click Hadoop-2.6.5.tar.gz to install the Hadoop-related tar file. Hadoop and Java are both currently downloading.
Hadoop-2.6.5.tar.gz to install the Hadoop-related tar file. Hadoop and Java are both currently downloading.
We already have the tar files and the ssh configured to let us copy the tar files from machine one to machine two, so downloading Java and Hadoop on the second file is not necessary.
Top 30+ frequently asked Big Data Hadoop Interview Questions !
Conclusion
With this, we come to the end of this blog about how to install Hadoop on Ubuntu. We hope that you will be able to smoothly go through the installation with the help of our step-by-step process along with the commands we have mentioned with every step.
Related blogs:
Batch starts on 7th Jun 2023, Weekday batch
Batch starts on 11th Jun 2023, Weekend batch
Batch starts on 15th Jun 2023, Weekday batch
Hadoop is a Java framework that integrates elements of both the MapReduce computing paradigm and the Google File System (GFS) for executing the application on huge clusters of commodity hardware. Like Hadoop in particular, HDFS is a distributed file system with a high fault tolerance that is made to be installed on inexpensive hardware. It is appropriate for applications with huge data volumes and offers high throughput accessibility to application data.
By default, the Hadoop configuration file may be found in /etc/hadoop/hdfs-site.xml. The dfs. namenode property is located here.
Hadoop 2.x is being widely used and you can learn the same.
Gigabytes to petabytes of data may be stored and processed effectively using the open source framework known as Apache Hadoop. Hadoop enables clustering many computers to examine big datasets in parallelism more swiftly than using a single powerful machine for data storage and processing.
Java-based Hadoop is a framework for running programmes on a sizable cluster of public hardware. It resembles the Google file system in many ways. Java must first be installed in Ubuntu because we require it to install Hadoop.