Azure HDInsight

Hadoop has pioneered a fundamentally new way of storing and processing data. Hadoop is a software framework for storing and processing big data in a distributed fashion on large clusters of systems. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Azure provides a distribution of Hadoop on the cloud called Azure HDInsight. In this post, we will explain briefly what HDInsight is and how it is used. You will get to know the types of clusters available in HDInsight. You will also know how to create an HDInsight cluster. Let us get started.

What is Azure HDInsight?

Azure HDInsight is a fully managed analytics service provided by Microsoft Azure. It helps in creating Hadoop clusters using Linux with Ubuntu. It was originally based on the Hortonworks Data Platform (HDP) Hadoop distribution on Microsoft Azure. Microsoft later announced its Hadoop distribution on July 21st, 2020, during the Microsoft Inspire event. Users won't know much of a difference, but the underlying Hadoop distribution is based on Apache open source components.

Azure HDInsight provides Hadoop as a service on top of the Azure platform. It helps in managing and analyzing big data. It comes with features to extract, transform, and load (ETL) huge amounts of data, data warehousing, IoT, and machine learning. The frameworks that we can install in HDInsight are Hadoop, Apache Spark, Apache Hive, Apache Storm, R, Apache Kafka, and many more.

       Interested in learning Azure Course ? Enroll in our Microsoft Azure Certification Training program now!

Features of HDInsight

Here are some of the features that Azure HDInsight provides.

High availability

The applications run on thousands of cores and TBs of memory in Azure. It always makes sure that your applications are reliable by monitoring the health continuously. If it finds any failures at any point in time, it recovers from failures automatically.

End to end security

It provides enterprise-grade security and government compliance standards for data stored on clusters. You can even restrict access to sensitive data and applications for some users by defining role-based access policies. It also enables protection with encryption, Active Directory authentication. It has received more than 30 industry-standard certifications for protecting data. 

Get More information on Azure application insights

Global availability

HDInsight clusters can be deployed in more than 26 public regions and Government regions across the world. So it will be easier for users to deploy a cluster in a data center near them.

Open source ecosystem

HDInsight is based on open-source Apache frameworks. So if there is a new stable release of a framework in the market, users will be updated with the newest release of that framework.

Native integrations

It provides native integrations for a lot of Azure services like Synapse Analytics, Data Lake Storage, Blob Storage, Azure Cosmos DB, Event Hubs, and Data Factory. It also allows integrations with other big data certified applications with one-click deployment.


All the Hadoop related jobs generate log files that contain detailed information. They also include cluster configuration, error states, etc. To monitor all the logs of different clusters, users can integrate to Azure Log Analytics, where they can monitor all clusters through a single interface.

Advantages of HDInsight

The following are the key advantages of Azure HDInsight.

  • Users can easily run Hadoop related frameworks like Spark Kafka, Storm, etc., on the cloud.
  • Azure HDinsight makes it easy to process huge amounts of data.
  • Users can easily spin up clusters on Azure HDInsight within minutes. So, the users don’t have to worry about installations or managing infrastructure.
  • Users can scale up or down the clusters whenever they want.
  • Users can go ahead with the development in any language like Java, .Net, Python, R, C#, and Scala.
  • Around 30 Hadoop and Spark applications are available in Azure Marketplace. Users can pick any application and deploy it directly in their cluster.
  • Users can build applications in tools like Jupyter, Zeppelin, Visual Studio, Eclipse, and IntelliJ.
  • HDInsight can integrate with other Azure services like Data Lake Storage, Data Factory, etc.
  • Users can scale down the cluster whenever there is not much workload. So, the users can pay only for what they use.
  • It can perform interactive queries over data in any format like structured or unstructured.
  • Users can process streaming data from various devices in real-time.

Microsoft Azure Certification Training

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning

Types of clusters in HDInsight

Azure HDInsight offers several cluster types that serve different purposes. Users can customize the cluster according to their preference by adding components, features, languages, and utilities. Here are the available cluster types.

Apache Hadoop

Apache Hadoop is a fully managed Hadoop cluster on Azure. It consists of HDFS for storage and YARN for resource management. It also consists of MapReduce to process big data. MapReduce can be implemented in various languages like Java, C#, and Python.

Apache Spark

This cluster sources Apache Spark, which enables in-memory processing. This boosts the performance of big data analysis. These Spark clusters are compatible with Azure Storage and Azure Data Lake Storage. It also provides Jupyter and Apache Zeppelin notebooks for development.

Apache HBase

HBase is a NoSQL columnar database from Apache that can store huge amounts of unstructured and semi-structured data. Azure provides HBase as a managed cluster integrated into the Azure environment. It implements the scale-out architecture of HBase. The data stored through the Apache HBase cluster will directly get stored in Azure Storage. 

what are the Azure cognitive services 

ML Services

This cluster provides on-demand access for scalable and distributed methods of analytics on HDInsight. It offers R based analytics on the data stored in either Azure Blob or Data Lake storage. It offers 8000+ open-source R packages. The nodes make it possible to run R scripts. The resultant predictions or models can be downloaded for local use.

Apache Storm

Apache Storm is a distributed and fault-tolerant computation system used to process streaming data in real-time. Apache Storm on HDInsight guarantees that every message is fully processed. Users can monitor and manage Storm topologies from the browser using the Storm UI.

Apache Interactive Query

The Interactive Query in Azure is also known as Apache Hive LLAP. It offers in-memory caching that makes the Apache Hive queries run faster. Users can use this cluster to run high-speed queries on data stored in Azure storage and Azure Data Lake Storage. This cluster only contains Hive service.

Apache Kafka

Apache Kafka is used for building data pipelines on streaming data. It offers message queue functionality where data can be published and subscribed. The backing store for Kafka is Azure Managed Disks. The cluster provides all the Kafka functionalities like partitioning, racks, replication, etc.

Subscribe to our youtube channel to get new updates..!

Creating an HDInsight cluster

Navigate to the Azure portal and login to your account. Click on '+ Create a resource' from the top menu. Click on 'Analytics' and select the 'Azure HDInsight' option. The 'Create an HDInsight cluster' page will open. Select the 'Azure' for the subscription field. Select your resource group and give a unique name for the 'Cluster Name' field.

Select a region for the 'Location' field and choose a cluster type from the drop-down. Give the username and password for the cluster. Set 'sshuser' as the Secure Shell (SSH) username. Select the checkbox for 'Use cluster login password for SSH' field. Click on the 'Next: Storage >>' button. Select 'Azure Storage' from the drop-down for 'Primary storage type'. Choose the 'Use access key' radio button. Give your Storage account name and Access key. Click on 'Next: Security + networking >>' button.

All the options under the networking tab are optional. So if you want to configure anything, set those options, and click on 'Next: Configuration + pricing >>' button. Select the node size and number of nodes for each node type. Click on 'Review + create >>' button. You can view all the settings configured until now. Go through them once and click on 'Create'. It takes around 20 minutes to create the cluster. Once the process is complete, you will get a notification.

                                                                                                                    [Related Article:Microsoft Azure tutotrial]

Install apps on HDInsight cluster

Navigate to the Azure portal and login to your account. Click on '+ Create a resource' from the top menu. Click on 'Analytics' and select the 'Azure HDInsight' option. Select your HDInsight cluster from the list. Your cluster page will open where you can see all the information on your cluster.

Go to 'Settings' and select 'Applications' option. You will get a list of already installed applications. If you did not install any application, then the list will be empty. To add a new application, click on '+Add' from the menu. Select an application that you want to install in your cluster. Accept the terms and conditions. The installation status can be viewed on portal notifications. Once the application installation is complete, it will appear in the Installed Apps list.

Microsoft Azure Certification Training

Weekday / Weekend Batches


Azure HDInsight is the most popular Hadoop distribution on the cloud. LG, Roche, Virginia Tech, Blackball, Leeds Teaching Hospital, Fusionex, iTrend are some of the top companies that have already adopted HDInsight in their business. Through this post, you have learned how to create HDInsight clusters. Go ahead and try creating different clusters that suit your business needs. Azure provides Ambari UI to monitor and manage all the clusters.

Related articles :

Azure Load balancer

Aws vs Azure 

Find our upcoming Microsoft Azure Certification Training Online Classes

  • Batch starts on 28th Oct 2021, Weekday batch

  • Batch starts on 1st Nov 2021, Weekday batch

  • Batch starts on 5th Nov 2021, Fast Track batch

Global Promotional Image


Request for more information

Saritha Reddy
Saritha Reddy
Research Analyst
A technical lead content writer in HKR Trainings with an expertise in delivering content on the market demanding technologies like Networking, Storage & Virtualization,Cyber Security & SIEM Tools, Server Administration, Operating System & Administration, IAM Tools, Cloud Computing, etc. She does a great job in creating wonderful content for the users and always keeps updated with the latest trends in the market. To know more information connect her on Linkedin, Twitter, and Facebook.