Hadoop has pioneered a fundamentally new way of storing and processing data. Hadoop is a software framework for storing and processing big data in a distributed fashion on large clusters of systems. Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Azure provides a distribution of Hadoop on the cloud called Azure HDInsight. In this post, we will explain briefly what HDInsight is and how it is used. You will get to know the types of clusters available in HDInsight. You will also know how to create an HDInsight cluster. Let us get started.
Azure HDInsight is a fully managed analytics service provided by Microsoft Azure. It helps in creating Hadoop clusters using Linux with Ubuntu. It was originally based on the Hortonworks Data Platform (HDP) Hadoop distribution on Microsoft Azure. Microsoft later announced its Hadoop distribution on July 21st, 2020, during the Microsoft Inspire event. Users won't know much of a difference, but the underlying Hadoop distribution is based on Apache open source components.
Azure HDInsight provides Hadoop as a service on top of the Azure platform. It helps in managing and analyzing big data. It comes with features to extract, transform, and load (ETL) huge amounts of data, data warehousing, IoT, and machine learning. The frameworks that we can install in HDInsight are Hadoop, Apache Spark, Apache Hive, Apache Storm, R, Apache Kafka, and many more.
Interested in learning Azure Course ? Enroll in our Microsoft Azure Certification Training program now!
Here are some of the features that Azure HDInsight provides.
The applications run on thousands of cores and TBs of memory in Azure. It always makes sure that your applications are reliable by monitoring the health continuously. If it finds any failures at any point in time, it recovers from failures automatically.
It provides enterprise-grade security and government compliance standards for data stored on clusters. You can even restrict access to sensitive data and applications for some users by defining role-based access policies. It also enables protection with encryption, Active Directory authentication. It has received more than 30 industry-standard certifications for protecting data.
Get More information on Azure application insights
HDInsight clusters can be deployed in more than 26 public regions and Government regions across the world. So it will be easier for users to deploy a cluster in a data center near them.
HDInsight is based on open-source Apache frameworks. So if there is a new stable release of a framework in the market, users will be updated with the newest release of that framework.
It provides native integrations for a lot of Azure services like Synapse Analytics, Data Lake Storage, Blob Storage, Azure Cosmos DB, Event Hubs, and Data Factory. It also allows integrations with other big data certified applications with one-click deployment.
All the Hadoop related jobs generate log files that contain detailed information. They also include cluster configuration, error states, etc. To monitor all the logs of different clusters, users can integrate to Azure Log Analytics, where they can monitor all clusters through a single interface.
The following are the key advantages of Azure HDInsight.
Azure HDInsight offers several cluster types that serve different purposes. Users can customize the cluster according to their preference by adding components, features, languages, and utilities. Here are the available cluster types.
Apache Hadoop is a fully managed Hadoop cluster on Azure. It consists of HDFS for storage and YARN for resource management. It also consists of MapReduce to process big data. MapReduce can be implemented in various languages like Java, C#, and Python.
This cluster sources Apache Spark, which enables in-memory processing. This boosts the performance of big data analysis. These Spark clusters are compatible with Azure Storage and Azure Data Lake Storage. It also provides Jupyter and Apache Zeppelin notebooks for development.
HBase is a NoSQL columnar database from Apache that can store huge amounts of unstructured and semi-structured data. Azure provides HBase as a managed cluster integrated into the Azure environment. It implements the scale-out architecture of HBase. The data stored through the Apache HBase cluster will directly get stored in Azure Storage.
what are the Azure cognitive services
This cluster provides on-demand access for scalable and distributed methods of analytics on HDInsight. It offers R based analytics on the data stored in either Azure Blob or Data Lake storage. It offers 8000+ open-source R packages. The nodes make it possible to run R scripts. The resultant predictions or models can be downloaded for local use.
Apache Storm is a distributed and fault-tolerant computation system used to process streaming data in real-time. Apache Storm on HDInsight guarantees that every message is fully processed. Users can monitor and manage Storm topologies from the browser using the Storm UI.
The Interactive Query in Azure is also known as Apache Hive LLAP. It offers in-memory caching that makes the Apache Hive queries run faster. Users can use this cluster to run high-speed queries on data stored in Azure storage and Azure Data Lake Storage. This cluster only contains Hive service.
Apache Kafka is used for building data pipelines on streaming data. It offers message queue functionality where data can be published and subscribed. The backing store for Kafka is Azure Managed Disks. The cluster provides all the Kafka functionalities like partitioning, racks, replication, etc.
Navigate to the Azure portal https://portal.azure.com/ and login to your account. Click on '+ Create a resource' from the top menu. Click on 'Analytics' and select the 'Azure HDInsight' option. The 'Create an HDInsight cluster' page will open. Select the 'Azure' for the subscription field. Select your resource group and give a unique name for the 'Cluster Name' field.
Select a region for the 'Location' field and choose a cluster type from the drop-down. Give the username and password for the cluster. Set 'sshuser' as the Secure Shell (SSH) username. Select the checkbox for 'Use cluster login password for SSH' field. Click on the 'Next: Storage >>' button. Select 'Azure Storage' from the drop-down for 'Primary storage type'. Choose the 'Use access key' radio button. Give your Storage account name and Access key. Click on 'Next: Security + networking >>' button.
All the options under the networking tab are optional. So if you want to configure anything, set those options, and click on 'Next: Configuration + pricing >>' button. Select the node size and number of nodes for each node type. Click on 'Review + create >>' button. You can view all the settings configured until now. Go through them once and click on 'Create'. It takes around 20 minutes to create the cluster. Once the process is complete, you will get a notification.
[Related Article:Microsoft Azure tutotrial]
Navigate to the Azure portal https://portal.azure.com/ and login to your account. Click on '+ Create a resource' from the top menu. Click on 'Analytics' and select the 'Azure HDInsight' option. Select your HDInsight cluster from the list. Your cluster page will open where you can see all the information on your cluster.
Go to 'Settings' and select 'Applications' option. You will get a list of already installed applications. If you did not install any application, then the list will be empty. To add a new application, click on '+Add' from the menu. Select an application that you want to install in your cluster. Accept the terms and conditions. The installation status can be viewed on portal notifications. Once the application installation is complete, it will appear in the Installed Apps list.
Azure HDInsight is the most popular Hadoop distribution on the cloud. LG, Roche, Virginia Tech, Blackball, Leeds Teaching Hospital, Fusionex, iTrend are some of the top companies that have already adopted HDInsight in their business. Through this post, you have learned how to create HDInsight clusters. Go ahead and try creating different clusters that suit your business needs. Azure provides Ambari UI to monitor and manage all the clusters.
Related articles :
Batch starts on 2nd Oct 2022, Weekend batch
Batch starts on 6th Oct 2022, Weekday batch
Batch starts on 10th Oct 2022, Weekday batch