Azure Data Factory is a cloud integration support tool of Microsoft Corporation. In this blog, we are going to explore all the features in Azure Data Factory as compared to the SSIS (SQL Server Integration Services) and will learn how it helps in solving the real-life data integration problems. Azure Data Factory (ADF) offers integration platform services with many different data sources. The Azure platform builds hybrid Extract Transform Load(ETL), Extract Load Transform (ELT) and other integration pipelines for providing solutions. With ADF you can able to perform the below functionalities. Copying the data from various sources of on-premises and cloud. Data transformation. Publishing the data which is copied and transformed and sending it for destination storage. Data flow monitoring.
Azure Data Factory (ADF) is a cloud integration system. The data can be moved between on-premises and cloud systems, scheduling and orchestrating data flows with the support of ADF. The platform of ADF is mostly based on Extract and Load Transform and Transform and Load rather than ETL (Extract Transform and Load) platforms. The following approaches are used in achieving the Extract and Load.
The data transfer between file systems and database systems which are located on-premises and on cloud is configured with simple ADF built-in features. Various databases such as SQL Server, Oracle, MySQL, DB2, Azure SQL Database, Azure Data Lake, blob storage, local file system and HDFS can be connected with ADF.
The SISS packages are initiated using ADF which can implement more sophisticated data movement and transformation tasks.
These are the challenges which are faced by the Azure Data in moving data to or from the cloud. The below measures explains the purpose of using ADF.
The SQL Agent services like Azure scheduler and Azure automation will trigger the data integration tasks to move the data. The features like scheduling the jobs are also included in ADF. Event-based dataflows and dependencies are allowed in ADF.
Within a few hours, the large volume of gigabytes of data can be transferred into the cloud. ADF has features like build-in parallelism and time slicing which can handle large volumes of data.
ADF ensures security by encrypting the data transit between on-premises and on cloud sources.
The ADF v2 provides an interactive interface which requires less coding in developing the components with the Azure portal and this is configured with JSON files.
ADF can integrate with GitHub to develop and deploy the build automatically into Azure. The entire configuration will be downloaded as Azure ARM Template and used to deploy ADF in other environments. The skilled PowerShell developers can able to create and deploy all the components of ADF.
The following are the components in Azure Data Factory. To understand how ADF works it is important to know about the components of ADF. The below picture represents the ADF component resources comprising a pipeline, two datasets and one data flow.
The connectors are the linked services that are configured with the settings for accessing certain data sources. The setting preferences include server/database name, file folder, credentials etc. Each data flow may have one or more linked services which are dependant on the job nature.
The configuration settings for Dataset include table name, file name, structure etc. Each dataset is referred by a linked service which determines a list of possible dataset properties.
The activities are the actions which are performed like data movement, data transfer. The activity configurations settings include database query, stored procedure name, parameters, script locations etc.
The data flows allows the data engineers to develop a transformation logic visually without writing code. These data flow activity types are executed in ADF Pipeline on Azure Databricks for scaled out processing using Spark. ADF handles large amounts of data by controlling the data flow execution and code translation.
The logical group of activities are called pipelines. A data factory can have one or more pipelines and each pipeline may have one or more activities. The task scheduling and monitoring the multiple logical activities become easy with pipelines.
Scheduling the configuration for pipelines are called triggers. The settings include start/end date, execution frequency etc.
ADF provides the data movement, compute capabilities across different network environments by running this integration service. It is a complete infrastructure made of below main runtime types.
Azure IR: It provides a fully managed serverless compute in Azure which handles the data moment activities in the cloud.
Self-hosted IR: The copy activities between a cloud data store and a data store in a private network along with transformation activities are managed by Self-hosted IR.
Azure SSIS IR: The SSIS packages are executed with Azure SSIS IR.
The below picture represents the overview of ADF and the relationship between different ADF entities of data set, activity, pipeline and linked services.
The features that distinguish in using ADF from other ETL tools are.
Working with Azure Data Factory is very easy and simple. The ADF is designed with GUI features which offer to create/manage the activities and pipelines by reducing the coding effort. Only the complex transformations require coding skills.
The ADF contains the default connectors with all data sources including MySQL, SQL Server, Oracle DBs.
The ADF integrates well with other Azure compute and storage resources with linked services that define the connection to the external resources. You can define two kinds of linked services.
Data Store service
This linked service provides the data storage services for Azure SQL Database, a Data Lake, Azure SQL Data-warehouse, a filesystem, an on-premises database, a NoSQL DB, etc.
This service is used in transforming and enriching the data for Azure HDInsight, Azure Machine Learning, Stored Procedure in any SQL, U-SQL activity, Data Lake Analytics Azure Databricks and/or Azure Batch.
The Azure Data Factory is a very unique application which can transform and enrich the complex data. It is very easy to integrate the cloud with on-premises data. The delivery of integration services is scalable and available at low costs that can develop the data flow building blocks for any data platform and machine learning projects.
5th April | 08:00 AM