Do you know that today there are more than 750 contributors from more than 200 companies that use Apache spark across the world? It is because Apache Spark has Increased access to big data. It uses Existing Big data Investments and helps you to pace up with growing enterprise Adoption. Many organizations like Amazon, Yahoo, and eBay run Apache Spark on clusters with thousands of nodes. So there is a huge demand for Spark Developers. Why late? Let's get started with Apache Spark Tutorial and grab huge opportunities in the fascinating platform by learning Apache Spark. In this Apache Spark Online Tutorial, you will learn about Apache Spark, Spark installation, Spark Architecture, Spark components, etc.
Apache Spark is a lightweight open-source framework that handles the real-time generated data. It was designed to make fast computations based on Hadoop MapReduce. In other words Apache spark was developed for speeding up the Hadoop computing process. MapReduce model was extended by Apache Spark to use it more efficiently for computations that include stream processing and interactive queries. In-Memory cluster computing increases the processing speed of the application which was the main feature of Spark.
Apache Spark covers a wide range of workloads such as iterative algorithms,interactive queries,batch applications and streaming. Along with all these workloads, it reduces the burden to the management for maintaining separate tools.
In 2009, Matei Zaharia developed Spark as one of Hadoop's sub-projects in UC Berkeley's Lab. Under a BSD license, it was open-sourced in 2010. After that, Spark was donated to Apache software foundation in 2013.Now it has emerged as a top-level Apache project.
The data that is being generated is increasing day by day.The traditional methods cannot access this huge volume of data. To eliminate this problem, Big data and Hadoop emerged. But they too had some limitations.These limitations can be eliminated by Apache spark. So Apache Spark has become more efficient because of its speed and less complexity.
Spark toolset is continuously expanding, which is attracting third-party interest. So boost your career by learning Apache spark from this Apache Spark Tutorial. Here you can write the applications in any of the programming languages like Java,Python, R, Scala that you are comfortable with. Moreover, Spark developers were paid high salaries.
Step 1: Before installing Apache Spark, we need to verify if Java was installed or not.If Java is already installed, proceed with the next step; otherwise, Download Java and install it on your system.
Step 2: Then Verify if Scala is installed in your system. If it is already installed, then proceed; otherwise, download Scala's latest version and install it in your system.
Step 3: Now, Download the latest version of Apache Spark from the following Link.
You can see the Spark Zip file in your download folder.
Step 4: Extract it. Then create a folder named Spark under user Directory and copy-paste the content from the unzipped file.
Step 5: Now, we need to configure the path.
Go to Control Panel -> System and Security -> System -> Advanced Settings -> Environment Variables
Add new user variable (or System variable)
(To add a new user variable, click on the New button under User variable for
Then click OK.
Now, Add %SPARK_HOME%\bin to the path variable.
Then click OK.
Now, Add %SPARK_HOME%\bin to the path variable.
And Click OK.
Step 6: Spark needs Hadoop to run.For Hadoop 2.7,you need to install winutils.exe.
You can find winutils.exe from the following link. Download it
Step 7: Create a folder named winutils in the C drive and create a folder named bin inside. Move the downloaded winutils file to the bin folder.
Now add the user (or system) variable %HADOOP_HOME% like SPARK_HOME.
And Click OK. This step completes spark installation.
Apache Spark Architecture is a well-defined and layered architecture, where all the layers and components are loosely coupled. This Architecture is integrated with various libraries and extensions. In other words, it is said that Spark Architecture follows Master-Slave architecture, where a cluster consists of a single master and multiple workers nodes.
1. Directed Acyclic Graph (DAG):
Directed Acyclic Graph is a sequence of computations performed on data. Here each node is an RDD partition, and each edge is a transformation on top of data. DAG eliminates the Hadoop MapReduce multistage execution model and provides performance enhancements over Hadoop.
Let us understand it more clearly.
Here the Driver Program runs the main() function of the application.It creates a SparkContext object whose primary purpose is to run as an independent set of processes on the cluster and coordinate with the spark applications. So to run on a cluster, SparkContext connects with different cluster managers. Then it acquires executors on nodes in the cluster and sends the application code to the executors. Here the application code can be defined by Python or JAR files. Finally, the SparkContext sends the tasks to the executors to run.
2. Resilient Distributed Dataset (RDD):
Resilient Distributed Datasets are the collection of data items that are split into different partitions and stored in the memory of the spark cluster's worker nodes.
RDD's can be created in two ways:
Parallelized Collection: Parallelized collections are created by calling the SparkContext's parallelize method on an existing driver program collection. The elements of the collection are copied to form a distributed dataset that can be operated in parallel.
Here is an example of how to create a parallized collection holding the numbers 1 to 3.
val info = Array(1, 2, 3)
val distnumbr = sc.parallelize(numbr)
External Datasets: From any storage sources supported by Hadoop such as HDFS, HBase, Cassandra, or even the local file system, distributed datasets can be created. Spark supports text files, Sequence Files, and any other Hadoop InputFormat.
To create RDD's text file, SparkContext's textfile method can be used. URI for the file is taken by this method, either a hdfs:// or a local path on the machine, and reads the file's data.
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD at textFile at
distFile can be acted on by dataset operations once it is created. For example, Sizes of all the lines can be added using map and reduce operations.
distFile.map(s => s.length).reduce((a, b) => a + b).
RDD Operations: RDD provides two types of Operations. They are:
In Spark, the role of Transformation is to create a new dataset from an existing one. As they are computed when an action requires a result to be returned to the driver program, the transformations are considered lazy.
Some of the RDD transformations that are frequently used are:
In Spark,the role of action is to return a value to your driver program after running a computation on the dataset.
Some of the RDD actions that are frequently used are:
RDD Persistence: One of the important capabilities Spark provides is persisting a dataset in memory across operations. While persisting an RDD, each node stores in memory any partition of it that it computes and reuses in other actions on that dataset. This makes the future actions much faster. persist() or cache() methods can be used to mark an RDD to be persisted. Cache() is considered as fault-tolerant. It means, if any partition is lost, it will be recomputed automatically using the transformations that were originally created. There are different storage levels to store persisted RDD's. These Storage levels are set by passing a StorageLevel object(Scala, Java, Python) to persist(). While the Cache() method is used for the default storage level StorageLevel.MEMORY_ONLY.
Set of Storage Levels are as follows:
RDD Shared Variables: Whenever a function is passed to a Spark operation, it is executed on a remote cluster node and works on separate copies of all the function variables. These variables are copied to each machine, and no updates of the variables on the remote machine are propagated back to the driver program.
Spark provides two limited types of variables: Broadcast variables and accumulators.
i) Broadcast variable: Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than providing a copy of it with tasks. To reduce communication costs, Spark attempts to distribute broadcast variables using efficient broadcast algorithms. Through a set of stages, Spark actions are executed, separated by distributed "shuffle" operations. Spark broadcasts the common data required by the tasks within each stage automatically. The data broadcasted in this way is cached in serialized form and deserialized before running the task.
Broadcast variable v is created using call SparkContext.broadcast(v).
scala> val v = sc.broadcast(Array(1, 2, 3))
ii) Accumulators: Accumulator is a variable that is used to perform associative and commutative operations such as sums or counters. Numeric type accumulators are supported by Spark. To create a numeric accumulator value of Long or Double type, use SparkContext.longAccumulator() or SparkContext.doubleAccumulator()
scala> val a=sc.longAccumulator("Accumulator")
Spark Project consists of different components that are tightly integrated.To its core, It is a computational engine that can distribute, monitor, and schedule multiple applications.
Spark cannot replace Hadoop, but it influences the functionality of Hadoop. From the beginning, Spark reads data from and can write data to Hadoop Distributed File System(HDFS). We can say that Apache Spark is a Hadoop-based data processing engine which can take over batch and streaming overheads. So running Spark over Hadoop provides more enhanced functionality.
We can use Spark over Hadoop in 3 ways: Standalone, YARN, SIMR
In Standalone mode, We can allocate resources on all the machines or on a subset of machines in the Hadoop cluster. We can also run Spark side by side with Hadoop MapReduce.
Without any prerequisites we can run Spark on YARN. Spark in Hadoop stack can be integrated and use the facilities and advantages of Spark.
With Spark in MapReduce(SIMR), we can use Spark Shell in a few minutes after downloading. Hence it reduces the overhead of Deployment.
Spark provides high performance for both batch data and streaming data. It is an easy to use application which provides a collection of libraries. Moreover the following are the uses of Apache Spark:
There is a good demand for the expert professionals in this field. Hope this tutorial helped you in learning Apache Spark. In this tutorial, we have covered all the topics that are required to enhance your professionals skills in Apache Spark.
Batch starts on 28th Oct 2021, Weekday batch
Batch starts on 1st Nov 2021, Weekday batch
Batch starts on 5th Nov 2021, Fast Track batch