Pyspark is a structure that runs on a group of item equipment and performs information unification i.e., perusing and composing of a wide assortment of information from different sources. In Spark, an undertaking is an activity that can be a guide task. The execution of the activity is handled by Flash Context and furthermore gives API’s in various dialects i.e., Scala, Java and Python to create applications and quicker execution when contrasted with MapReduce.
In this article, you can go through the set of Pyspark interview questions most frequently asked in the interview panel. This will help you crack the interview as the topmost industry experts curate these at HKR training.
Let us have a quick review of the Pyspark interview questions & Pyspark Coding Interview Questions
Ans: An object is an instantiation of a class. A class can be instantiated by calling the class using the class name.
Syntax:
= ( )
Example:
class Student:
id = 25;
name = "HKR Trainings"
estb = 10
def display (self):
print("ID: %d \n Name: %s \n Estb: %d "%(self.id,self.name,self.estb))
stud = Student()
stud.display()
Output:
ID: 25
Name: HKR Trainings
Estb: 10
Ans: In Python, a method is a function that is associated with an object. Any object type can have methods.
Example:
class Student:
roll = 17;
name = "gopal"
age = 25
def display (self):
print(self.roll,self.name,self.age)
In the above example, a class named Student is created which contains three fields as Student’s roll, name, age and a function “display()” which is used to display the information of the Student.
Ans: Encapsulation is used to restrict access to methods and variables. Here, the methods and data/variables are wrapped together within a single unit so as to prevent data from direct modification.
Below is the example of encapsulation whereby the max price of the product cannot be modified as it is set to 75.
Example:
class Product:
def __init__(self):
self.__maxprice = 75
def sell(self):
print("Selling Price: {}".format(self.__maxprice))
def setMaxPrice(self, price):
self.__maxprice = price
p = Product()
p.sell()
# change the price
p.__maxprice = 100
p.sell()
Output:
Selling Price: 75
Selling Price: 75
Ans: Inheritance refers to a concept where one class inherits the properties of another. It helps to reuse the code and establish a relationship between different classes.
Amongst following two types of classes, inheritance is performed:
In python, a derived class can inherit base class by just mentioning the base in the bracket after the derived class name.
The syntax to inherit a base class into the derived class is shown below:
Syntax:
class derived-class(base class):
The syntax to inherit multiple classes is shown below by specifying all of them inside the bracket.
Syntax:
class derive-class(, , ..... ):
Ans: A for loop in Python requires at least two variables to work. The first is the iterable object such as a list, tuple or a string and second is the variable to store the successive values from the sequence in the loop.
Syntax:
for iter in sequence:
statements(iter)
Ans: Python allows us to handle loops in an interesting manner by providing a function to write else blocks for cases when the loop does not satisfy a certain condition.
Example :
x = [ ]
for i in x:
print "in for loop"
else:
print "in else block"
Output:
in else block
Ans: In Python, there are two types of errors - syntax error and exceptions.
Syntax Error: It is also known as parsing errors. Errors are issues in a program which may cause it to exit abnormally. When an error is detected, the parser repeats the offending line and then displays an arrow which points at the earliest point in the line.
Exceptions: Exceptions take place in a program when the normal flow of the program is interrupted due to the occurrence of an external event. Even if the syntax of the program is correct, there are chances of detecting an error during execution, this error is nothing but an exception. Some of the examples of exceptions are - ZeroDivisionError, TypeError and NameError.
Ans:
Example:
list_num[3] = 7
print(list_num)
tup_num[3] = 7
Output:
[1,2,5,7]
Traceback (most recent call last):
File "python", line 6, in
TypeError: 'tuple' object does not support item assignment
In this code, we had assigned 7 to list_num at index 3 and in the output, we can see 7 is found in index 3. However, we had assigned 7 to tup_num at index 3 but we got type error on the output. This is because we cannot modify tuples due to its immutable nature.
Ans: The
It can be called with a string containing a number as the argument, and it will return the number converted to an actual integer.
Example:
print int("1") + 2
The above prints 3.
Ans: Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.
Ans: Data visualization is the representation of data or information in a graph, chart, or other visual format. It communicates relationships of the data with images. The data visualizations are important because it allows trends and patterns to be more easily seen.
Ans: To support Python with Spark, the Spark community has released a tool called PySpark. It is primarily used to process structured and semi-structured datasets and also supports an optimized API to read data from the multiple data sources containing different file formats. Using PySpark, you can also work with RDDs in the Python programming language using its library name Py4j.
The main characteristics of PySpark are listed below:
Ans: RDD stands for Resilient Distribution Datasets, a fault-tolerant set of operational elements that are capable of running in parallel. These RDDs, in general, are the portions of data, which are stored in the memory and distributed over many nodes.
All partitioned data in an RDD is distributed and immutable.
There are primarily two types of RDDs are available:
Ans: Spark provides two methods to create RDD:
method val DataArray = Array(2,4,6,8,10)
val DataRDD = sc.parallelize(DataArray)
Ans: The following are the components of Apache Spark.
Ans: When an Action is approached at a certain point, Spark RDD at an abnormal state, Spark presents the heredity chart to the DAG Scheduler.
Activities are separated into phases of the errand in the DAG Scheduler. A phase contains errands dependent on the parcel of the info information. The DAG scheduler pipelines administrators together. It dispatches tasks through the group chief. The conditions of stages are obscure to the errand scheduler. The Workers execute the undertaking on the slave.
Ans: Stream processing is an extension to the Spark API that lets stream processing of live data streams. Data from multiple sources such as Flume, Kafka, Kinesis, etc., is processed and then pushed to live dashboards, file systems, and databases. Compared to the terms of input data, it is just similar to batch processing, and data is segregated into streams like batches in processing.
Ans: MLlib is a scalable Machine Learning library offered by Spark. It supports making Machine Learning secure and scalable with standard learning algorithms and use cases such as regression filtering, clustering, dimensional reduction.
Ans:
Ans: SparkCore implements several key functions such as
Moreover, additional libraries, built atop the core, let diverse workloads for streaming, machine learning, and SQL. This is useful for:
Ans: The module used is Spark SQL, which integrates relational processing with Spark’s functional programming API. It helps to query data either through Hive Query Language or SQL. These are the four libraries of Spark SQL.
Ans: Spark SQL is capable of:
Ans: The different algorithms supported by PySpark are:
Ans: For improving performance, PySpark supports custom serializers to transfer data. They are:
Ans: PySpark Storage Level controls storage of an RDD. It also manages how to store RDD in the memory or over the disk, or sometimes both. Moreover, it even controls the replicate or serializes RDD partitions. The code for StorageLevel is as follows
class pyspark.StorageLevel( useDisk, useMemory, useOfHeap, deserialized, replication = 1)
Ans: PySpark SparkContext is treated as an initial point for entering and using any Spark functionality. The SparkContext uses py4j library to launch the JVM, and then create the JavaSparkContext. By default, the SparkContext is available as ‘sc’.
Ans: PySpark SparkFiles is used to load our files on the Apache Spark application. It is one of the functions under SparkContext and can be called using sc.addFile to load the files on the Apache Spark. SparkFIles can also be used to get the path using SparkFile.get or resolve the paths to files that were added from sc.addFile. The class methods present in the SparkFiles directory are getrootdirectory() and get(filename).
Ans: Apache Spark is a graph execution engine that enables users to analyze massive data sets with high performance. For this, Spark first needs to be held in memory to improve performance drastically, if data needs to be manipulated with multiple stages of processing.
Ans: SparkConf helps in setting a few configurations and parameters to run a Spark application on the local/cluster. In simple terms, it provides configurations to run a Spark application.
Ans: Few main attributes of SparkConf are listed below:
you can use above pyspark interview questions for experienced data engineer as well
Batch starts on 2nd Oct 2023, Weekday batch
Batch starts on 6th Oct 2023, Fast Track batch
Batch starts on 10th Oct 2023, Weekday batch