PySpark was released to support Apache Spark's collaboration with Python. It is actually a Python API for Spark. Additionally, PySpark allows you to interact with Resilient Distributed Datasets in the Apache Spark and Python programming languages. This was carried out by leveraging the Py4j library. Py4J is a popular library that is built into PySpark and lets Python interact dynamically with JVM objects. PySpark offers some libraries to write efficient programs. In Pyspark, a new column can be added to DataFrame through the withColumn(), SQL(), select() functions. Dropping or deleting a column in pyspark may be performed using the drop() function. In this post, let us go through the following things:
In Pyspark, using the drop() function, we can drop a single column. Drop function with the column name as an argument will delete this particular column.
Syntax: df_orders.drop(‘column1’). show()
When we execute the above syntax, column1 column will be dropped from the dataframe.
We can also drop a single column with the drop function using df.name_of_the_column as an argument.
Syntax: df_orderd.drop(df_orders.column1).show()
If we execute the above syntax, then column1 column will be dropped from the dataframe.
We can also drop a number of columns into pyspark using the drop() function. We need to include the columns name list as the argument in the drop function to drop those columns.
Syntax: df_orderd.drop(‘column1’, ‘column2’).show()
When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.
Start learning PySpark training from hkrtrainings to make a bright career in the world of PySpark!
We can also drop multiple columns using the drop() function with another method. The list of column names to be deleted is listed under "columns_to_be_dropped". Then we need to pass this list to the drop() function.
Syntax: columns_to_be_dropped =[‘column1’, ‘column2’]
df_orderd.drop(*columns_to_be_dropped).show()
When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.
There is another method through which we can drop multiple columns. In this method, we include two drop() functions. This function will drop the columns one after the other in a single step.
Syntax: df_orderd.drop(df_orders.column1).drop(df_orders.column2).show()
If we execute the above syntax, then column1, column2 columns will be dropped from the data frame one by one in a sequence.
Deleting more than one column using the position in pyspark is done in a rounded manner. The required list of columns and rows are extracted using the select() function first and then converted into a dataframe.
Syntax: spark.createDataFrame(df_orders.select(df_orders.columns[:2]).take(5)).show()
Frequently asked PySpark Interview questions & answers
When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.
Deleting more than one column that starts with a particular string in pyspark is done in a rounded manner. The column name list that starts with a particular string will be extracted first with the help of the startswith() function, and this function is then passed to the drop() function.
Syntax: list_name=df_orders.columns
Columns_to_be_deleted = [i for i in list_name if i.startswith(‘column’)]
df_orders.drop(*column_to_be_deleted).show()
When we execute the above syntax, columns that start with the name 'column' will be dropped from the dataframe.
Deleting more than one column that terminates with a particular string in pyspark is done in a rounded manner. The column name list that terminates with a particular string will be extracted first with the help of endwith() function, and this function is then passed to the drop() function.
Syntax: list_name=df_orders.columns
Columns_to_be_deleted = [i for i in list_name if i.endswith(‘id’)]
df_orders.drop(*column_to_be_deleted).show()
When we execute the above syntax, columns that terminate with the name 'id' will be dropped from the dataframe.
Deleting more than one column that contains a particular string in pyspark is done in a rounded manner. The column names that contain a particular string will be extracted first, and then this function is then passed to the drop() function.
Syntax: list_name=df_orders.columns
Columns_to_be_deleted = [i for i in df.columns if i._contains_(‘name’)]
df_orders.drop(*column_to_be_deleted).show()
When we execute the above syntax, columns that contain 'name' will be dropped from the dataframe.
We have the perfect professional PySpark Tutorial for you. Enroll now!
Deleting more than one column that contains Null values in pyspark is done in a rounded manner by creating the user-defined function. The column names that contain null values will be extracted first using the isNull() function, and this function is then passed to the drop() function.
Syntax: import pyspark.sql.functions as X
def drop_null_columns(df_ord):
null_counts = df_ord.select([X.count(X.when(X.col(a).isNull(), a)).alias(a) for a in df_ord.columns]).collect()[0].asDict()
to_delete = [p for p, u in null_counts.items() if u>0]
df_ord = df_orders.drop(*to_delete)
return df_ord
drop_null_columns(df_ord).show()
When we execute the above syntax, columns that contain null values will be dropped from the dataframe.
Deleting more than one column that contains NaN/NA values in pyspark is done in a rounded manner by creating the user-defined function. The column names that contain NaN/NA values will be extracted using the isnan() function, and this function is then passed to the drop() function.
Syntax: import pyspark.sql.functions as X
def drop_null_columns(df_ord):
null_counts = df_ord.select([X.count(X.when(X.col(a).isnan(), a)).alias(a) for a in df_ord.columns]).collect()[0].asDict()
to_delete = [p for p, u in null_counts.items() if u>0]
df_ord = df_orders.drop(*to_delete)
return df_ord
drop_null_columns(df_ord).show()
When we execute the above syntax, columns that contain NaN/NA values will be dropped from the dataframe.
Conclusion:
In this blog, we have learned to drop a single column, multiple columns, a column with column name that begins with a particular string, a column with column name that terminates with a particular string, column by column position, a column with the NaN/NA values and null values in pyspark. I hope the information provided in this blog is helpful. Feel free to comment if you have any queries.
Related Articles:
Batch starts on 2nd Oct 2023, Weekday batch
Batch starts on 6th Oct 2023, Fast Track batch
Batch starts on 10th Oct 2023, Weekday batch