PySpark Drop Column

PySpark was released to support Apache Spark's collaboration with Python. It is actually a Python API for Spark. Additionally, PySpark allows you to interact with Resilient Distributed Datasets in the Apache Spark and Python programming languages. This was carried out by leveraging the Py4j library. Py4J is a popular library that is built into PySpark and lets Python interact dynamically with JVM objects. PySpark offers some libraries to write efficient programs.

PySpark Drop Column - Table of content

In Pyspark, a new column can be added to DataFrame through the withColumn(), SQL(), select() functions. Dropping or deleting a column in pyspark may be performed using the drop() function. In this post, let us go through the following things:

  • Dropping Single Column in PySpark 
  • Dropping Multiple Columns in PySpark 
  • Dropping column with a column name that begins with a particular string in pyspark.
  • Dropping column with a column name that terminates with a particular string in pyspark.
  • Dropping Column Using the Position in PySpark
  • Dropping Columns that have Null values in PySpark
  • Dropping Column with the NaN/NA values in PySpark

Start learning PySpark training from hkrtrainings to make a bright career in the world of PySpark!

Dropping a Single Column in PySpark:

In Pyspark, using the drop() function, we can drop a single column. Drop function with the column name as an argument will delete this particular column.

Syntax: df_orders.drop(‘column1’). show()

When we execute the above syntax, column1 column will be dropped from the dataframe.

We can also drop a single column with the drop function using df.name_of_the_column as an argument.

Syntax: df_orderd.drop(df_orders.column1).show()

If we execute the above syntax, then column1 column will be dropped from the dataframe.

Dropping Multiple Column in PySpark:

We can also drop a number of columns into pyspark using the drop() function. We need to include the columns name list as the argument in the drop function to drop those columns.

Syntax: df_orderd.drop(‘column1’, ‘column2’).show()

When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.

PySpark Training Certification

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning

We can also drop multiple columns using the drop() function with another method. The list of column names to be deleted is listed under "columns_to_be_dropped". Then we need to pass this list to the drop() function.

Syntax: columns_to_be_dropped =[‘column1’, ‘column2’]

df_orderd.drop(*columns_to_be_dropped).show()

When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.

There is another method through which we can drop multiple columns. In this method, we include two drop() functions. This function will drop the columns one after the other in a single step.

Syntax: df_orderd.drop(df_orders.column1).drop(df_orders.column2).show() 

If we execute the above syntax, then column1, column2 columns will be dropped from the data frame one by one in a sequence.

Dropping Column Using the Position in PySpark:

Deleting more than one column using the position in pyspark is done in a rounded manner. The required list of columns and rows are extracted using the select() function first and then converted into a dataframe.

Syntax: spark.createDataFrame(df_orders.select(df_orders.columns[:2]).take(5)).show()

Frequently asked PySpark Interview questions & answers

When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.

Dropping column with column name that begins with a particular string in PySpark:

Deleting more than one column that starts with a particular string in pyspark is done in a rounded manner. The column name list that starts with a particular string will be extracted first with the help of the startswith() function, and this function is then passed to the drop() function.

Syntax: list_name=df_orders.columns
             Columns_to_be_deleted = [i for i in list_name if i.startswith(‘column’)]
             df_orders.drop(*column_to_be_deleted).show()

When we execute the above syntax, columns that start with the name 'column' will be dropped from the dataframe.

Subscribe to our youtube channel to get new updates..!

Dropping column with column name that terminates with a particular string in PySpark:

Deleting more than one column that terminates with a particular string in pyspark is done in a rounded manner. The column name list that terminates with a particular string will be extracted first with the help of endwith() function, and this function is then passed to the drop() function.

Syntax: list_name=df_orders.columns
             Columns_to_be_deleted = [i for i in list_name if i.endswith(‘id’)]
             df_orders.drop(*column_to_be_deleted).show()

When we execute the above syntax, columns that terminate with the name 'id' will be dropped from the dataframe.

Dropping column name that contains a particular string in PySpark:

Deleting more than one column that contains a particular string in pyspark is done in a rounded manner. The column names that contain a particular string will be extracted first, and then this function is then passed to the drop() function.

Syntax: list_name=df_orders.columns
             Columns_to_be_deleted = [i for i in df.columns if i._contains_(‘name’)]
             df_orders.drop(*column_to_be_deleted).show()

When we execute the above syntax, columns that contain 'name' will be dropped from the dataframe.

We have the perfect professional PySpark Tutorial for you. Enroll now!

Dropping Columns that have Null values in PySpark:

Deleting more than one column that contains Null values in pyspark is done in a rounded manner by creating the user-defined function. The column names that contain null values will be extracted first using the isNull() function, and this function is then passed to the drop() function.

Syntax: import pyspark.sql.functions as X
             def drop_null_columns(df_ord):
             null_counts = df_ord.select([X.count(X.when(X.col(a).isNull(), a)).alias(a) for a in df_ord.columns]).collect()[0].asDict()
             to_delete = [p for p, u in null_counts.items() if u>0]
             df_ord = df_orders.drop(*to_delete)
             return df_ord
             drop_null_columns(df_ord).show()

When we execute the above syntax, columns that contain null values will be dropped from the dataframe.

PySpark Training Certification

Weekday / Weekend Batches

Dropping Column with the NaN/NA values in PySpark:

Deleting more than one column that contains NaN/NA values in pyspark is done in a rounded manner by creating the user-defined function. The column names that contain NaN/NA values will be extracted using the isnan() function, and this function is then passed to the drop() function.

Syntax: import pyspark.sql.functions as X
             def drop_null_columns(df_ord):
             null_counts = df_ord.select([X.count(X.when(X.col(a).isnan(), a)).alias(a) for a in df_ord.columns]).collect()[0].asDict()
             to_delete = [p for p, u in null_counts.items() if u>0]
             df_ord = df_orders.drop(*to_delete)
             return df_ord
             drop_null_columns(df_ord).show()

When we execute the above syntax, columns that contain NaN/NA values will be dropped from the dataframe.

Conclusion:

In this blog, we have learned to drop a single column, multiple columns, a column with column name that begins with a particular string, a column with column name that terminates with a particular string, column by column position, a column with the NaN/NA values and null values in pyspark. I hope the information provided in this blog is helpful. Feel free to comment if you have any queries.

Find our upcoming PySpark Training Certification Online Classes

  • Batch starts on 5th Dec 2021, Weekend batch

  • Batch starts on 9th Dec 2021, Weekday batch

  • Batch starts on 13th Dec 2021, Weekday batch

Global Promotional Image
 

Categories

Request for more information

Manikanth
Manikanth
Research Analyst
As a Senior Writer for HKR Trainings, Sai Manikanth has a great understanding of today’s data-driven environment, which includes key aspects such as Business Intelligence and data management. He manages the task of creating great content in the areas of Digital Marketing, Content Management, Project Management & Methodologies, Product Lifecycle Management Tools. Connect with him on LinkedIn and Twitter.