- Dropping Single column in pyspark
- Dropping Multiple Column in PySpark
- Dropping Column Using the Position in PySpark
- Dropping Columns that have Null values in PySpark
- Dropping Column with the NaN/NA values in PySpark
- Conclusion
Dropping a Single Column in PySpark:
In Pyspark, using the drop() function, we can drop a single column. Drop function with the column name as an argument will delete this particular column.
Syntax: df_orders.drop(‘column1’). show()
When we execute the above syntax, column1 column will be dropped from the dataframe.
We can also drop a single column with the drop function using df.name_of_the_column as an argument.
Syntax: df_orderd.drop(df_orders.column1).show()
If we execute the above syntax, then column1 column will be dropped from the dataframe.
Dropping Multiple Column in PySpark:
We can also drop a number of columns into pyspark using the drop() function. We need to include the columns name list as the argument in the drop function to drop those columns.
Syntax: df_orderd.drop(‘column1’, ‘column2’).show()
When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.
Start learning PySpark training from hkrtrainings to make a bright career in the world of PySpark!
PySpark Training Certification
- Master Your Craft
- Lifetime LMS & Faculty Access
- 24/7 online expert support
- Real-world & Project Based Learning
We can also drop multiple columns using the drop() function with another method. The list of column names to be deleted is listed under "columns_to_be_dropped". Then we need to pass this list to the drop() function.
Syntax: columns_to_be_dropped =[‘column1’, ‘column2’]
df_orderd.drop(*columns_to_be_dropped).show()
When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.
There is another method through which we can drop multiple columns. In this method, we include two drop() functions. This function will drop the columns one after the other in a single step.
Syntax: df_orderd.drop(df_orders.column1).drop(df_orders.column2).show()
If we execute the above syntax, then column1, column2 columns will be dropped from the data frame one by one in a sequence.
Dropping Column Using the Position in PySpark:
Deleting more than one column using the position in pyspark is done in a rounded manner. The required list of columns and rows are extracted using the select() function first and then converted into a dataframe.
Syntax: spark.createDataFrame(df_orders.select(df_orders.columns[:2]).take(5)).show()
Frequently asked PySpark Interview questions & answers
When we execute the above syntax, column1 and column2 columns will be dropped from the dataframe.
Dropping column with column name that begins with a particular string in PySpark:
Deleting more than one column that starts with a particular string in pyspark is done in a rounded manner. The column name list that starts with a particular string will be extracted first with the help of the startswith() function, and this function is then passed to the drop() function.
Syntax: list_name=df_orders.columns
Columns_to_be_deleted = [i for i in list_name if i.startswith(‘column’)]
df_orders.drop(*column_to_be_deleted).show()
When we execute the above syntax, columns that start with the name 'column' will be dropped from the dataframe.
Subscribe to our YouTube channel to get new updates..!
Dropping column with column name that terminates with a particular string in PySpark:
Deleting more than one column that terminates with a particular string in pyspark is done in a rounded manner. The column name list that terminates with a particular string will be extracted first with the help of endwith() function, and this function is then passed to the drop() function.
Syntax: list_name=df_orders.columns
Columns_to_be_deleted = [i for i in list_name if i.endswith(‘id’)]
df_orders.drop(*column_to_be_deleted).show()
When we execute the above syntax, columns that terminate with the name 'id' will be dropped from the dataframe.
Dropping column name that contains a particular string in PySpark:
Deleting more than one column that contains a particular string in pyspark is done in a rounded manner. The column names that contain a particular string will be extracted first, and then this function is then passed to the drop() function.
Syntax: list_name=df_orders.columns
Columns_to_be_deleted = [i for i in df.columns if i._contains_(‘name’)]
df_orders.drop(*column_to_be_deleted).show()
When we execute the above syntax, columns that contain 'name' will be dropped from the dataframe.
We have the perfect professional PySpark Tutorial for you. Enroll now!
Dropping Columns that have Null values in PySpark:
Deleting more than one column that contains Null values in pyspark is done in a rounded manner by creating the user-defined function. The column names that contain null values will be extracted first using the isNull() function, and this function is then passed to the drop() function.
Syntax: import pyspark.sql.functions as X
def drop_null_columns(df_ord):
null_counts = df_ord.select([X.count(X.when(X.col(a).isNull(), a)).alias(a) for a in df_ord.columns]).collect()[0].asDict()
to_delete = [p for p, u in null_counts.items() if u>0]
df_ord = df_orders.drop(*to_delete)
return df_ord
drop_null_columns(df_ord).show()
When we execute the above syntax, columns that contain null values will be dropped from the dataframe.
Dropping Column with the NaN/NA values in PySpark:
Deleting more than one column that contains NaN/NA values in pyspark is done in a rounded manner by creating the user-defined function. The column names that contain NaN/NA values will be extracted using the isnan() function, and this function is then passed to the drop() function.
Syntax: import pyspark.sql.functions as X
def drop_null_columns(df_ord):
null_counts = df_ord.select([X.count(X.when(X.col(a).isnan(), a)).alias(a) for a in df_ord.columns]).collect()[0].asDict()
to_delete = [p for p, u in null_counts.items() if u>0]
df_ord = df_orders.drop(*to_delete)
return df_ord
drop_null_columns(df_ord).show()
When we execute the above syntax, columns that contain NaN/NA values will be dropped from the dataframe.
Conclusion:
In this blog, we have learned to drop a single column, multiple columns, a column with column name that begins with a particular string, a column with column name that terminates with a particular string, column by column position, a column with the NaN/NA values and null values in pyspark. I hope the information provided in this blog is helpful. Feel free to comment if you have any queries.
Related Articles:
About Author
As a Senior Writer for HKR Trainings, Sai Manikanth has a great understanding of today’s data-driven environment, which includes key aspects such as Business Intelligence and data management. He manages the task of creating great content in the areas of Digital Marketing, Content Management, Project Management & Methodologies, Product Lifecycle Management Tools. Connect with him on LinkedIn and Twitter.
Upcoming PySpark Training Certification Online classes
Batch starts on 7th Oct 2024 |
|
||
Batch starts on 11th Oct 2024 |
|
||
Batch starts on 15th Oct 2024 |
|