Python: Simplify Your Data Cleaning with Pyjanitor
Data cleaning is an essential part of any data science project. Without clean data, any insights derived from the data are likely to be inaccurate.
However, data cleaning can be a time-consuming and tedious process, often involving writing lengthy and complex code.
Fortunately, Pyjanitor is a powerful library that simplifies the process of data cleaning, making it easier and more efficient for data scientists and analysts.
In this article, I will explore Pyjanitor and how it can help streamline your data cleaning process. I will start by discussing what Pyjanitor is and its key features. Then, will dive into some practical examples of how to use Pyjanitor to clean and transform your data.
By the end of this article, you’ll have a solid understanding of how to use Pyjanitor to simplify your data cleaning workflow and spend more time analyzing and interpreting your data.
What is Pyjanitor
Pyjanitor is a Python library that simplifies the process of data cleaning. It is an extension to the popular Pandas library and provides additional functionality for cleaning and preparing data.
Pyjanitor is a popular choice for data scientists and analysts because it is easy to use, efficient, and highly customizable.
Pyjanitor is a highly versatile library that provides a wide range of functions for data cleaning. Some of the key features of Pyjanitor include:
- Adding and Removing Columns
- Renaming Columns
- Handling Missing Values
- Filtering Data
- Grouping Data
- Reshaping Data
- Handling Strings and Text Data
Pyjanitor Benefits
Some of the key benefits of using Pyjanitor for data cleaning include:
- Simplifies the process of data cleaning
- Saves time and effort
- Provides a wide range of functions for cleaning and preparing data
- Highly customizable and flexible
- Compatible with Pandas and other popular Python libraries
How to use Pyjanitor in Python
Suppose we have a dataset of employees and their salaries. The dataset has a few missing values and some columns have inconsistent names.
Installation —
pip install pyjanitor
Here’s how we can use Pyjanitor to clean the dataset:
import pandas as pd
import janitor
# Read the dataset
df = pd.read_csv('employees.csv')
# Clean the column names
df = df.clean_names()
# Fill missing values with the median salary
df = df.fill_median('salary')
# Droping the unnecessary columns
df = df.remove_columns(['ssn', 'dob'])
# Convert the salary to a float
df['salary'] = df['salary'].astype(float)
# Sort the dataframe by the salary column in descending order
df = df.sort_values(by='salary', ascending=False)
# Save the cleaned dataframe to a new CSV file
df.to_csv('cleaned_employees.csv', index=False)
In this example, I have first imported the necessary libraries, including Pyjanitor. Then read in the dataset using the read_csv
function from Pandas. After that, Used Pyjanitor's clean_names
function to standardize the column names. Next, I have used the fill_median
function to fill in any missing values in the salary column with the median salary. And then used the remove_columns
function to remove any unnecessary columns.
After that, Used the astype
method to convert the salary column to a float. Finally, thesort_values
method to sort the dataframe by salary in descending order, and save the cleaned dataframe to a new CSV file using the to_csv
method.
Here’s another simple example demonstrating the use of Pyjanitor:
import pandas as pd
import janitor
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, None, 30],
'Salary': [50000, 60000, 75000]
})
# cleaning operations with Pyjanitor
cleaned_df = (
df.clean_names() # Cleaning column names
.remove_empty() # Removing rows with missing values
.set_index('name') # Setting column-'name' as the index
.rename_column('age', 'years_old') # Renaming column-'age'
)
print(cleaned_df)
# output -
# years_old salary
# name
# Alice 25.0 50000
# Bob NaN 60000
# Charlie 30.0 75000
Lets explore some more features of Pyjanitor
Reshaping Data:
In addition to cleaning data, Pyjanitor can also be used to reshape and transform data. Pyjanitor provides a variety of functions that allow you to reshape data in various ways, such as pivoting, melting, and splitting columns.
Here’s an examples of how to use Pyjanitor to reshape data:
import pandas as pd
import janitor
# Reading the dataset
df = pd.read_csv('my_dataset.csv')
Suppose if the dataset has the following columns: id
, date
, type_1
, type_2
, value_1
, and value_2
. We want to reshape the data such that we have a separate column for each type and value combination. You can achieve this by using Pyjanitor's spread_columns()
function:
df = df.spread_columns(column_pairs=[('type_1', 'value_1'), ('type_2', 'value_2')], sep='_')
print(df.head())
OUTPUT -
id date type_1_a type_1_b type_2_a type_2_b
0 1 2021-01-01 3 4 5 6
1 2 2021-01-02 7 8 9 10
2 3 2021-01-03 11 12 13 14
You can see that the data has been reshaped, with separate columns for each type and value combination.
Handling Strings and Text Data:
Let’s use a different dataset to demonstrate how to handle string and text data. Suppose we have a dataset that contains information about movies, including the title, year, and genre.
However, the genre column contains a mix of uppercase and lowercase letters, as well as whitespace. We want to standardize the genre column such that all genres are in title case, with no leading or trailing whitespace. You can achieve this by using Pyjanitor’s clean_text()
function:
# Read the dataset
df = pd.read_csv('movies.csv')
# Clean the genre column
df = df.clean_names().clean_text('genre')
print(df.head())
OUTPUT -
title year genre
0 The Godfather 1972 Crime
1 The Shawshank Redemption 1994 Drama
2 The Dark Knight 2008 Action
3 The Godfather: Part II 1974 Crime
4 12 Angry Men 1957 Drama|Crime
You can see that the genre column has been standardized, with all genres in title case and no leading or trailing whitespace.
Pyjanitor offers a solution to this problem, providing a suite of functions that simplify and automate the data cleaning process.
Pyjanitor is a powerful and versatile package that can make the data cleaning process faster and more efficient.
By integrating Pyjanitor into your workflow, you can spend less time on data cleaning and more time on data analysis and interpretation.
So if you’re looking to streamline your data cleaning process, give Pyjanitor a try and see how it can enhance your data analysis capabilities.