Python: Simplify Your Data Cleaning with Pyjanitor

API for cleaning data written on top of the Pandas

5 min readJan 31, 2024

Data cleaning is an essential part of any data science project. Without clean data, any insights derived from the data are likely to be inaccurate.

However, data cleaning can be a time-consuming and tedious process, often involving writing lengthy and complex code.

Fortunately, Pyjanitor is a powerful library that simplifies the process of data cleaning, making it easier and more efficient for data scientists and analysts.

In this article, I will explore Pyjanitor and how it can help streamline your data cleaning process. I will start by discussing what Pyjanitor is and its key features. Then, will dive into some practical examples of how to use Pyjanitor to clean and transform your data.

By the end of this article, you’ll have a solid understanding of how to use Pyjanitor to simplify your data cleaning workflow and spend more time analyzing and interpreting your data.

What is Pyjanitor

Pyjanitor is a Python library that simplifies the process of data cleaning. It is an extension to the popular Pandas library and provides additional functionality for cleaning and preparing data.

Pyjanitor is a popular choice for data scientists and analysts because it is easy to use, efficient, and highly customizable.

Pyjanitor is a highly versatile library that provides a wide range of functions for data cleaning. Some of the key features of Pyjanitor include:

Adding and Removing Columns
Renaming Columns
Handling Missing Values
Filtering Data
Grouping Data
Reshaping Data
Handling Strings and Text Data

Pyjanitor Benefits

Some of the key benefits of using Pyjanitor for data cleaning include:

Simplifies the process of data cleaning
Saves time and effort
Provides a wide range of functions for cleaning and preparing data
Highly customizable and flexible
Compatible with Pandas and other popular Python libraries

How to use Pyjanitor in Python

Suppose we have a dataset of employees and their salaries. The dataset has a few missing values and some columns have inconsistent names.

Installation —

pip install pyjanitor

Here’s how we can use Pyjanitor to clean the dataset:

import pandas as pd
import janitor

# Read the dataset
df = pd.read_csv('employees.csv')

# Clean the column names
df = df.clean_names()

# Fill missing values with the median salary
df = df.fill_median('salary')

# Droping the unnecessary columns
df = df.remove_columns(['ssn', 'dob'])

# Convert the salary to a float
df['salary'] = df['salary'].astype(float)

# Sort the dataframe by the salary column in descending order
df = df.sort_values(by='salary', ascending=False)

# Save the cleaned dataframe to a new CSV file
df.to_csv('cleaned_employees.csv', index=False)

In this example, I have first imported the necessary libraries, including Pyjanitor. Then read in the dataset using the read_csv function from Pandas. After that, Used Pyjanitor's clean_names function to standardize the column names. Next, I have used the fill_median function to fill in any missing values in the salary column with the median salary. And then used the remove_columns function to remove any unnecessary columns.

After that, Used the astype method to convert the salary column to a float. Finally, thesort_values method to sort the dataframe by salary in descending order, and save the cleaned dataframe to a new CSV file using the to_csv method.

Here’s another simple example demonstrating the use of Pyjanitor:

import pandas as pd
import janitor

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, None, 30],
    'Salary': [50000, 60000, 75000]
})

# cleaning operations with Pyjanitor
cleaned_df = (
    df.clean_names()  # Cleaning column names
      .remove_empty()  # Removing rows with missing values
      .set_index('name')  # Setting column-'name' as the index
      .rename_column('age', 'years_old')  # Renaming column-'age'
)

print(cleaned_df)

# output -
#         years_old  salary
# name                      
# Alice         25.0   50000
# Bob            NaN   60000
# Charlie       30.0   75000

Lets explore some more features of Pyjanitor

Reshaping Data:

In addition to cleaning data, Pyjanitor can also be used to reshape and transform data. Pyjanitor provides a variety of functions that allow you to reshape data in various ways, such as pivoting, melting, and splitting columns.

Here’s an examples of how to use Pyjanitor to reshape data:

import pandas as pd
import janitor

# Reading the dataset
df = pd.read_csv('my_dataset.csv')

Suppose if the dataset has the following columns: id, date, type_1, type_2, value_1, and value_2. We want to reshape the data such that we have a separate column for each type and value combination. You can achieve this by using Pyjanitor's spread_columns() function:

df = df.spread_columns(column_pairs=[('type_1', 'value_1'), ('type_2', 'value_2')], sep='_')

print(df.head())

OUTPUT -

   id        date  type_1_a  type_1_b  type_2_a  type_2_b
0   1  2021-01-01         3         4         5         6
1   2  2021-01-02         7         8         9        10
2   3  2021-01-03        11        12        13        14

You can see that the data has been reshaped, with separate columns for each type and value combination.

Handling Strings and Text Data:

Let’s use a different dataset to demonstrate how to handle string and text data. Suppose we have a dataset that contains information about movies, including the title, year, and genre.
However, the genre column contains a mix of uppercase and lowercase letters, as well as whitespace. We want to standardize the genre column such that all genres are in title case, with no leading or trailing whitespace. You can achieve this by using Pyjanitor’s clean_text() function:

# Read the dataset
df = pd.read_csv('movies.csv')

# Clean the genre column
df = df.clean_names().clean_text('genre')

print(df.head())

OUTPUT -

                    title  year      genre
0           The Godfather  1972      Crime
1  The Shawshank Redemption  1994      Drama
2          The Dark Knight  2008     Action
3    The Godfather: Part II  1974      Crime
4              12 Angry Men  1957  Drama|Crime

You can see that the genre column has been standardized, with all genres in title case and no leading or trailing whitespace.

Pyjanitor offers a solution to this problem, providing a suite of functions that simplify and automate the data cleaning process.

Pyjanitor is a powerful and versatile package that can make the data cleaning process faster and more efficient.

By integrating Pyjanitor into your workflow, you can spend less time on data cleaning and more time on data analysis and interpretation.

So if you’re looking to streamline your data cleaning process, give Pyjanitor a try and see how it can enhance your data analysis capabilities.