Beginner’s Guide to Data Cleaning with Pyjanitor

Is data cleaning too time-consuming and frustrating for you? Try Pyjanitor to enhance your data cleaning skills.



Data Cleaning with PyJanitor
Image by Author | DALLE-3 & Canva

 

Have you ever dealt with messy datasets? They are one of the biggest hurdles in any data science project. These datasets can contain inconsistencies, missing values, or irregularities that hinder analysis. Data cleaning is the essential first step that lays the foundation for accurate and reliable insights, but it's lengthy and time-consuming.

Fear not! Let me introduce you to Pyjanitor, a fantastic Python library that can save the day. It is a convenient Python package, providing a simple remedy to these data-cleaning challenges. In this article, I am going to discuss the importance of Pyjanitor along with its features and practical usage.

By the end of this article, you will have a clear understanding of how Pyjanitor simplifies data cleaning and its application in everyday data-related tasks.

 

What is Pyjanitor?

 

Pyjanitor is an extended R package of Python, built on top of pandas that simplifies data cleaning and preprocessing tasks. It extends its functionality by offering a variety of useful functions that refine the process of cleaning, transforming, and preparing datasets. Think of it as an upgrade to your data-cleaning toolkit. Are you eager to learn about Pyjanitor? Me too. Let’s start.

 

Getting Started

 

First things first, you need to install Pyjanitor. Open your terminal or command prompt and run the following command:

pip install pyjanitor

 

The next step is to import Pyjanitor and Pandas into your Python script. This can be done by:

import janitor
import pandas as pd

 

Now, you are ready to use Pyjanitor for your data cleaning tasks. Moving forward, I will cover some of the most useful features of Pyjanitor which are:

 

1. Cleaning Column Names

Raise your hand if you have ever been frustrated by inconsistent column names. Yup, me too. With Pyjanitor's clean_names() function, you can quickly standardize your column names making them uniform and consistent with just a simple call. This powerful function replaces spaces with underscores, converts all characters to lowercase, strips leading and trailing whitespace, and even replaces dots with underscores. Let’s understand it with a basic example.

#Create a data frame with inconsistent column names
student_df = pd.DataFrame({
    'Student.ID': [1, 2, 3],
    'Student Name': ['Sara', 'Hanna', 'Mathew'],
    'Student Gender': ['Female', 'Female', 'Male'],
    'Course': ['Algebra', 'Data Science', 'Geometry'],
    'Grade': ['A', 'B', 'C']
})

#Clean the column names
clean_df = student_df.clean_names()
print(clean_df)

 

Output:

   student_id    student_name    student_gender        course    grade
0           1            Sara            Female       Algebra        A
1           2           Hanna            Female  Data Science        B
2           3          Mathew              Male      Geometry        C

 

2. Renaming Columns

At times, renaming columns not only enhances our understanding of the data but also improves its readability and consistency. Thanks to the rename_column() function, this task becomes effortless. A simple example showcasing the usability of this function is as follows:

student_df = pd.DataFrame({
    'stu_id': [1, 2],
    'stu_name': ['Ryan', 'James'],
})
# Renaming the columns
student_df = student_df.rename_column('stu_id', 'Student_ID')
student_df =student_df.rename_column('stu_name', 'Student_Name')
print(student_df.columns)

 

Output:

Index(['Student_ID', 'Student_Name'], dtype='object')

 

3. Handling Missing Values

Missing values are a real headache when dealing with datasets. Fortunately, the fill_empty() comes in handy for addressing these issues. Let's explore how to handle missing values using Pyjanitor with a practical example. First, we will create a dummy data frame and populate it with some missing values.

# Create a data frame with missing values
employee_df = pd.DataFrame({
    'employee_id': [1, 2, 3],
    'name': [None, 'James', 'Alicia'],
    'department': ['HR', None, 'Engineering'],
    'salary': [60000, 55000, None]
})

 

Now, let's see how Pyjanitor can assist in filling up these missing values:

# Fill missing values in 'department' and 'name' with 'Unknown' and 'salary' with the mean salary
employee_df = employee_df.fill_empty(column_names=['name', 'department'], value='Unknown')
employee_df = employee_df.fill_empty(column_names='salary', value=employee_df['salary'].mean())

print(employee_df)

 

Output:

   employee_id     name   department   salary
0            1  Unknown           HR  60000.0
1            2    James      Unknown  55000.0
2            3   Alicia  Engineering  57500.0

 

In this example, the department of employee ‘James’ is substituted with ‘Unknown', and the salary of ‘Alicia’ is substituted with the average of ‘Unknown’ and ‘James’ salaries. You can use various strategies for handling missing values like forward pass, backward pass, or, filling with a specific value.

 

4. Filtering Rows & Selecting Columns

Filtering rows and columns is a crucial task in data analysis. Pyjanitor simplifies this process by providing functions that allow you to select columns and filter rows based on specific conditions. Suppose you have a data frame containing student records, and you want to filter out students(rows) whose marks are less than 60. Let’s explore how Pyjanitor helps us in achieving this.

# Create a data frame with student data
students_df = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5],
    'name': ['John', 'Julia', 'Ali', 'Sara', 'Sam'],
    'subject': ['Maths', 'General Science', 'English', 'History','Biology'],
    'marks': [85, 58, 92, 45, 75],
    'grade': ['A', 'C', 'A+', 'D', 'B']
})

# Filter rows where marks are less than 60
filtered_students_df = students_df.query('marks >= 60')
print(filtered_students_df)

 

Output:

   student_id  name  subject  marks grade
0           1  John    Maths     85     A
2           3   Ali  English     92    A+
4           5   Sam  Biology     75     B

 

Now suppose you also want to output only specific columns, such as only the name and ID, rather than their entire data. Pyjanitor can also help in doing this as follows:

# Select specific columns
selected_columns_df = filtered_students_df.loc[:,['student_id', 'name']]

 

Output:

	student_id	name
0	1	John
2	3	Ali
4	5	Sam

 

5. Chaining Methods

With Pyjanitor's method chaining feature, you can perform multiple operations in a single line. This capability stands out as one of its best features. To illustrate, let's consider a data frame containing data about cars:

# Create a data frame with sample car data
cars_df = pd.DataFrame({
    'Car ID': [101, None, 103, 104, 105],
    'Car Model': ['Toyota', 'Honda', 'BMW', 'Mercedes', 'Tesla'],
    'Price': [25000, 30000, None, 40000, 45000],
    'Year': [2018, 2019, 2017, 2020, None]
})
print("Cars Data Before Applying Method Chaining:")
print(cars_df)

 

Output:

Cars Data Before Applying Method Chaining:
   Car ID Car Model    Price    Year
0   101.0    Toyota  25000.0  2018.0
1     NaN     Honda  30000.0  2019.0
2   103.0       BMW      NaN  2017.0
3   104.0  Mercedes  40000.0  2020.0
4   105.0     Tesla  45000.0     NaN

 

Now that we see the data frame contains missing values and inconsistent column names. We can solve this by performing operations sequentially, such as clean_names(), rename_column(), and, dropna(), etc. in multiple lines. Alternatively, we can chain these methods together– performing multiple operations in a single line –for a fluent workflow and cleaner code.

# Chain methods to clean column names, drop rows with missing values, select specific columns, and rename columns
cleaned_cars_df = (
    cars_df
    .clean_names()  # Clean column names
    .dropna()  # Drop rows with missing values
    .select_columns(['car_id', 'car_model', 'price'])  # Select columns
    .rename_column('price', 'price_usd')  # Rename column
)

print("Cars Data After Applying Method Chaining:")
print(cleaned_cars_df)

 

Output:

Cars Data After Applying Method Chaining:
   car_id car_model  price_usd
0   101.0    Toyota    25000.0
3   104.0  Mercedes    40000.0

 

In this pipeline, the following operations have been performed:

  • clean_names() function cleans out the column names.
  • dropna() function drops the rows with missing values.
  • select_columns() function selects specific columns which are ‘car_id’, ‘car_model’ and ‘price’.
  • rename_column() function renames the column ‘price’ with ‘price_usd’.

 

Wrapping Up

 
So, to wrap up, Pyjanitor proves to be a magical library for anyone working with data. It offers many more features than discussed in this article, such as encoding categorical variables, obtaining features and labels, identifying duplicate rows, and much more. All of these advanced features and methods can be explored in its documentation. The deeper you delve into its features, the more you will be surprised by its powerful functionality. Lastly, enjoy manipulating your data with Pyjanitor.
 
 

Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.