Data Preprocessing with pandas

Lerekoqholosha
4 min readAug 2, 2021

Data cleaning with pandas

Pandas is one of the most-widely used data analysis and manipulation libraries offers several functions to preprocess the raw data.

In this article, we will focus on one particular function that organizes multiple preprocessing operations into a single one: the pipe function.

Let’s start with creating a data frame with toy data.

import numpy as np
import pandas as pd
df = pd.DataFrame({
"id": [100, 100, 101, 102, 103, 104, 105, 106,np.nan,107],
"A": [1, 2, 3, 4, 5, 2, np.nan, 5,np.nan,8],
"B": [45, 56, 48, 47, 62, 112, 54, 49,12,50],
"C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5,np.nan,2.1]
})df

Our data frame contains some missing values indicated by a standard missing value representation (i.e. NaN). The id column includes duplicate values. Last but not least, 112 in column B seems like an outlier.

We will be creating a pipe that handles the issues we have just described.

For each task, we need a function. Thus, the first step is to create the functions that will be placed in the pipe.

It is important to note that the functions used in the pipe need to take a data frame as argument and return a data frame.

The first function handles the missing values.

def fill_missing_values(df):
for col in df.select_dtypes(include= [“int”,”float”]).columns:
val = df[col].mean()
df[col].fillna(val, inplace=True)
return df

I prefer to replace the missing values in the numerical columns with the mean value of the column. Feel free to customize this function. It will work in the pipe as long as it takes a data frame as argument and returns a data frame.

The second function will help us remove the duplicate values.

def drop_duplicates(df, column_name):
df = df.drop_duplicates(subset=column_name)
return df

I have got some help from the built-in drop duplicates function of Pandas. It eliminates the duplicate values in the given column or columns. In addition to the data frame, this function also takes a column name as an argument. We can pass the additional arguments to the pipe as well.

The last function in the pipe will be used for eliminating the outliers.

def remove_outliers(df, column_list):
for col in column_list:
avg = df[col].mean()
std = df[col].std()
low = avg — 2 * std
high = avg + 2 * std
df = df[df[col].between(low, high, inclusive=True)]
return df

What this function does is as follows:

  1. It takes a data frame and a list of columns
  2. For each column in the list, it calculates the mean and standard deviation
  3. It calculates a lower and upper bound using the mean and standard deviation
  4. It removes the values that are outside range defined by the lower and upper bound

We now have 3 functions that handle a data preprocessing task. The next step is to create a pipe with these functions.

df_processed = (df.
pipe(fill_missing_values).
pipe(drop_duplicates, “id”).
pipe(remove_outliers, [“A”,”B”]))

This pipe executes the functions in the given order. We can pass the arguments to the pipe along with the function names.

One thing to mention here is that some functions in the pipe modify the original data frame. Thus, using the pipe as indicated above will update df as well.

One option to overcome this issue is to use a copy of the original data frame in the pipe. If you do not care about keeping the original data frame as is, you can just use it in the pipe.

I will update the pipe as below:

my_df = df.copy()df_processed = (my_df.
pipe(fill_missing_values).
pipe(drop_duplicates, "id").
pipe(remove_outliers, ["A","B"]))

Let’s take a look at the original and processed data frames:

Conclusion

You can, of course, accomplish the same tasks by applying these functions separately. However, the pipe function offers a structured and organized way for combining several functions into a single operation.

Depending on the raw data and the tasks, the preprocessing may include more steps. You can add as many steps as you need in the pipe function. As the number of steps increase, the syntax becomes cleaner with the pipe function compared to executing functions separately.

Thank you for reading.

Please let me know if you have any feedback.

--

--

Lerekoqholosha

I am a data scientist with 1 year of experience working with Python.