Exploratory Data Analysis Using Python Functions.

Lerekoqholosha
ILLUMINATION’S MIRROR
4 min readSep 23, 2021

--

In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

Unsplash Image

In this article we will look at some basic python functions to do exploratory data analysis(eda).

This functions requires numpy and pandas

Import Packages

#Import libraries:import pandas as pdimport numpy as npimport seaborn as sns

Let’s read data

#read data:df=pd.read_csv('/content/housingprice.csv')

show data

#show the first 5 rows of dataframedf.head()
Dataframe

df.info()

#info about datadf.info()
Info

For any data frame the info() the function will tell you how many entries you have, the names of each column, the data type of each column, and how many non-null values you have in each column. You can compare the quantity of non-null values to the total number of entries to find which columns have null values.

Find Duplicates

#duplicatesdf.duplicated().sum()0

This function above is the easiest, as it will find all the duplicate entries and print how many there are. If it prints 0, there are no duplicates and you are good to go!

Find Unique Values in a Column

#unique value columndf['YrSold'].unique()array([2008, 2007, 2006, 2009, 2010])

Let’s visualize this to understand it better

#Visualing unique values in columnsns.countplot(df['YrSold'].unique())
years Sold

This function quickly prints all the unique values of that column, so you can understand the breadth and range of the values. As you can see from the above graph.

Find the Counts of Unique Values in a column

#unique value countsdf['YearBuilt'].value_counts()
#value counts columnsns.countplot(df['YrSold'].value_counts())

This function builds upon the previous one by providing you the unique values in that column that have the largest and smallest frequencies. This is a great way to look for outliers.

Find null values In data frame

#null valuesdf.isnull().sum()

let’s visualize this to see which columns have missing values.

This function combines isnull() and sum() and will return a list of each column in the data frame with the amount of null values in each column. Finding null values is an important part of EDA and data cleaning.

Fill Null Values With Zeros

#fill missing values with zerosdf.replace(np.nan,'0',inplace=True)

This function will take your entire data frame and fill the null values with zeros, or whatever value you put in the second argument of the function. It is certainly the fastest way to get rid of your null values, putting your dataset in a place that will avoid more errors and dead-ends in your analysis.

Filter Rows In your Dataframe

#filter rowsdf[df['YrSold']>2008]

The line of code above creates a new data frame that hold all the rows, where column_name YrSold is greater than the year 2008. You can, of course, filter on other conditionals such as less than or equals to and more complex conditionals, with multiple conditions.

Create a Boxplot for any Column

First let’s select numerical columns in our data frame

df.dtypes
#numeric colsnumeric_cols = df.select_dtypes(include='number')print(numeric_cols)
#boxplot numericnumeric_cols.boxplot()

The function above will return box plots for all the numerical columns in dataset.

To specify that the box plot only is created for a certain column, use this function:

#boxplot for single columndf[['LotArea']].boxplot()

Create a Correlation Matrix

numeric_cols.corr()

This pandas function will only return correlations for pairs of numeric columns.

sns.heatmap(numeric_cols.corr())

Wikipedia

Pandas

Colab Notebook

GitHub

Dataset Link

Thank you for reading :).

Please let me know if you have any feedback.

--

--

Lerekoqholosha
ILLUMINATION’S MIRROR

I am a data scientist with 1 year of experience working with Python.