Introduction to data cleaning using Pandas

阿新 • • 發佈：2018-12-29

Introduction to data cleaning using Pandas

I’ve been using Excel for data cleaning until I discovered how powerful pandas are for data analysis and data cleaning. In this article I want to go over basics of how to use pandas for cleaning data in excel files.

Importing data and taking a look

As a first step, lets look at the raw data we have, to understand what needs to be done as part of cleaning. I’m skipping installation of python and Jupyter notebook in this article. If you are completely new to python and want to install Jupyter, here are some links to guide you through the procees-

http://jupyter.readthedocs.io/en/latest/install.html

I have the Anaconda Navigator installed on my Mac. Follow along this link to install the software- https://docs.anaconda.com/anaconda/install/

Once you have Anaconda installed, click on the jupyter notebook

If you are opening up jupyter for the first time, command prompt should provide you with a local host url. Enter this url in your browser window and the jupyter notebook should open up.

To follow along , download the Austin_Animal_Center_Intakes.csv file here. Save the file to your local and run this piece of code in jupyter.

Importing the file into pandas

import pandas as pd
import numpy as np
import os
filepath = “/Users/ayyagamv/Desktop/Tableau/Assignments/Capstone”
filename = “Austin_Animal_Center_Intakes.csv”
filePathFileName = filepath + “/” + filename
outputFile=os.path.join(“/Users/ayyagamv/Desktop/Tableau/Assignments/Capstone/Austin_Animal_Center_Intakes_cleaned.csv”)
df = pd.read_csv(filePathFileName)

Here we import pandas using the alias ‘pd’, then we read in our data.

df.head() - shows us the first 5 rows and headers - it gives us an idea what to expect. df.tail() - shows us the last 5 rows. Lets try df.head.

df.head()-Output

df.tail()-Output

df.columns -Output

df.shape- Output

df.isnull()- Output

Displays True is a cell does not have a value

Looking at the first and last 5 rows, I can see that some cells in the Name column are blank, panda displays this as NaN. The second obvious issue I see is that some cells in the name field have an asterisk*. We would ideally want to remove the asterisk.

Lets replace the cells without any data in the Name column with the value ‘None’ using this line

df = df.astype(object).where(pd.notnull(df),None)

Onto the next issue with data. I want to remove the asterisk in the Name column. It can be done with this line

df[‘Name’]=df[‘Name’].apply(lambda x:str(x).replace(‘*’,’’))

Lets take a look now at what we have

I can also see that the MonthYear column has time which is not required in that column. Lets ged rid of that.

Firs step is to convert the data frame object to date.time format and then replace the date time with just date. This can be done with just 2 lines of code.

df[‘MonthYear’] = pd.to_datetime(df[‘MonthYear’])
df[‘MonthYear’] = df[‘MonthYear’].apply(lambda x: x.date())

Finally, write the output to an other file and specify the location of the file.

outputFile=os.path.join(“/Users/ayyagamv/Desktop/Tableau/Assignments/Capstone/Austin_Animal_Center_Intakes_cleaned.csv”)

df.to_csv(outputFile)

This was a brief introduction to data cleaning using pandas. The data is now at a point where it can be imported into Tableau for further analysis.

Introduction to data cleaning using Pandas

Introduction to data cleaning using Pandas

Introduction to data cleaning using Pandas

Note 1 for <Pratical Programming : An Introduction to Computer Science Using Python 3>

Note 2 for <Pratical Programming : An Introduction to Computer Science Using Python 3>

Text Mining 101: A Stepwise Introduction to Topic Modeling using Latent Semantic Analysis (using…

Data Handling using Pandas; Machine Learning in Real Life

A gentle introduction to decision trees using R

1.0 An Introduction to Web Scraping using Python

INF6027 Introduction to Data Science Analysis of the UK Police Dataset

【讀書筆記】資料探勘導論（Introduction to Data Mining） 1

An Introduction to Scientific Python – Pandas

《Spring Data官方文件》5.3. Connecting to Cassandra with Spring至5.5. Introduction to CassandraTemplate

Data Cleaning with Python and Pandas: Detecting Missing Values

An intuitive introduction to support vector machines using R

An Introduction to Using Form Elements in React

Explore and get value out of your raw data: An Introduction to Splunk

Introduction to Image Caption Generation using the Avenger’s Infinity War Characters

Introduction to Reinforcement Learning – Towards Data Science

Introduction to Google Data Studio

Using Pandas to Read Large Excel Files in Python β Real Python

Pythonic Data Cleaning With NumPy and Pandas

Introduction to data cleaning using Pandas

Introduction to data cleaning using Pandas

相關推薦