Pandas Crosstab Explained

阿新 • • 發佈：2018-12-28

Start the Process

Let’s get started by importing all the modules we need. If you want to follow along on your own, I have placed the notebook on github:

import pandas as pd
import seaborn as sns

Now we’ll read in the automobile data set from the UCI Machine Learning Repository and make some label changes for clarity:

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type" 
, "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df_raw = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data" 
,
                     header=None, names=headers, na_values="?" )

# Define a list of models that we want to review
models = ["toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"]

# Create a copy of the data with only the top 8 manufacturers
df = df_raw[df_raw.make.isin(models)].copy()

For this example, I wanted to shorten the table so I only included the 8 models listed above. This is done solely to make the article more compact and hopefully more understandable.

For the first example, let’s use pd.crosstab to look at how many different body styles these car makers made in 1985 (the year this dataset contains).

pd.crosstab(df.make, df.body_style)

body_style	convertible	hardtop	hatchback	sedan	wagon
make
honda	0	0	7	5	1
mazda	0	0	10	7	0
mitsubishi	0	0	9	4	0
nissan	0	1	5	9	3
subaru	0	0	3	5	4
toyota	1	3	14	10	4
volkswagen	1	0	1	9	1
volvo	0	0	0	8	3

The crosstab function can operate on numpy arrays, series or columns in a dataframe. For this example, I pass in df.make for the crosstab index and df.body_style for the crosstab’s columns. Pandas does that work behind the scenes to count how many occurrences there are of each combination. For example, in this data set Volvo makes 8 sedans and 3 wagons.

Before we go much further with this example, more experienced readers may wonder why we use the crosstab instead of a another pandas option. I will address that briefly by showing two alternative approaches.

First, we could use a groupby followed by an unstack to get the same results:

df.groupby(['make', 'body_style'])['body_style'].count().unstack().fillna(0)

The output for this example looks very similar to the crosstab but it took a couple of extra steps to get it formatted correctly.

It is also possible to do something similar using a pivot_table :

df.pivot_table(index='make', columns='body_style', aggfunc={'body_style':len}, fill_value=0)

Make sure to review my previous article on pivot_tables if you would like to understand how this works.

The question still remains, why even use a crosstab function? The short answer is that it provides a couple of handy functions to more easily format and summarize the data.

The longer answer is that sometimes it can be tough to remember all the steps to make this happen on your own. The simple crosstab API is the quickest route to the solution and provides some useful shortcuts for certain types of analysis.

In my experience, it is important to know about the options and use the one that flows most naturally from the analysis. I have had experiences where I struggled trying to make a pivot_table solution and then quickly got what I wanted by using a crosstab. The great thing about pandas is that once the data is in a dataframe all these manipulations are 1 line of code so you are free to experiment.

Pandas Crosstab Explained

Start the Process

Pandas Crosstab Explained

2018.03.29 python-pandas 數據透視pivot table / 交叉表crosstab

Pandas分組統計函式：groupby、pivot_table及crosstab

Pandas Pivot Table Explained

Pandas 基礎(13) - Crosstab 交叉列表取值

Python pandas

pandas 運算

pandas 常用函數

Numpy - Pandas - Matplot 功能與函數名速查

pandas中pd.read_excel()方法中的converters參數

使用pandas處理大型CSV文件

6-感覺身體被掏空，但還是要學Pandas（上）

python pandas模塊,nba數據處理（1）

7-感覺身體被掏空，但還是要學Pandas（下）

linux下安裝numpy,pandas,scipy,matplotlib,scikit-learn

Python： Pandas的DataFrame如何按指定list排序

Pandas DataFrame學習筆記

pandas基礎操作

centos6.5離線安裝pandas

pandas Series KeyError: -1

Pandas Crosstab Explained

Start the Process

相關推薦