Pandas Crosstab Explained
Start the Process
Let’s get started by importing all the modules we need. If you want to follow along on your own, I have placed the notebook on github:
import pandas as pd import seaborn as sns
Now we’ll read in the automobile data set from the UCI Machine Learning Repository and make some label changes for clarity:
# Define the headers since the data does not have any headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", "body_style", "drive_wheels", "engine_location", "wheel_base", "length", "width", "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] # Read in the CSV file and convert "?" to NaN df_raw = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data", header=None, names=headers, na_values="?" ) # Define a list of models that we want to review models = ["toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"] # Create a copy of the data with only the top 8 manufacturers df = df_raw[df_raw.make.isin(models)].copy()
For this example, I wanted to shorten the table so I only included the 8 models listed above. This is done solely to make the article more compact and hopefully more understandable.
For the first example, let’s use
pd.crosstab
to look at how many different
body styles these car makers made in 1985 (the year this dataset contains).
pd.crosstab(df.make, df.body_style)
body_style | convertible | hardtop | hatchback | sedan | wagon |
---|---|---|---|---|---|
make | |||||
honda | 0 | 0 | 7 | 5 | 1 |
mazda | 0 | 0 | 10 | 7 | 0 |
mitsubishi | 0 | 0 | 9 | 4 | 0 |
nissan | 0 | 1 | 5 | 9 | 3 |
subaru | 0 | 0 | 3 | 5 | 4 |
toyota | 1 | 3 | 14 | 10 | 4 |
volkswagen | 1 | 0 | 1 | 9 | 1 |
volvo | 0 | 0 | 0 | 8 | 3 |
The
crosstab
function can operate on numpy arrays, series or columns in a dataframe.
For this example, I pass in
df.make
for the crosstab index
and
df.body_style
for the crosstab’s columns. Pandas does that work behind
the scenes to count how many occurrences there are of each combination. For example,
in this data set Volvo makes 8 sedans and 3 wagons.
Before we go much further with this example, more experienced readers may wonder why
we use the
crosstab
instead of a another pandas option. I will address that briefly
by showing two alternative approaches.
First, we could use a
groupby
followed by an
unstack
to get the same results:
df.groupby(['make', 'body_style'])['body_style'].count().unstack().fillna(0)
The output for this example looks very similar to the crosstab but it took a couple of extra steps to get it formatted correctly.
It is also possible to do something similar using a
pivot_table
:
df.pivot_table(index='make', columns='body_style', aggfunc={'body_style':len}, fill_value=0)
Make sure to review my previous article on pivot_tables if you would like to understand how this works.
The question still remains, why even use a crosstab function? The short answer is that it provides a couple of handy functions to more easily format and summarize the data.
The longer answer is that sometimes it can be tough to remember all the steps to make this happen on your own. The simple crosstab API is the quickest route to the solution and provides some useful shortcuts for certain types of analysis.
In my experience, it is important to know about the options and use the one that flows most naturally from the analysis. I have had experiences where I struggled trying to make a pivot_table solution and then quickly got what I wanted by using a crosstab. The great thing about pandas is that once the data is in a dataframe all these manipulations are 1 line of code so you are free to experiment.