1. 程式人生 > >Pandas Crosstab Explained

Pandas Crosstab Explained

Start the Process

Let’s get started by importing all the modules we need. If you want to follow along on your own, I have placed the notebook on github:

import pandas as pd
import seaborn as sns

Now we’ll read in the automobile data set from the UCI Machine Learning Repository and make some label changes for clarity:

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type"
, "num_cylinders", "engine_size", "fuel_system", "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price"] # Read in the CSV file and convert "?" to NaN df_raw = pd.read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data"
, header=None, names=headers, na_values="?" ) # Define a list of models that we want to review models = ["toyota","nissan","mazda", "honda", "mitsubishi", "subaru", "volkswagen", "volvo"] # Create a copy of the data with only the top 8 manufacturers df = df_raw[df_raw.make.isin(models)].copy()

For this example, I wanted to shorten the table so I only included the 8 models listed above. This is done solely to make the article more compact and hopefully more understandable.

For the first example, let’s use pd.crosstab to look at how many different body styles these car makers made in 1985 (the year this dataset contains).

pd.crosstab(df.make, df.body_style)
body_style convertible hardtop hatchback sedan wagon
make
honda 0 0 7 5 1
mazda 0 0 10 7 0
mitsubishi 0 0 9 4 0
nissan 0 1 5 9 3
subaru 0 0 3 5 4
toyota 1 3 14 10 4
volkswagen 1 0 1 9 1
volvo 0 0 0 8 3

The crosstab function can operate on numpy arrays, series or columns in a dataframe. For this example, I pass in df.make for the crosstab index and df.body_style for the crosstab’s columns. Pandas does that work behind the scenes to count how many occurrences there are of each combination. For example, in this data set Volvo makes 8 sedans and 3 wagons.

Before we go much further with this example, more experienced readers may wonder why we use the crosstab instead of a another pandas option. I will address that briefly by showing two alternative approaches.

First, we could use a groupby followed by an unstack to get the same results:

df.groupby(['make', 'body_style'])['body_style'].count().unstack().fillna(0)

The output for this example looks very similar to the crosstab but it took a couple of extra steps to get it formatted correctly.

It is also possible to do something similar using a pivot_table :

df.pivot_table(index='make', columns='body_style', aggfunc={'body_style':len}, fill_value=0)

Make sure to review my previous article on pivot_tables if you would like to understand how this works.

The question still remains, why even use a crosstab function? The short answer is that it provides a couple of handy functions to more easily format and summarize the data.

The longer answer is that sometimes it can be tough to remember all the steps to make this happen on your own. The simple crosstab API is the quickest route to the solution and provides some useful shortcuts for certain types of analysis.

In my experience, it is important to know about the options and use the one that flows most naturally from the analysis. I have had experiences where I struggled trying to make a pivot_table solution and then quickly got what I wanted by using a crosstab. The great thing about pandas is that once the data is in a dataframe all these manipulations are 1 line of code so you are free to experiment.