Pandas Pivot Table Explained
Pivot the data
As we build up the pivot table, I think it’s easiest to take it one step at a time. Add items and check each step to verify you are getting the results you expect. Don’t be afraid to play with the order and the variables to see what presentation makes the most sense for your needs.
The simplest pivot table must have a dataframe and an
index
pd.pivot_table(df,index=["Name"])
Account | Price | Quantity | |
---|---|---|---|
Name | |||
Barton LLC | 740150 | 35000 | 1.000000 |
Fritsch, Russel and Anderson | 737550 | 35000 | 1.000000 |
Herman LLC | 141962 | 65000 | 2.000000 |
Jerde-Hilpert | 412290 | 5000 | 2.000000 |
Kassulke, Ondricka and Metz | 307599 | 7000 | 3.000000 |
Keeling LLC | 688981 | 100000 | 5.000000 |
Kiehn-Spinka | 146832 | 65000 | 2.000000 |
Koepp Ltd | 729833 | 35000 | 2.000000 |
Kulas Inc | 218895 | 25000 | 1.500000 |
Purdy-Kunde | 163416 | 30000 | 1.000000 |
Stokes LLC | 239344 | 7500 | 1.000000 |
Trantow-Barrows | 714466 | 15000 | 1.333333 |
You can have multiple indexes as well. In fact, most of the
pivot_table
pd.pivot_table(df,index=["Name","Rep","Manager"])
Account | Price | Quantity | |||
---|---|---|---|---|---|
Name | Rep | Manager | |||
Barton LLC | John Smith | Debra Henley | 740150 | 35000 | 1.000000 |
Fritsch, Russel and Anderson | Craig Booker | Debra Henley | 737550 | 35000 | 1.000000 |
Herman LLC | Cedric Moss | Fred Anderson | 141962 | 65000 | 2.000000 |
Jerde-Hilpert | John Smith | Debra Henley | 412290 | 5000 | 2.000000 |
Kassulke, Ondricka and Metz | Wendy Yule | Fred Anderson | 307599 | 7000 | 3.000000 |
Keeling LLC | Wendy Yule | Fred Anderson | 688981 | 100000 | 5.000000 |
Kiehn-Spinka | Daniel Hilton | Debra Henley | 146832 | 65000 | 2.000000 |
Koepp Ltd | Wendy Yule | Fred Anderson | 729833 | 35000 | 2.000000 |
Kulas Inc | Daniel Hilton | Debra Henley | 218895 | 25000 | 1.500000 |
Purdy-Kunde | Cedric Moss | Fred Anderson | 163416 | 30000 | 1.000000 |
Stokes LLC | Cedric Moss | Fred Anderson | 239344 | 7500 | 1.000000 |
Trantow-Barrows | Craig Booker | Debra Henley | 714466 | 15000 | 1.333333 |
This is interesting but not particularly useful. What we probably want
to do is look at this by Manager and Rep. It’s easy enough to do by
changing the
index
.
pd.pivot_table(df,index=["Manager","Rep"])
Account | Price | Quantity | ||
---|---|---|---|---|
Manager | Rep | |||
Debra Henley | Craig Booker | 720237.0 | 20000.000000 | 1.250000 |
Daniel Hilton | 194874.0 | 38333.333333 | 1.666667 | |
John Smith | 576220.0 | 20000.000000 | 1.500000 | |
Fred Anderson | Cedric Moss | 196016.5 | 27500.000000 | 1.250000 |
Wendy Yule | 614061.5 | 44250.000000 | 3.000000 |
You can see that the pivot table is smart enough to start aggregating the data and summarizing it by grouping the reps with their managers. Now we start to get a glimpse of what a pivot table can do for us.
For this purpose, the Account and Quantity columns aren’t really useful.
Let’s remove it by explicitly defining the columns we care about using
the
values
field.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])
Price | ||
---|---|---|
Manager | Rep | |
Debra Henley | Craig Booker | 20000 |
Daniel Hilton | 38333 | |
John Smith | 20000 | |
Fred Anderson | Cedric Moss | 27500 |
Wendy Yule | 44250 |
The price column automatically averages the data but we can do a count
or a sum. Adding them is simple using
aggfunc
and
np.sum
.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)
Price | ||
---|---|---|
Manager | Rep | |
Debra Henley | Craig Booker | 80000 |
Daniel Hilton | 115000 | |
John Smith | 40000 | |
Fred Anderson | Cedric Moss | 110000 |
Wendy Yule | 177000 |
aggfunc
can take a list of functions. Let’s try a mean using the numpy
mean
function and
len
to get a count.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])
mean | len | ||
---|---|---|---|
Price | Price | ||
Manager | Rep | ||
Debra Henley | Craig Booker | 20000 | 4 |
Daniel Hilton | 38333 | 3 | |
John Smith | 20000 | 2 | |
Fred Anderson | Cedric Moss | 27500 | 4 |
Wendy Yule | 44250 | 4 |
If we want to see sales broken down by the products, the
columns
variable allows us to define one or more columns.
pivot_table
is the use of
columns
and
values
.
Remember,
columns
are optional - they provide an additional way to segment the actual values you care about.
The aggregation functions are applied to the
values
you list.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"], columns=["Product"],aggfunc=[np.sum])
sum | |||||
---|---|---|---|---|---|
Price | |||||
Product | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||
Debra Henley | Craig Booker | 65000 | 5000 | NaN | 10000 |
Daniel Hilton | 105000 | NaN | NaN | 10000 | |
John Smith | 35000 | 5000 | NaN | NaN | |
Fred Anderson | Cedric Moss | 95000 | 5000 | NaN | 10000 |
Wendy Yule | 165000 | 7000 | 5000 | NaN |
The NaN’s are a bit distracting. If we want to remove them, we could use
fill_value
to set them to 0.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"], columns=["Product"],aggfunc=[np.sum],fill_value=0)
sum | |||||
---|---|---|---|---|---|
Price | |||||
Product | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||
Debra Henley | Craig Booker | 65000 | 5000 | 0 | 10000 |
Daniel Hilton | 105000 | 0 | 0 | 10000 | |
John Smith | 35000 | 5000 | 0 | 0 | |
Fred Anderson | Cedric Moss | 95000 | 5000 | 0 | 10000 |
Wendy Yule | 165000 | 7000 | 5000 | 0 |
I think it would be useful to add the quantity as well. Add Quantity to
the
values
list.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price","Quantity"], columns=["Product"],aggfunc=[np.sum],fill_value=0)
sum | |||||||||
---|---|---|---|---|---|---|---|---|---|
Price | Quantity | ||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||||||
Debra Henley | Craig Booker | 65000 | 5000 | 0 | 10000 | 2 | 2 | 0 | 1 |
Daniel Hilton | 105000 | 0 | 0 | 10000 | 4 | 0 | 0 | 1 | |
John Smith | 35000 | 5000 | 0 | 0 | 1 | 2 | 0 | 0 | |
Fred Anderson | Cedric Moss | 95000 | 5000 | 0 | 10000 | 3 | 1 | 0 | 1 |
Wendy Yule | 165000 | 7000 | 5000 | 0 | 7 | 3 | 2 | 0 |
What’s interesting is that you can move items to the index to get a
different visual representation. Remove Product from the
columns
and add to the
index
.
pd.pivot_table(df,index=["Manager","Rep","Product"], values=["Price","Quantity"],aggfunc=[np.sum],fill_value=0)
sum | ||||
---|---|---|---|---|
Price | Quantity | |||
Manager | Rep | Product | ||
Debra Henley | Craig Booker | CPU | 65000 | 2 |
Maintenance | 5000 | 2 | ||
Software | 10000 | 1 | ||
Daniel Hilton | CPU | 105000 | 4 | |
Software | 10000 | 1 | ||
John Smith | CPU | 35000 | 1 | |
Maintenance | 5000 | 2 | ||
Fred Anderson | Cedric Moss | CPU | 95000 | 3 |
Maintenance | 5000 | 1 | ||
Software | 10000 | 1 | ||
Wendy Yule | CPU | 165000 | 7 | |
Maintenance | 7000 | 3 | ||
Monitor | 5000 | 2 |
For this data set, this representation makes more sense. Now, what if I
want to see some totals?
margins=True
does that for us.
pd.pivot_table(df,index=["Manager","Rep","Product"], values=["Price","Quantity"], aggfunc=[np.sum,np.mean],fill_value=0,margins=True)
sum | mean | |||||
---|---|---|---|---|---|---|
Price | Quantity | Price | Quantity | |||
Manager | Rep | Product | ||||
Debra Henley | Craig Booker | CPU | 65000 | 2 | 32500.000000 | 1.000000 |
Maintenance | 5000 | 2 | 5000.000000 | 2.000000 | ||
Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
Daniel Hilton | CPU | 105000 | 4 | 52500.000000 | 2.000000 | |
Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
John Smith | CPU | 35000 | 1 | 35000.000000 | 1.000000 | |
Maintenance | 5000 | 2 | 5000.000000 | 2.000000 | ||
Fred Anderson | Cedric Moss | CPU | 95000 | 3 | 47500.000000 | 1.500000 |
Maintenance | 5000 | 1 | 5000.000000 | 1.000000 | ||
Software | 10000 | 1 | 10000.000000 | 1.000000 | ||
Wendy Yule | CPU | 165000 | 7 | 82500.000000 | 3.500000 | |
Maintenance | 7000 | 3 | 7000.000000 | 3.000000 | ||
Monitor | 5000 | 2 | 5000.000000 | 2.000000 | ||
All | 522000 | 30 | 30705.882353 | 1.764706 |
Let’s move the analysis up a level and look at our pipeline at the manager level. Notice how the status is ordered based on our earlier category definition.
pd.pivot_table(df,index=["Manager","Status"],values=["Price"], aggfunc=[np.sum],fill_value=0,margins=True)
sum | ||
---|---|---|
Price | ||
Manager | Status | |
Debra Henley | declined | 70000 |
pending | 50000 | |
presented | 50000 | |
won | 65000 | |
Fred Anderson | declined | 65000 |
pending | 5000 | |
presented | 45000 | |
won | 172000 | |
All | 522000 |
A really handy feature is the ability to pass a dictionary to the
aggfunc
so you can perform different functions on each of the values you
select. This has a side-effect of making the labels a little cleaner.
pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"], aggfunc={"Quantity":len,"Price":np.sum},fill_value=0)
Price | Quantity | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||
Debra Henley | declined | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
pending | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
won | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |
Fred Anderson | declined | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 | |
presented | 30000 | 0 | 5000 | 10000 | 1 | 0 | 1 | 1 | |
won | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
You can provide a list of aggfunctions to apply to each value too:
table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"], aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0) table
Price | Quantity | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | sum | len | |||||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||||||
Debra Henley | declined | 35000 | 0 | 0 | 0 | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
pending | 40000 | 5000 | 0 | 0 | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 10000 | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
won | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | |
Fred Anderson | declined | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 | |
presented | 30000 | 0 | 5000 | 10000 | 30000 | 0 | 5000 | 10000 | 1 | 0 | 1 | 1 | |
won | 82500 | 7000 | 0 | 0 | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
It can look daunting to try to pull this all together at one time but as
soon as you start playing with the data and slowly add the items, you
can get a feel for how it works. My general rule of thumb is that once
you use multiple
grouby
you should evaluate whether a pivot table
is a useful approach.