1. 程式人生 > 實用技巧 >SQL from Perspect of Python Pandas

SQL from Perspect of Python Pandas

1. Introduction

Recently, I came across a group which has heavy SQL culture. They have many workflow base on history SQL code, waitting for people to read or to rebuild. Although I have some knowledge of SQL, it is not my first choice.

SQL and Pandas are totally different things and it is designed for data storage mostly.Nevertheless, both of them can be used and are common used in data analysis. What's more, there are some people like SQL style data analysis very much, so that they use it outside SQL.For example, pandasql in Python, or sqldf in R.There are also some other sofewears called "Bussiness Intellegence(BI)", like Tableau, which they themself has weakness in data cleaning/ transforming, so they purly let SQL do it for them in advance.

If someone has some situation like above(forced to use SQL but it's not their "first data language"), this article can do some help.

2. Begin with "FROM", not "SELECT"

From perspect of Pandas, mindset are always data-oriented. Which means we always first have some dataframe, then we can do something to/from it.

But SQL always begins with "SELECT" sentence, so it is heavy columns-oriented, or result-oriented.(What you "SELECT" is what you get at the end)

To reconcile these difference, we can read/ write "FROM" sentence first, getting clear & ready what our data is, then we do something to it, just like in Pandas.

SQL always has "SELECT" & "FROM". Although "FROM" are not in the first line, but we should begin with it.

We use classic dataset "tips.csv" for example:

(more rows than picture shows)

Python Pandas:

tips_nonesmoker = tips.query('smoker=="No"')
tips_nonesmoker.groupby('sex')[['total_bill', 'tip']].sum()

SQL(begin with "FROM"):

SELECT tips_nonsmoker.sex, SUM(tips_nonsmoker.total_bill), SUM(tips_nonsmoker.tip)
FROM (
SELECT *
FROM tips
WHERE smoker='No'
) AS tips_nonsmoker
GROUP BY tips_nonsmoker.sex

This trick will be even more useful when we have two tables to merge. In Pandas, we will first merge them to get one whole dataframe, then we can do something to/from it.

But in SQL we do not have this independence step. But if we begin with "FROM" sentence, we can easily understand how we merge tables, then we can do somthing to/from it.

It is the same when we are facing even more complex problems. Begins with "FROM" will make it easier to understand.

3. "WHERE" is like .query(), but more powerful

"WHERE" controls how we filter data. In Pandas, we have two style to do it.

First style,

filter1 = tips['sex'] == 'Female'
tips[filter1]

Second style,

tips.query('sex=="Female"')

"WHERE" sentence is like the second one, but more powerful. Because it can use LIKE keyword to do fuzzy matching.

This is what .query() cannot do until now(pandas 1.1.0). For doing this .query() has to do some detour, harming consistency and efficiency, explicitly using engine='python'.(it is slower because it use engine='python')

Python Pandas:

# wrong
# tips.query('sex LIKE "Fe"')

# right
tips.query('sex.str.contains("Fe")', engine='python')

SQL:

SELECT *
FROM tips
WHERE sex LIKE 'Fe%'

4. What to"GROUP BY"? Those in "SELECT" but not in aggregate functions

Another manipulation we can do comparion is "GROUP BY".

Again in Pandas our mindset are data-oriented. After we do .groupby() to dataframe, we will have an intermedia object(called DataFrameGroupByObject), then we can select columns or do aggregations.

But in SQL things are different. To understand how it works, we can read "GROUP BY" with what we "SELECT". Only those are not going to aggregate should be inside "GROUP BY".

This idea is still the same and useful when we want to "GROUP BY" more than one column.The resule of "GROUP BY" is just the same as what we will get in Pandas by .groupby() plus .reset_index().

Python Pandas:

tips.groupby(['sex', 'day'])[['total_bill', 'tip']].sum().reset_index()

SQL:

SELECT sex, "day", SUM(total_bill), SUM(tip)
FROM tips
GROUP BY sex, "day"
ORDER BY sex, "day"

(The "ORDER BY" sentence is not necessary, just used it for clear output)

5. Pivot table is no more than many "GROUP BY" result combined together

Microsoft Excel is famous for Pivot table, and according to wikipedia, MS even has a trademake of pivot table.

In Python, we have a function called pd.pivot_table(). It is easy to use and easy to understand from perspect from Excel pivot table.

The problem is, too comfortable to use may make us ignore what pivot table really do under the hood.

If we say a groupby is a one-dimention aggregation. A pivot table can be called a two-dimention aggregation.

If we look carefully how SQL do pivot table, we will understand a pivot table is no more than many "GROUP BY" result combined together.

# we ignore what is underhood
pd.pivot_table(tips, index=['sex'], columns=['smoker'], values='total_bill', aggfunc=sum)

SELECT sex, SUM(CASE WHEN smoker='No' THEN total_bill END) AS "No", SUM(CASE WHEN smoker='Yes' THEN total_bill END) AS "Yes"
FROM tips
GROUP BY sex

The CASE WHEN sentence is sugesting a pivot table is equivalance to two groupby result combine together:

Result 1:

SELECT sex, SUM(total_bill)
FROM tips
WHERE smoker='No'
GROUP BY sex

Result 2:

SELECT sex, SUM(total_bill)
FROM tips
WHERE smoker='Yes'
GROUP BY sex

If we combine result 1 with result 2, we will get a pivot table.