Pandas Cookbook -- 09合併Pandas物件及資料庫
簡書大神SeanCheney的譯作,我作了些格式調整和文章目錄結構的變化,更適合自己閱讀,以後翻閱是更加方便自己查詢吧
import pandas as pd
import numpy as np
DataFrame插入
讀取names資料集
names = pd.read_csv('data/names.csv')
names
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
用loc直接賦值新的行
new_data_list = ['Aria', 1]
names.loc[4] = new_data_list
names
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
也可以用字典賦值新行
names.loc[len(names)] = {'Name':'Zayd', 'Age':2}
字典可以打亂列名的順序
names.loc[len(names)] = pd.Series({'Age':32, 'Name':'Dean'})
用append新增新行
1.直接append一個字典
names.append({'Name':'Aria2222', 'Age':12222},ignore_index=True)
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
5 | Zayd | 2 |
6 | Dean | 32 |
7 | Aria2222 | 12222 |
2.append方法可以將DataFrame和Series相連
建立一個Series物件
s = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
names.append(s)
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
5 | Zayd | 2 |
6 | Dean | 32 |
7 | Zach | 3 |
append方法可以同時連線多行,只要將物件放到列表中
s1 = pd.Series({'Name': 'Zach', 'Age': 3}, name=len(names))
s2 = pd.Series({'Name': 'Zayd', 'Age': 2}, name='USA')
names.append([s1, s2])
Name | Age | |
---|---|---|
0 | Cornelia | 70 |
1 | Abbas | 69 |
2 | Penelope | 4 |
3 | Niko | 2 |
4 | Aria | 1 |
5 | Zayd | 2 |
6 | Dean | 32 |
7 | Zach | 3 |
USA | Zayd | 2 |
用insert新增新列
DataFrame.insert(loc, column, value, allow_duplicates=False)
Insert column into DataFrame at specified location.
Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.
向dataframe指定的位置插入新列,如果該列與已有列重複則會丟擲異常,除非allow_duplicates=True.
- loc : int
- Insertion index. Must verify 0 <= loc <= len(columns)
- column : string, number, or hashable object
- label of the inserted column
- value : int, Series, or array-like
- allow_duplicates : bool, optional
names.insert(0,'id',np.random.randint(len(names)))
names
id | Name | Age | |
---|---|---|---|
0 | 6 | Cornelia | 70 |
1 | 6 | Abbas | 69 |
2 | 6 | Penelope | 4 |
3 | 6 | Niko | 2 |
4 | 6 | Aria | 1 |
5 | 6 | Zayd | 2 |
6 | 6 | Dean | 32 |
DataFrame合併
讀取stocks_2016和stocks_2017兩個資料集,用Symbol作為行索引名
years = 2016, 2017, 2018
stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year), index_col='Symbol') for year in years]
stocks_2016, stocks_2017, stocks_2018 = stock_tables
names = ['prices', 'transactions']
food_tables = [pd.read_csv('data/food_{}.csv'.format(name)) for name in names]
concat
將兩個DataFrame放到一個列表中,用pandas的concat方法將它們連線起來
s_list = [stocks_2016, stocks_2017]
pd.concat(s_list)
Shares | Low | High | |
---|---|---|---|
Symbol | |||
AAPL | 80 | 95 | 110 |
TSLA | 50 | 80 | 130 |
WMT | 40 | 55 | 70 |
AAPL | 50 | 120 | 140 |
GE | 100 | 30 | 40 |
IBM | 87 | 75 | 95 |
SLB | 20 | 55 | 85 |
TXN | 500 | 15 | 23 |
TSLA | 100 | 100 | 300 |
concat是唯一一個可以將DataFrames垂直連線起來的函式
keys引數可以給兩個DataFrame命名,該標籤會出現在行索引的最外層,會生成多層索引,names引數可以重新命名每個索引層
pd.concat(s_list, keys=['2016', '2017'], names=['Year', 'Symbol'])
Shares | Low | High | ||
---|---|---|---|---|
Year | Symbol | |||
2016 | AAPL | 80 | 95 | 110 |
TSLA | 50 | 80 | 130 | |
WMT | 40 | 55 | 70 | |
2017 | AAPL | 50 | 120 | 140 |
GE | 100 | 30 | 40 | |
IBM | 87 | 75 | 95 | |
SLB | 20 | 55 | 85 | |
TXN | 500 | 15 | 23 | |
TSLA | 100 | 100 | 300 |
也可以橫向連線。只要將axis引數設為columns或1
pd.concat(s_list, keys=['2016', '2017'], axis='columns', names=['Year', None],sort=True)
Year | 2016 | 2017 | ||||
---|---|---|---|---|---|---|
Shares | Low | High | Shares | Low | High | |
AAPL | 80.0 | 95.0 | 110.0 | 50.0 | 120.0 | 140.0 |
GE | NaN | NaN | NaN | 100.0 | 30.0 | 40.0 |
IBM | NaN | NaN | NaN | 87.0 | 75.0 | 95.0 |
SLB | NaN | NaN | NaN | 20.0 | 55.0 | 85.0 |
TSLA | 50.0 | 80.0 | 130.0 | 100.0 | 100.0 | 300.0 |
TXN | NaN | NaN | NaN | 500.0 | 15.0 | 23.0 |
WMT | 40.0 | 55.0 | 70.0 | NaN | NaN | NaN |
concat函式預設使用的是外連線,會保留每個DataFrame中的所有行。也可以通過設定join引數,使用內連線:
pd.concat(s_list, join='inner', keys=['2016', '2017'], axis='columns', names=['Year', None])
Year | 2016 | 2017 | ||||
---|---|---|---|---|---|---|
Shares | Low | High | Shares | Low | High | |
Symbol | ||||||
AAPL | 80 | 95 | 110 | 50 | 120 | 140 |
TSLA | 50 | 80 | 130 | 100 | 100 | 300 |
concat方法的TIPS
給dataframe命名後,再連線
pd.concat(dict(zip(years,stock_tables)), axis='columns',sort=True)
2016 | 2017 | 2018 | |||||||
---|---|---|---|---|---|---|---|---|---|
Shares | Low | High | Shares | Low | High | Shares | Low | High | |
AAPL | 80.0 | 95.0 | 110.0 | 50.0 | 120.0 | 140.0 | 40.0 | 135.0 | 170.0 |
AMZN | NaN | NaN | NaN | NaN | NaN | NaN | 8.0 | 900.0 | 1125.0 |
GE | NaN | NaN | NaN | 100.0 | 30.0 | 40.0 | NaN | NaN | NaN |
IBM | NaN | NaN | NaN | 87.0 | 75.0 | 95.0 | NaN | NaN | NaN |
SLB | NaN | NaN | NaN | 20.0 | 55.0 | 85.0 | NaN | NaN | NaN |
TSLA | 50.0 | 80.0 | 130.0 | 100.0 | 100.0 | 300.0 | 50.0 | 220.0 | 400.0 |
TXN | NaN | NaN | NaN | 500.0 | 15.0 | 23.0 | NaN | NaN | NaN |
WMT | 40.0 | 55.0 | 70.0 | NaN | NaN | NaN | NaN | NaN | NaN |
append是concat方法的超簡化版本,append內部其實就是呼叫concat。前本節的第二個例子,pd.concat也可以如下實現:
stocks_2016.append(stocks_2017)
Shares | Low | High | |
---|---|---|---|
Symbol | |||
AAPL | 80 | 95 | 110 |
TSLA | 50 | 80 | 130 |
WMT | 40 | 55 | 70 |
AAPL | 50 | 120 | 140 |
GE | 100 | 30 | 40 |
IBM | 87 | 75 | 95 |
SLB | 20 | 55 | 85 |
TXN | 500 | 15 | 23 |
TSLA | 100 | 100 | 300 |
join
用join將DataFrame連起來;如果列名有相同的,需要設定lsuffix或rsuffix以進行區分
stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')
Shares_2016 | Low_2016 | High_2016 | Shares_2017 | Low_2017 | High_2017 | |
---|---|---|---|---|---|---|
Symbol | ||||||
AAPL | 80.0 | 95.0 | 110.0 | 50.0 | 120.0 | 140.0 |
GE | NaN | NaN | NaN | 100.0 | 30.0 | 40.0 |
IBM | NaN | NaN | NaN | 87.0 | 75.0 | 95.0 |
SLB | NaN | NaN | NaN | 20.0 | 55.0 | 85.0 |
TSLA | 50.0 | 80.0 | 130.0 | 100.0 | 100.0 | 300.0 |
TXN | NaN | NaN | NaN | 500.0 | 15.0 | 23.0 |
WMT | 40.0 | 55.0 | 70.0 | NaN | NaN | NaN |
merge
檢視food_prices和food_transactions兩個小資料集
food_prices, food_transactions = food_tables
food_prices
item | store | price | Date | |
---|---|---|---|---|
0 | pear | A | 0.99 | 2017 |
1 | pear | B | 1.99 | 2017 |
2 | peach | A | 2.99 | 2017 |
3 | peach | B | 3.49 | 2017 |
4 | banana | A | 0.39 | 2017 |
5 | banana | B | 0.49 | 2017 |
6 | steak | A | 5.99 | 2017 |
7 | steak | B | 6.99 | 2017 |
8 | steak | B | 4.99 | 2015 |
food_transactions
custid | item | store | quantity | |
---|---|---|---|---|
0 | 1 | pear | A | 5 |
1 | 1 | banana | A | 10 |
2 | 2 | steak | B | 3 |
3 | 2 | pear | B | 1 |
4 | 2 | peach | B | 2 |
5 | 2 | steak | B | 1 |
6 | 2 | coconut | B | 4 |
通過鍵item和store,將food_transactions和food_prices兩個資料集融合
food_transactions.merge(food_prices, on=['item', 'store'])
custid | item | store | quantity | price | Date | |
---|---|---|---|---|---|---|
0 | 1 | pear | A | 5 | 0.99 | 2017 |
1 | 1 | banana | A | 10 | 0.39 | 2017 |
2 | 2 | steak | B | 3 | 6.99 | 2017 |
3 | 2 | steak | B | 3 | 4.99 | 2015 |
4 | 2 | steak | B | 1 | 6.99 | 2017 |
5 | 2 | steak | B | 1 | 4.99 | 2015 |
6 | 2 | pear | B | 1 | 1.99 | 2017 |
7 | 2 | peach | B | 2 | 3.49 | 2017 |
因為steak在兩張表中分別出現了兩次,融合時產生了笛卡爾積,造成結果中出現了四行steak
因為coconut沒有對應的價格,造成結果中沒有coconut
下面只融合2017年的資料
food_transactions.merge(food_prices.query('Date == 2017'), how='left')
custid | item | store | quantity | price | Date | |
---|---|---|---|---|---|---|
0 | 1 | pear | A | 5 | 0.99 | 2017.0 |
1 | 1 | banana | A | 10 | 0.39 | 2017.0 |
2 | 2 | steak | B | 3 | 6.99 | 2017.0 |
3 | 2 | pear | B | 1 | 1.99 | 2017.0 |
4 | 2 | peach | B | 2 | 3.49 | 2017.0 |
5 | 2 | steak | B | 1 | 6.99 | 2017.0 |
6 | 2 | coconut | B | 4 | NaN | NaN |
concat/join/merge的區別
concat:
- Pandas函式
- 可以垂直和水平地連線兩個或多個pandas物件
- 只用索引對齊
- 索引出現重複值時會報錯
- 預設是外連線(也可以設為內連線)
- concat是唯一一個可以將DataFrames垂直連線起來的函式
join:
- DataFrame方法
- 只能水平連線兩個或多個pandas物件
- 對齊是靠被呼叫的DataFrame的列索引或行索引和另一個物件的行索引(不能是列索引)
- 通過笛卡爾積處理重複的索引值
- 預設是左連線(也可以設為內連線、外連線和右連線)
merge:
- DataFrame方法
- 只能水平連線兩個DataFrame物件
- 對齊是靠被呼叫的DataFrame的列或行索引和另一個DataFrame的列或行索引
- 通過笛卡爾積處理重複的索引值
- 預設是內連線(也可以設為左連線、外連線、右連線)
連線SQL資料庫
建立SQLAlchemy引擎
在讀取chinook資料庫之前,需要建立SQLAlchemy引擎
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/chinook.db')
read_sql_table函式
read_sql_table函式可以讀取一張表,第一個引數是表名,第二個引數是引擎,返回一個dataframe
tracks = pd.read_sql_table('tracks', engine)
tracks.iloc[:5,:5]
TrackId | Name | AlbumId | MediaTypeId | GenreId | |
---|---|---|---|---|---|
0 | 1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 |
1 | 2 | Balls to the Wall | 2 | 2 | 1 |
2 | 3 | Fast As a Shark | 3 | 2 | 1 |
3 | 4 | Restless and Wild | 3 | 2 | 1 |
4 | 5 | Princess of the Dawn | 3 | 2 | 1 |
genres = pd.read_sql_table('genres', engine)
genres.head()
GenreId | Name | |
---|---|---|
0 | 1 | Rock |
1 | 2 | Jazz |
2 | 3 | Metal |
3 | 4 | Alternative & Punk |
4 | 5 | Rock And Roll |
找到每種型別歌曲的平均時長
genre_track = genres.merge(
tracks[['GenreId', 'Milliseconds']],
on='GenreId',
how='left').drop('GenreId', axis='columns')
genre_track.head()
Name | Milliseconds | |
---|---|---|
0 | Rock | 343719 |
1 | Rock | 342562 |
2 | Rock | 230619 |
3 | Rock | 252051 |
4 | Rock | 375418 |
將Milliseconds列轉變為timedelta資料型別
genre_time = genre_track.groupby('Name')['Milliseconds'].mean()
pd.to_timedelta(genre_time, unit='ms').dt.floor('s').sort_values()[:10]
Name
Rock And Roll 00:02:14
Opera 00:02:54
Hip Hop/Rap 00:02:58
Easy Listening 00:03:09
Bossa Nova 00:03:39
R&B/Soul 00:03:40
World 00:03:44
Pop 00:03:49
Latin 00:03:52
Alternative & Punk 00:03:54
Name: Milliseconds, dtype: timedelta64[ns]
找到每名顧客花費的總時長
cust = pd.read_sql_table('customers', engine, columns=['CustomerId', 'FirstName', 'LastName'])
invoice = pd.read_sql_table('invoices', engine, columns=['InvoiceId','CustomerId'])
ii = pd.read_sql_table('invoice_items', engine, columns=['InvoiceId', 'UnitPrice', 'Quantity'])
cust_inv = cust.merge(invoice, on='CustomerId').merge(ii, on='InvoiceId')
cust_inv.head()
CustomerId | FirstName | LastName | InvoiceId | UnitPrice | Quantity | |
---|---|---|---|---|---|---|
0 | 1 | Luís | Gonçalves | 98 | 1.99 | 1 |
1 | 1 | Luís | Gonçalves | 98 | 1.99 | 1 |
2 | 1 | Luís | Gonçalves | 121 | 0.99 | 1 |
3 | 1 | Luís | Gonçalves | 121 | 0.99 | 1 |
4 | 1 | Luís | Gonçalves | 121 | 0.99 | 1 |
現在可以用總量乘以單位價格,找到每名顧客的總消費
total = cust_inv['Quantity'] * cust_inv['UnitPrice']
cols = ['CustomerId', 'FirstName', 'LastName']
cust_inv.assign(Total = total).groupby(cols)['Total'].sum().sort_values(ascending=False).head()
CustomerId FirstName LastName
6 Helena Holý 49.62
26 Richard Cunningham 47.62
57 Luis Rojas 46.62
46 Hugh O'Reilly 45.62
45 Ladislav Kovács 45.62
Name: Total, dtype: float64
read_sql_query函式
pd.read_sql_query('select * from tracks limit 5', engine).iloc[:,:5]
TrackId | Name | AlbumId | MediaTypeId | GenreId | |
---|---|---|---|---|---|
0 | 1 | For Those About To Rock (We Salute You) | 1 | 1 | 1 |
1 | 2 | Balls to the Wall | 2 | 2 | 1 |
2 | 3 | Fast As a Shark | 3 | 2 | 1 |
3 | 4 | Restless and Wild | 3 | 2 | 1 |
4 | 5 | Princess of the Dawn | 3 | 2 | 1 |
可以將長字串傳給read_sql_query
sql_string1 = '''
select
Name,
time(avg(Milliseconds) / 1000, 'unixepoch') as avg_time
from (
select
g.Name,
t.Milliseconds
from
genres as g
join
tracks as t
on
g.genreid == t.genreid
)
group by
Name
order by
avg_time
'''
pd.read_sql_query(sql_string1, engine)[:10]
Name | avg_time | |
---|---|---|
0 | Rock And Roll | 00:02:14 |
1 | Opera | 00:02:54 |
2 | Hip Hop/Rap | 00:02:58 |
3 | Easy Listening | 00:03:09 |
4 | Bossa Nova | 00:03:39 |
5 | R&B/Soul | 00:03:40 |
6 | World | 00:03:44 |
7 | Pop | 00:03:49 |
8 | Latin | 00:03:52 |
9 | Alternative & Punk | 00:03:54 |
sql_string2 = '''
select
c.customerid,
c.FirstName,
c.LastName,
sum(ii.quantity * ii.unitprice) as Total
from
customers as c
join
invoices as i
on c.customerid = i.customerid
join
invoice_items as ii
on i.invoiceid = ii.invoiceid
group by
c.customerid, c.FirstName, c.LastName
order by
Total desc
'''
pd.read_sql_query(sql_string2, engine)[:10]
CustomerId | FirstName | LastName | Total | |
---|---|---|---|---|
0 | 6 | Helena | Holý | 49.62 |
1 | 26 | Richard | Cunningham | 47.62 |
2 | 57 | Luis | Rojas | 46.62 |
3 | 45 | Ladislav | Kovács | 45.62 |
4 | 46 | Hugh | O'Reilly | 45.62 |
5 | 37 | Fynn | Zimmermann | 43.62 |
6 | 24 | Frank | Ralston | 43.62 |
7 | 28 | Julia | Barnett | 43.62 |
8 | 25 | Victor | Stevens | 42.62 |
9 | 7 | Astrid | Gruber | 42.62 |