python 數據分析3
本節概要
pandas簡介
安裝
pip install pandas
pandas的2個主要數據結構:DataFrame 和 Series
Series
series是一種類似於一維數組的對象,它由一組數據以及一組與之相關的數據標簽(索引)組成。僅由一組數組即可產生最簡單的Series:
obj = Series([4, 7, 9, -1]) print(obj) 0 4 1 7 2 9 3 -1 dtype: int64
Series的字符串表現形式為索引在左邊,值在右邊。沒有設定索引,會自動穿件一個0~N-1的整數型索引。
obj = Series([4, 7, 9, -1]) print(obj.values) print(obj.index) [ 4 7 9 -1] RangeIndex(start=0, stop=4, step=1)
創建一個含有自定義索引的series
obj = Series([4, 7, 9, -1], index=[‘a‘, ‘b‘, ‘c‘, ‘d‘]) print(obj) print(obj.index) a 4 b 7 c 9 d -1 dtype: int64 Index([‘a‘, ‘b‘, ‘c‘, ‘d‘], dtype=‘object‘)
索引取值
obj[‘a‘] ==> 4 obj[‘c‘] ==> 9 obj[‘a‘, ‘d‘] ==> 4, -1
NumPy數組運算都會保留索引跟值之間的鏈接:
obj[obj>2] a 4 b 7 c 9 dtype: int64 obj*2 a 8 b 14 c 18 d -2 dtype: int64
series可以看成是一個有序字典,因為存在index到value的一個映射關系。可以應用在許多原本需要字典參數的函數中:
‘b‘ i obj ==> True
如果數據存放在Python字典中,也可以直接用字典穿件series:
dict_obj = {"a":100,"b":20,"c":50,"d":69} obj = Series(dict_obj) dict_obj a 100 b 20 c 50 d 69 dtype: int64
如果傳入一個字典,還有index列表:
dict_obj = {"a":100,"b":20,"c":50,"d":69} states = [‘LA‘,‘b‘,‘a‘,‘NY‘] obj = Series(dict_obj, index=states) LA NaN b 20.0 a 100.0 NY NaN dtype: float64
我們發現匹配項會被找出來放在相應的位置,而沒有匹配的則用NAN(not a number)表示缺失。pandas的isnull 和notnull函數可以用於檢測數據缺失:
pd.isnull(obj) LA True b False a False NY True dtype: bool
Series也有類似的用法:
obj.isnull() LA True b False a False NY True dtype: bool
Series 最重要的一個功能是:它在算術運算中會自動對齊不同索引的數據
dict_obj = {"a":100,"b":20,"c":50,"d":69} dict_obj1 = {"e":100,"b":20,"c":50,"f":69} obj = Series(dict_obj) obj1 = Series(dict_obj1) obj+obj1 a NaN b 40.0 c 100.0 d NaN e NaN f NaN dtype: float64
Series對象的name屬性
obj.name=‘qty‘ obj.index.name = ‘types‘ types a 100 b 20 c 50 d 69 Name: qty, dtype: int64
Series索引可以通過賦值的方式就地修改:
obj.index = [‘dandy‘,‘renee‘,‘Jeff‘,‘Steve‘] obj dandy 100 renee 20 Jeff 50 Steve 69 Name: qty, dtype: int64
DataFrame
dataframe是一個表格型的數據結構,它含有一組有序列,每列可以是不通的值的類型。DataFrame既有行索引,又有列索引,它可以看成是series組成的字典(共用同一個索引)。
構建DataFrame
data = {‘states‘:[‘NY‘, ‘LA‘, ‘CA‘, ‘BS‘, ‘CA‘], ‘year‘:[2000, 2001, 2002, 2001, 2000], ‘pop‘:[1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data) frame pop states year 0 1.5 NY 2000 1 1.7 LA 2001 2 3.6 CA 2002 3 2.4 BS 2001 4 2.9 CA 2000
指定列序列
frame2 = DataFrame(data, columns=[‘year‘, ‘pop‘, ‘states‘, ‘test‘])
frame2.columns year pop states test 0 2000 1.5 NY NaN 1 2001 1.7 LA NaN 2 2002 3.6 CA NaN 3 2001 2.4 BS NaN 4 2000 2.9 CA NaN
Index([‘year‘, ‘pop‘, ‘states‘, ‘test‘], dtype=‘object‘) # 不存在的列就會產生NaN值
取值:
# 取一列數據的2種方式 frame2[‘states‘] frame2.year 0 NY 1 LA 2 CA 3 BS 4 CA Name: states, dtype: object 0 2000 1 2001 2 2002 3 2001 4 2000 Name: year, dtype: int64 # 返回一個series # 修改行索引 DataFrame(data, columns=[‘year‘, ‘pop‘, ‘states‘, ‘test‘], index=[‘one‘, ‘two‘, ‘three‘, ‘four‘, ‘five‘]) year pop states test one 2000 1.5 NY NaN two 2001 1.7 LA NaN three 2002 3.6 CA NaN four 2001 2.4 BS NaN five 2000 2.9 CA NaN 獲取列 frame2.ix[‘three‘] year 2002 pop 3.6 states CA test NaN Name: three, dtype: object 列可以通過賦值的方式修改 frame2.test = ‘11‘ year pop states test one 2000 1.5 NY 11 two 2001 1.7 LA 11 three 2002 3.6 CA 11 four 2001 2.4 BS 11 five 2000 2.9 CA 11
列操作:
將列表或數組賦值給某個列時,其長度必須跟DataFrame的長度相匹配。如果是Series則會精確匹配DataFrame索引,所有空位被填上缺失值
val = Series([-1, -2, 3], index=[‘two‘, ‘one‘, ‘three‘]) frame2[‘test‘] = val frame2 year pop states test one 2000 1.5 NY -2.0 two 2001 1.7 LA -1.0 three 2002 3.6 CA 3.0 four 2001 2.4 BS NaN five 2000 2.9 CA NaN
為不存在的列賦值,會創建出一列新列。del用於刪除,跟python字典用法很像
frame2[‘test1‘] = frame2.test.notnull() frame2 year pop states test test1 one 2000 1.5 NY -2.0 True two 2001 1.7 LA -1.0 True three 2002 3.6 CA 3.0 True four 2001 2.4 BS NaN False five 2000 2.9 CA NaN False
del frame2[‘test1‘] frame2 year pop states test one 2000 1.5 NY -2.0 two 2001 1.7 LA -1.0 three 2002 3.6 CA 3.0 four 2001 2.4 BS NaN five 2000 2.9 CA NaN
嵌套字典創建dataframe
pop = { "dandy":{"age":18, "gender":"male"}, "elina": {"age": 16, "gender": "female"}, "renee": {"age": 16, "gender": "female"}, "taylor": {"age": 18, "gender": "female"}, } frame3 = DataFrame(pop) frame3 dandy elina renee taylor age 18 16 16 18 gender male female female female frame3.T # 轉置 age gender dandy 18 male elina 16 female renee 16 female taylor 18 female
series組成的字典創建:
pdata = {‘dandy‘: frame3[‘dandy‘][:-1], ‘elina‘: frame3[‘elina‘]} frame4 = DataFrame(pdata) frame4 dandy elina age 18 16 gender NaN female
設置屬性名
frame3.index.name = ‘detail‘ frame3.columns.name = ‘name‘ frame3 name dandy elina renee taylor detail age 18 16 16 18 gender male female female female
values屬性
frame3.values # 以二維ndarray的形式返回dataframe中的數據 [[18 16 16 18] [‘male‘ ‘female‘ ‘female‘ ‘female‘]]
索引對象
pandas的索引對象負責管理軸標簽和其他元素(軸名稱等)。構建series或者dataframe時,所用到的任何數組和其他序列的標簽都會被轉成一個Index。Index對象是不可修改的(immutable)。
obj = Series(range(3), index=[‘a‘, ‘b‘, ‘c‘]) Index = obj.index Index[0] a
如果輸入Index[0] = ‘x‘:
正是因為index的不可修改性,才能使Index對象在多個數據結構之間安全共享:
Index = pd.Index(np.arange(3)) obj2 = Series([1.5, -2, 0], index=Index) obj2.index is Index True
除了長得像數組,Index的功能也類似一個固定大小的集合:
pop = { "dandy":{"age":18, "gender":"male"}, "elina": {"age": 16, "gender": "female"}, "renee": {"age": 16, "gender": "female"}, "taylor": {"age": 18, "gender": "female"}, } frame3 = DataFrame(pop) ‘dandy‘ in frame3.columns True
基本功能
obj = Series([4, 6, 9.9, 7], index=[‘a‘, ‘v‘, ‘b‘, ‘d‘]) obj a 4.0 v 6.0 b 9.9 d 7.0 dtype: float64
reindex
obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘v‘]) # 不存在的NaN obj2 a 4.0 b 9.9 c NaN d 7.0 v 6.0 dtype: float64 # 引入fill_value=0 obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘v‘], fill_value=0) obj2 a 4.0 b 9.9 c 0.0 d 7.0 v 6.0 dtype: float64 #method obj = Series([‘aa‘, ‘bb‘, ‘cc‘, ‘dd‘], index=[0,2,4,6]) obj2 = obj.reindex(range(7), method=‘ffill‘) 0 aa 1 aa 2 bb 3 bb 4 cc 5 cc 6 dd dtype: object
frame = DataFrame(np.arange(9).reshape((3,3)), index=[‘a‘, ‘b‘, ‘c‘], columns=[‘Ohio‘, ‘Texas‘, ‘California‘]) frame2 =frame.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘]) frame Ohio Texas California a 0 1 2 b 3 4 5 c 6 7 8 frame2 Ohio Texas California a 0.0 1.0 2.0 b 3.0 4.0 5.0 c 6.0 7.0 8.0 d NaN NaN NaN text = [‘Texas‘, ‘LA‘, ‘California‘] frame3 = frame.reindex(columns=text) frame3 Texas LA California a 1 NaN 2 b 4 NaN 5 c 7 NaN 8
同時對行列重新索引,插值只能按行應用(軸0)
text = [‘Texas‘, ‘LA‘, ‘California‘] frame.reindex(index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘,‘f‘], method=‘ffill‘).reindex(columns=text) Texas LA California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 4 NaN 5 e 7 NaN 8 f 7 NaN 8
利用ix的標簽索引功能,重新索引會更簡潔
frame.ix[[‘a‘,‘b‘,‘c‘,‘d‘],text] Texas LA California a 1.0 NaN 2.0 b NaN NaN NaN c 4.0 NaN 5.0 d NaN NaN NaN
丟棄指定軸上的項
drop方法,返回的是一個在指定軸上刪除了指定值的新對象:
obj = Series(np.arange(5), index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘]) new_obj = obj.drop(‘c‘) new_obj a 0 b 1 d 3 e 4 dtype: int32
對於DataFrame,可以刪除任意軸上的索引值:
data = DataFrame(np.arange(16).reshape(4,4), index=[‘LA‘,‘UH‘,‘NY‘,‘BS‘], columns=[‘one‘,‘two‘,‘three‘,‘four‘]) data.drop([‘LA‘,‘BS‘]) one two three four UH 4 5 6 7 NY 8 9 10 11 #對於列,axis=1 data.drop([‘one‘,‘three‘], axis=1) two four LA 1 3 UH 5 7 NY 9 11 BS 13 15
索引、選取和過濾
Series索引的工作方式類似於NumPy數組的索引,只不過Series的索引值不只是整數。
obj = Series(np.arange(4), index=[‘a‘,‘b‘,‘c‘,‘d‘]) obj a 0 b 1 c 2 d 3 dtype: int32 obj[‘b‘] # 或者obj[1] 1 obj[2:4] # 或者obj[[‘c‘,‘d‘]] c 2 d 3 dtype: int32 obj[[1,3]] b 1 d 3 dtype: int32 obj[obj>2] d 3 dtype: int32