python數據處理：pandas基礎

阿新 • • 發佈：2017-07-18

log eat ges 處理保留 sed lang sce rop

本文資料來源：

　　Python for Data Anylysis： Chapter 5

　　10 mintues to pandas: http://pandas.pydata.org/pandas-docs/stable/10min.html#min

1. Pandas簡介

經過數年的發展，pandas已經成為python處理數據中最常被使用的package。以下是開發pandas最開始的目的，也是現在pandas最常用的功能

　　a: Data structures with labeled axes supporting automatic or explicit data alignment(數據調整). This prevents common errors resulting from misaligned data and working with differently-indexed data coming from differernt sources.

　　b: Integrated time series functionality

　　c: The same data structures handle both time series data and non-time series data.

　　d: Arithmetic operations and reductions (like summing across an axis) would pass on the metadata(axis labels，元數據)。

　　e: Flexible handling of missing data

　　f: Merge and other relational operations found in popular database databases(SQL-based, for example)

有一篇文章“Don‘t use Hadoop when your data isn‘t that big ”指出：只有在超過5TB數據量的規模下，Hadoop才是一個合理的技術選擇。所以一般處理<5TB的數據量的時候，python pandas已經足夠可以應付。

2. pandas data structure

2.1 Series

Series是一個一維的array-like對象，由兩部分組成：1. 任意numpy數據類型的array 2. 數據標簽，稱之為index。

因此一個series有兩個主要參數：values和index

示例為創建一個series，獲得其value和index的過程

技術分享

通過傳遞一個能夠被轉換成類似序列結構的字典對象來創建一個Series:

技術分享

字典的key作為index表示。在Series中還可以加入index參數來規定index的順序，其value會自動根據key來匹配數值。

Series有一個重要的特征就是：在進行數學運算時，它的對齊特征(Data alignment features)可以自動調整不同index的數據，以便同一種數據進行數學運算。

而且Series對象本身和index參數都有一個參量為name，比如obj.name=‘population‘, obj.index.name = ‘state‘

2.2 DataFrame

DataFrame可以用來表達圖表類型、數據庫關系類型的數據，它包含數個順序排列的columns，每個col中的數據類型一致，但是col彼此間數據類型可以不一致。

DataFrame有兩個index：row和column

create dataframe的方法：通過同等長度的list或者array或者tuples的dictionary，通過nested dict of dicts，通過dicts of seires等等，詳見書本table5.1

技術分享

提取列：通過obj3[‘state‘]或者obj3.year獲取列的信息，返回類型為Series，與DataFrame有同樣的index

提取row：用ix函數以及row的位置信息或者名字

常用函數：

del：刪除列 del obj[‘year‘]

常見參數：index和 columns都有name參數，value

2.3 index ojbect和reindexing

pandas index的作用：for holding the axis labels and other metadata(like the axis name or names)

Index對象是不變的，意思就是無法被用戶修改，所以下列code無法通過，這個對應了我們簡介中所說的a這一條

技術分享

reindex()方法可以對指定軸上的索引(index)進行改變/增加/刪除操作，這將返回原始數據的一個拷貝

技術分享

reindex()中參數介紹：

　　　　index：新的index，代替原來的，原來的index不會copy。pandas的處理一般都會自動copy原始value，這點與ndarry不同

　　　　method：有ffill和bfill

　　　　fill_value：填補NAN value

　　　　copy等等

3.查看數據

　　 3.1 sorting：返回一個排序好的object

　　　　a：按照軸(行列)進行排序

　　　　　　sort_Index()

　　　　　　參數介紹：默認按照row排序，axis=1即按照列

　　　　　　　　　　　默認升序，降序ascedning=False

　　　　b:按照value排序

　　　　　　order()：缺值排在末尾

　　3.2 ranking

　　　　rank():按照值出現的順序賦值，返回一個新的obj。有同樣的值的時候，默認返回排序的mean

　　　　技術分享

　　3.3 unique

　　　　is_unique: tell you whether its values are unique or not，返回true or false

　　　　unique：返回不重復的值，返回一個array

　　3.4 value_count：計算序列中各個值出現的次數

　　　　技術分享

　　3.5 describe() 對於數據快速統計匯總

4.選擇數據

　　4.1 drop

　　drop行：

　　pandas的處理一般都會自動copy原始value，這點與ndarry不同，舉例如下，drop一行之後調用原始對象，發現沒有改變

　　技術分享　　

　　drop列：obj4.drop(‘Nevada‘,axis=1)

　　　　　　在python很多函數的參數中，默認都是考慮row的，所以有axis（軸）這個參數　　　　　　

　　　　　　axis=1 為垂直的，即列　　　　

　　　　　　axis=0 為水平的，即行

　　4.2 選擇selection，切片slicing，索引index　

　　a: 選擇一個單獨的列，這將會返回一個Series，df[‘A‘] 和 df.A一個意思

　　b: 通過[]進行選擇，這將會對行進行切片

　　c: 通過標簽選擇：endpoint is inclusive 即obj[‘b‘:‘c‘]包含‘c‘行

　　d: 選擇row和columns的子集：ix

　　f: 通過標簽進行索引: loc

　　　　　　技術分享

　　e: 通過位置進行索引: iloc

　　4.3 使用isin()方法來過濾：

　　　　用於過濾數據

5.缺失值處理

　　5.1 missing value

　　　　pandas用NaN(floating point value）來表示missing data

　　 5.2 去掉包含缺失值的行或者列

　　　　dropna

　　　　參數說明：how=‘all‘ only drop row that all NA

　　　　　　　　 axis=1， drop column

　　　　　　　　 thresh=3，只保留還有3個obseration的行

　　5.3 對缺失值進行填充

　　　　fillna

　　5.4 isnull：返回like-type對象，包含boolean values指明value是否為缺失值

　　　 notnull: isnull的反作用

6.計算函數

　　a:對於不同index的兩個df對象相加“+”，其結果與數據庫中union類似，缺失值為NaN

　　b:具體的加減用add()或者sub()，缺失值可以用fill_value代替

　　c:sum，count，min，max等等，包含一些method

　　d:correlation and covariance

　　　　　.corr()

　　　　　.cov()

7.合並 reshape

8.分組

　　對於”group by”操作，我們通常是指以下一個或多個操作步驟：

　　（Splitting）按照一些規則將數據分為不同的組；

　　（Applying）對於每組數據分別執行一個函數；

　　（Combining）將結果組合到一個數據結構中；

註：本文並不全面，僅僅總結了目前我所需要的部分。

python數據處理：pandas基礎

log eat ges 處理保留 sed lang sce rop 本文資料來源：　　Python for Data Anylysis： Chapter 5 　　10 mintues to pandas: http://pandas.pydata.org/pandas-

python數據處理：pandas基礎

python數據處理：pandas基礎

python-數據處理的包Numpy,scipy,pandas,matplotlib

海量數據處理：Hash映射 + Hash_map統計 + 堆/快速/歸並排序

前端數據處理：參數的獲取和組織發送

Python數據處理工具使用方法整理

python數據圖形化—— matplotlib 基礎應用

python數據處理技巧二

《Python數據處理》（高清中文版PDF+高清英文版PDF+源代碼）

吳裕雄 python 數據處理（1）

Python 資料分析包：pandas 基礎

python數據處理常用函數

python pandas模塊,nba數據處理（1）

python pandas 數據處理

R實戰第三篇：數據處理（基礎）

Python數據分析（二）pandas缺失值處理

pandas基礎(3)_數據處理

【Python數據分析基礎】: 異常值檢測和處理

Python基礎【數據結構：列表 | 元組 | 集合 | 字典】

第十二節：pandas缺失數據處理

【資料分析】：python：Pandas基礎：結構化資料處理

python數據處理：pandas基礎

相關推薦