pandas系列學習（一）：pandas入門

阿新 • • 發佈：2018-11-10

作者：chen_h
微訊號 & QQ：862251340
微信公眾號：coderpai

介紹

pandas 是一套用於 Python 的快速，高效的資料分析工具。近年來它的受歡迎程度飆升，與資料科學和機器學習等領域的興起同步。

在這裡插入圖片描述

正如 Numpy 提供了基礎的資料型別，pandas 也提供了核心陣列操作，它定義了處理資料的基本結構，並且賦予了它們促進操作的方法，例如：

讀取資料
調整索引
使用日期和時間序列
排序，分組，重新排序和一般資料調整
處理缺失值等等

跟複雜的統計和分析功能留給其他軟體包，例如 statsmodels 和 scikit-learn，它們構建在 pandas 之上。接下來，開始我們的學習，首先我們來匯入我們需要的資料包：

import pandas as pd
import numpy as np

Series

由 pandas 定義的兩種重複資料型別是 Series 和 DataFrame，你可以將 Series 看做是一個 column，例如對單個變數的觀察集合。DataFrame 是多個數據相關的 Series 的集合。

接下來，讓我們從 Series 開始學習。

s = pd.Series(np.random.randn(4), name = "daily returns")
s

0    1.528827
1   -0.836487
2   -1.932910
3   -1.006040 

Name: daily returns, dtype: float64

在這裡，你可以將索引 0，1，2，3 想象成四家上市公司的索引，其對應的值是其股票的每日回報。pandas Series 是基於 numpy 陣列構建，支援許多相似的操作。

s * 100

0    152.882717
1    -83.648681
2   -193.290987
3   -100.603970
Name: daily returns, dtype: float64

np.abs(s)

0    1.528827
1    0.836487
2    1.932910
3    1.006040
Name: 
 daily returns, dtype: float64

但是 Series 提供的不僅僅是 Numpy 陣列，他們還有一些額外的方法（偏向於統計）。

s.describe()

count    4.000000
mean    -0.561652
std      1.474615
min     -1.932910
25%     -1.237757
50%     -0.921263
75%     -0.245158
max      1.528827
Name: daily returns, dtype: float64

我們還可以自定義索引的值，比如：

s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
s

AMZN    1.528827
AAPL   -0.836487
MSFT   -1.932910
GOOG   -1.006040
Name: daily returns, dtype: float64

通過這種方式檢視，Series 就像快速，高效的 Python 詞典。實際上，你可以使用與 Python 字典大致相同的語法來操作。

s['AMZN']

1.528827

s['AMZN'] = 0
s

AMZN    0.000000
AAPL   -0.836487
MSFT   -1.932910
GOOG   -1.006040
Name: daily returns, dtype: float64

'AAPL' in s

True

DataFrames

雖然 Series 非常有效，但是它是單列資料，有時候我們想處理多列資料怎麼辦呢？DataFrame 幫我們解決了這個問題，它是多列資料，每一列代表一個變數。實質上，pandas 中的 DataFrame 類似於（高度優化的）Excel 電子表格。因此，它是一種強大的工具，用於表示和分析自然組織成行和列的資料，通常具有針對各行和各列的描述性索引。我們來舉個例子，比如我這邊有一個 csv 檔案，你可以點選這裡下載。資料展示如下：

"country","country isocode","year","POP","XRAT","tcgdp","cc","cg"
"Argentina","ARG","2000","37335.653","0.9995","295072.21869","75.716805379","5.5788042896"
"Australia","AUS","2000","19053.186","1.72483","541804.6521","67.759025993","6.7200975332"
"India","IND","2000","1006300.297","44.9416","1728144.3748","64.575551328","14.072205773"
"Israel","ISR","2000","6114.57","4.07733","129253.89423","64.436450847","10.266688415"
"Malawi","MWI","2000","11801.505","59.543808333","5026.2217836","74.707624181","11.658954494"
"South Africa","ZAF","2000","45064.098","6.93983","227242.36949","72.718710427","5.7265463933"
"United States","USA","2000","282171.957","1","9898700","72.347054303","6.0324539789"
"Uruguay","URY","2000","3219.793","12.099591667","25255.961693","78.978740282","5.108067988"

假設你將此資料儲存為當前工作目錄中的 test_pwt.csv（在 Jupyter 中鍵入 %pwd 可以檢視它是什麼），我們可以按照如下形式進行讀入資料：

df = pd.read_csv('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/pandas/data/test_pwt.csv')
type(df)

pandas.core.frame.DataFrame

df

	country	country isocode	year	POP	XRAT	tcgdp	cc	cg
0	Argentina	ARG	2000	37335.653	0.999500	2.950722e+05	75.716805	5.578804
1	Australia	AUS	2000	19053.186	1.724830	5.418047e+05	67.759026	6.720098
2	India	IND	2000	1006300.297	44.941600	1.728144e+06	64.575551	14.072206
3	Israel	ISR	2000	6114.570	4.077330	1.292539e+05	64.436451	10.266688
4	Malawi	MWI	2000	11801.505	59.543808	5.026222e+03	74.707624	11.658954
5	South Africa	ZAF	2000	45064.098	6.939830	2.272424e+05	72.718710	5.726546
6	United States	USA	2000	282171.957	1.000000	9.898700e+06	72.347054	6.032454
7	Uruguay	URY	2000	3219.793	12.099592	2.525596e+04	78.978740	5.108068

我們可以使用標準的 Python 資料切片表示法選擇特定的行:

df[2:5]

	country	country isocode	year	POP	XRAT	tcgdp	cc	cg
2	India	IND	2000	1006300.297	44.941600	1.728144e+06	64.575551	14.072206
3	Israel	ISR	2000	6114.570	4.077330	1.292539e+05	64.436451	10.266688
4	Malawi	MWI	2000	11801.505	59.543808	5.026222e+03	74.707624	11.658954

要選擇列，我們可以傳遞一個列表，其中包含表示為字串的所需列的名稱：

df[['country', 'tcgdp']]

	country	tcgdp
0	Argentina	2.950722e+05
1	Australia	5.418047e+05
2	India	1.728144e+06
3	Israel	1.292539e+05
4	Malawi	5.026222e+03
5	South Africa	2.272424e+05
6	United States	9.898700e+06
7	Uruguay	2.525596e+04

要使用整數選擇行和列，我們可以使用 iloc 屬性，格式為 .iloc[rows, columns]

df.iloc[2:5,0:4]

	country	country isocode	year	POP
2	India	IND	2000	1006300.297
3	Israel	ISR	2000	6114.570
4	Malawi	MWI	2000	11801.505

要使用整數和標籤的混合來選擇行和列，我們可以以類似的方法使用 loc 屬性。

df.loc[df.index[2:5], ['country', 'tcgdp']]

	country	tcgdp
2	India	1.728144e+06
3	Israel	1.292539e+05
4	Malawi	5.026222e+03

讓我們想象一下，我們只關注人口和GDP（tcgdp），將資料幀 df 剝離到僅這些變數的一種方法是使用上述選擇方法覆蓋資料幀。

df = df[['country','POP','tcgdp']]
df

	country	POP	tcgdp
0	Argentina	37335.653	2.950722e+05
1	Australia	19053.186	5.418047e+05
2	India	1006300.297	1.728144e+06
3	Israel	6114.570	1.292539e+05
4	Malawi	11801.505	5.026222e+03
5	South Africa	45064.098	2.272424e+05
6	United States	282171.957	9.898700e+06
7	Uruguay	3219.793	2.525596e+04

這裡索引 0，1，…，7 是多餘的，因為我們可以使用國家名稱作為索引。為此，我們將索引設定為資料框中的國家/地區變數

df = df.set_index('country')
df

	POP	tcgdp
country
Argentina	37335.653	2.950722e+05
Australia	19053.186	5.418047e+05
India	1006300.297	1.728144e+06
Israel	6114.570	1.292539e+05
Malawi	11801.505	5.026222e+03
South Africa	45064.098	2.272424e+05
United States	282171.957	9.898700e+06
Uruguay	3219.793	2.525596e+04

讓我們給列取一個稍微好一點的名字

df.columns = 'population', 'total GDP'
df

	population	total GDP
country
Argentina	37335.653	2.950722e+05
Australia	19053.186	5.418047e+05
India	1006300.297	1.728144e+06
Israel	6114.570	1.292539e+05
Malawi	11801.505	5.026222e+03
South Africa	45064.098	2.272424e+05
United States	282171.957	9.898700e+06
Uruguay	3219.793	2.525596e+04

表中人口數以千計算，讓我們來恢復一下，按照個計算：

df['population'] = df['population'] * 1e3
df

	population	total GDP
country
Argentina	3.733565e+07	2.950722e+05
Australia	1.905319e+07	5.418047e+05
India	1.006300e+09	1.728144e+06
Israel	6.114570e+06	1.292539e+05
Malawi	1.180150e+07	5.026222e+03
South Africa	4.506410e+07	2.272424e+05
United States	2.821720e+08	9.898700e+06
Uruguay	3.219793e+06	2.525596e+04

接下來我們將新增一個現實人均實際 GDP 的列，隨著時間的推移乘以 1000000，因為總 GDP 為數百萬

df['GDP percap'] = df['total GDP'] * 1e6 / df['population']
df

	population	total GDP	GDP percap
country
Argentina	3.733565e+07	2.950722e+05	7903.229085
Australia	1.905319e+07	5.418047e+05	28436.433261
India	1.006300e+09	1.728144e+06	1717.324719
Israel	6.114570e+06	1.292539e+05	21138.672749
Malawi	1.180150e+07	5.026222e+03	425.896679
South Africa	4.506410e+07	2.272424e+05	5042.647686
United States	2.821720e+08	9.898700e+06	35080.381854
Uruguay	3.219793e+06	2.525596e+04	7843.970620

關於 pandas DataFrame 和 Series 物件的一個好處是它們具有通過 Matplotlib 工作的繪圖和視覺化方法。例如，我們可以輕鬆生成人均 GDP 的條形圖。

import matplotlib.pyplot as plt

df['GDP percap'].plot(kind='bar')
plt.show()

在這裡插入圖片描述

目前，資料框按照國家/地區的字母順序排序——讓我們將其改為人均 GDP。

df = df.sort_values(by='GDP percap', ascending=False)
df

	population	total GDP	GDP percap
country
United States	2.821720e+08	9.898700e+06	35080.381854
Australia	1.905319e+07	5.418047e+05	28436.433261
Israel	6.114570e+06	1.292539e+05	21138.672749
Argentina	3.733565e+07	2.950722e+05	7903.229085
Uruguay	3.219793e+06	2.525596e+04	7843.970620
South Africa	4.506410e+07	2.272424e+05	5042.647686
India	1.006300e+09	1.728144e+06	1717.324719
Malawi	1.180150e+07	5.026222e+03	425.896679

我們繼續來畫圖:

df['GDP percap'].plot(kind='bar')
plt.show()

在這裡插入圖片描述

pandas系列學習（一）：pandas入門

介紹

Series

DataFrames

pandas系列學習（一）：pandas入門

pandas系列學習（三）：DataFrame

pandas系列學習（五）：資料連線

pandas系列學習（六）：資料聚合

CNN系列學習（一）：LeNet-5

機器學習（一）：快速入門線性分類器

TensorFlow系列專題（一）：機器學習基礎

CSS3總結學習（一）：CSS3用戶界面

[linux][MongoDB] mongodb學習（一）：MongoDB安裝、管理工具、

Unity3D學習（一）：簡單梳理下Unity跨平臺的機制原理

tp5.0 學習（一）：虛擬環境安裝

canvas學習（一）：線條，圖像變換和狀態保存

JavaAPI學習（一）：API && String類 && Stringbuffer && StringBuilder

ElasticsearchCRUD翻譯系列之（一）： ElasticsearchCRUD 介紹

前端學習（一）：基本類型

docker學習（一）：docker安裝和架構

PE檔案格式學習（一）：概述

javaweb學習筆記（一）：web入門簡介、tomcat

Java學習（一）：第一章計算機、程式和Java概述

深度強化學習（一）： Deep Q Network(DQN)

pandas系列學習（一）：pandas入門

介紹

Series

DataFrames

相關推薦