python 數據分析3

阿新 • • 發佈：2018-01-09

之前算法進行 lin abc 會計用戶 del 另一個

本節概要

　　pandas簡介

安裝

pip install pandas

pandas的2個主要數據結構：DataFrame 和 Series

Series

series是一種類似於一維數組的對象，它由一組數據以及一組與之相關的數據標簽（索引）組成。僅由一組數組即可產生最簡單的Series：

obj = Series([4, 7, 9, -1])
print(obj)

0    4
1    7
2    9
3   -1
dtype: int64

Series的字符串表現形式為索引在左邊，值在右邊。沒有設定索引，會自動穿件一個0~N-1的整數型索引。

obj = Series([4, 7, 9, -1])
print(obj.values)
print(obj.index)

[ 4  7  9 -1]
RangeIndex(start=0, stop=4, step=1)

創建一個含有自定義索引的series

obj = Series([4, 7, 9, -1], index=[‘a‘, ‘b‘, ‘c‘, ‘d‘])
print(obj)
print(obj.index)

a    4
b    7
c    9
d   -1
dtype: int64
Index([‘a‘, ‘b‘, ‘c‘, ‘d‘], dtype=‘object‘)

索引取值

obj[‘a‘]    ==> 4
obj[‘c‘]    ==> 9
obj[‘a‘, ‘d‘]    ==> 4, -1

NumPy數組運算都會保留索引跟值之間的鏈接：

obj[obj>2]

a    4
b    7
c    9
dtype: int64

obj*2

a     8
b    14
c    18
d    -2
dtype: int64

series可以看成是一個有序字典，因為存在index到value的一個映射關系。可以應用在許多原本需要字典參數的函數中：

‘b‘ i obj        ==> True

如果數據存放在Python字典中，也可以直接用字典穿件series：

dict_obj = {"a":100,"b":20,"c":50,"d":69}
obj = Series(dict_obj)
dict_obj

a    100
b     20
c     50
d     69
dtype: int64

如果傳入一個字典，還有index列表：

dict_obj = {"a":100,"b":20,"c":50,"d":69}
states = [‘LA‘,‘b‘,‘a‘,‘NY‘]
obj = Series(dict_obj, index=states)

LA      NaN
b      20.0
a     100.0
NY      NaN
dtype: float64

我們發現匹配項會被找出來放在相應的位置，而沒有匹配的則用NAN(not a number)表示缺失。pandas的isnull 和notnull函數可以用於檢測數據缺失：

pd.isnull(obj)

LA     True
b     False
a     False
NY     True
dtype: bool

Series也有類似的用法：

obj.isnull()

LA     True
b     False
a     False
NY     True
dtype: bool

Series 最重要的一個功能是：它在算術運算中會自動對齊不同索引的數據

dict_obj = {"a":100,"b":20,"c":50,"d":69}
dict_obj1 = {"e":100,"b":20,"c":50,"f":69}

obj = Series(dict_obj)
obj1 = Series(dict_obj1)

obj+obj1

a      NaN
b     40.0
c    100.0
d      NaN
e      NaN
f      NaN
dtype: float64

Series對象的name屬性

obj.name=‘qty‘
obj.index.name = ‘types‘


types
a    100
b     20
c     50
d     69
Name: qty, dtype: int64

Series索引可以通過賦值的方式就地修改：

obj.index = [‘dandy‘,‘renee‘,‘Jeff‘,‘Steve‘]
obj

dandy    100
renee     20
Jeff      50
Steve     69
Name: qty, dtype: int64

DataFrame

dataframe是一個表格型的數據結構，它含有一組有序列，每列可以是不通的值的類型。DataFrame既有行索引，又有列索引，它可以看成是series組成的字典（共用同一個索引）。

構建DataFrame

data = {‘states‘:[‘NY‘, ‘LA‘, ‘CA‘, ‘BS‘, ‘CA‘],
        ‘year‘:[2000, 2001, 2002, 2001, 2000],
        ‘pop‘:[1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame


   pop states  year
0  1.5     NY  2000
1  1.7     LA  2001
2  3.6     CA  2002
3  2.4     BS  2001
4  2.9     CA  2000

指定列序列

frame2 = DataFrame(data, columns=[‘year‘, ‘pop‘, ‘states‘, ‘test‘])
frame2.columns

   year  pop states test
0  2000  1.5     NY  NaN
1  2001  1.7     LA  NaN
2  2002  3.6     CA  NaN
3  2001  2.4     BS  NaN
4  2000  2.9     CA  NaN

Index([‘year‘, ‘pop‘, ‘states‘, ‘test‘], dtype=‘object‘)
# 不存在的列就會產生NaN值

取值:

# 取一列數據的2種方式
frame2[‘states‘]
frame2.year


0    NY
1    LA
2    CA
3    BS
4    CA
Name: states, dtype: object

0    2000
1    2001
2    2002
3    2001
4    2000
Name: year, dtype: int64
# 返回一個series

# 修改行索引
DataFrame(data, columns=[‘year‘, ‘pop‘, ‘states‘, ‘test‘], index=[‘one‘, ‘two‘, ‘three‘, ‘four‘, ‘five‘])

       year  pop states test
one    2000  1.5     NY  NaN
two    2001  1.7     LA  NaN
three  2002  3.6     CA  NaN
four   2001  2.4     BS  NaN
five   2000  2.9     CA  NaN

獲取列
frame2.ix[‘three‘]

year      2002
pop        3.6
states      CA
test       NaN
Name: three, dtype: object

列可以通過賦值的方式修改
frame2.test = ‘11‘

       year  pop states test
one    2000  1.5     NY   11
two    2001  1.7     LA   11
three  2002  3.6     CA   11
four   2001  2.4     BS   11
five   2000  2.9     CA   11

列操作：

將列表或數組賦值給某個列時，其長度必須跟DataFrame的長度相匹配。如果是Series則會精確匹配DataFrame索引，所有空位被填上缺失值

val = Series([-1, -2, 3], index=[‘two‘, ‘one‘, ‘three‘])
frame2[‘test‘] = val

frame2

       year  pop states  test
one    2000  1.5     NY  -2.0
two    2001  1.7     LA  -1.0
three  2002  3.6     CA   3.0
four   2001  2.4     BS   NaN
five   2000  2.9     CA   NaN

為不存在的列賦值，會創建出一列新列。del用於刪除，跟python字典用法很像

frame2[‘test1‘] = frame2.test.notnull()
frame2

       year  pop states  test  test1
one    2000  1.5     NY  -2.0   True
two    2001  1.7     LA  -1.0   True
three  2002  3.6     CA   3.0   True
four   2001  2.4     BS   NaN  False
five   2000  2.9     CA   NaN  False

del frame2[‘test1‘]
frame2

       year  pop states  test
one    2000  1.5     NY  -2.0
two    2001  1.7     LA  -1.0
three  2002  3.6     CA   3.0
four   2001  2.4     BS   NaN
five   2000  2.9     CA   NaN

嵌套字典創建dataframe

pop = {
    "dandy":{"age":18, "gender":"male"},
    "elina": {"age": 16, "gender": "female"},
    "renee": {"age": 16, "gender": "female"},
    "taylor": {"age": 18, "gender": "female"},
}
frame3 = DataFrame(pop)

frame3

       dandy   elina   renee  taylor
age       18      16      16      18
gender  male  female  female  female

frame3.T  # 轉置

       age  gender
dandy   18    male
elina   16  female
renee   16  female
taylor  18  female

series組成的字典創建：

pdata = {‘dandy‘: frame3[‘dandy‘][:-1],
         ‘elina‘: frame3[‘elina‘]}
frame4 = DataFrame(pdata)
frame4

       dandy   elina
age       18      16
gender   NaN  female

設置屬性名

frame3.index.name = ‘detail‘
frame3.columns.name = ‘name‘

frame3

name   dandy   elina   renee  taylor
detail                              
age       18      16      16      18
gender  male  female  female  female

values屬性

frame3.values  # 以二維ndarray的形式返回dataframe中的數據

[[18 16 16 18]
 [‘male‘ ‘female‘ ‘female‘ ‘female‘]]

索引對象

pandas的索引對象負責管理軸標簽和其他元素（軸名稱等）。構建series或者dataframe時，所用到的任何數組和其他序列的標簽都會被轉成一個Index。Index對象是不可修改的(immutable)。

obj = Series(range(3), index=[‘a‘, ‘b‘, ‘c‘])
Index = obj.index
Index[0]

a

如果輸入Index[0] = ‘x‘：

技術分享圖片

正是因為index的不可修改性，才能使Index對象在多個數據結構之間安全共享：

Index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2, 0], index=Index)

obj2.index is Index

True

除了長得像數組，Index的功能也類似一個固定大小的集合：

pop = {
    "dandy":{"age":18, "gender":"male"},
    "elina": {"age": 16, "gender": "female"},
    "renee": {"age": 16, "gender": "female"},
    "taylor": {"age": 18, "gender": "female"},
}
frame3 = DataFrame(pop)
‘dandy‘ in frame3.columns

True

技術分享圖片

基本功能

obj = Series([4, 6, 9.9, 7], index=[‘a‘, ‘v‘, ‘b‘, ‘d‘])
obj

a    4.0
v    6.0
b    9.9
d    7.0
dtype: float64

reindex

obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘v‘])  # 不存在的NaN

obj2

a    4.0
b    9.9
c    NaN
d    7.0
v    6.0
dtype: float64


# 引入fill_value=0
obj2 = obj.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘, ‘v‘], fill_value=0)
obj2

a    4.0
b    9.9
c    0.0
d    7.0
v    6.0
dtype: float64

#method
obj = Series([‘aa‘, ‘bb‘, ‘cc‘, ‘dd‘], index=[0,2,4,6])
obj2 = obj.reindex(range(7), method=‘ffill‘)

0    aa
1    aa
2    bb
3    bb
4    cc
5    cc
6    dd
dtype: object

技術分享圖片

frame = DataFrame(np.arange(9).reshape((3,3)), index=[‘a‘, ‘b‘, ‘c‘], columns=[‘Ohio‘, ‘Texas‘, ‘California‘])
frame2 =frame.reindex([‘a‘, ‘b‘, ‘c‘, ‘d‘])
frame

   Ohio  Texas  California
a     0      1           2
b     3      4           5
c     6      7           8

frame2

   Ohio  Texas  California
a   0.0    1.0         2.0
b   3.0    4.0         5.0
c   6.0    7.0         8.0
d   NaN    NaN         NaN

text = [‘Texas‘, ‘LA‘, ‘California‘]
frame3 = frame.reindex(columns=text)
frame3

   Texas  LA  California
a      1 NaN           2
b      4 NaN           5
c      7 NaN           8

同時對行列重新索引，插值只能按行應用(軸0)

text = [‘Texas‘, ‘LA‘, ‘California‘]
frame.reindex(index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘,‘f‘], method=‘ffill‘).reindex(columns=text)


   Texas  LA  California
a      1 NaN           2
b      1 NaN           2
c      4 NaN           5
d      4 NaN           5
e      7 NaN           8
f      7 NaN           8

利用ix的標簽索引功能，重新索引會更簡潔

frame.ix[[‘a‘,‘b‘,‘c‘,‘d‘],text]

   Texas  LA  California
a    1.0 NaN         2.0
b    NaN NaN         NaN
c    4.0 NaN         5.0
d    NaN NaN         NaN

技術分享圖片

丟棄指定軸上的項

drop方法，返回的是一個在指定軸上刪除了指定值的新對象：

obj = Series(np.arange(5), index=[‘a‘,‘b‘,‘c‘,‘d‘,‘e‘])
new_obj = obj.drop(‘c‘)
new_obj

a    0
b    1
d    3
e    4
dtype: int32

對於DataFrame，可以刪除任意軸上的索引值：

data = DataFrame(np.arange(16).reshape(4,4),
                 index=[‘LA‘,‘UH‘,‘NY‘,‘BS‘],
                 columns=[‘one‘,‘two‘,‘three‘,‘four‘])
data.drop([‘LA‘,‘BS‘])

    one  two  three  four
UH    4    5      6     7
NY    8    9     10    11

#對於列，axis=1
data.drop([‘one‘,‘three‘], axis=1)

    two  four
LA    1     3
UH    5     7
NY    9    11
BS   13    15

索引、選取和過濾

Series索引的工作方式類似於NumPy數組的索引，只不過Series的索引值不只是整數。

obj = Series(np.arange(4), index=[‘a‘,‘b‘,‘c‘,‘d‘])
obj

a    0
b    1
c    2
d    3
dtype: int32

obj[‘b‘]  # 或者obj[1]
1


obj[2:4]  # 或者obj[[‘c‘,‘d‘]]
c    2
d    3
dtype: int32

obj[[1,3]]
b    1
d    3
dtype: int32


obj[obj>2]
d    3
dtype: int32

python 數據分析3

之前算法進行 lin abc 會計用戶 del 另一個本節概要　　pandas簡介安裝 pip install pandas pandas的2個主要數據結構：DataFrame 和 Series Series series是一種類似於一維數組的對象

Python數據分析(二): Numpy技巧 (3/4)

targe 工具由於 ref 數據分析技術分享添加 pan note numpy、pandas、matplotlib（+seaborn）是python數據分析/機器學習的基本工具。 numpy的內容特別豐富，我這裏只能介紹一下比較常見的方法和屬性。昨天晚上發

基於 Python 和 Pandas 的數據分析(3) --- 輸入/輸出基礎

als 作圖輸入 UNC 改變同時 inf 有一點理論這一節, 我們要討論 Pandas 的輸入與輸出, 並且應用在現實的實際例子中. 為了得到大量的數據, 向大家推薦一個網站 Quandl. Quandl 有很多免費和付費的資源. 這個網站最大的優勢在於數據的規範

Python數據分析－Kobe Bryan生涯數據讀取及分析

type lag col 導入 csv hot plot 打印 cat 1.將數據（csv格式）導入jupyter import pandas as pd import matplotlib.pyplot as plt filename＝‘data.csv‘ raw=pd.

Python數據分析必備Anaconda安裝、快捷鍵、包安裝

倉庫 iop http ins alt 3.1 pip des rip Python數據分析必備： 1.Anaconda操作首先應該設置本地存放數據目錄為工作目錄，這樣可以加載本地數據集到內存中 import os os.chdir("D:/BigData/

python數據分析入門學習筆記兒

rip help cat app run 復雜 bsp 真的 parser 學習利用python進行數據分析的筆記兒&下星期二內部交流會要講的內容，一並分享給大家。博主粗心大意，有什麽不對的地方歡迎指正~還有許多尚待完善的地方，待我一邊學習一邊完善~ 前言：各種和

python數據分析筆記中panda(2)

log csv code panda imp span 抽取分析 .cn 1 將手機號碼分開為運營商，地區和號碼段 1 from pandas import read_csv; 2 3 df = read_csv("H:\\pythonCode\\4.6

利用python數據分析panda學習筆記之基本功能

數據分析 method 入行整數 -s cnblogs 3.4 style fill 1 重新生成索引如果某個索引值不存在就引入缺失值 1 from pandas import Series,DataFrame 2 import pandas as pd 3 im

搭建python數據分析平臺

python學習大數據 jupyter 基本結構其實沒什麽高深的東西，無非是常用的那一套：pandas, numpy, matplotlib…但是為了更方便使用，加持了 jupyter notebook（即以前的ipython notebook）……又為了更方便使用，前端加了nginx或apac

Python數據分析(一): ipython 技巧！

http 機器 pic naconda 環境 pytho 也會 win 令行不一定非得使用Jupyter Notebook，試試ipython命令行安裝 ipython 我只試過Windows 10環境下的。 1.安裝python安裝包之後，應該就有ipython了。

Python數據分析(二): Numpy技巧 (4/4)

div 基本 images atp 工具 cnblogs note 屬性。 html numpy、pandas、matplotlib（+seaborn）是python數據分析/機器學習的基本工具。 numpy的內容特別豐富，我這裏只能介紹一下比較常見的方法和屬性。

Python數據分析工具

ins img logs nbsp print cat pytho all [0 1、Numpy 　　安裝：pip install numpy　 [root@kvm work]# cat numpy_test.py #!/usr/bin/env python #cod

python 數據分析

size 隨機生成表數據類型 num msi ray 動態 pytho import numpy as np list = [[1,3,5,7],[2,4,6,8]] np_list = np.array(list) #將l列表數據轉化為數組類型 print(

[讀書筆記] Python數據分析（一）準備工作

基礎 htm 環境防止功能多維處理工具 ati 增強 1. python中數據結構：矩陣，數組，數據框，通過關鍵列相互聯系的多個表（SQL主鍵，外鍵），時間序列 2. python 解釋型語言，程序員時間和CPU時間衡量，高頻交易系統 3. 全局解釋器鎖GIL，

利用Python數據分析-Numpy和Pands篇

單位另一個 mat transpose 映射文件 nor med mea 隨機書籍《利用Python進行數據分析》 Numpy--數組及矩陣，矢量計算　　1、ndarray多維數組， matrix矩陣　　2、針對整組數據進行快速運算的標準數學（統計）函數，（與lis

CP1621-唐宇迪-python數據分析與機器實戰

imageview 算法包括 container href blank gin wid 困難深度學習框架-Tensorflow案例實戰視頻課程隨筆背景：在很多時候，很多入門不久的朋友都會問我：我是從其他語言轉到程序開發的，有沒有一些基礎性的資料給我們學習學習呢，你的框

Python數據分析I

endpoint spl fig ner 存儲 markdown line wid urn Python數據分析概述數據分析的含義與目標統計分析方法提取有用信息研究、概括、總結 Python與數據分析 Python: Guido Van Rossum Christm

【Python數據分析】

改變 line 數組調整 panda title 索引對象 play back 索引對象的其他功能 ①更換索引 ②對齊 ③刪除一、更換索引我們已經知道，數據結構一旦聲明，index對象就不能改變事實上，我們重新定義索引之後，我們就能夠用現有的數據結構生成一個新的數

Python數據分析與挖掘所需的Pandas常用知識

columns 列表元素其中標簽數據 shtml 導致 lenovo Python數據分析與挖掘所需的Pandas常用知識前言Pandas基於兩種數據類型：series與dataframe。一個series是一個一維的數據類型，其中每一個元素都有一個標簽。serie

《Python 數據分析》筆記——pandas

filled 處理追加默認 date ips 變量 style 標準差 Pandaspandas是一個流行的開源Python項目，其名稱取panel data(面板數據)與Python data analysis(Python 數據分析)之意。pandas有兩個重要的數

python 數據分析3

相關推薦