利用python進行資料分析-----第二篇 Numpy 陣列 向量計算 索引 切片 轉置 軸對換 檔案輸入輸出
目錄
Numpy簡介
NumPy是一種通用的陣列處理軟體包,旨在有效地操縱任意記錄的大型多維陣列,而不會為小型多維陣列犧牲太多的速度。 NumPy建立在數字程式碼庫的基礎上,增加了numarray引入的功能以及擴充套件的C-API,並且能夠建立任意型別的陣列,這也使得NumPy適合與通用資料庫應用程式連線。 離散傅立葉變換,基本線性代數和隨機數生成也有基本功能。 從pypi分發的所有numpy輪都是BSD許可的。 Windows wheels與ATLAS BLAS / LAPACK庫連結,僅限SSE2指令,因此可能無法為您的機器提供最佳線性代數效能。 有關替代方案,請參見http://docs.scipy.org/doc/numpy/user/install.html。
ndarray 一種多維陣列物件
Numpy最重一個特點就是其多維陣列物件ndarray,ndarray是一個通用的同構資料容器,所有元素必須為相同型別,每個陣列都包含用來表示維度大小的元組shape,一個用於說明陣列物件的dtype.
建立ndarray
>>> data1 = [2,3,4,5,6] >>> import numpy as np >>> arr1 = np.array(data1) >>> print arr1 [2 3 4 5 6] >>> arr1 array([2, 3, 4, 5, 6])
>>> data2 = [[1,2,3,4],[5,6,7,8]]
>>> arr2 = np.array(data2)
>>> arr2
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
>>> arr2.ndim
2
>>> arr2.shape
(2L, 4L)
其他陣列建立函式
函式 | 說明 |
array | 輸入的 列表、元組、陣列、或其它序列型別轉換為ndarray,推斷出或顯示指定dtype,預設直接複製輸入資料 |
asarray |
將輸入轉換為asarray,輸入是ndarray不進行復制 |
arange | 內建range,返回為ndarray |
ones,ones_like | 根據指定形狀建立一個全1陣列,one_like以另一個數組為引數,並根據其形狀和dtype建立一個全1陣列 |
zeros,zeros_like |
全0陣列 |
empty,empty_like | 建立新陣列,只分配記憶體空間,但不填充任何值 |
eye,identity | 建立一個N*N單位矩陣。 |
ndarray的資料型別
dtype包含將ndarray記憶體解釋為特定資料型別需要的資訊,多數情況下,他們直接對映到響應機器表示。
型別 | 型別程式碼 | 說明 |
int8、uint8 | i1、u1 | 有符號和無符號的8位(1個位元組)整數 |
int16、uint16 | i2、u2 | 有符號和無符號的16位(2個位元組)整數 |
int32、uint32 | i4、u4 | 有符號和無符號的32位(4個位元組)整數 |
int64、unint64 | i8、u8 | 有符號和無符號的64位(8個位元組)整數 |
float16 | f2 | 半精度浮點數 |
float32 | f4或f | 標準的單精度浮點數 |
float64 | f8或d | 標準的雙精度浮點數 |
float128 | f16或g | 擴充套件精度浮點數 |
complex64、complex128、complex256 | c8、c16、c32 | 分別用兩個32位、64位或128位浮點數表示的複數 |
bool | ? | 儲存True和False值的布林型別 |
object | O | Python物件型別 |
string_ | S | 固定長度的字串長度(每個字元1個位元組) |
unicode_ | U | 固定長度的unicode長度(每個字元1個位元組) |
陣列和標量之間的運算
陣列很重要,因為即使你不用編寫迴圈即可對資料執行批量運算,通常叫做向量化,vectorization,大小相等的陣列之間任何的算數運算都將運算應用到元素級。陣列與算數運算會將那個標量傳播到各個元素。
>>> arr = np.array([[1.,2.,3.],[4.,5.,6.]])
>>> arr
array([[1., 2., 3.],
[4., 5., 6.]])
>>> arr*arr
array([[ 1., 4., 9.],
[16., 25., 36.]])
>>> arr*3
array([[ 3., 6., 9.],
[12., 15., 18.]])
>>> arr**0.5
array([[1. , 1.41421356, 1.73205081],
[2. , 2.23606798, 2.44948974]])
基本的索引和切片
下面的一維陣列建立和廣播,該過程操作的是資料的原始檢視,資料不會被複制,檢視任何修改都會直接反映到源陣列上
>>> arr =np.arange(15)
>>> arr[3:7] = 0
>>> arr
array([ 0, 1, 2, 0, 0, 0, 0, 7, 8, 9, 10, 11, 12, 13, 14])
>>> array_slice = arr[4:6]
>>> array_slice[1] = 999
>>> arr
array([ 0, 1, 2, 0, 0, 999, 0, 7, 8, 9, 10, 11, 12, 13, 14])
>>> array_slice[:] = 888
>>> arr
array([ 0, 1, 2, 0, 888, 888, 0, 7, 8, 9, 10, 11, 12, 13, 14])
高維切片索引
>>> add2d=[[1,2,3],[4,5,6],[7,8,9]]
>>> arr2d = np.array(add2d)
>>> arr2d
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> arr2d[:2,1:]=0
>>> arr2d
array([[1, 0, 0],
[4, 0, 0],
[7, 8, 9]])
布林值索引
>>> names =np.array(['Bob','Joe','Will','Bob','Will','Joe','Joe'])
>>> data = np.random.randn(7,4)
>>> data
array([[-0.07293883, -0.7612633 , -0.29319602, -0.0042023 ],
[ 2.289825 , -0.79544618, -1.07545136, -0.90398504],
[-0.16304643, -0.32437501, -1.74858425, 0.98331551],
[ 1.38958392, -0.45864779, 0.84023555, -1.21870602],
[ 1.33682575, 1.06778095, 1.97012061, 0.1859616 ],
[ 0.59551277, -1.09129405, 1.1283531 , -1.65953415]])
>>> data[names == 'Bob']
array([[ 0.27565828, 0.76888988, 0.39861839, 1.17158988],
[ 0.8282542 , 1.32392267, -0.04900376, -0.08355354]])
花式索引
迭代陣列內元素,作為索引。
花式索引與切片不同的是,總是將資料複製到新陣列中
>>> arr = np.empty((8,4))
>>> for i in range(8):
... arr[i] = i
...
>>> arr
array([[0., 0., 0., 0.],
[1., 1., 1., 1.],
[2., 2., 2., 2.],
[3., 3., 3., 3.],
[4., 4., 4., 4.],
[5., 5., 5., 5.],
[6., 6., 6., 6.],
[7., 7., 7., 7.]])
>>> arr[[4,3,0,6]]
array([[4., 4., 4., 4.],
[3., 3., 3., 3.],
[0., 0., 0., 0.],
[6., 6., 6., 6.]])
陣列轉置和軸對換
轉置(transpose)是重塑的一種特殊形式,它返回的是源資料的檢視,不會進行復制操作,陣列還具有T屬性
>>> arr = np.arange(15).reshape((5,3))
>>> arr
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14]])
>>> arr = np.arange(15).reshape((3,5))
>>> arr
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> arr.T
array([[ 0, 5, 10],
[ 1, 6, 11],
[ 2, 7, 12],
[ 3, 8, 13],
[ 4, 9, 14]])
計算矩陣的內積
>>> arr = np.random.randn(6,3)
>>> arr
array([[ 0.79621331, -0.83430358, -0.80911319],
[ 0.03574342, 1.84386643, 0.0981496 ],
[ 2.57203239, 1.25346891, 0.75237162],
[-0.06301713, -0.64254258, -0.03910239],
[-0.26404073, 0.24409075, -1.62509342],
[-0.68939168, 0.89466497, -0.7718325 ]])
>>> np.dot(arr.T,arr)
array([[7.79953341, 1.98485182, 2.25805555],
[1.98485182, 6.93995684, 0.73701842],
[2.25805555, 0.73701842, 4.46854357]])
transpose函式
元組引數的含義為資料填充的順序。0,1,2即使正常的順序。1,0,2即使先填充並列陣列內容。即使對應reshape函式的2,2,4
二維陣列含兩個元素,一維陣列含兩個元素,三級陣列含4個元素,生成順序為:
000 001 002 003
100 101 102 103
010 011 012 013
110 111 112 113
000 001 002 003 010 011 .... 對應
0 1 2 3 4 5 。。。
>>> arr = np.arange(16).reshape((2,2,4))
>>> arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
>>> arr.transpose((0,1,2))
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15]]])
>>> arr.transpose((0,2,1))
array([[[ 0, 4],
[ 1, 5],
[ 2, 6],
[ 3, 7]],
[[ 8, 12],
[ 9, 13],
[10, 14],
[11, 15]]])
>>> arr.transpose((1,0,2))
array([[[ 0, 1, 2, 3],
[ 8, 9, 10, 11]],
[[ 4, 5, 6, 7],
[12, 13, 14, 15]]])
通用函式:快速的元素級陣列函式
>>> import numpy as np
>>> arr = np.arange(10)
>>> arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.sqrt(arr)
array([0. , 1. , 1.41421356, 1.73205081, 2. ,
2.23606798, 2.44948974, 2.64575131, 2.82842712, 3. ])
>>> np.exp(arr)
array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
2.98095799e+03, 8.10308393e+03])
兩陣列取最大
>>> x = np.random.randn(8)
>>> y = np.random.randn(8)
>>> x
array([-0.26723909, -1.65511147, -0.04990455, -0.42501926, 0.97194785,
0.39073749, 1.06175327, 0.5585866 ])
>>> y
array([ 1.01009041, -1.62671653, 0.23241848, 0.80207752, -0.96744722,
0.16301932, 1.06355945, -0.57478033])
>>> np.maximum(x,y)
array([ 1.01009041, -1.62671653, 0.23241848, 0.80207752, 0.97194785,
0.39073749, 1.06355945, 0.5585866 ])
返回整數和小數部分
>>> arr = np.random.randn(7)*5
>>> arr
array([-7.66209967, 5.64896005, 3.55067973, -8.23018358, 0.16836742,
5.2033361 , 4.0949317 ])
>>> np.modf(arr)
(array([-0.66209967, 0.64896005, 0.55067973, -0.23018358, 0.16836742,
0.2033361 , 0.0949317 ]), array([-7., 5., 3., -8., 0., 5., 4.]))
其他函式
一元ufunc
二元ufunc
利用陣列進行資料處理
>>> import numpy as np
>>> points = np.arange(-5,5,0.01)
>>> xs,ys = np.meshgrid(points,points)
>>> xs
array([[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
...,
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99],
[-5. , -4.99, -4.98, ..., 4.97, 4.98, 4.99]])
>>> ys
array([[-5. , -5. , -5. , ..., -5. , -5. , -5. ],
[-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
[-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
...,
[ 4.97, 4.97, 4.97, ..., 4.97, 4.97, 4.97],
[ 4.98, 4.98, 4.98, ..., 4.98, 4.98, 4.98],
[ 4.99, 4.99, 4.99, ..., 4.99, 4.99, 4.99]])
>>> import matplotlib.pyplot as plt
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> z = np.sqrt(xs **2+ys **2)
>>> plt.imshow(z,cmap=plt.cm.gray);plt.colorbar()
>>> plt.title("Image plot of $\sqrt{x^2+y^2}$ for a grid of values")
Text(0.5,1,'Image plot of $\\sqrt{x^2+y^2}$ for a grid of values')
將條件邏輯表為陣列運算
>>> xarr = np.arange(1.1,1.6,0.1)
>>> yarr = np.arange(2.1,2.6,0.1)
>>> cond = np.array([True,False,True,True,False])
>>> result = np.where(cond,xarr,yarr)
>>> result
array([1.1, 2.2, 1.3, 1.4, 2.5])
>>> arr = np.random.randn(4,4)
>>> arr
array([[ 0.23409964, -0.05202383, -0.14870559, 0.79573558],
[ 0.55759966, 1.77630082, -1.02818888, 1.9391484 ],
[-1.61633658, 1.22474876, -1.43399786, -1.06554536],
[-0.59160076, -0.97505352, -0.17524749, -1.0561203 ]])
>>> np.where(arr>0,'+','-')
array([['+', '-', '-', '+'],
['+', '+', '-', '+'],
['-', '+', '-', '-'],
['-', '-', '-', '-']], dtype='|S1')
>>> np.where(arr>0,'+',arr)
array([['+', '-0.05202383380539802', '-0.1487055947867779','+'],
['+', '+', '-1.0281888793706875', '+'],
['-1.6163365764782625', '+', '-1.4339978621839753','-1.0655453648238764'],
['-0.5916007573384776', '-0.9750535166810507','-0.17524749337499476', '-1.0561202973574124']], dtype='|S32')
數學和統計方法
>>> arr = np.random.randn(5,4)
>>> arr
array([[ 1.08258166, -1.3217334 , -0.52350247, 0.03050462],
[ 0.11465261, 0.96859544, -0.46915886, -0.64741787],
[-0.71023988, -1.86821324, 0.96951363, 0.75352188],
[ 0.21555276, -1.61668268, 0.87062487, 0.82324383],
[-1.39872473, 1.23463811, -0.12252616, 0.07202626]])
>>> arr.mean()
-0.07713718068750831
>>> arr.sum()
-1.5427436137501662
>>> arr.mean(axis=1)
array([-0.1830374 , -0.00833217, -0.2138544 , 0.0731847 , -0.05364663])
>>> arr.sum(0)
array([-0.69617757, -2.60339578, 0.72495101, 1.03187872])
用於布林型陣列的方法
>>> arr = np.random.randn(100)
>>> arr
array([ 0.34073972, 0.34988851, -0.24126835, -0.42443041, -0.82233812,
-1.21461717, 0.70067547, 0.27200361, -0.78803519, 2.72967498,
0.23312249, 1.18763919, 0.55894897, 2.53258942, 0.36844006,
-0.67321937, 0.49786976, 1.31297101, 0.27737939, 0.39658457,
0.43270061, 1.36756408, 0.52557057, -0.38479557, -0.54033742,
2.36014817, 0.38723984, 1.39320484, -2.14569269, -1.43343552,
0.44446276, -0.42993059, 0.56459971, 0.83332985, 0.98949477,
2.60815978, 1.26375065, -0.88059805, -1.14111095, -1.65499809,
0.63864394, 0.47778961, 0.26342211, -1.76634124, -0.26068543,
0.5670814 , 1.04007051, -0.80613633, -0.32673813, 0.9117205 ,
-0.75458016, 1.25012221, 0.69612343, 1.06615896, 0.13390071,
-0.454111 , 0.14655905, -0.4580414 , 0.07454767, -0.27025394,
-1.04844553, 1.57240204, 1.18913241, -0.78432448, -0.43894174,
0.66986533, 0.20814651, 0.92518062, 0.12918228, 0.27310124,
1.1493472 , 0.85226379, 0.03587044, 0.05448845, 0.82835153,
1.20158862, -2.5518186 , 0.00477461, -2.04586305, -0.67640765,
-0.34065765, -2.03171558, -0.67235383, -1.09601531, -1.89471508,
1.19177494, 0.23241942, 0.34659145, 0.3189491 , -1.78125371,
-0.40714885, 1.07899036, -0.42497074, -2.30353161, 0.63488171,
0.72633715, -0.95954112, 1.3100279 , 1.3475652 , -0.19139045])
>>> (arr>0).sum()
61
>>> bools = np.array([True,False])
>>> bools.all()
False
>>> bools.any()
True
排序
>>> arr = np.random.rand(8)
>>> arr.sort()
>>> arr
array([0.21615835, 0.38790504, 0.48986001, 0.62345955, 0.72247371,
0.76378606, 0.8537614 , 0.98389717])
>>> arr = np.random.rand(5,3)
>>> arr
array([[0.20842559, 0.62874868, 0.09412693],
[0.74471062, 0.8824011 , 0.07132945],
[0.55527621, 0.08276499, 0.68830341],
[0.89926682, 0.55918536, 0.57398518],
[0.3620882 , 0.50525962, 0.14761893]])
>>> arr.sort()
>>> arr
array([[0.09412693, 0.20842559, 0.62874868],
[0.07132945, 0.74471062, 0.8824011 ],
[0.08276499, 0.55527621, 0.68830341],
[0.55918536, 0.57398518, 0.89926682],
[0.14761893, 0.3620882 , 0.50525962]])
唯一化及其他的集合邏輯
>>> name = np.array(['Bob','Jon','Tom','Bob','Bob'])
>>> np.unique(name)
array(['Bob', 'Jon', 'Tom'], dtype='|S3')
>>> ints = np.array([1,1,2,3,4,4,5,5])
>>> np.unique(ints)
array([1, 2, 3, 4, 5])
>>> value = np.array([2,3,4,5,6,7])
>>> np.in1d(value,[3,5])
array([False, True, False, True, False, False])
用於陣列的檔案輸出
將陣列以二進位制格式儲存到磁碟
>>> arr = np.arange(10)
>>> arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.save('D:\python\DataAnalysis\some_arr',arr)
>>> np.load('D:\python\DataAnalysis\some_arr.npy')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
存取文字檔案
arr = np.loadtxt('D:\python\DataAnalysis\data\\1.txt',delimiter=',')
arr
array([1., 2., 3., 4., 5., 6., 7.])
線性代數
>>> x = np.array([[1,2,3],[4,5,6]])
>>> y = np.array([[7,8],[9,10],[11,12]])
>>> np.dot(x,y)
array([[ 58, 64],
[139, 154]])
隨機數生成
>>> sample = np.random.normal(size=(4,4))
>>> sample
array([[ 0.77132371, 1.15235977, -0.28535321, -0.58087207],
[-0.09853563, -0.78486528, 0.24612461, 1.22643528],
[ 0.13219711, -2.65805317, 0.05154038, 2.10351203],
[-1.65812602, 0.66330672, 1.62199991, 0.29079451]])