python:利用pandas進行繪圖(總結)繪圖工具
利用python進行資料分析
第八章:繪圖和視覺化
pandas繪圖工具
>>> from pandas.plotting import scatter_matrix
>>> from pandas import Series, DataFrame
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
1,散點圖矩陣(Scatter Matrix Plot)
These functions can be imported from pandas.plotting
利用繪圖工具繪圖,需要引入pandas.plotting模組,以Series和DataFrame作為引數
>>> df = pd.DataFrame(np.random.randn(1000, 4), columns=['a', 'b', 'c', 'd'])
>>> scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal='kde')
>>> plt.show()
生成4X4的共16個圖片,對角線是密度圖,其他的為散點圖
2,密度圖(Density Plot)
You can create density plots using the Series.plot.kde() and DataFrame.plot.kde() methods
利用Series.plot.kde()或DataFrame.plot.kde()方法繪製密度圖
np.random.randn(1000)生成的是一個正太分佈曲線
>>> ser = pd.Series(np.random.randn(1000))
>>> ser.plot.kde()
生成一個正太分佈曲線圖
3,安德魯斯曲線(Andrews Curves)
Andrews curves allow one to plot multivariate data as a large number of curves that are created using the attributes of samples as coefficients for Fourier series. By coloring these curves differently for each class it is possible to visualize data clustering. Curves belonging to samples of the same class will usually be closer together and form larger structures.
安德魯斯曲線是在一個繪圖中存在大量的曲線,這些曲線是不同樣本之間存在的不同屬性而產生的分類結果;所以在繪圖時利用不同的顏色來區分不同的分組,不同分類的曲線在繪圖時會靠近並形成一個更大的結構體系。使用andrews_curves()方法進行繪圖
>>> from pandas.plotting import andrews_curves
>>> df=DataFrame(np.random.rand(10,10), columns=range(1,11))
>>> df
1 2 3 4 5 6 7 \
0 0.657668 0.234840 0.187963 0.480384 0.676935 0.644506 0.849955
1 0.347819 0.278945 0.482548 0.856854 0.369824 0.921871 0.195208
2 0.481188 0.886892 0.269874 0.992266 0.663039 0.285274 0.222589
3 0.999133 0.932073 0.656683 0.607936 0.362180 0.756532 0.479407
4 0.918229 0.965718 0.243416 0.042666 0.932310 0.734750 0.142455
5 0.393881 0.821673 0.598786 0.715335 0.525187 0.763766 0.570982
6 0.998222 0.770152 0.803504 0.932111 0.629249 0.632741 0.230093
7 0.730399 0.127948 0.586990 0.890208 0.885532 0.821200 0.216378
8 0.823925 0.741674 0.690356 0.269986 0.530224 0.446307 0.265048
9 0.497035 0.830702 0.399065 0.242242 0.192078 0.622756 0.867983
8 9 10
0 0.428669 0.921396 0.865082
1 0.897575 0.000369 0.019511
2 0.004554 0.093646 0.152874
3 0.376975 0.512618 0.385439
4 0.314657 0.032770 0.406077
5 0.087637 0.525262 0.095010
6 0.841192 0.115266 0.358726
7 0.957213 0.709480 0.013137
8 0.483483 0.687900 0.431011
9 0.924797 0.119433 0.386189
>>> plt.figure()
>>> andrews_curves(df, 1)
df這個DataFrame物件的第一列,每一個index的數值都繪製出一條曲線
4,平行座標(Parallel Coordinates)
Parallel coordinates is a plotting technique for plotting multivariate data. It allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster will appear closer together.
>>> from pandas.plotting import parallel_coordinates
>>> df=DataFrame(np.random.rand(10,10), columns=range(1,11))
>>> df
1 2 3 4 5 6 7 \
0 0.467659 0.978732 0.179538 0.685182 0.229915 0.882398 0.924433
1 0.863878 0.992446 0.732572 0.543559 0.164539 0.710433 0.220690
2 0.816937 0.866524 0.561880 0.136630 0.972659 0.352004 0.650383
3 0.351081 0.341353 0.004663 0.600008 0.880758 0.440976 0.111892
4 0.226553 0.014078 0.379845 0.598606 0.341625 0.675299 0.708234
5 0.170063 0.342096 0.813045 0.860868 0.905096 0.737247 0.652726
6 0.797142 0.777763 0.737259 0.100391 0.551292 0.739408 0.266556
7 0.130778 0.201388 0.896418 0.549645 0.587309 0.548748 0.009598
8 0.467129 0.298170 0.861704 0.217054 0.761984 0.110673 0.493671
9 0.778196 0.456548 0.171519 0.745076 0.905559 0.390150 0.727006
8 9 10
0 0.494924 0.612457 0.026332
1 0.430576 0.064443 0.970996
2 0.776737 0.251197 0.410517
3 0.763297 0.365974 0.889982
4 0.947055 0.200605 0.179035
5 0.435712 0.694421 0.101725
6 0.581694 0.719693 0.588572
7 0.998294 0.138834 0.059504
8 0.549928 0.096064 0.312498
9 0.854901 0.985777 0.691980
>>> plt.figure()
>>> parallel_coordinates(df, 1)
最終結果是df這個DataFrame物件的第一列,每一個index的數值都繪製出一條線並通過2-10這些線段進行分隔
5,Lag Plot
Lag plots are used to check if a data set or time series is random. Random data should not exhibit any structure in the lag plot. Non-random structure implies that the underlying data are not random.
Lag plots用於檢視隨機資料,隨機資料不會在lag plot當中展示,非隨機體系,意味著潛在資料不是隨機的。
>>> from pandas.plotting import lag_plot
>>> plt.figure()
>>> data = pd.Series(0.1 * np.random.rand(1000) + 0.9 * np.sin(np.linspace(-99 * np.pi, 99 * np.pi, num=1000)))
>>> lag_plot(data)
繪製圖形的X軸是y(t),Y軸是y(t+1)
6,自相關圖(Autocorrelation Plot)
Autocorrelation plots are often used for checking randomness in time series. This is done by computing autocorrelations for data values at varying time lags. If time series is random, such autocorrelations should be near zero for any and all time-lag separations. If time series is non-random then one or more of the autocorrelations will be significantly non-zero. The horizontal lines displayed in the plot correspond to 95% and 99% confidence bands. The dashed line is 99% confidence band.
>>> from pandas.plotting import autocorrelation_plot
>>> data = pandas.Series(0.7 * np.random.rand(1000) + 0.3 * np.sin(np.linspace(-9 * np.pi, 9 * np.pi, num=1000)))
>>> autocorrelation_plot(data)
生成圖片的橫軸是label是Lag,縱軸label是Autocorrelation
7,Bootstrap Plot
Bootstrap plots are used to visually assess the uncertainty of a statistic, such as mean, median, midrange, etc. A random subset of a specified size is selected from a data set, the statistic in question is computed for this subset and the process is repeated a specified number of times. Resulting plots and histograms are what constitutes the bootstrap plot.
>>> from pandas.plotting import bootstrap_plot
>>> data = pd.Series(np.random.rand(1000))
>>> bootstrap_plot(data, size=50, samples=500, color='green')
8,RadViz
RadViz is a way of visualizing multi-variate data. It is based on a simple spring tension minimization algorithm. Basically you set up a bunch of points in a plane. In our case they are equally spaced on a unit circle. Each point represents a single attribute. You then pretend that each sample in the data set is attached to each of these points by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to unit interval). The point in the plane, where our sample settles to (where the forces acting on our sample are at an equilibrium) is where a dot representing our sample will be drawn. Depending on which class that sample belongs it will be colored differently.
>>> df=DataFrame(np.array([[2,4,6,79,23,190,552,1314,23457], [4,9,6,97,32,110,555,1210,4325]]).T, columns=['a','b'])
>>> radviz(df, 'a')