Python作業-Jupyter-資料集分析

阿新 • • 發佈：2018-12-31

目標：學習使用Jupyter NoteBook 以及python庫中的資料分析函式

exercise連結：

https://nbviewer.jupyter.org/github/schmit/cme193-ipython-notebooks-lecture/blob/master/Exercises.ipynb

題目要求：

Part 1

For each of the four datasets...

Compute the mean and variance of both x and y
Compute the correlation coefficient between

x and y
Compute the linear regression line: y=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook)

Python程式碼實現：

%matplotlib inline

import random

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

sns.set_context("talk")

anascombe = pd.read_csv('C:/Users/Administrator/Desktop/data/anscombe.csv')
anascombe.head()

print('The mean of x and y:')  
print(anascombe.groupby(['dataset'])[['x', 'y']].mean())  
    
print('\nThe varience of x and y:')  
print(anascombe.groupby(['dataset'])[['x', 'y']].var())  
  
print('\nThe correlation coefficient between x and y:')  
print(anascombe.groupby(['dataset'])[['x', 'y']].corr());  

#hint: use statsmodels and look at the Statsmodels notebook
datasets = ['I', 'II', 'III', 'IV'] 
for dataset in datasets:  
    lin_model = smf.ols('y ~ x', anascombe[anascombe['dataset'] == dataset]).fit()  
    print(lin_model.summary())

結果如下：

The mean of x and y:
           x         y
dataset               
I        9.0  7.500909
II       9.0  7.500909
III      9.0  7.500000
IV       9.0  7.500909

The varience of x and y:
            x         y
dataset                
I        11.0  4.127269
II       11.0  4.127629
III      11.0  4.122620
IV       11.0  4.123249

The correlation coefficient between x and y:
                  x         y
dataset                      
I       x  1.000000  0.816421
        y  0.816421  1.000000
II      x  1.000000  0.816237
        y  0.816237  1.000000
III     x  1.000000  0.816287
        y  0.816287  1.000000
IV      x  1.000000  0.816521
        y  0.816521  1.000000
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.667
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.99
Date:                Mon, 11 Jun 2018   Prob (F-statistic):            0.00217
Time:                        00:06:58   Log-Likelihood:                -16.841
No. Observations:                  11   AIC:                             37.68
Df Residuals:                       9   BIC:                             38.48
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0001      1.125      2.667      0.026       0.456       5.544
x              0.5001      0.118      4.241      0.002       0.233       0.767
==============================================================================
Omnibus:                        0.082   Durbin-Watson:                   3.212
Prob(Omnibus):                  0.960   Jarque-Bera (JB):                0.289
Skew:                          -0.122   Prob(JB):                        0.865
Kurtosis:                       2.244   Cond. No.                         29.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.97
Date:                Mon, 11 Jun 2018   Prob (F-statistic):            0.00218
Time:                        00:06:58   Log-Likelihood:                -16.846
No. Observations:                  11   AIC:                             37.69
Df Residuals:                       9   BIC:                             38.49
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0009      1.125      2.667      0.026       0.455       5.547
x              0.5000      0.118      4.239      0.002       0.233       0.767
==============================================================================
Omnibus:                        1.594   Durbin-Watson:                   2.188
Prob(Omnibus):                  0.451   Jarque-Bera (JB):                1.108
Skew:                          -0.567   Prob(JB):                        0.575
Kurtosis:                       1.936   Cond. No.                         29.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.666
Model:                            OLS   Adj. R-squared:                  0.629
Method:                 Least Squares   F-statistic:                     17.97
Date:                Mon, 11 Jun 2018   Prob (F-statistic):            0.00218
Time:                        00:06:58   Log-Likelihood:                -16.838
No. Observations:                  11   AIC:                             37.68
Df Residuals:                       9   BIC:                             38.47
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0025      1.124      2.670      0.026       0.459       5.546
x              0.4997      0.118      4.239      0.002       0.233       0.766
==============================================================================
Omnibus:                       19.540   Durbin-Watson:                   2.144
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               13.478
Skew:                           2.041   Prob(JB):                      0.00118
Kurtosis:                       6.571   Cond. No.                         29.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.667
Model:                            OLS   Adj. R-squared:                  0.630
Method:                 Least Squares   F-statistic:                     18.00
Date:                Mon, 11 Jun 2018   Prob (F-statistic):            0.00216
Time:                        00:06:58   Log-Likelihood:                -16.833
No. Observations:                  11   AIC:                             37.67
Df Residuals:                       9   BIC:                             38.46
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.0017      1.124      2.671      0.026       0.459       5.544
x              0.4999      0.118      4.243      0.002       0.233       0.766
==============================================================================
Omnibus:                        0.555   Durbin-Watson:                   1.662
Prob(Omnibus):                  0.758   Jarque-Bera (JB):                0.524
Skew:                           0.010   Prob(JB):                        0.769
Kurtosis:                       1.931   Cond. No.                         29.1
==============================================================================

Part 2

Using Seaborn, visualize all four datasets.

hint: use sns.FacetGrid combined with plt.scatter

Python程式碼：

(參照statsmodels.ipython)

graph= sns.FacetGrid(anascombe, col='dataset',col_wrap=2)  
graph.map(plt.scatter, 'x', 'y')

結果如下:

Python作業-Jupyter-資料集分析

目標：學習使用Jupyter NoteBook 以及python庫中的資料分析函式 exercise連結： https://nbviewer.jupyter.org/github/schmit/cme193-ipython-notebooks-lecture/blob/master/Exerci

【Python資料探勘課程】四.決策樹DTC資料分析及鳶尾資料集分析

希望這篇文章對你有所幫助，尤其是剛剛接觸資料探勘以及大資料的同學，同時準備嘗試以案例為主的方式進行講解。如果文章中存在不足或錯誤的地方，還請海涵~一. 分類及決策樹介紹1.分類分類其實是從特定的資料中挖掘模式，作出判斷的過程。比如Gmail郵箱

Python資料探勘課程四.決策樹DTC資料分析及鳶尾資料集分析

希望這篇文章對你有所幫助，尤其是剛剛接觸資料探勘以及大資料的同學，同時準備嘗試以案例為主的方式進行講解。如果文章中存在不足或錯誤的地方，還請海涵~ 一. 分類及決策樹介紹 1.分類分類其實是從特定的資料中挖掘模式，作

北京二手房房價資料集分析

本次分析的資料集來源為鏈家2017年房源資訊。在資料分析的過程中，我們也可以先去理解資料，再提出問題，在探索資料的過程當中，我們往往會發現很多有趣的事情~ 1.提出問題北京二手房的房價跟哪些因素有關呢？ 2.讀取資料，理解資料匯入資料分析相關工具包 %matplotlib 為魔法函式，

第六次作業 numpy資料集

#1. 安裝scipy，numpy，sklearn包 from sklearn.datasets import load_iris #匯入numpy import numpy #2. 從sklearn包自帶的資料集中讀出鳶尾花資料集data #3.檢視data型別，包含哪些資料 data = load

python讀取cifar10資料集

最近學習卷積網路用到cifar10資料集，自己寫了一個工具類，用來讀取已經下載到本地的cifar10資料集。程式碼寫的不算好，但是自己用起來還可以。所以放到網上，有需要的可以拿去用。程式碼比較少，所以沒有寫註釋。下面介紹一下實現的功能。完整的程式碼可以在github上下載。地址：https:/

【資料分析】：python：金融資料指標分析

python：金融資料指標分析 # -*- coding: utf-8 -*- """ Created on Wed Jul 4 17:31:47 2018 @author: 孫正陽 """ #@匯入功能模組資料包 import numpy as np im

利用python完成ICDAR2015資料集格式轉換,轉換為PASCAL_VOC資料集格式

目錄 1.ICDAR2015資料集 (1)下載資料集資料找了很久,最後還是在csdn上下載的,下載地址: mahttps://download.csdn.net/download/moonshapedpool/10645292 我沒有積分,

PASCAL VOC 2007 資料集分析

1 VOC2007基本資訊作為標準資料集，voc-2007 是衡量影象分類識別能力的基準。faster-rcnn，yolo -v1, yolo-v2都以此資料集為最為演示樣例，因此，有必要了解一下本資料集的組成架構。 VOC資料集共包含：

科比資料集分析及預測

愛籃球，愛人工智慧，愛生活。探索性的對科比資料集進行分析import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline from sklearn.ense

python下載vgg_face_dataset資料集

import threading import urllib import os def download_and_save(url, savename): try: urlopen = urllib.URLopener() fp = urlopen.ope

Apache Drill 1.15.0 釋出，大型資料集分析系統

Apache Drill 1.15.0 釋出了，Drill 是一個大資料 SQL 查詢引擎，其在基於 SQL 的資料分析和商業智慧（BI）上引入了 JSON 檔案模型，使得使用者能查詢固定架構、演化架構，以及各種格式和資料儲存中的模式無關（schema-free

科比資料集分析與預測

本文收集了一系列科比的資料，有投籃位置，投進二分、三分球個數，比賽剩餘時間，對手是誰等等來預測科比是否進球。主要是想通過本例來認識一下pandas在資料處理方面強大的功能。資料集有需要的可以聯絡我qq：1344184686 一、匯入需要用到的包，讀入資料集 import

資料探勘-Iris資料集分析-決策邊界_根據花瓣資料繪製(七)

# coding: utf-8 # 使用花瓣測量資料繪製 2D散點圖，並繪出決策邊界 import numpy as np import matplotlib.pyplot as plt from

資料探勘-K-近鄰分類器-Iris資料集分析-使用K-近鄰分類器進行分類預測(四)

# coding: utf-8 #使用 scikit-learn庫的K-近鄰分類器完成分類 #1. 首先將資料集中的資料進行打亂 #2. 將資料分為訓練集和測試集兩部分 #3. 使用 k-means分類器進行分類訓練，得到預測模型 #4. 使用模型對測試集進行測試，並與真實

資料探勘之鳶尾花資料集分析

因為手上沒有iris.data資料，只能通過在sklearn中載入原始資料，並將其轉換為Dataframe格式主要內容：資料分佈的視覺化（特徵之間分佈、特徵內部、分類精度、熱力圖）演算法：決策樹隨機森林 import pandas as pd from skle

【計算機視覺】PASCAL VOC資料集分析

PASCAL VOC資料集分析

接下來需要研究的是如何自己生成訓練資料和測試資料，將在下一篇中闡述。2. 轉載2 現在深度學習很多框架都在使用VOC資料集，來研究一下voc資料集的具體內容一般voc解壓出來後都包括Annotations,ImageSets,JPEFImages,SegmentationClass ,Segmentatio

使用Python將MNIST資料集轉化為圖片

1，對於訓練集資料： import numpy as np import struct from PIL import Image import os data_file = 'somePath

PASCAL VOC資料集分析（分類部分）

VOC2007資料集共包含：訓練集（5011幅），測試集（4952幅），共計9963幅圖，共包含20個種類。資料集的組成架構如下： Annotations —目標真值區域ImageSets —-類別標籤JPEGImages —–影象SegmentationClassSegmentationObje

Python作業-Jupyter-資料集分析

Part 1

Part 2

相關推薦