Python——sklearn提供的自帶的數據集

阿新 • • 發佈：2018-08-12

pack 數據 cal -i 一個 .text 分類器 unique gen

sklearn提供的自帶的數據集

sklearn 的數據集有好多個種

自帶的小數據集（packaged dataset）：sklearn.datasets.load_<name>
可在線下載的數據集（Downloaded Dataset）：sklearn.datasets.fetch_<name>
計算機生成的數據集（Generated Dataset）：sklearn.datasets.make_<name>
svmlight/libsvm格式的數據集:sklearn.datasets.load_svmlight_file(...)
從買了data.org在線下載獲取的數據集:sklearn.datasets.fetch_mldata(...)

①自帶的數據集

其中的自帶的小的數據集為：sklearn.datasets.load_<name>

技術分享圖片

這些數據集都可以在官網上查到，以鳶尾花為例，可以在官網上找到demo，http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html

技術分享圖片

 1 from sklearn.datasets import load_iris
 2 #加載數據集
 3 iris=load_iris()
 4 iris.keys()　　#dict_keys([‘target‘, ‘DESCR‘, ‘data‘, ‘target_names‘, ‘feature_names‘]) 

 5 #數據的條數和維數
 6 n_samples,n_features=iris.data.shape
 7 print("Number of sample:",n_samples)  #Number of sample: 150
 8 print("Number of feature",n_features)　　#Number of feature 4
 9 #第一個樣例
10 print(iris.data[0])　　　　　　#[ 5.1  3.5  1.4  0.2]
11 print(iris.data.shape)　　　　#(150, 4)
12 print(iris.target.shape)　　# 
(150,)
13 print(iris.target)
14 """
15 
16 　　[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
17 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
19 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
20 2]
21 
22 """
23 import numpy as np
24 print(iris.target_names)　　#[‘setosa‘ ‘versicolor‘ ‘virginica‘]
25 np.bincount(iris.target)　　#[50 50 50]
26 
27 import matplotlib.pyplot as plt
28 #以第3個索引為劃分依據，x_index的值可以為0，1，2，3
29 x_index=3
30 color=[‘blue‘,‘red‘,‘green‘]
31 for label,color in zip(range(len(iris.target_names)),color):
32     plt.hist(iris.data[iris.target==label,x_index],label=iris.target_names[label],color=color)
33 
34 plt.xlabel(iris.feature_names[x_index])
35 plt.legend(loc="Upper right")
36 plt.show()
37 
38 #畫散點圖，第一維的數據作為x軸和第二維的數據作為y軸
39 x_index=0
40 y_index=1
41 colors=[‘blue‘,‘red‘,‘green‘]
42 for label,color in zip(range(len(iris.target_names)),colors):
43     plt.scatter(iris.data[iris.target==label,x_index],
44                 iris.data[iris.target==label,y_index],
45                 label=iris.target_names[label],
46                 c=color)
47 plt.xlabel(iris.feature_names[x_index])
48 plt.ylabel(iris.feature_names[y_index])
49 plt.legend(loc=‘upper left‘)
50 plt.show()

技術分享圖片

手寫數字數據集load_digits()：用於多分類任務的數據集

 1 from sklearn.datasets import load_digits
 2 digits=load_digits()
 3 print(digits.data.shape)
 4 import matplotlib.pyplot as plt
 5 plt.gray()
 6 plt.matshow(digits.images[0])
 7 plt.show()
 8 
 9 from sklearn.datasets import load_digits
10 digits=load_digits()
11 digits.keys()
12 n_samples,n_features=digits.data.shape
13 print((n_samples,n_features))
14 
15 print(digits.data.shape)
16 print(digits.images.shape)
17 
18 import numpy as np
19 print(np.all(digits.images.reshape((1797,64))==digits.data))
20 
21 fig=plt.figure(figsize=(6,6))
22 fig.subplots_adjust(left=0,right=1,bottom=0,top=1,hspace=0.05,wspace=0.05)
23 #繪制數字：每張圖像8*8像素點
24 for i in range(64):
25     ax=fig.add_subplot(8,8,i+1,xticks=[],yticks=[])
26     ax.imshow(digits.images[i],cmap=plt.cm.binary,interpolation=‘nearest‘)
27     #用目標值標記圖像
28     ax.text(0,7,str(digits.target[i]))
29 plt.show()

技術分享圖片

乳腺癌數據集load-barest-cancer（）：簡單經典的用於二分類任務的數據集

糖尿病數據集：load-diabetes（）：經典的用於回歸任務的數據集，值得註意的是，這10個特征中的每個特征都已經被處理成0均值，方差歸一化的特征值

波士頓房價數據集：load-boston（）：經典的用於回歸任務的數據集

體能訓練數據集：load-linnerud（）：經典的用於多變量回歸任務的數據集，其內部包含兩個小數據集：Excise是對3個訓練變量的20次觀測（體重，腰圍，脈搏），physiological是對3個生理學變量的20次觀測（引體向上，仰臥起坐，立定跳遠）

svmlight/libsvm的每一行樣本的存放格式：

<label><feature-id>:<feature-value> <feature-id>:<feature-value> ....

這種格式比較適合用來存放稀疏數據，在sklearn中，用scipy sparse CSR矩陣來存放X，用numpy數組來存放Y

1 from sklearn.datasets import load_svmlight_file
2 x_train,y_train=load_svmlight_file("/path/to/train_dataset.txt","")#如果要加在多個數據的時候，可以用逗號隔開

②生成數據集

生成數據集：可以用來分類任務，可以用來回歸任務，可以用來聚類任務，用於流形學習的，用於因子分解任務的

用於分類任務和聚類任務的：這些函數產生樣本特征向量矩陣以及對應的類別標簽集合

make_blobs：多類單標簽數據集，為每個類分配一個或多個正太分布的點集

make_classification：多類單標簽數據集，為每個類分配一個或多個正太分布的點集，提供了為數據添加噪聲的方式，包括維度相關性，無效特征以及冗余特征等

make_gaussian-quantiles：將一個單高斯分布的點集劃分為兩個數量均等的點集，作為兩類

make_hastie-10-2：產生一個相似的二元分類數據集，有10個維度

make_circle和make_moom產生二維二元分類數據集來測試某些算法的性能，可以為數據集添加噪聲，可以為二元分類器產生一些球形判決界面的數據

 1 #生成多類單標簽數據集
 2 import numpy as np
 3 import matplotlib.pyplot as plt
 4 from sklearn.datasets.samples_generator import make_blobs
 5 center=[[1,1],[-1,-1],[1,-1]]
 6 cluster_std=0.3
 7 X,labels=make_blobs(n_samples=200,centers=center,n_features=2,
 8                     cluster_std=cluster_std,random_state=0)
 9 print(‘X.shape‘,X.shape)
10 print("labels",set(labels))
11 
12 unique_lables=set(labels)
13 colors=plt.cm.Spectral(np.linspace(0,1,len(unique_lables)))
14 for k,col in zip(unique_lables,colors):
15     x_k=X[labels==k]
16     plt.plot(x_k[:,0],x_k[:,1],‘o‘,markerfacecolor=col,markeredgecolor="k",
17              markersize=14)
18 plt.title(‘data by make_blob()‘)
19 plt.show()
20 
21 #生成用於分類的數據集
22 from sklearn.datasets.samples_generator import make_classification
23 X,labels=make_classification(n_samples=200,n_features=2,n_redundant=0,n_informative=2,
24                              random_state=1,n_clusters_per_class=2)
25 rng=np.random.RandomState(2)
26 X+=2*rng.uniform(size=X.shape)
27 
28 unique_lables=set(labels)
29 colors=plt.cm.Spectral(np.linspace(0,1,len(unique_lables)))
30 for k,col in zip(unique_lables,colors):
31     x_k=X[labels==k]
32     plt.plot(x_k[:,0],x_k[:,1],‘o‘,markerfacecolor=col,markeredgecolor="k",
33              markersize=14)
34 plt.title(‘data by make_classification()‘)
35 plt.show()
36 
37 #生成球形判決界面的數據
38 from sklearn.datasets.samples_generator import make_circles
39 X,labels=make_circles(n_samples=200,noise=0.2,factor=0.2,random_state=1)
40 print("X.shape:",X.shape)
41 print("labels:",set(labels))
42 
43 unique_lables=set(labels)
44 colors=plt.cm.Spectral(np.linspace(0,1,len(unique_lables)))
45 for k,col in zip(unique_lables,colors):
46     x_k=X[labels==k]
47     plt.plot(x_k[:,0],x_k[:,1],‘o‘,markerfacecolor=col,markeredgecolor="k",
48              markersize=14)
49 plt.title(‘data by make_moons()‘)
50 plt.show()

技術分享圖片

Python——sklearn提供的自帶的數據集

TF之AE：AE實現TF自帶數據集AE的encoder之後decoder之前的非監督學習分類

feature print all feed 定義 jason 學習分類 and lib import tensorflow as tf import numpy as np import matplotlib.pyplot as plt #Import MNIST d

R語言自帶數據包

tac death 動力 over 生成不同工廠 sunspot iss Datasets（r自帶數據包）作者：王書宇作品來源：百度百科 dating from #配對的病例對照數據，用於條件logistic回歸 InsectSprays #使用不同殺蟲劑時昆蟲數目

FineBI學習系列之FineBI官網提供的SAP數據集（圖文詳解）

wiki blank 插件後 jdk1.5 jvm 現在 src strong blog 不多說，直接上幹貨！　　這是來自FineBI官網提供的幫助文檔 http://help.finebi.com/http://help.finebi.com/doc

吳裕雄 python 機器學習——模型選擇數據集切分

ets for sco true targe pytho target 學習 del import numpy as np from sklearn.model_selection import train_test_split,KFold,StratifiedKF

人臉表情識別深度神經網絡 python實現簡單模型 fer2013數據集

lib [1] clas nbsp ces ini batch 類別 rep 參考網址：https://sefiks.com/2018/01/01/facial-expression-recognition-with-keras/ 1.數據集介紹及處理：（1)　　數據

Python——sklearn提供的自帶的數據集

pack 數據 cal -i 一個 .text 分類器 unique gen sklearn提供的自帶的數據集 sklearn 的數據集有好多個種自帶的小數據集（packaged dataset）：sklearn.datasets.load_<name> 可

Makefile中怎樣調用python和perl文件為自己提供須要的數據

print erl col shel ria 實例代碼 space shell eight Makefile中怎樣調用python和perl文件為自己提供須要的數據，利用print函數對外輸出數據實例代碼例如以下 perl.pl #!/usr/bin/per

sklearn學習：make_multilabel_classification——多標簽數據集方法

矩陣 hat nts ger form pre returns sting matrix Generate a random multilabel classification problem. For each sample, the generative process

HackRF采集調頻廣播基帶數據並使用Ocatve解調播放

得到音頻 hackrf atl hack atan class style pkg 使用hackrf_transfer工具在Linux系統上，采集當地的一個調頻廣播，使用的采樣頻率為8MHz。得到IQ交錯存儲的8位有符號基帶數據，在Octave中，先進行50倍抽取

sklearn訓練感知器用iris數據集

proc load %d gre 通過 lin tro 感知 misc 簡化版代碼 1 from sklearn import datasets 2 import numpy as np 3 4 #獲取data和類標 5 iris = datasets.loa

python構建bp神經網絡_鳶尾花分類(一個隱藏層)__1.數據集

data learn 9.png blog spa src 兩個 idt 數據 IDE：jupyter 目前我知道的數據集來源有兩個，一個是csv數據集文件另一個是從sklearn.datasets導入 1.1 csv格式的數據集（下載地址已上傳到博客園--

《Python機器學習》高清英文版PDF+中文版PDF+源代碼及數據集

img roc 學習 ges src ref term watermark 下載下載：https://pan.baidu.com/s/1I-Kd5KhmkggOVTppo3ysTQ 《Python機器學習》高清英文版PDF+中文版PDF+源代碼及數據集中文和英文兩版對比

python數據集處理，加載成list

[] light clas num file line 處理 lena label def loadDataSet(filename): f = open(filename) numFeatures = len(f.readline().split(‘\t

分享《Python機器學習》高清英文版PDF+中文版PDF+源代碼及數據集

ado mar 技術 nag 分享圖片 proc follow 下載 shadow 下載：https://pan.baidu.com/s/1I-Kd5KhmkggOVTppo3ysTQ 更多資料分享：http://blog.51cto.com/4820691 《Python

mnist數據集進行自編碼

plt elf numpy 壓縮自己 color download ray inb """ 自動編碼的核心就是各種全連接的組合，它是一種無監督的形式，因為他的標簽是自己。 """ import torch import torch.nn as nn from torch

吳裕雄 python 神經網絡——TensorFlow實現AlexNet模型處理手寫數字識別MNIST數據集

its iter style 輸出 init 向量數字 ict sha import tensorflow as tf # 輸入數據 from tensorflow.examples.tutorials.mnist import input_data m

吳裕雄 python 神經網絡——TensorFlow實現回歸模型訓練預測MNIST手寫數據集

rect ros its .com img tensor sce 交互 run import tensorflow as tf from tensorflow.examples.tutorials.mnist import input_data mnist =

貝葉斯決策分類器 MNIST手寫數據集分類 python實現

row 出了 net 訓練集貝葉斯公式影響集中 oat blog 轉載： (1) https://zhuanlan.zhihu.com/p/51200626 　　　（2）菊安醬的機器學習第三期　　　（3）代碼來自：https://github.com

數據集

機器學習http://moreno.ss.uci.edu/data.html http://archive.ics.uci.edu/ml/ 海量數據（又稱大數據）已經成為各大互聯網企業面臨的最大問題，如何處理海量數據，提供更好的解決方案，是目前相當熱門的一個話題。類似MapReduce、 Hadoop等架構的

[數據集]新浪微博數據集MicroblogPCU

sets learning lun epo con 新浪摘要 get 關系數據集下載地址：下載摘要：MicroblogPCU是從新浪微博採集到的。它能夠被用於研究機器學習方法和社會關系研究。這個數據集被原作者用於探索微博中的spammers（發送垃圾信息的人）。

Python——sklearn提供的自帶的數據集

相關推薦