08-04 細分構建機器學習應用程式的流程-資料收集

阿新 • • 發佈：2020-12-11

細分構建機器學習應用程式的流程-資料收集
一、1.1 通過sklearn生成隨機資料
二、1.2 skleran自帶資料集
三、1.3 匯入UCI官網資料
四、1.4 匯入天池比賽csv資料

更新、更全的《機器學習》的更新網站，更有python、go、資料結構與演算法、爬蟲、人工智慧教學等著你：

https://www.cnblogs.com/nickchen121/p/11686958.html

細分構建機器學習應用程式的流程-資料收集

sklearn資料集官方文件地址：https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

sklearn資料集一覽

型別	獲取方式
sklearn生成的隨機資料集	sklearn.datasets.make_…
sklearn自帶資料集	sklearn.datasets.load_…
sklearn線上下載的資料集	sklearn.datasets.fetch_…
sklearn中載入的svmlight格式的資料集	sklearn.datasets.load_svmlight_file(…)
sklearn在mldata.org線上下載的資料集	sklearn.datasets.fetch_mldata(…)

一、1.1 通過sklearn生成隨機資料

通過sklearn改變生成隨機資料方法的引數，既可以獲得用不盡的資料，並且資料的樣本數、特徵數、標記類別數、噪聲數都可以自定義，非常靈活，簡單介紹幾個sklearn經常使用的生成隨機資料的方法。

方法	用途
make_classification()	用於分類
maek_multilabel_classfication()	用於多標籤分類
make_regression()	用於迴歸
make_blobs()	用於聚類和分類
make_circles()	用於分類
make_moons()	用於分類

1.1 1.1.1 make_classification()

引數	解釋
n_features	特徵個數= n_informative（） + n_redundant + n_repeated
n_informative	多資訊特徵的個數
n_redundant	冗餘資訊，informative特徵的隨機線性組合
n_repeated	重複資訊，隨機提取n_informative和n_redundant 特徵
n_classes	分類類別
n_clusters_per_class	某一個類別是由幾個cluster構成的

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties
from sklearn import datasets
%matplotlib inline
font = FontProperties(fname='/Library/Fonts/Heiti.ttc')

from sklearn import datasets
try:
    X1, y1 = datasets.make_classification(
        n_samples=50, n_classes=3, n_clusters_per_class=2, n_informative=2)
    print(X1.shape)
except Exception as e:
    print('error:{}'.format(e))
# 下面錯誤資訊n_classes * n_clusters_per_class must be smaller or equal 2 ** n_informative，

# 當n_clusters_per_class=2時，意味著該生成隨機數的n_classes應該小於2，可以理解成一分類或二分類

error:n_classes * n_clusters_per_class must be smaller or equal 2 ** n_informative

import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10, 10))

plt.subplot(221)
plt.title("One informative feature, one cluster per class", fontsize=12)
X1, y1 = datasets.make_classification(n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=1,
                                      n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)

plt.subplot(222)
plt.title("Two informative features, one cluster per class", fontsize=12)
X1, y1 = datasets.make_classification(n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=2,
                                      n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)

plt.subplot(223)
plt.title("Two informative features, two clusters per class", fontsize=12)
X1, y1 = datasets.make_classification(
    n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=2)
plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)


plt.subplot(224)
plt.title("Multi-class, two informative features, one cluster",
          fontsize=12)
X1, y1 = datasets.make_classification(n_samples=1000, random_state=1, n_features=2, n_redundant=0, n_informative=2,
                                      n_clusters_per_class=1, n_classes=4)
plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)
plt.show()

1.2 1.1.2 make_multilabel_classification()

X1, y1 = datasets.make_multilabel_classification(
    n_samples=1000, n_classes=4, n_features=2, random_state=1)
datasets.make_multilabel_classification()
print('樣本維度:{}'.format(X1.shape))
# 一個樣本可能有多個標記

print(y1[0:5, :])

樣本維度:(1000, 2)
[[1 1 0 0]
 [0 0 0 0]
 [1 1 0 0]
 [0 0 0 1]
 [0 0 0 0]]

plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)
plt.show()

1.3 1.1.3 make_regression()

import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import datasets
X1, y1 = datasets.make_regression(n_samples=500, n_features=1, noise=20)
plt.scatter(X1, y1, color='r', s=10, marker='*')
plt.show()

1.4 1.1.4 make_blobs

# 生成3個簇的中心點
centers = [[1, 1], [-1, -2], [1, -2]]
X1, y1 = datasets.make_blobs(

n_samples=1500, centers=centers, n_features=2, random_state=1, shuffle=False, cluster_std=0.5)
print('樣本維度:{}'.format(X1.shape))
# plt.scatter(X1[0:500, 0], X1[0:500, 1], s=10, label='cluster1')

# plt.scatter(X1[500:1000, 0], X1[500:1000, 1], s=10, label='cluster2')

# plt.scatter(X1[1000:1500, 0], X1[1000:1500, 1], s=10, label='cluster3')

plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)

plt.show()

樣本維度:(1500, 2)

1.5 1.1.5 make_circles()

X1, y1 = datasets.make_circles(
    n_samples=1000, random_state=1, factor=0.5, noise=0.1)
print('樣本維度:{}'.format(X1.shape))
plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)

plt.title('make_circles()', fontsize=20)

plt.show()

樣本維度:(1000, 2)

1.6 1.1.6 make_moons

X1, y1 = datasets.make_moons(n_samples=1000, noise=0.1, random_state=1)
print('樣本維度:{}'.format(X1.shape))
plt.scatter(X1[:, 0], X1[:, 1], marker='*', c=y1)

plt.title('make_moons()', fontsize=20)

plt.show()

樣本維度:(1000, 2)

二、1.2 skleran自帶資料集

方法	描述	用途
load_iris()	鳶尾花資料集	用於分類或聚類
load_digits()	手寫數字資料集	用於分類
load_breast_cancer()	乳腺癌資料集	用於分類
load_boston()	波士頓房價資料集	用於迴歸
load_linnerud()	體能訓練資料集	用於迴歸
load_sample_image(name)	影象資料集

# 鳶尾花資料集
iris = datasets.load_iris()
iris['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

# 手寫數字資料集
digits = datasets.load_digits()
digits['target_names']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# 乳腺癌資料集
breast = datasets.load_breast_cancer()
breast['target_names']

array(['malignant', 'benign'], dtype='<U9')

# 波士頓房價資料集
boston = datasets.load_boston()
boston['feature_names']

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

# 體能訓練資料集
linnerud = datasets.load_linnerud()
linnerud['feature_names']

['Chins', 'Situps', 'Jumps']

# 影象資料集
china = datasets.load_sample_image('china.jpg')
plt.axis('off')
plt.title('中國頤和園影象', fontproperties=font, fontsize=20)
plt.imshow(china)
plt.show()

三、1.3 匯入UCI官網資料

UCI官網：http://archive.ics.uci.edu/ml/datasets.html

df = pd.read_csv(
    'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
# 取出前100行的第五列即生成標記向量

y = df.iloc[0:100, 4].values

# 如果標記為'Iris-versicolor'則賦值1，否則賦值-1

y = np.where(y == 'Iris-versicolor', 1, -1)
# 取出前100行的第一列和第三列的特徵即生成特徵向量

X = df.iloc[:, [2, 3]].values
plt.scatter(X[:50, 0], X[:50, 1], color='b', s=50, marker='x', label='山鳶尾')

plt.scatter(X[50:100, 0], X[50:100, 1], color='r',

s=50, marker='o', label='雜色鳶尾')

plt.scatter(X[100:150, 0], X[100:150, 1], color='g',

s=50, marker='*', label='維吉尼亞鳶尾')

plt.xlabel('花瓣長度', fontproperties=font, fontsize=15)

plt.ylabel('花瓣寬度', fontproperties=font, fontsize=15)

plt.title('花瓣長度-花瓣寬度', fontproperties=font, fontsize=20)

plt.legend(prop=font)

plt.show()

四、1.4 匯入天池比賽csv資料

本次以天池比賽中的葡萄酒質量研究的資料為例，下載地址：https://tianchi.aliyun.com/dataset/dataDetail?dataId=44

上圖可以看出，葡萄酒質量的資料是存放在.csv檔案當中，我們首先把csv檔案下載到本地，然後可以使用pandas做處理。

# csv可以看成普通文字檔案，pandas可以使用read_csv讀取
df = pd.read_csv('winequality-red.csv')
df[:2]

	fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
0	7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
1	7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5

# sep引數相當於規定csv檔案資料的分隔符
df = pd.read_csv('winequality-red.csv', sep=';')
df[:2]

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.0	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.0	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5

# 獲取特徵值
X = df.iloc[:2, :-1].values
X

array([[ 7.4   ,  0.7   ,  0.    ,  1.9   ,  0.076 , 11.    , 34.    ,
         0.9978,  3.51  ,  0.56  ,  9.4   ],
       [ 7.8   ,  0.88  ,  0.    ,  2.6   ,  0.098 , 25.    , 67.    ,
         0.9968,  3.2   ,  0.68  ,  9.8   ]])

# 獲取標記值
y = df.iloc[:2, -1].values
y

array([5, 5])

08-04 細分構建機器學習應用程式的流程-資料收集

目錄細分構建機器學習應用程式的流程-資料收集一、1.1 通過sklearn生成隨機資料1.1 1.1.1 make_classification()1.2 1.1.2 make_multilabel_classification()1.3 1.1.3 make_regression()1.4 1.1.4 make_blobs1.5 1

08-05 細分構建機器學習應用程式的流程-資料預處理

目錄細分構建機器學習應用程式的流程-資料預處理一、1.1 缺失值處理1.1 1.1.1 刪除缺失值1.1.14.6.1.2 填充缺失值二、1.2 異常值處理三、1.3 自定義資料型別編碼四、1.4 通過sklearn對資料型別編碼五、1.5 獨熱編碼

08-03 細分構建機器學習應用程式的流程-流程簡介

目錄細分構建機器學習應用程式的流程-流程簡介一、1.1 sklearn安裝二、1.2 sklearn功能模組2.1 1.2.1 英文版本2.2 1.2.2 中文版本2.3 1.2.3 API統一的方法三、1.3 sklearn使用地圖3.1 1.3.1 英文版本3.2 1.3.2 中文

《Spring實戰》學習筆記（5）——構建Spring Web應用程式

這章主要學習一下Spring的MVC框架。Spring MVC基於模型-檢視-控制器（Model-View-Controller）模式實現，可以構建出鬆耦合的Web應用程式。

使用 TensorFlow 構建機器學習專案中文版·翻譯完成

原文：Building Machine Learning Projects with TensorFlow 協議：CC BY-NC-SA 4.0 不要擔心自己的形象，只關心如何實現目標。——《原則》，生活原則 2.3.c

一種使用 Redis 深度驅動的，為構建輕量級分散式應用程式（Microservices）的工程方案

Hydra 是一個輕量級的 NodeJS 庫，用於構建分散式計算應用程式，比如微服務。我們對輕量級的定義是：輕處理外部複雜性和基礎設施依賴 —— 而不是有限的輕處理。

構建圖形資料庫應用程式

構建圖形資料庫應用程式 In this chapter, we discuss some of the practical issues of working with a graph database. In previous chapters, we’ve looked at graph data; in this chapter, we’ll apply t