cuML 機器學習加速庫

阿新 • • 發佈：2020-10-18

https://github.com/rapidsai/cudf

https://github.com/cupy/cupy

https://github.com/rapidsai/cuml

nvidia-rapids︱cuML 機器學習加速庫_素質雲筆記 / Recorder...-CSDN 部落格

cuML 是一套用於實現與其他 RAPIDS 專案共享相容 API 的機器學習演算法和數學原語函式。

cuML 使資料科學家、研究人員和軟體工程師能夠在 GPU 上執行傳統的表格 ML 任務，而無需深入瞭解 CUDA 程式設計的細節。在大多數情況下，cuML 的 Python API 與來自 scikit-learn 的 API 相匹配。

對於大型資料集，這些基於 GPU 的實現可以比其 CPU 等效完成 10-50 倍。有關效能的詳細資訊，請參閱 cuML 基準測試筆記本。

官方文件：
rapidsai/cuml
cuML API Reference

官方案例還是蠻多的：

來看看有啥模型：

關聯文章：

nvidia-rapids︱cuDF 與 pandas 一樣的 DataFrame 庫
 NVIDIA 的 python-GPU 演算法生態︱ RAPIDS 0.10
nvidia-rapids︱cuML 機器學習加速庫
 nvidia-rapids︱cuGraph(NetworkX-like) 關係圖模型

文章目錄

1 安裝與背景

1.1 安裝

參考：https://github.com/rapidsai/cuml/blob/branch-0.13/BUILD.md

conda env create -n cuml_dev python=3.7 --file=conda/environments/cuml_dev_cuda10.0.yml

docker 版本，可參考：https://rapids.ai/start.html#prerequisites

docker pull rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7
docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \
    rapidsai/rapidsai:cuda10.1-runtime-ubuntu16.04-py3.7

1.2 背景

不僅是訓練，要想真正在 GPU 上擴充套件資料科學，也需要加速端到端的應用程式。cuML 0.9 為我們帶來了基於 GPU 的樹模型支援的下一個發展，包括新的森林推理庫（FIL）。FIL 是一個輕量級的 GPU 加速引擎，它對基於樹形模型進行推理，包括梯度增強決策樹和隨機森林。使用單個 V100 GPU 和兩行 Python 程式碼，使用者就可以載入一個已儲存的 XGBoost 或 LightGBM 模型，並對新資料執行推理，速度比雙 20 核 CPU 節點快 36 倍。在開源 Treelite 軟體包的基礎上，下一個版本的 FIL 還將新增對 scikit-learn 和 cuML 隨機森林模型的支援。

圖 3：推理速度對比，XGBoost CPU vs 森林推理庫 (FIL) GPU

2 DBSCAN

The DBSCAN algorithm is a clustering algorithm that works really well for datasets that have regions of high density.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames.

import cudf
import matplotlib.pyplot as plt
import numpy as np
from cuml.datasets import make_blobs
from cuml.cluster import DBSCAN as cuDBSCAN
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.metrics import adjusted_rand_score

%matplotlib inline

# 定義引數
n_samples = 10**4
n_features = 2

eps = 0.15
min_samples = 3
random_state = 23

#Generate Data
%%time
device_data, device_labels = make_blobs(n_samples=n_samples, 
                                        n_features=n_features,
                                        centers=5,
                                        cluster_std=0.1,
                                        random_state=random_state)

device_data = cudf.DataFrame.from_gpu_matrix(device_data)
device_labels = cudf.Series(device_labels)
# Copy dataset from GPU memory to host memory.
# This is done to later compare CPU and GPU results.
host_data = device_data.to_pandas()
host_labels = device_labels.to_pandas()

# sklearn 模型擬合
%%time
clustering_sk = skDBSCAN(eps=eps,
                         min_samples=min_samples,
                         algorithm="brute",
                         n_jobs=-1)

clustering_sk.fit(host_data)

# cuML 模型擬合
%%time
clustering_cuml = cuDBSCAN(eps=eps,
                           min_samples=min_samples,
                           verbose=True,
                           max_mbytes_per_batch=13e3)

clustering_cuml.fit(device_data, out_dtype="int32")

# 視覺化
fig = plt.figure(figsize=(16, 10))

X = np.array(host_data)
labels = clustering_cuml.labels_

n_clusters_ = len(labels)

# Black removed and is used for noise instead.
unique_labels = labels.unique()
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markersize=5, markeredgecolor=tuple(col))

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

結果評估：

%%time
sk_score = adjusted_rand_score(host_labels, clustering_sk.labels_)
cuml_score = adjusted_rand_score(host_labels, clustering_cuml.labels_)

>>> (0.9998750031236718, 0.9998750031236718)

兩個結果是一模一樣的，也就是 skearn 和 cuML 的結果一致。

3 TSNE 演算法在 Fashion MNIST 的使用

TSNE(T-Distributed Stochastic Neighborhood Embedding) is a fantastic dimensionality reduction algorithm used to visualize large complex datasets including medical scans, neural network weights, gene expressions and much more.

cuML’s TSNE algorithm supports both the faster Barnes Hut $ n logn $ algorithm and also the slower Exact $ n^2 $ .

The model can take array-like objects, either in host as NumPy arrays as well as cuDF DataFrames as the input.

import gzip
import matplotlib.pyplot as plt
import numpy as np
import os
from cuml.manifold import TSNE

%matplotlib inline

# https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py
def load_mnist_train(path):
    """Load MNIST data from path"""
    labels_path = os.path.join(path, 'train-labels-idx1-ubyte.gz')
    images_path = os.path.join(path, 'train-images-idx3-ubyte.gz')

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)
    return images, labels


# 載入資料
images, labels = load_mnist_train("data/fashion")

plt.figure(figsize=(5,5))
plt.imshow(images[100].reshape((28, 28)), cmap = 'gray')

# 建模
tsne = TSNE(n_components = 2, method = 'barnes_hut', random_state=23)
%time embedding = tsne.fit_transform(images)

print(embedding[:10], embedding.shape)



CPU times: user 2.41 s, sys: 2.57 s, total: 4.98 s
Wall time: 4.98 s
[[-13.577632    39.87483   ]
 [ 26.136728   -17.68164   ]
 [ 23.164072    22.151243  ]
 [ 28.361032    11.134571  ]
 [ 35.419216     5.6633983 ]
 [ -0.15575314 -11.143476  ]
 [-24.30308     -1.584903  ]
 [ -5.9438944  -27.522072  ]
 [  2.0439444   29.574451  ]
 [ -3.0801039   27.079374  ]] (60000, 2)

視覺化 Visualize Embedding：

# Visualize Embedding


classes = [
    'T-shirt/top',
    'Trouser',
    'Pullover',
    'Dress',
    'Coat',
    'Sandal',
    'Shirt',
    'Sneaker',
    'Bag',
    'Ankle boot'
]

fig, ax = plt.subplots(1, figsize = (14, 10))
plt.scatter(embedding[:,1], embedding[:,0], s = 0.3, c = labels, cmap = 'Spectral')
plt.setp(ax, xticks = [], yticks = [])
cbar = plt.colorbar(boundaries = np.arange(11)-0.5)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(classes)
plt.title('Fashion MNIST Embedded via TSNE');

4 XGBoosting

import numpy as np; print('numpy Version:', np.__version__)
import pandas as pd; print('pandas Version:', pd.__version__)
import xgboost as xgb; print('XGBoost Version:', xgb.__version__)


# helper function for simulating data
def simulate_data(m, n, k=2, numerical=False):
    if numerical:
        features = np.random.rand(m, n)
    else:
        features = np.random.randint(2, size=(m, n))
    labels = np.random.randint(k, size=m)
    return np.c_[labels, features].astype(np.float32)


# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:
        df = pd.read_csv(filename)
    else:
        df = pd.read_csv(filename, nrows=n_rows)
    return df.values.astype(np.float32)

# settings
LOAD = False
n_rows = int(1e5)
n_columns = int(100)
n_categories = 2

# 載入資料
%%time

if LOAD:
    dataset = load_data('/tmp', n_rows)
else:
    dataset = simulate_data(n_rows, n_columns, n_categories)
print(dataset.shape)

# 訓練集切分
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

# split validation data
X_validation, y_validation = X[train_index:, :], y[train_index:]

# 檢驗
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_validation', X_validation.shape, X_validation.dtype, 'y_validation: ', y_validation.shape, y_validation.dtype)

# check the proportions
total = X_train.shape[0] + X_validation.shape[0]
print('X_train proportion:', X_train.shape[0] / total)
print('X_validation proportion:', X_validation.shape[0] / total)

# Convert NumPy data to DMatrix format
%%time

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)

# 設定引數
# instantiate params
params = {}

# general params
general_params = {'silent': 1}
params.update(general_params)

# booster params
n_gpus = 1
booster_params = {}

if n_gpus != 0:
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus
params.update(booster_params)

# learning task params
learning_task_params = {'eval_metric': 'auc', 'objective': 'binary:logistic'}
params.update(learning_task_params)
print(params)


# 模型訓練
# model training settings
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
num_round = 10

%%time

bst = xgb.train(params, dtrain, num_round, evallist)

輸出：

[0]	validation-auc:0.504014	train-auc:0.542211
[1]	validation-auc:0.506166	train-auc:0.559262
[2]	validation-auc:0.501638	train-auc:0.570375
[3]	validation-auc:0.50275	train-auc:0.580726
[4]	validation-auc:0.503445	train-auc:0.589701
[5]	validation-auc:0.503413	train-auc:0.598342
[6]	validation-auc:0.504258	train-auc:0.605253
[7]	validation-auc:0.503157	train-auc:0.611937
[8]	validation-auc:0.502372	train-auc:0.617561
[9]	validation-auc:0.501949	train-auc:0.62333
CPU times: user 1.12 s, sys: 195 ms, total: 1.31 s
Wall time: 360 ms

相關參考：

5 利用 KNN 進行影象檢索

參考：在 GPU 例項上使用 RAPIDS 加速影象搜尋任務

阿里雲文件中有專門的介紹，所以不做太多贅述。
使用開源框架 Tensorflow 和 Keras 提取圖片特徵，其中模型為基於 ImageNet 資料集的 ResNet50（notop）預訓練模型。
連線公網下載模型（大小約 91M），下載完成後預設儲存到/root/.keras/models/目錄

資料下載：

import os
import tarfile
import numpy as np
from urllib.request import urlretrieve


def download_and_extract(data_dir):
    """doc"""
    def _progress(count, block_size, total_size):
        print('\r>>> Downloading %s  (total:%.0fM) %.1f%%' % (
            filename, total_size / 1024 / 1024, 100.0 * count * block_size / total_size), end='')
    
    url = 'http://ai.stanford.edu/~acoates/stl10/stl10_binary.tar.gz'
    filename = url.split('/')[-1]
    filepath = os.path.join(data_dir, filename)
    decom_dir = os.path.join(data_dir, filename.split('.')[0])
    
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
    if os.path.exists(filepath):
        print('>>> {} has exist in current directory.'.format(filename))
    else:
        urlretrieve(url, filepath, _progress)
        print("\nSuccessfully downloaded.")
    if not os.path.exists(decom_dir):
        # Decompress
        print(">>> Decompressing from {}....".format(filepath))
        tar = tarfile.open(filepath, 'r')
        tar.extractall(data_dir)
        print("Successfully decompressed")
        tar.close()
    else:
        print('>>> Directory "{}" has exist. '.format(decom_dir))


def read_all_images(path_to_data):
    """get all images from binary path"""
    with open(path_to_data, 'rb') as f:
        everything = np.fromfile(f, dtype=np.uint8)
        images = np.reshape(everything, (-1, 3, 96, 96))
        images = np.transpose(images, (0, 3, 2, 1))
    return images


# the directory to save data
data_dir = './data'
# download and decompression
download_and_extract(data_dir)

# 讀入資料
# the path of unlabeled data
path_unlabeled = os.path.join(data_dir, 'stl10_binary/unlabeled_X.bin')
# get images from binary
images = read_all_images(path_unlabeled)
print('>>> images shape: ', images.shape)

# 看圖
import random
import matplotlib.pyplot as plt
%matplotlib inline

def show_image(image):
    """show image"""
    fig = plt.figure(figsize=(3, 3))
    plt.imshow(image)
    plt.show()
    fig.clear()


# random show a image
rand_image_index = random.randint(0, images.shape[0])
show_image(images[rand_image_index])

# 分割資料
from sklearn.model_selection import train_test_split

train_images, query_images = train_test_split(images, test_size=0.1, random_state=123)
print('train_images shape: ', train_images.shape)
print('query_images shape: ', query_images.shape)

# 圖片特徵
# set tensorflow params to adjust GPU memory usage, if use default params, tensorflow would use
# nearly all of the gpu memory, we need reserve some gpu memory for cuml.
import os
# only use device 0
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
# method 1: allocate gpu memory base on runtime allocations
# config.gpu_options.allow_growth = True
# method 2: determines the fraction of the onerall amount of memory 
# that each visibel GPU should be allocated.
config.gpu_options.per_process_gpu_memory_fraction = 0.3
set_session(tf.Session(config=config))

# 特徵抽取
from keras.applications.resnet50 import ResNet50
from keras.preprocessing import image
from keras.applications.resnet50 import preprocess_input

# download resnet50(notop) model(first running) and load model
model = ResNet50(weights='imagenet', include_top=False, input_shape=(96, 96, 3), pooling='max')
# network summary
model.summary()

%%time
train_features = model.predict(train_images)
print('train features shape: ', train_features.shape)

%%time
query_features = model.predict(query_images)
print('query features shape: ', query_features.shape)

然後是 KNN 階段, 包括了 sklear-KNN，和 CUML-KNN：

from cuml.neighbors import NearestNeighbors

%%time
knn_cuml = NearestNeighbors()
knn_cuml.fit(train_features)

%%time
distances_cuml, indices_cuml = knn_cuml.kneighbors(query_features, k=3)

from sklearn.neighbors import NearestNeighbors
%%time
knn_sk = NearestNeighbors(n_neighbors=3, metric='sqeuclidean', n_jobs=-1)
knn_sk.fit(train_features)

%%time
distances_sk, indices_sk = knn_sk.kneighbors(query_features, 3)
# compare the distance obtained while using sklearn and cuml models
(np.abs(distances_cuml - distances_sk) < 1).all()

# 展示結果
def show_images(query, sim_images, sim_dists):
    """doc"""
    simi_num = len(sim_images)
    fig = plt.figure(figsize=(3 * (simi_num + 1), 3))

    axes = fig.subplots(1, simi_num + 1)
    for index, ax in enumerate(axes):
        if index == 0:
            ax.imshow(query)
            ax.set_title('query')
        else:
            ax.imshow(sim_images[index - 1])
            ax.set_title('dist: %.1f' % (sim_dists[index - 1]))
    plt.show()
    fig.clear()

# get random indices
random_show_index = np.random.randint(0, query_images.shape[0], size=5)
random_query = query_images[random_show_index]
random_indices = indices_cuml[random_show_index].astype(np.int)
random_distances = distances_cuml[random_show_index]

# show result images
for query_image, sim_indices, sim_dists in zip(random_query, random_indices, random_distances):
    sim_images = train_images[sim_indices]
    show_images(query_image, sim_images, sim_dists)

用到後再追加..

全文完

本文由簡悅 SimpRead優化，用以提升閱讀體驗使用了全新的簡悅詞法分析引擎beta，點選檢視詳細說明

文章目錄 1 安裝與背景 1.1 安裝 1.2 背景 2 DBSCAN 3 TSNE 演算法在 Fashion MNIST 的使用 4 XGBoosting 5 利用 KNN 進行影象檢索

cuML 機器學習加速庫

文章目錄

1 安裝與背景

1.1 安裝

1.2 背景

2 DBSCAN

3 TSNE 演算法在 Fashion MNIST 的使用

4 XGBoosting

5 利用 KNN 進行影象檢索

cuML 機器學習加速庫

python機器學習庫xgboost的使用

python3.6機器學習tensorflow離線庫cpu和gpu

python使用cuML訓練你的機器學習模型

機器學習筆記之Matplotlib庫legend scatter plot figure subplot函式引數解釋

微軟宣佈公開發行 DirectML 機器學習 API：可用 A 卡加速實現光追、DLSS 等效果

sklearn-機器學習庫

微軟 TensorFlow-DirectML 正式版釋出：可在 WSL 上 GPU 加速機器學習

利用pyspark pandas_udf 加速機器學習任務

機器學習入門 pandas 庫基本使用

Spark機器學習庫

【Python環境】探索 Python、機器學習和 NLTK 庫

CatBoost:一個自動處理分類(CAT)資料的機器學習庫

Java 機器學習庫Smile實戰（一）SVM

Java 機器學習庫Smile實戰（二）AdaBoost

Python經典機器學習的四大常用的第三方庫

python機器學習實現決策樹

人工智慧機器學習常用演算法總結及各個常用演算法精確率對比

Python新手學習標準庫模組命名

python新手學習使用庫

cuML 機器學習加速庫

文章目錄

1 安裝與背景

1.1 安裝

1.2 背景

2 DBSCAN

3 TSNE 演算法在 Fashion MNIST 的使用

4 XGBoosting

5 利用 KNN 進行影象檢索

相關推薦