sklearn學習：make_multilabel_classification——多標簽數據集方法

阿新 • • 發佈：2017-08-29

矩陣 hat nts ger form pre returns sting matrix

Generate a random multilabel classification problem.

For each sample, the generative process is:

pick the number of labels: n ~ Poisson(n_labels)：選取標簽的數目
n times, choose a class c: c ~ Multinomial(theta) ：n次,選取類別C:多項式
pick the document length: k ~ Poisson(length) ：選取文檔長度
k times, choose a word: w ~ Multinomial(theta_c)：k次,選取一個單詞

In the above process, rejection sampling is used to make sure that n is never zero or more than n_classes, and that the document length is never zero. Likewise, we reject classes which have already been chosen.
在上面的過程中，為確保n不為0或不超過變量n_classes，且文本長度不為0，采用拒絕抽樣的方法。同樣的，我們拒絕已經選擇的類。

Parameters:

Parameters:	n_samples : int, optional (default=100) The number of samples.【生成樣本數】 n_features : int, optional (default=20) The total number of features.【每個樣本特征數】 n_classes : int, optional (default=5) The number of classes of the classification problem.【分類問題類或標簽總數】 n_labels : int, optional (default=2) The average number of labels per instance. More precisely, the number of labels per sample is drawn from a Poisson distribution with `n_labels` as its expected value, but samples are bounded (usingrejection sampling) by `n_classes`, and must be nonzero if `allow_unlabeled` is False. 【每個樣本的平均標簽數量。更準確地說，每個樣本的標簽數量是以泊松分布繪制的，其中n_labels為其預期值，但樣本是由n_classes限定（使用註射采樣），如果allow_unlabeled為False，那麽它們必須非零。】 length : int, optional (default=50) The sum of the features (number of words if documents) is drawn from a Poisson distribution with this expected value.【特征的總和（如果是文檔，則為單詞的數量），從具有該預期值的泊松分布繪制。】 allow_unlabeled : bool, optional (default=True) If `True`, some instances might not belong to any class.【如果為True，一些樣例可能就不屬於任何一類】 sparse : bool, optional (default=False) If `True`, return a sparse feature matrix【如果為True，返回一個稀疏的特征矩陣】 New in version 0.17: parameter to allow sparse output. return_indicator : ‘dense’ (default) \| ‘sparse’ \| False If `dense` return`Y` in the dense binary indicator format. If`‘sparse‘` return`Y` in the sparse binary indicator format.`False` returns a list of lists of labels. return_distributions : bool, optional (default=False) If `True`, return the prior class probability and conditional probabilities of features given classes, from which the data wasdrawn.【如果為True，則返回先前的類概率和給定類的特征的條件概率，從中提取數據。】 random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedbynp.random.【如果數字，random_state是隨機數生成器使用的種子；如果是隨機狀態實例，random_state是隨機數生成器；如果為None，則隨機數生成器是np.random使用的隨機狀態實例。】
Returns:	X : array of shape [n_samples, n_features] The generated samples.【返回n_samples行n_features列的訓練集】 Y : array or sparse CSR matrix of shape [n_samples, n_classes] The label sets.【n_samples行n_classes列的數組或稀疏CSR陣】 p_c : array, shape [n_classes] The probability of each class being drawn. Only returned if `return_distributions=True`. p_w_c : array, shape [n_features, n_classes] The probability of each feature being drawn given each class.Only returned if `return_distributions=True`.

n_samples

: int, optional (default=100)

The number of samples.【生成樣本數】

n_features : int, optional (default=20)

The total number of features.【每個樣本特征數】

n_classes : int, optional (default=5)

The number of classes of the classification problem.【分類問題類或標簽總數】

n_labels : int, optional (default=2)

The average number of labels per instance. More precisely, the number of labels per sample is drawn from a Poisson distribution with n_labels as its expected value, but samples are bounded (usingrejection sampling) by n_classes, and must be nonzero if allow_unlabeled is False.
【每個樣本的平均標簽數量。更準確地說，每個樣本的標簽數量是以泊松分布繪制的，其中n_labels為其預期值，但樣本是由n_classes限定（使用註射采樣），如果allow_unlabeled為False，那麽它們必須非零。】

length : int, optional (default=50)

The sum of the features (number of words if documents) is drawn from a Poisson distribution with this expected value.【特征的總和（如果是文檔，則為單詞的數量），從具有該預期值的泊松分布繪制。】

allow_unlabeled : bool, optional (default=True)

If True, some instances might not belong to any class.【如果為True，一些樣例可能就不屬於任何一類】

sparse : bool, optional (default=False)

If True, return a sparse feature matrix【如果為True，返回一個稀疏的特征矩陣】

New in version 0.17: parameter to allow sparse output.

return_indicator : ‘dense’ (default) | ‘sparse’ | False

If dense returnY in the dense binary indicator format. If‘sparse‘ returnY in the sparse binary indicator format.False returns a list of lists of labels.

return_distributions : bool, optional (default=False)

If True, return the prior class probability and conditional probabilities of features given classes, from which the data wasdrawn.【如果為True，則返回先前的類概率和給定類的特征的條件概率，從中提取數據。】

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator;If RandomState instance, random_state is the random number generator;If None, the random number generator is the RandomState instance usedbynp.random.【如果數字，random_state是隨機數生成器使用的種子；如果是隨機狀態實例，random_state是隨機數生成器；如果為None，則隨機數生成器是np.random使用的隨機狀態實例。】

Returns:

X : array of shape [n_samples, n_features]

The generated samples.【返回n_samples行n_features列的訓練集】

Y : array or sparse CSR matrix of shape [n_samples, n_classes]

The label sets.【n_samples行n_classes列的數組或稀疏CSR陣】

p_c : array, shape [n_classes]

The probability of each class being drawn. Only returned if return_distributions=True.

p_w_c : array, shape [n_features, n_classes]

The probability of each feature being drawn given each class.Only returned if return_distributions=True.

官網教程：

"""
==============================================
Plot randomly generated multilabel dataset【繪制隨機生成的多標簽數據集】
==============================================
This illustrates the `datasets.make_multilabel_classification` dataset generator. Each sample consists of counts of two features (up to 50 in total), which are differently distributed in each of two classes.Points are labeled as follows, where Y means the class is present:
【數據集生成器“datasets.make_multilabel_classification”說明：】

===== ===== ===== ======
1 2 3 Color
===== ===== ===== ======
Y N N Red
N Y N Blue
N N Y Yellow
Y Y N Purple
Y N Y Orange
Y Y N Green
Y Y Y Brown
===== ===== ===== ======
A star marks the expected sample for each class; its size reflects the probability of selecting that class label.【一顆星星標誌著每個類標簽的預期樣本，它的大小反映了
選擇該類標簽的概率。】
The left and right examples highlight the ``n_labels`` parameter: more of the samples in the right plot have 2 or 3 labels.Note that this two-dimensional example is very degenerate:generally the number of features would be much greater than the "document length", while here we have much larger documents than vocabulary.
Similarly, with ``n_classes > n_features``, it is much less likely that a feature distinguishes a particular class.

【左右兩幅圖顯示“n_labels”的參數；右邊的大多數樣本有2到3個標簽。註意，這個二維的樣本是非常退化的：通常，特征的總數比“文本”的總數要多，但是在這裏，我們的文本長度大於詞匯數。類似地，因為``n_classes(3)> n_features(2)``，特征不太可能區分特定的類】

"""

sklearn學習：make_multilabel_classification——多標簽數據集方法

矩陣 hat nts ger form pre returns sting matrix Generate a random multilabel classification problem. For each sample, the generative process

sklearn學習：make_multilabel_classification——多標簽數據集方法

sklearn學習：make_multilabel_classification——多標簽數據集方法

大數據技術學習：彈性分布式數據集RDD

學習筆記TF016:CNN實現、數據集、TFRecord、加載圖像、模型、訓練、調試

深度學習遙感影像分類(1)_數據集批量準備

Java學習總結（二十三）——前端：HTML基本標簽的使用

袋鼠雲數據中臺專欄（七）：用戶標簽體系建設的四字箴言

MySql 基礎學習筆記 1——概述與基本數據類型：整型： 1）TINYINT 2)SMALLINT 3) MEDIUMINT 4)INT 5)BIGINT 主要是大小的差別圖浮點型：命令

機器學習工作流程第一步：如何用Python做數據準備？

分針網——每日分享：HTML5新增標簽 + 智能表單

第2天：HTML常用標簽

[ZZ] 多領域視覺數據的轉換、關聯與自適應學習

Sql Server合並多行詢數據到一行：使用自連接、FOR XML PATH('')、STUFF或REPLACE函數

前端學習02-01表格標簽

第153天：關於HTML標簽嵌套的問題詳解

TF之AE：AE實現TF自帶數據集AE的encoder之後decoder之前的非監督學習分類

MVC路由學習：自定義路由參數（用戶看不到參數名），重新定義路由規則

合並多行查詢數據到一行：使用自連接、FOR XML PATH('')、STUFF或REPLACE函數

Selenium2+python自動化-窗口多標簽處理方法總結(轉載)

【html學習整理】常用標簽

CSS學習筆記-04 a標簽-導航練習

sklearn學習：make_multilabel_classification——多標簽數據集方法

相關推薦