sklearn學習:make_multilabel_classification——多標簽數據集方法
Generate a random multilabel classification problem.
- For each sample, the generative process is:
-
- pick the number of labels: n ~ Poisson(n_labels):選取標簽的數目
- n times, choose a class c: c ~ Multinomial(theta) :n次,選取類別C:多項式
- pick the document length: k ~ Poisson(length) :選取文檔長度
- k times, choose a word: w ~ Multinomial(theta_c):k次,選取一個單詞
In the above process, rejection sampling is used to make sure that n is never zero or more than n_classes, and that the document length is never zero. Likewise, we reject classes which have already been chosen.
在上面的過程中,為確保n不為0或不超過變量n_classes,且文本長度不為0,采用拒絕抽樣的方法。同樣的,我們拒絕已經選擇的類。
Parameters: |
n_samples
n_features : int, optional (default=20)
n_classes : int, optional (default=5)
n_labels : int, optional (default=2)
length : int, optional (default=50)
allow_unlabeled : bool, optional (default=True)
sparse : bool, optional (default=False)
return_indicator : ‘dense’ (default) | ‘sparse’ | False
return_distributions : bool, optional (default=False)
random_state : int, RandomState instance or None, optional (default=None)
|
---|---|
Returns: |
X : array of shape [n_samples, n_features]
Y : array or sparse CSR matrix of shape [n_samples, n_classes]
p_c : array, shape [n_classes]
p_w_c : array, shape [n_features, n_classes]
|
官網教程:
"""
==============================================
Plot randomly generated multilabel dataset【繪制隨機生成的多標簽數據集】
==============================================
This
illustrates the `datasets.make_multilabel_classification` dataset
generator. Each sample consists of counts of two features (up to 50 in
total), which are differently distributed in each of two classes.Points
are labeled as follows, where Y means the class is present:
【數據集生成器“datasets.make_multilabel_classification”說明:】
===== ===== ===== ======
1 2 3 Color
===== ===== ===== ======
Y N N Red
N Y N Blue
N N Y Yellow
Y Y N Purple
Y N Y Orange
Y Y N Green
Y Y Y Brown
===== ===== ===== ======
A
star marks the expected sample for each class; its size reflects the
probability of selecting that class label.【一顆星星標誌著每個類標簽的預期樣本,它的大小反映了
選擇該類標簽的概率。】
The
left and right examples highlight the ``n_labels`` parameter: more of
the samples in the right plot have 2 or 3 labels.Note that this
two-dimensional example is very degenerate:generally the number of
features would be much greater than the "document length", while here we
have much larger documents than vocabulary.
Similarly, with ``n_classes > n_features``, it is much less likely that a feature distinguishes a particular class.
【左右兩幅圖顯示“n_labels”的參數;右邊的大多數樣本有2到3個標簽。註意,這個二維的樣本是非常退化的:通常,特征的總數比“文本”的總數要多,但是在這裏,我們的文本長度大於詞匯數。類似地,因為``n_classes(3)> n_features(2)``,特征不太可能區分特定的類】
"""
sklearn學習:make_multilabel_classification——多標簽數據集方法