multilabel of sklearn

阿新 • • 發佈：2020-12-30

multilabel

https://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification

多標記，對於一個樣本資料，多個可能的標籤。

例如，一段文字或者視訊，可能關於宗教政治金融教育其中之一，或者多個，或者全部。

多個標籤之間可能又關係，也可能是獨立的，分別有不同的模型。

標籤獨立的可以使用 MultiOutputClassifier

Multilabel classification (closely related to multioutput classification

) is a classification task labeling each sample with m labels from n_classes possible classes, where m can be 0 to n_classes inclusive. This can be thought of as predicting properties of a sample that are not mutually exclusive. Formally, a binary output is assigned to each class, for every sample. Positive classes are indicated with 1 and negative classes with 0 or -1. It is thus comparable to running n_classes binary classification tasks, for example with MultiOutputClassifier. This approach treats each label independently whereas multilabel classifiers may treat the multiple classes simultaneously, accounting for correlated behavior among them.

For example, prediction of the topics relevant to a text document or video. The document or video may be about one of ‘religion’, ‘politics’, ‘finance’ or ‘education’, several of the topic classes or all of the topic classes.

形象解釋

https://towardsdatascience.com/building-a-multi-label-text-classifier-using-bert-and-tensorflow-f188e0ecdc5d

對於天氣特徵，如果是多分類（multiclass）, 天氣可能是晴天陰天雨雪等之一，天氣預報只負責較粗略的特徵。

對於詳細的天氣特徵，例如有沒有太陽、有沒有云、有沒有月亮，則可以是其中之一，也可以有多個，這個就是多標籤。

In a multi-label classification problem, the training set is composed of instances each can be assigned with multiple categories represented as a set of target labels and the task is to predict the label set of test data e.g.,

A text might be about any of religion, politics, finance or education at the same time or none of these.

A movie can be categorized into action, comedy and romance genre based on its summary content. There is possibility that a movie falls into multiple genres like romcoms [romance & comedy].

How is it different from multi-class classification problem?

In Multi-class classification each sample is assigned to one and only one label: a fruit can be either an apple or a pear but not both at the same time. Let us consider an example of three classes C= [“Sun, “Moon, Cloud”]. In multi-class each sample can belong to only one of C classes. In multi-label case each sample can belong to one or more than one class.

影象標籤

https://towardsdatascience.com/approaches-to-multi-label-classification-1cf981ff2108

例如一副照片中，有多個物體，需要給標註多個標籤，選用模型的時候，也需要模型支援多標籤 mulitilabel（也是多輸出 multiouput）。

A multi-label classification for an image deals with a situation where an image can belong to more than one class. For example the below image has a train, woman, girl and Jacuzzi all in the same photo.

Photo Credit: Open Image Dataset V4 (License)

There are multiple ways to solve this problem. The first approach is that of binary classification. In this approach we can use ‘k’ independent binary classifiers corresponding to k classes in our data.

This approach the final layer is consists of k independent sigmoid (logistics) activation. The class prediction is based on a threshold value of the logistics layer.

This approach is easy to understand and train. However, this approach has a deficiency. There are semantic relation between the labels which we are ignoring. For example ‘Human Face’ and Woman are related. And so are Human Face and Girl.

Target format

對標籤輸出的目標，是個矩陣，每列表示一個標籤。

矩陣如果很是稀疏，可以使用係數矩陣儲存。

A valid representation of multilabel y is an either dense or sparse binary matrix of shape (n_samples, n_classes). Each column represents a class. The 1’s in each row denote the positive classes a sample has been labeled with. An example of a dense matrix y for 3 samples:
>>>
>>> y = np.array([[1, 0, 0, 1], [0, 0, 1, 1], [0, 0, 0, 0]])
>>> print(y)
[[1 0 0 1]
 [0 0 1 1]
 [0 0 0 0]]
Dense binary matrices can also be created using MultiLabelBinarizer. For more information, refer to Transforming the prediction target (y).

An example of the same y in sparse matrix form:
>>>
>>> y_sparse = sparse.csr_matrix(y)
>>> print(y_sparse)
  (0, 0)      1
  (0, 3)      1
  (1, 2)      1
  (1, 3)      1

MultiOutputClassifier

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html#sklearn.multioutput.MultiOutputClassifier

對於不支援多標籤/目標分類的模型，可以通過這個分類器外殼，組裝成支援多標籤的分類器。

其實現原理，是對於多標籤的每一列，即每一個目標，單獨構建一個分類器。那麼幾個標籤，就需要幾個獨立的模型。

Multi target classification

This strategy consists of fitting one classifier per target. This is a simple strategy for extending classifiers that do not natively support multi-target classification

Multilabel classification support can be added to any classifier with MultiOutputClassifier. This strategy consists of fitting one classifier per target. This allows multiple target variable classifications. The purpose of this class is to extend estimators to be able to estimate a series of target functions (f1,f2,f3…,fn) that are trained on a single X predictor matrix to predict a series of responses (y1,y2,y3…,yn).

>>> from sklearn.datasets import make_classification
>>> from sklearn.multioutput import MultiOutputClassifier
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.utils import shuffle
>>> import numpy as np
>>> X, y1 = make_classification(n_samples=10, n_features=100, n_informative=30, n_classes=3, random_state=1)
>>> y2 = shuffle(y1, random_state=1)
>>> y3 = shuffle(y1, random_state=2)
>>> Y = np.vstack((y1, y2, y3)).T
>>> n_samples, n_features = X.shape # 10,100
>>> n_outputs = Y.shape[1] # 3
>>> n_classes = 3
>>> forest = RandomForestClassifier(random_state=1)
>>> multi_target_forest = MultiOutputClassifier(forest, n_jobs=-1)
>>> multi_target_forest.fit(X, Y).predict(X)
array([[2, 2, 0],
       [1, 2, 1],
       [2, 1, 0],
       [0, 0, 2],
       [0, 2, 1],
       [0, 0, 2],
       [1, 1, 0],
       [1, 1, 1],
       [0, 0, 2],
       [2, 0, 0]])

MultiOutputClassifier Example

https://www.datatechnotes.com/2020/03/multi-output-classification-with-multioutputclassifier.html

code

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn.datasets import make_multilabel_classification
from sklearn.svm import SVC
from sklearn.multioutput import MultiOutputClassifier

x, y = make_multilabel_classification(n_samples=5000, n_features=10,
                                      n_classes=2, random_state=0)

for i in range(10): 
 print(x[i]," => ", y[i])

xtrain, xtest, ytrain, ytest=train_test_split(x, y, train_size=0.95, random_state=0)
print(len(xtest))

svc = SVC(gamma="scale")
model = MultiOutputClassifier(estimator=svc)
print(model)

model.fit(xtrain, ytrain)
print(model.score(xtrain, ytrain))

yhat = model.predict(xtest)
auc_y1 = roc_auc_score(ytest[:,0],yhat[:,0])
auc_y2 = roc_auc_score(ytest[:,1],yhat[:,1])
 

print("ROC AUC y1: %.4f, y2: %.4f" % (auc_y1, auc_y2))

cm_y1 = confusion_matrix(ytest[:,0],yhat[:,0])
cm_y2 = confusion_matrix(ytest[:,1],yhat[:,1])
 

print(cm_y1)
print(cm_y2)

cr_y1 = classification_report(ytest[:,0],yhat[:,0])
cr_y2 = classification_report(ytest[:,1],yhat[:,1])

print(cr_y1)
print(cr_y2)

output

[ 5. 11.  8.  7.  7.  9.  0.  8.  5.  5.]  =>  [1 1]
[1. 2. 6. 1. 6. 8. 1. 9. 3. 8.]  =>  [0 1]
[8. 3. 7. 6. 4. 7. 0. 4. 7. 6.]  =>  [1 1]
[3. 4. 9. 4. 3. 7. 0. 2. 7. 8.]  =>  [1 1]
[ 8.  7. 10.  8.  7.  4.  1.  4. 10.  9.]  =>  [1 1]
[ 6.  5. 10.  5.  5.  3.  7.  6.  1.  9.]  =>  [0 0]
[ 7.  4. 13.  6.  5.  4.  1.  4.  5. 10.]  =>  [1 1]
[ 5.  2.  3. 14. 10.  4.  2.  0.  6. 12.]  =>  [1 0]
[10.  3.  1.  5.  7.  9.  3.  3.  4.  3.]  =>  [0 0]
[ 5.  4.  9.  5.  8. 10.  0.  8.  3.  9.]  =>  [0 1]
250
MultiOutputClassifier(estimator=SVC())
0.8688421052631579
ROC AUC y1: 0.9206, y2: 0.9202
[[ 80   8]
 [ 11 151]]
[[ 77   9]
 [  9 155]]
              precision    recall  f1-score   support

           0       0.88      0.91      0.89        88
           1       0.95      0.93      0.94       162

    accuracy                           0.92       250
   macro avg       0.91      0.92      0.92       250
weighted avg       0.92      0.92      0.92       250

              precision    recall  f1-score   support

           0       0.90      0.90      0.90        86
           1       0.95      0.95      0.95       164

    accuracy                           0.93       250
   macro avg       0.92      0.92      0.92       250
weighted avg       0.93      0.93      0.93       250

ClassifierChain

https://scikit-learn.org/stable/modules/generated/sklearn.multioutput.ClassifierChain.html#sklearn.multioutput.ClassifierChain

MultiOutputClassifier型別的分類器，是基於標籤獨立的假設。

但是現實生活中，往往標籤資料之間是具有相關性的，除非使用PCA之列工具，進行分解成獨立不相關的更少列的資料。

ClassifierChain 就是面向這種標籤之間存在相關性的情況。

有多少個標籤就有多少個分類器，但是分類器是具有順序的，第一個分類器輸出的第一個標籤預測值，此值會和樣本資料一同送入第二個模型的輸入。

A multi-label model that arranges binary classifiers into a chain.

Each model makes a prediction in the order specified by the chain using all of the available features provided to the model plus the predictions of models that are earlier in the chain.

Classifier chains (see ClassifierChain) are a way of combining a number of binary classifiers into a single multi-label model that is capable of exploiting correlations among targets.

For a multi-label classification problem with N classes, N binary classifiers are assigned an integer between 0 and N-1. These integers define the order of models in the chain. Each classifier is then fit on the available training data plus the true labels of the classes whose models were assigned a lower number.

When predicting, the true labels will not be available. Instead the predictions of each model are passed on to the subsequent models in the chain to be used as features.

Clearly the order of the chain is important. The first model in the chain has no information about the other labels while the last model in the chain has features indicating the presence of all of the other labels. In general one does not know the optimal ordering of the models in the chain so typically many randomly ordered chains are fit and their predictions are averaged together.

>>> from sklearn.datasets import make_multilabel_classification
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.multioutput import ClassifierChain
>>> X, Y = make_multilabel_classification(
...    n_samples=12, n_classes=3, random_state=0
... )
>>> X_train, X_test, Y_train, Y_test = train_test_split(
...    X, Y, random_state=0
... )
>>> base_lr = LogisticRegression(solver='lbfgs', random_state=0)
>>> chain = ClassifierChain(base_lr, order='random', random_state=0)
>>> chain.fit(X_train, Y_train).predict(X_test)
array([[1., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.]])
>>> chain.predict_proba(X_test)
array([[0.8387..., 0.9431..., 0.4576...],
       [0.8878..., 0.3684..., 0.2640...],
       [0.0321..., 0.9935..., 0.0625...]])

ClassifierChain--形象解釋

https://towardsdatascience.com/multi-label-text-classification-5c505fdedca8

Classifier chain model learns classifiers as in binary relevance method. However, all classifiers are linked in a chain.

Fig. 4 Classifier Chain rule

Firstly, all the features (X1, X2, …, Xm) are used to predict y1.

Then, all the features (X1, X2, …, Xm, y1) are used to predict y2

Finally, (X1, X2, …, Xm, y1, y2) are applied to predict y3

The order in which labels are predicted has large impact on the results.
┌────────────────────────────┬──────────────────────────┐
│         Advantages         │      Disadvantages       │
├────────────────────────────┼──────────────────────────┤
│ - label correlation taken  │ - accuracy heavily       │
│   into consideration       │   depends on the order   │
│ - acceptable computational │ - for n labels there     │
│   complexity               │   are n! possible orders │
└────────────────────────────┴──────────────────────────┘

multilabel 學習專用庫

https://towardsdatascience.com/multi-label-text-classification-5c505fdedca8

http://scikit.ml/

Multi-Label Classification in Python

Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.
pip install scikit-multilearn

multilabel of sklearn

multilabel https://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification 多標記，對於一個樣本資料，多個可能的標籤。

datasets of sklearn

datasets sklearn提供了一些內建的小的玩具資料。也可以載入外部的一些資料。

Confusion Matrix of sklearn

Confusion Matrix https://machinelearningmastery.com/confusion-matrix-machine-learning/ 混淆矩陣是一種總結分類演算法效能的技術。

Classification report of sklearn

Classification report The classification_report function builds a text report showing the main classification metrics. Here is a small example with custom target_names and inferred labels:

Transforming the prediction target of sklearn

concept https://scikit-learn.org/stable/modules/preprocessing_targets.html#preprocessing-targets 對於監督性學習，其目標值需要進行轉化，才能作為模型的目標，或者更加有效地適應模型。

Multiclass and multioutput overview of sklearn

Multiclass and multioutput algorithms https://scikit-learn.org/stable/modules/multiclass.html# sklearn 支援如下典型型別學習

statistical learning - supervised_learning of sklearn

統計學習 https://scikit-learn.org/stable/tutorial/statistical_inference/index.html 資料量不停增加，增加了機器學習的重要性。

statistical learning -- Unsupervised learning of sklearn

Unsupervised learning https://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html

Visualizing the stock market structure of sklearn

Visualizing the stock market structure https://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#stock-market

Manifold learning of sklearn

Manifold learning https://scikit-learn.org/stable/modules/manifold.html#locally-linear-embedding 流形學習是一種非線性降維方法，演算法是基於一種想法，很多資料集的高緯度是人為製造的高，並不是真的高。

Column Transformer with Mixed Types -- of sklearn

Column Transformer with Mixed Types https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py

Column Transformer with Heterogeneous Data Sources -- of sklearn

Column Transformer with Heterogeneous Data Sources https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py

Feature extraction of sklearn

Feature extraction https://scikit-learn.org/stable/modules/feature_extraction.html 從文字或圖片的資料集中提取出機器學習支援的資料格式。

Sample pipeline for text feature extraction and evaluation of sklearn

Sample pipeline for text feature extraction and evaluation https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection

Clustering text documents using k-means of sklearn

Clustering text documents using k-means https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py

Sentinel Getting Started And Integration of Spring Cloud Alibaba Tutorials

原文連結：Sentinel Getting Started And Integration of Spring Cloud Alibaba Tutorials Sentinel Getting Started And Integration of Spring Cloud Alibaba Tutorials

A Tour of Go---Go語言之旅

序言開始翻譯一下GO的官方網站的:A Tour of Go(Go語言的不歸之旅) 學習A Tour of Go的前置步驟

解決大於5.7版本mysql的分組報錯Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated

原因: 　　 MySQL 5.7.5和up實現了對功能依賴的檢測。如果啟用了only_full_group_by SQL模式(在預設情況下是這樣)，那麼MySQL就會拒絕選擇列表、條件或順序列表引用的查詢，這些查詢將引用組中未命名的非聚合列，而不

sklearn-SVC實現與類引數詳解

sklearn-SVC實現與類引數對應的API：http://scikit-learn.sourceforge.net/stable/modules/generated/sklearn.svm.SVC.html

python運用sklearn實現KNN分類演算法

KNN（K-Nearest-Neighbours Classiflication）分類演算法，供大家參考，具體內容如下最簡單的分類演算法，易於理解和實現

multilabel of sklearn

multilabel

形象解釋

影象標籤

Target format

MultiOutputClassifier

MultiOutputClassifier Example

ClassifierChain

ClassifierChain--形象解釋

multilabel 學習專用庫

Multi-Label Classification in Python

Scikit-multilearn is a BSD-licensed library for multi-label classification that is built on top of the well-known scikit-learn ecosystem.

相關推薦