1. 程式人生 > 實用技巧 >Column Transformer with Heterogeneous Data Sources -- of sklearn

Column Transformer with Heterogeneous Data Sources -- of sklearn

Column Transformer with Heterogeneous Data Sources

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py

資料集合經常包含不同元素, 需要不同的特徵提取 和 處理流水線。 例如

1. 資料集合中包含異構資料, 例如 文字 和 圖片。

2. 資料集合儲存在 pandas中, 不同列需要不同的處理流水線。

使用 ColumnTransformer

可以處理不同型別特徵的預處理。

Datasets can often contain components that require different feature extraction and processing pipelines. This scenario might occur when:

  1. your dataset consists of heterogeneous data types (e.g. raster images and text captions),

  2. your dataset is stored in a pandas.DataFrame and different columns require different processing pipelines.

This example demonstrates how to use ColumnTransformer on a dataset containing different types of features. The choice of features is not particularly helpful, but serves to illustrate the technique.

20 newsgroups dataset

20中新聞主題資料集合,獲取訓練和測試集合。

We will use the 20 newsgroups dataset, which comprises posts from newsgroups on 20 topics. This dataset is split into train and test subsets based on messages posted before and after a specific date. We will only use posts from 2 categories to speed up running time.

categories = ['sci.med', 'sci.space']
X_train, y_train = fetch_20newsgroups(random_state=1,
                                      subset='train',
                                      categories=categories,
                                      remove=('footers', 'quotes'),
                                      return_X_y=True)
X_test, y_test = fetch_20newsgroups(random_state=1,
                                    subset='test',
                                    categories=categories,
                                    remove=('footers', 'quotes'),
                                    return_X_y=True)

Each feature comprises meta information about that post, such as the subject, and the body of the news post.

print(X_train[0])

Out:

From: [email protected] (fred j mccall 575-3539)
Subject: Re: Metric vs English
Article-I.D.: mksol.1993Apr6.131900.8407
Organization: Texas Instruments Inc
Lines: 31




American, perhaps, but nothing military about it.  I learned (mostly)
slugs when we talked English units in high school physics and while
the teacher was an ex-Navy fighter jock the book certainly wasn't
produced by the military.

[Poundals were just too flinking small and made the math come out
funny; sort of the same reason proponents of SI give for using that.]

--
"Insisting on perfect safety is for people who don't have the balls to live
 in the real world."   -- Mary Shafer, NASA Ames Dryden

Creating transformers

對於 非結構化資料 或者 半結構化資料, 需要自己定義資料的解析規則。 這時候就需要開發者自己寫資料的變換過程。

例如本例中, 需要從post中, 提取 主題 和 正文。

使用 FunctionTransformer 來定義一個數據變換器。

First, we would like a transformer that extracts the subject and body of each post. Since this is a stateless transformation (does not require state information from training data), we can define a function that performs the data transformation then use FunctionTransformer to create a scikit-learn transformer.

提取內容變換器 SubjectBodyExtractor

def subject_body_extractor(posts):
    # construct object dtype array with two columns
    # first column = 'subject' and second column = 'body'
    features = np.empty(shape=(len(posts), 2), dtype=object)
    for i, text in enumerate(posts):
        # temporary variable `_` stores '\n\n'
        headers, _, body = text.partition('\n\n')
        # store body text in second column
        features[i, 1] = body

        prefix = 'Subject:'
        sub = ''
        # save text after 'Subject:' in first column
        for line in headers.split('\n'):
            if line.startswith(prefix):
                sub = line[len(prefix):]
                break
        features[i, 0] = sub

    return features


subject_body_transformer = FunctionTransformer(subject_body_extractor)

提取文字長度和句子個數 text_stats_transformer

We will also create a transformer that extracts the length of the text and the number of sentences.

def text_stats(posts):
    return [{'length': len(text),
             'num_sentences': text.count('.')}
            for text in posts]


text_stats_transformer = FunctionTransformer(text_stats)

FunctionTransformer

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer

從任意個可呼叫的物件中建立變換器。

一個函式變換器, 傳遞X引數到使用者自定義的函式中, 返回函式的處理結果。

Constructs a transformer from an arbitrary callable.

A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.

Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.

也可以提供逆函式, 以支援

inverse_transform(X)

Transform X using the inverse function.

funccallable, default=None

The callable to use for the transformation. This will be passed the same arguments as transform, with args and kwargs forwarded. If func is None, then func will be the identity function.

inverse_funccallable, default=None

The callable to use for the inverse transformation. This will be passed the same arguments as inverse transform, with args and kwargs forwarded. If inverse_func is None, then inverse_func will be the identity function.

>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[0.       , 0.6931...],
       [1.0986..., 1.3862...]])

Classification pipeline

流水線第一步 subject_body_transformer, 從半結構化資料中 提取資料。

使用 ColumnTransformer 組裝特徵,

特徵來自子流水線, 包括

1 主題 的詞頻向量

2. 正文的 詞頻向量, 後經過奇異值分解, 獲取降維資料

3. 正文 的統計資料

後 給三類特徵新增權重。

最後將特徵送入 模型。

The pipeline below extracts the subject and body from each post using SubjectBodyExtractor, producing a (n_samples, 2) array.

This array is then used to compute standard bag-of-words features for the subject and body as well as text length and number of sentences on the body, using ColumnTransformer.

We combine them, with weights, then train a classifier on the combined set of features.

pipeline = Pipeline([
    # Extract subject & body
    ('subjectbody', subject_body_transformer),
    # Use ColumnTransformer to combine the subject and body features
    ('union', ColumnTransformer(
        [
            # bag-of-words for subject (col 0)
            ('subject', TfidfVectorizer(min_df=50), 0),
            # bag-of-words with decomposition for body (col 1)
            ('body_bow', Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 1),
            # Pipeline for pulling text stats from post's body
            ('body_stats', Pipeline([
                ('stats', text_stats_transformer),  # returns a list of dicts
                ('vect', DictVectorizer()),  # list of dicts -> feature matrix
            ]), 1),
        ],
        # weight above ColumnTransformer features
        transformer_weights={
            'subject': 0.8,
            'body_bow': 0.5,
            'body_stats': 1.0,
        }
    )),
    # Use a SVC classifier on the combined features
    ('svc', LinearSVC(dual=False)),
], verbose=True)

最後訓練模型, 並在驗證集合上檢視效能。

Finally, we fit our pipeline on the training data and use it to predict topics for X_test. Performance metrics of our pipeline are then printed.

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print('Classification report:\n\n{}'.format(
    classification_report(y_test, y_pred))
)

分類效能報告

Out:

[Pipeline] ....... (step 1 of 3) Processing subjectbody, total=   0.0s
[Pipeline] ............. (step 2 of 3) Processing union, total=   0.4s
[Pipeline] ............... (step 3 of 3) Processing svc, total=   0.0s
Classification report:

              precision    recall  f1-score   support

           0       0.84      0.87      0.86       396
           1       0.87      0.83      0.85       394

    accuracy                           0.85       790
   macro avg       0.85      0.85      0.85       790
weighted avg       0.85      0.85      0.85       790

Total running time of the script: ( 0 minutes 2.529 seconds)

TruncatedSVD

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD

使用截斷的奇異值分解, 來線性降維。

對比PCA, 這個演算法不需要中心化資料。可以處理稀疏矩陣。經常用於文字特徵提取的潛在語義分析。

Dimensionality reduction using truncated SVD (aka LSA).

This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can work with sparse matrices efficiently.

In particular, truncated SVD works on term count/tf-idf matrices as returned by the vectorizers in sklearn.feature_extraction.text. In that context, it is known as latent semantic analysis (LSA).

This estimator supports two algorithms: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on X * X.T or X.T * X, whichever is more efficient.

>>> from sklearn.decomposition import TruncatedSVD
>>> from scipy.sparse import random as sparse_random
>>> X = sparse_random(100, 100, density=0.01, format='csr',
...                   random_state=42)
>>> svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
>>> svd.fit(X)
TruncatedSVD(n_components=5, n_iter=7, random_state=42)
>>> print(svd.explained_variance_ratio_)
[0.0646... 0.0633... 0.0639... 0.0535... 0.0406...]
>>> print(svd.explained_variance_ratio_.sum())
0.286...
>>> print(svd.singular_values_)
[1.553... 1.512...  1.510... 1.370... 1.199...]