Sample pipeline for text feature extraction and evaluation of sklearn

阿新 • • 發佈：2021-01-21

Sample pipeline for text feature extraction and evaluation

https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection-grid-search-text-feature-extraction-py

下載20新聞資料

定義流水線：計數向量化詞頻轉換線性分類模型

使用網格搜尋來尋找最優引數

The dataset used in this example is the 20 newsgroups dataset which will be automatically downloaded and then cached and reused for the document classification example.

You can adjust the number of categories by giving their names to the dataset loader or setting them to None to get the 20 of them.

Here is a sample output of a run on a quad-core machine:

Code

# Author: Olivier Grisel <[email protected]>
#         Peter Prettenhofer <[email protected]> 

#         Mathieu Blondel <[email protected]>
# License: BSD 3 clause
from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
 
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

print(__doc__)

# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')


# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
]
# Uncomment the following to do the analysis on all the categories
#categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
print()

# #############################################################################
# Define a pipeline combining a text feature extractor with a simple
# classifier
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])

# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    # 'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    'clf__max_iter': (20,),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    # 'clf__max_iter': (10, 50, 80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(data.data, data.target)
    print("done in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Output

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']
1427 documents
2 categories

Performing grid search...
pipeline: ['vect', 'tfidf', 'clf']
parameters:
{'clf__alpha': (1.0000000000000001e-05, 9.9999999999999995e-07),
 'clf__max_iter': (10, 50, 80),
 'clf__penalty': ('l2', 'elasticnet'),
 'tfidf__use_idf': (True, False),
 'vect__max_n': (1, 2),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__max_features': (None, 5000, 10000, 50000)}
done in 1737.030s

Best score: 0.940
Best parameters set:
    clf__alpha: 9.9999999999999995e-07
    clf__max_iter: 50
    clf__penalty: 'elasticnet'
    tfidf__use_idf: True
    vect__max_n: 2
    vect__max_df: 0.75
    vect__max_features: 50000

Sample pipeline for text feature extraction and evaluation of sklearn

Sample pipeline for text feature extraction and evaluation https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection

sklearn 《Text feature extraction》筆記

技術標籤：Pythonsklearn 官方教程見： https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction 6.2.3節

The Manager application has been re-structured for Tomcat 7 onwards and some of URLs have changed. A

最近寫專案經常碰到這個錯誤最後查詢發現是在tomcat中xml配置請求地址出問題了 ssm框架下，request的對映路徑沒有問題，控制檯也沒有丟擲異常，主頁能正常開啟，但是不能跳轉到Controller的路徑

Clustering text documents using k-means of sklearn

Clustering text documents using k-means https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py

深度學習論文翻譯解析（八）：Rich feature hierarchies for accurate object detection and semantic segmentation

論文標題：Rich feature hierarchies for accurate object detection and semantic segmentation 　　標題翻譯：豐富的特徵層次結構，可實現準確的目標檢測和語義分割

no suitable HttpMessageConverter found for response type [X] and content type [text/plain]

技術標籤：Javajavaspringspring boot 在後端使用springBoot提供的RestTemplate傳送請求獲取響應時報錯：org.springframework.web.client.UnknownContentTypeException: Could not extract response: no suitabl

[R-CNN]Rich feature hierarchies for accurate object detection and semantic segmentation

論文周閱讀（14周）論文標題：Rich feature hierarchies for accurate object detection and semantic segmentation Tech report

【論文精讀】TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task

這篇文章是ACL2020上的文章，來德國研究中心的Christoph Alt。文章主要研究的是Tacred的資料集合中的Dev和Test集的標註錯誤，並且做了標註錯誤型別的分組，做了對比試驗驗證這些不同的錯誤原因對四個對比模

HypoML: Visual Analysis for Hypothesis-based Evaluation of Machine Learning Models

論文傳送門作者香港科技大學 Qianwen WangHuamin Qu 牛津大學 William AlexanderJack PeggMin Chen

CRAFT：Character region awareness for text detection 論文閱讀

CRAFT：Character region awareness for text detection 論文閱讀 Reference正文摘要引言相關工作方法架構訓練真實標籤生成弱監督學習

讀書筆記-多工學習-A Novel Multi-task Deep Learning Model for Skin Lesion Segmentation and Classification

一篇2017年的論文，A Novel Multi-task Deep Learning Model for Skin Lesion Segmentation and Classification，基於多工學習的面板病變分割與分類。

《Character-level Convolutional Networks for Text Classification》

Character-level Convolutional Networks for Text Classification 概述：驗證使用字元級的CNN進行文字分類任務的可能性

Python for Data Science - Filtering and selecting data

Chapter 2 - Data Preparation Basics Segment 1 - Filtering and selecting data import numpy as np import pandas as pd

Python for Data Science - Concatenating and transforming data

Chapter 2 - Data Preparation Basics Segment 4 - Concatenating and transforming data import numpy as np import pandas as pd

Feature extraction of sklearn

Feature extraction https://scikit-learn.org/stable/modules/feature_extraction.html 從文字或圖片的資料集中提取出機器學習支援的資料格式。

gridsome develop Gridsome v0.7.23 Initializing plugins... Error: No transformer for ‘text/markdown

技術標籤：轉角遇到愛bugnpm 目錄出錯現象出錯原因解決方案出錯現象這個報錯是我在使用Gridsome建立專案之後，使用md檔案作為資料匯入，用cnpm先安裝了@gridsome/source-filesystem外掛，為了正確編譯又引入了

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation 2021-07-2120:23:07 Paper:https://arxiv.org/pdf/2107.00249.pdf

Xcode Building for , but the linked and embedded framework was built for iOS + iOS Simulator

錯誤資訊專案整合一個SDK後報錯 …………Building for iOS, but the linked and embedded framework ‘xxx.framework’ was built for iOS + iOS Simulator.

sql 報錯：Table is specified twice, both as a target for 'UPDATE' and as a separate source for data

如題，sql 報錯:Table is specified twice, both as a target for \'UPDATE\' and as a separate source for data。表被指定了兩次，同時作為 update 物件和獨立資料來源。

http://10.18.5.83:8080/ 瀏覽報錯：A default document is not configured for the requested URL, and directory browsing is not enabled on the server

http://10.18.5.83:8080/瀏覽報錯：A default document is not configured for the requested URL, and directory browsing is not enabled on the server.

Sample pipeline for text feature extraction and evaluation of sklearn

Sample pipeline for text feature extraction and evaluation

Code

Output

相關推薦