Clustering text documents using k-means of sklearn

阿新 • • 發佈：2021-01-21

Clustering text documents using k-means

https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py

將20個新聞組資料下載，

使用詞頻向量化工具等，提取文件特徵，

對特徵實施kmeans聚類，

最後評價聚類效果

This is an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features instead of standard numpy arrays.

Two feature extraction methods can be used in this example:

TfidfVectorizer uses a in-memory vocabulary (a python dict) to map the most frequent words to features indices and hence compute a word occurrence frequency (sparse) matrix. The word frequencies are then reweighted using the Inverse Document Frequency (IDF) vector collected feature-wise over the corpus.

HashingVectorizer hashes word occurrences to a fixed dimensional space, possibly with collisions. The word count vectors are then normalized to each have l2-norm equal to one (projected to the euclidean unit-ball) which seems to be important for k-means to work in high dimensional space.

HashingVectorizer does not provide IDF weighting as this is a stateless model (the fit method does nothing). When IDF weighting is needed it can be added by pipelining its output to a TfidfTransformer instance.

Two algorithms are demoed: ordinary k-means and its more scalable cousin minibatch k-means.

Additionally, latent semantic analysis can also be used to reduce dimensionality and discover latent patterns in the data.

It can be noted that k-means (and minibatch k-means) are very sensitive to feature scaling and that in this case the IDF weighting helps improve the quality of the clustering by quite a lot as measured against the “ground truth” provided by the class label assignments of the 20 newsgroups dataset.

This improvement is not visible in the Silhouette Coefficient which is small for both as this measure seem to suffer from the phenomenon called “Concentration of Measure” or “Curse of Dimensionality” for high dimensional datasets such as text data. Other measures such as V-measure and Adjusted Rand Index are information theoretic based evaluation scores: as they are only based on cluster assignments rather than distances, hence not affected by the curse of dimensionality.

Note: as k-means is optimizing a non-convex objective function, it will likely end up in a local optimum. Several runs with independent random init might be necessary to get a good convergence.

Code

# Author: Peter Prettenhofer <[email protected]>
#         Lars Buitinck
# License: BSD 3 clause
from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn import metrics

from sklearn.cluster import KMeans, MiniBatchKMeans

import logging
from optparse import OptionParser
import sys
from time import time

import numpy as np


# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s')

# parse commandline arguments
op = OptionParser()
op.add_option("--lsa",
              dest="n_components", type="int",
              help="Preprocess documents with latent semantic analysis.")
op.add_option("--no-minibatch",
              action="store_false", dest="minibatch", default=True,
              help="Use ordinary k-means algorithm (in batch mode).")
op.add_option("--no-idf",
              action="store_false", dest="use_idf", default=True,
              help="Disable Inverse Document Frequency feature weighting.")
op.add_option("--use-hashing",
              action="store_true", default=False,
              help="Use a hashing feature vectorizer")
op.add_option("--n-features", type=int, default=10000,
              help="Maximum number of features (dimensions)"
                   " to extract from text.")
op.add_option("--verbose",
              action="store_true", dest="verbose", default=False,
              help="Print progress reports inside k-means algorithm.")

print(__doc__)
op.print_help()


def is_interactive():
    return not hasattr(sys.modules['__main__'], '__file__')


# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)


# #############################################################################
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]
# Uncomment the following to do the analysis on all the categories
# categories = None

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

print("%d documents" % len(dataset.data))
print("%d categories" % len(dataset.target_names))
print()

labels = dataset.target
true_k = np.unique(labels).shape[0]

print("Extracting features from the training dataset "
      "using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    if opts.use_idf:
        # Perform an IDF normalization on the output of HashingVectorizer
        hasher = HashingVectorizer(n_features=opts.n_features,
                                   stop_words='english', alternate_sign=False,
                                   norm=None)
        vectorizer = make_pipeline(hasher, TfidfTransformer())
    else:
        vectorizer = HashingVectorizer(n_features=opts.n_features,
                                       stop_words='english',
                                       alternate_sign=False, norm='l2')
else:
    vectorizer = TfidfVectorizer(max_df=0.5, max_features=opts.n_features,
                                 min_df=2, stop_words='english',
                                 use_idf=opts.use_idf)
X = vectorizer.fit_transform(dataset.data)

print("done in %fs" % (time() - t0))
print("n_samples: %d, n_features: %d" % X.shape)
print()

if opts.n_components:
    print("Performing dimensionality reduction using LSA")
    t0 = time()
    # Vectorizer results are normalized, which makes KMeans behave as
    # spherical k-means for better results. Since LSA/SVD results are
    # not normalized, we have to redo the normalization.
    svd = TruncatedSVD(opts.n_components)
    normalizer = Normalizer(copy=False)
    lsa = make_pipeline(svd, normalizer)

    X = lsa.fit_transform(X)

    print("done in %fs" % (time() - t0))

    explained_variance = svd.explained_variance_ratio_.sum()
    print("Explained variance of the SVD step: {}%".format(
        int(explained_variance * 100)))

    print()


# #############################################################################
# Do the actual clustering

if opts.minibatch:
    km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                         init_size=1000, batch_size=1000, verbose=opts.verbose)
else:
    km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                verbose=opts.verbose)

print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X)
print("done in %0.3fs" % (time() - t0))
print()

print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, km.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, km.labels_, sample_size=1000))

print()


if not opts.use_hashing:
    print("Top terms per cluster:")

    if opts.n_components:
        original_space_centroids = svd.inverse_transform(km.cluster_centers_)
        order_centroids = original_space_centroids.argsort()[:, ::-1]
    else:
        order_centroids = km.cluster_centers_.argsort()[:, ::-1]

    terms = vectorizer.get_feature_names()
    for i in range(true_k):
        print("Cluster %d:" % i, end='')
        for ind in order_centroids[i, :10]:
            print(' %s' % terms[ind], end='')
        print()

Output

3387 documents
4 categories

Extracting features from the training dataset using a sparse vectorizer
done in 0.820565s
n_samples: 3387, n_features: 10000

Clustering sparse data with MiniBatchKMeans(batch_size=1000, init_size=1000, n_clusters=4, n_init=1,
                verbose=False)
done in 0.065s

Homogeneity: 0.219
Completeness: 0.338
V-measure: 0.266
Adjusted Rand-Index: 0.113
Silhouette Coefficient: 0.005

Top terms per cluster:
Cluster 0: cc ibm au buffalo monash com vnet software nicho university
Cluster 1: space nasa henry access digex toronto gov pat alaska shuttle
Cluster 2: com god university article don know graphics people posting like
Cluster 3: sgi keith livesey morality jon solntze wpd caltech objective moral

Normalizer

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

對特徵進行範數歸一處理。有利於比較向量相似性。

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

預設為 L2範數

>>> from sklearn.preprocessing import Normalizer
>>> X = [[4, 1, 2, 2],
...      [1, 3, 9, 3],
...      [5, 7, 5, 1]]
>>> transformer = Normalizer().fit(X)  # fit does nothing.
>>> transformer
Normalizer()
>>> transformer.transform(X)
array([[0.8, 0.2, 0.4, 0.4],
       [0.1, 0.3, 0.9, 0.3],
       [0.5, 0.7, 0.5, 0.1]])

https://www.jianshu.com/p/6cf5d60db634

範數

理解L1，L2 範數

L1，L2 範數即 L1-norm 和 L2-norm，自然，有L1、L2便也有L0、L3等等。因為在機器學習領域，L1 和 L2 範數應用比較多，比如作為正則項在迴歸中的使用 Lasso Regression(L1) 和 Ridge Regression(L2)。

因此，此兩者的辨析也總被提及，或是考到。不過在說明兩者定義和區別前，先來談談什麼是範數（Norm）吧。

什麼是範數？

線上性代數以及一些數學領域中，norm 的定義是

a function that assigns a strictly positive length or size to each vector in a vector space， except for the zero vector. ——Wikipedia

簡單點說，一個向量的 norm 就是將該向量投影到 [0, ) 範圍內的值，其中 0 值只有零向量的 norm 取到。看到這樣的一個範圍，相信大家就能想到其與現實中距離的類比，於是在機器學習中 norm 也就總被拿來表示距離關係：根據怎樣怎樣的範數，這兩個向量有多遠。

上面這個怎樣怎樣也就是範數種類，通常我們稱為p-norm，嚴格定義是：

其中當 p 取 1 時被稱為 1-norm，也就是提到的 L1-norm，同理 L2-norm 可得。

L1 和 L2 範數的定義

根據上述公式 L1-norm 和 L2-norm 的定義也就自然而然得到了。

先將 p=1 代入公式，就有了 L1-norm 的定義：

然後代入 p=2，L2-norm 也有了：

L2 展開就是熟悉的歐幾里得範數：

題外話，其中 L1-norm 又叫做 taxicab-norm 或者 Manhattan-norm，可能最早提出的大神直接用在曼哈頓區坐計程車來做比喻吧。下圖中綠線是兩個黑點的 L2 距離，而其他幾根就是 taxicab 也就是 L1 距離，確實很像我們平時用地圖時走的路線了。

L1 和 L2 範數在機器學習上最主要的應用大概分下面兩類

作為損失函式使用

作為正則項使用也即所謂 L1-regularization 和 L2-regularization

Clustering text documents using k-means of sklearn

Clustering text documents using k-means https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html#sphx-glr-auto-examples-text-plot-document-clustering-py

Sample pipeline for text feature extraction and evaluation of sklearn

Sample pipeline for text feature extraction and evaluation https://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html#sphx-glr-auto-examples-model-selection

python實點雲分割k-means(sklearn)詳解

本文例項為大家分享了Python實點雲分割k-means（sklearn），供大家參考，具體內容如下

[ML L9] Clustering (K-MEANS)

The k-means algorithm captures the insight that each point in a cluster should be near to the center of that cluster. It works like this: first we choose k, the number of clusters we want to find in t

python基於K-means聚類演算法的影象分割

1 K-means演算法實際上，無論是從演算法思想，還是具體實現上，K-means演算法是一種很簡單的演算法。它屬於無監督分類，通過按照一定的方式度量樣本之間的相似度，通過迭代更新聚類中心，當聚類中心不再移動或移動

在Python中使用K-Means聚類和PCA主成分分析進行影象壓縮

在Python中使用K-Means聚類和PCA主成分分析進行影象壓縮各位讀者好，在這片文章中我們嘗試使用sklearn庫比較k-means聚類演算法和主成分分析（PCA）在影象壓縮上的實現和結果。壓縮影象的效果通過佔用的減少比例以及

python 程式碼實現k-means聚類分析的思路(不使用現成聚類庫)

一、實驗目標　　　　1、使用 K-means 模型進行聚類，嘗試使用不同的類別個數 K，並分析聚類結果。

k-means 聚類演算法與Python實現程式碼

k-means 聚類演算法思想先隨機選擇k個聚類中心，把集合裡的元素與最近的聚類中心聚為一類，得到一次聚類，再把每一個類的均值作為新的聚類中心重新聚類，迭代n次得到最終結果分步解析

簡單的k-means聚類

演算法步驟：在樣本中隨機選取k個樣本點充當各個簇的中心點；計算所有樣本點與各個簇中心之間的距離，然後把樣本點劃入最近的簇中；

YOLOv3中K-Means聚類出新資料集的Anchor尺寸

參考部落格：聚類kmeans演算法在yolov3中的應用 https://www.cnblogs.com/sdu20112013/p/10937717.html

文獻閱讀 | Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms

Yamamoto, T., Nagasaki, H., Yonemaru, J. et al. Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms. BMC

Clustering text documents using k-means of sklearn

Clustering text documents using k-means

Code

Output

Normalizer

理解L1，L2 範數

什麼是範數？

L1 和 L2 範數的定義

Clustering text documents using k-means of sklearn

Sample pipeline for text feature extraction and evaluation of sklearn

python實點雲分割k-means(sklearn)詳解

[ML L9] Clustering (K-MEANS)

python基於K-means聚類演算法的影象分割

在Python中使用K-Means聚類和PCA主成分分析進行影象壓縮

python 程式碼實現k-means聚類分析的思路(不使用現成聚類庫)

k-means 聚類演算法與Python實現程式碼

簡單的k-means聚類

YOLOv3中K-Means聚類出新資料集的Anchor尺寸

文獻閱讀 | Fine definition of the pedigree haplotypes of closely related rice cultivars by means of genome-wide discovery of single-nucleotide polymorphisms

Python用K-means聚類演算法進行客戶分群的實現

MATLAB k-means聚類

問題 K: Length of Bundle Rope

matlab實現K-means聚類演算法（轉載）

論文題目：基於K-means聚類的三維點雲分類

SPSS實現快速聚類（K-Means/K-均值聚類）

關於k-means聚類演算法的原理及解析

OpenCV計算機視覺學習（12）——影象量化處理&影象取樣處理（K-Means聚類量化，區域性馬賽克處理）

基於劃分的聚類演算法（K-Means）與基於密度的聚類演算法（DBSCAN）的程式碼實現與分析

Clustering text documents using k-means of sklearn

Clustering text documents using k-means

Code

Output

Normalizer

理解L1，L2 範數

什麼是範數？

L1 和 L2 範數的定義

相關推薦