計算兩篇文件的餘弦相似度（tfidf）

阿新 • • 發佈：2019-01-07

# -*- coding:utf-8 -*-
"""
@author: Linlifang
"""

import os
import jieba
import sys
import re
import string
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
reload(sys)
sys.setdefaultencoding('utf-8')

'''
首先讀取資料夾裡的文件，然後通過結巴分詞，將分詞的結果存入檔案，接著使用sklearn包計算每一篇文件的tfidf值
 
並儲存在一個檔案裡，最後從這些檔案中任選兩個txt檔案來計算他們的餘弦相似度。
'''
def getFileList(path):
    filelist = []
    files = os.listdir(path)
    for f in files:
        if f[0] == '.':
            pass
        else:
            filelist.append(f)
    return filelist, path

def segment(filename, path, segPath):
    f = open(path + "/"  
+ filename, 'r+')
    file_list = f.read()
    f.close()
#對文件進行分詞處理
if not os.path.exists(segPath):
        os.mkdir(segPath)
#對空格，換行符進行處理
    # Segmenting the document
seg_list = jieba.cut(file_list, cut_all=False)
    # stopword = open('stopworda.txt').readlines()
result = []
    for seg in seg_list:
        seg = '' 
.join(seg.split())
        reg = 'w+'
r = re.search(reg, seg)
        if seg != '' and seg != ' = ' and seg != '[' and seg != ']' and seg != '(' and seg != ')' and not r:
            result.append(seg)
    finalresult = []
    stopword = open('stopworda.txt').read()
    for word in result: #去除停用詞
if word in stopword:
            continue
        else:
            if word >= u'\u4e00' and word <= u'\u9fa5': #判斷是否是漢字
finalresult.append(word)

            # 將分詞後的結果用空格隔開，儲存在本地
f = open(segPath + "/" + filename + "-seg.txt", "w+")
    f.write(' '.join(finalresult))
    f.close()


# 讀取已經分詞好的文件，進行TFIDF計算
def Tfidf(filelist, sFilePath, path):
    corpus = []
    for ff in filelist:
        fname = path + "/" + ff
        f = open(fname + "-seg.txt", 'r+')
        content = f.read()
        f.close()
        corpus.append(content)

    vectorizer = CountVectorizer()
    transformer = TfidfTransformer()
    tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
    word = vectorizer.get_feature_names()  # 全部文字關鍵字
weight = tfidf.toarray()

    if not os.path.exists(sFilePath):
        os.mkdir(sFilePath)

    for i in range(len(weight)):
        print u'-writing all the tf-idf in the ', i, u'file into ', sFilePath + '/' + string.zfill(i, 2) + ".txt"
f = open(sFilePath + "/" + string.zfill(i, 2) + ".txt", 'w+')
        for j in range(len(word)):
            f.write(word[j] + "  " + str(weight[i][j]) + " " + "\n")
        def coutcos(file1,file2):
    cipin1 = open(file1).readlines()
    cipin2 = open(file2).readlines()
    list1 = []
    list2 = []
    for x in cipin1:
        y = x.split(' ')
        list1.append(y[2])
    for x in cipin2:
        y = x.split(' ')
        list2.append(y[2])
    dot_product = 0.0
normA = 0.0
normB = 0.0
for a, b in zip(list1, list2):
        a = float(a)
        b = float(b)
        dot_product += a * b
        normA += a ** 2
normB += b ** 2
if normA == 0.0 or normB == 0.0:
        return None
else:
        return dot_product / ((normA * normB) ** 0.5)

if __name__ == "__main__":
    # 儲存TFIDF的計算結果到資料夾
sFilePath = "C:/Users/llfang1/PycharmProjects/untitled2/corpus/tfidffile"
# 儲存分詞的資料夾
segPath = 'C:/Users/llfang1/PycharmProjects/untitled2/corpus/segfile'
(allfile, path) = getFileList('C:/Users/llfang1/PycharmProjects/untitled2/corpus/allkeyword')
    for ff in allfile:
        print "Using jieba on " + ff
        segment(ff, path, segPath)
    Tfidf(allfile, sFilePath, segPath)
    file1 = sFilePath + "/" + "04.txt"
file2 = sFilePath + "/" + "05.txt"
similar = coutcos(file1,file2)
    print similar

注：此程式參考了一位同行的程式後進行了改動並新增一些內容

計算兩篇文件的餘弦相似度（tfidf）

# -*- coding:utf-8 -*- """ @author: Linlifang """ import os import jieba import sys import re import string from sklearn.feature_extracti

文件的相似度（2）--最小雜湊簽名

接著上一篇的部落格繼續下去，這篇部落格主要講下最小雜湊簽名的東西。對於上篇部落格中提到的shingle，可以說是在壓縮資料量的基礎上又儘可能保留了源文件的特徵，以便於後面對不同的文件進行相似度比較。但是我們會發現，shingle集

利用Lucene編寫的文件向量相似度計算程式

import java.io.IOException; import org.apache.lucene.analysis.WhitespaceAnalyzer;import org.apache.lucene.document.Document;import org.apa

gensim的LSI模型來計算文件的相似度

from gensim import corpora,models,similarities dictionary=corpora.Dictionary.load('/tmp/deerwester.dict') corpus=corpora.MmCorpus('/tmp/d

上兩篇文件解釋了這段程式碼中的兩個疑問，order_by('-pub_date')[:5]和[q.question_text for q in latest_question_list]

我們還可以指定逆向排序，在前面加一個減號 - 字首： ? 1 2 >>> Publisher.objects.order_by("-name") [<Publisher: O'Reilly>, <Publisher:

文件排序--相似度模型--VSM

說明：文章內容來源於課程視訊和課程ppt。我只學習了課程沒有做習題。文章不是翻譯，是我對課程的理解。　上文提到文件排序函式是TR的核心。文件排序函式的實現有幾種思路，其中一種是基於相似度的模型。這種模型具體是用空間向量模型(Vector Space Mo

nginx＋nagios搭建兩篇文件

安裝 php-fpm 妹的。 http://www.open-open.com/lib/view/1328680567780 http://blog.linuxeye.com/312.html 一、 http://www.open-

計算機如何理解事物的相關性-文件的相似度判斷

> **公號：碼農充電站pro** > **主頁：** 生活中，我們經常會對比兩個事物的**相關性**，也可以叫做**相似度**。 - 如果一件事物與另一件事物的相似度比較高，那這兩件事物的相關性就比較大。 - 如果一件事物與另一件事物的相似度比較低，那這兩件事物的相關性就比較小。人類會根據自己的經

IO練習兩個文件夾進行copy（含子目錄）

nts pan tor 拷貝 style copy str1 子目錄 red 1 //從鍵盤接收兩個文件夾路徑,把其中一個文件夾中(包含內容)拷貝到另一個文件夾中 2 @Test 3 public void t6(){ 4 //1)

利用SHAPEIT將vcf文件進行基因型（genotype）定相（phasing）：查看兩個突變是否來源於同一條鏈（染色體或父本或母本）

soft jar GC maps tst PE mes center mat 首先，下載SHAPEIT. 按照裏面的步驟安裝完後，將vcf文件進行基因型定相，分四步走。第一步，將vcf文件轉化為plink二進制文件（.bed, .bim, .fam）。這一步

java利用classfier4j實現模糊查詢、文章摘要、餘弦相似度、Tfidf、單詞糾正

jar包下載： https://download.csdn.net/download/dreamzuora/10853888 程式碼使用：餘弦相似度： Double result=cosSimilarityByString("關於王立軍，有幾個基本事實。首先，1月28日我是初次

NFS網絡文件系統服務實戰（六）

mount nfs portmap rpcbind NFS網絡文件系統服務實戰：一、按要求搭建配置NFS服務前準備準備三個服務器或虛擬機A、B、C配置NFS服務器：要求：在NFS服務端A上共享/data/w_shared及/data/r_shared兩個文件目錄，允許從NFS客戶端B，C上分

（轉）分布式文件存儲FastDFS（一）初識FastDFS

article targe 訪問 dfs 信息啟動時間戳 load 第一次啟動 http://blog.csdn.net/xingjiarong/article/details/50559849 一、FastDFS簡介 FastDFS是一款開源的、分布式文件系統（D

（轉）分布式文件存儲FastDFS（三）FastDFS配置

ubuntu .cn image markdown rtu sta name tls box http://blog.csdn.net/xingjiarong/article/details/50559768 在上一節中我們一起搭建了一個單節點的FastDFS系統

《UNIX環境高級編程》讀書筆記之系統數據文件和信息（1）

返回 -m 獲取高級編程記得 clas oid data- size 1.UNIX系統口令文件包括了下圖所看到的的各字段，這些字段包括在<pwd.h>中定義的passwd結構體中 POSIX定義了兩個獲取口令文件項的函數。在給出用戶登錄名或用戶ID後

域用戶配置文件的漫遊配置（全）

windows 域環境活動目錄漫遊趙一漫遊的優勢： 1，方便用戶的操作，域用戶無論在哪個客戶端登錄，桌面環境配置都是一樣的，可以在外地辦公。 2，數據的統一存儲，管理更方便。一：搭建域控制器 1.此實驗環境是windows2008 2.win+R快捷鍵打開運行 dcpr

JVM中class文件探索與解析（一）

範圍 protected test except itl 指向 strac 相關父類索引一直想成為一名優秀的架構師的我，轉眼已經工作快兩年了，對於java內核了解甚少，閑來時間，看看JVM，吧自己的一些研究寫下來供大家參考，有不對的地方請指正。廢話不多說，一起來看看J

Linux 4.10.8 根文件系統制作（一）---環境搭建

zxvf fstab project 根據構建 yaffs http tar.bz2 onf 一、工具　　制作工具為busybox 　　下載地址：https://busybox.net/ 　　解壓：　　二、制作文件系統　　進入目錄，執行make menuconfi

磁盤及文件管理系統入門（一）

fs 硬盤 uefi&gpt linux磁盤及文件系統管理初步目錄 1.磁盤及文件系統管理 2.機械磁盤結構 3.mknod命令 4.parted命令 5.GPT分區&UEFI 6.二進制單位Linux系統管理磁盤分區及文件系統管理RAIDLV

51nod 1304 字符串的相似度（exkmp）

stream style iostream class code 相似度 clas pac out 拓展kmp裸題自己跟自己匹配即可模板測試=v= #include <iostream> #include <cstring> using n

計算兩篇文件的餘弦相似度（tfidf）

相關推薦