一個獲取大量文章標題標籤的辦法

阿新 • • 發佈：2018-12-22

資料

這是大量的論文文章的標題

思維方法

所謂標籤指的就是有些共同的特徵，所以不能侷限於一個文章標題，要全域性考慮
文章標題中很多停用詞（stopwords）以及標點符號應該去除

ngram模型

有了上述條件就可以粗略的尋找文章的標籤了

工具

使用python以及python的nltk自然語言處理庫非常方便
這個是nltk中文文件

方法1

按照上述思路藉助nltk庫進行

import re
        from pymongo import MongoClient
        client = MongoClient("192.168.33.131" 
, 27017)#連線資料庫取用資料
        db = client.ccf.article.find()
        text = ""
        for a in db:
            text += " " + a['title']#將標題拼接成一個文字
        from nltk import word_tokenize, bigrams#使用nltk的英文分詞以及2gram模型
        from nltk.corpus import stopwords#停用詞
        list_stopWords = list(set(stopwords.words 
('english')))#獲取英文停用詞資料
        text = text.lower()#小寫轉換方便分析
        text = re.sub(r'[{}]+'.format(":,.?"), ' ', text)#用正則表示式去除無用的符號
        text = word_tokenize(text)#分詞技術

        text = [w for w in text if w not in list_stopWords]#去除停用詞
        bigrams = bigrams(text)#2gram技術
        from nltk import FreqDist
        bigramsDist = FreqDist(bigrams)#詞頻分析，其實也就是一個map<key,value> 

        print(bigramsDist.most_common(100))#列印前100個最流行的詞
        client.close()

這些是輸出，可以看到有15個文章是關於檔案系統的，14個文章是關於作業系統的，諸如此類。

[(('file', 'system'), 15), (('operating', 'system'), 14), (('distributed', 'systems'), 13), (('fault', 'tolerance'), 7), (('preface', 'special'), 6), (('special', 'issue'), 5), (('virtual', 'memory'), 5), (('mutual', 'exclusion'), 5), (('design', 'implementation'), 4), (('distributed', 'file'), 4), (('shared', 'memory'), 4), (('reuse', 'distance'), 3), (('storage', 'system'), 3), (('operating', 'systems'), 3), (('memory', 'management'), 3), (('run-time', 'support'), 3), (('distributed', 'system'), 3), (('shared-memory', 'multiprocessors'), 3), (('network', 'file'), 3), (('issue', 'operating'), 3), (('distributed', 'mutual'), 3), (('interprocess', 'communication'), 3), (('optimal', 'parallel'), 2), (('warehouse-scale', 'computers'), 2), (('power', 'energy'), 2), (('ix', 'operating'), 2), (('system', 'combining'), 2), (('combining', 'low'), 2), (('low', 'latency'), 2), (('latency', 'high'), 2), (('high', 'throughput'), 2), (('throughput', 'efficiency'), 2), (('efficiency', 'protected'), 2), (('protected', 'dataplane'), 2), (('cache', 'hierarchies'), 2), (('distance', 'analysis'), 2), (('value', 'prediction'), 2), (('virtual', 'machine'), 2), (('content-based', 'publish/subscribe'), 2), (('scheduling', 'improve'), 2), (('multicore', 'systems'), 2), (('memory', 'systems'), 2), (('garbage', 'collection'), 2), (('networks', 'efficient'), 2), (('wireless', 'ad'), 2), (('ad', 'hoc'), 2), (('hoc', 'networks'), 2), (('load', 'balancing'), 2), (('byzantine', 'fault'), 2), (('thread-level', 'speculation'), 2), (('membership', 'service'), 2), (('multiprocessor', 'cache'), 2), (('cache', 'miss'), 2), (('real-time', 'systems'), 2), (('case', 'study'), 2), (('performance', 'analysis'), 2), (('replicated', 'services'), 2), (('multimedia', 'applications'), 2), (('speculative', 'execution'), 2), (('system', 'using'), 2), (('commodity', 'operating'), 2), (('data', 'structures'), 2), (('branch', 'prediction'), 2), (('area', 'networks'), 2), (('storage', 'systems'), 2), (('performance', 'prediction'), 2), (('hardware', 'support'), 2), (('support', 'network'), 2), (('secure', 'distributed'), 2), (('design', 'evaluation'), 2), (('shared', 'virtual'), 2), (('network', 'interface'), 2), (('file', 'systems'), 2), (('automatically', 'parallelized'), 2), (('parallelized', 'programs'), 2), (('programs', 'using'), 2), (('performance', 'evaluation'), 2), (('system', 'based'), 2), (('traffic', 'control'), 2), (('control', 'systems'), 2), (('disk', 'scheduling'), 2), (('heterogeneous', 'distributed'), 2), (('systems', 'using'), 2), (('lightweight', 'recoverable'), 2), (('recoverable', 'virtual'), 2), (('i/o', 'performance'), 2), (('kernel', 'support'), 2), (('continuous', 'media'), 2), (('multiprocessors', 'preface'), 2), (('architectural', 'support'), 2), (('system', 'principles'), 2), (('concurrency', 'control'), 2), (('data', 'types'), 2), (('exclusion', 'algorithms'), 2), (('special', 'section'), 2), (('measurement', 'modeling'), 2), (('modeling', 'computer'), 2), (('cache', 'performance'), 2), (('systems', 'disk'), 2), (('naming', 'service'), 2)]

方法2

nltk庫封裝好的方法，全自動？？？

將文字分詞之後構建成nltk的Text類，就能解鎖該方法，自動化分析，去除了停用詞以及標點符號

class getArticalTag():
    from pymongo import MongoClient
    import util
    client = client = MongoClient(util.mongodb, 27017)
    db = client.ccf.article.find()#連線上mongo資料庫
    text = ""
    for a in db:#將標題拼接成一個文字
        text += " " + a['title']
    from nltk import word_tokenize
    #使用nltk Python 自然語言處理庫
    from nltk import Text
    text = text.lower()#將文字轉換為小寫方便去重
    text = word_tokenize(text)#分詞
    text = Text(text)#構造成nltk文字
    print(text.collocations(num=1000))#直接呼叫該方法

輸出也是差不多

operating system; file system; fault tolerance; mutual exclusion;
distributed systems; special issue; reuse distance; virtual memory;
interprocess communication; automatically parallelized; content-based
publish/subscribe; garbage collection; protected dataplane; warehouse-
scale computers; load balancing; thread-level speculation; run-time
support; shared memory; case study; shared-memory multiprocessors;
continuous media; combining low; branch prediction; lightweight
recoverable; parallelized programs; special section; low latency;
byzantine fault; recoverable virtual; virtual machine; naming service;
replicated services; area networks; hoc networks; multimedia
applications; value prediction; data types; cache hierarchies;
speculative execution; commodity operating; high throughput;
concurrency control; distributed mutual; distance analysis; optimal
parallel; traffic control; data structures; membership service; cache
miss; network interface; replicated data; memory management; network
file; architectural support; kernel support; multiprocessor cache;
distributed file; shared virtual; hardware support; disk scheduling;
fault-tolerant distributed; system principles; heterogeneous
distributed; secure distributed; programs using; performance
prediction; storage system; operating systems; performance evaluation;
system based; real-time systems; i/o performance; performance
analysis; control systems; multicore systems; cache performance;
storage systems; distributed system; memory systems; file systems

有了標籤之後

給資料庫的文章打上標籤

使用文字索引的精確檢索，找到對應的文章打上標籤即可

一個獲取大量文章標題標籤的辦法

資料這是大量的論文文章的標題思維方法所謂標籤指的就是有些共同的特徵，所以不能侷限於一個文章標題，要全域性考慮文章標題中很多停用詞（stopwords）以及標點符號應該去除 ngram模型

一個簡書的爬蟲，可以設定頁碼，抓取文章標題、簡介以及連結

1 #coding=utf-8 2 import requests 3 from bs4 import BeautifulSoup 4 5 m=input("請輸入想要抓取的頁碼數量:") 6 for i in range(1,int(m)): 7 url="https://ww

學習了一個月python，進行實戰一下：爬取文章標題和正文並儲存的程式碼

爬取東方財富網文章標題和正文並儲存的程式碼。自己知道寫的很爛，不過主要是為了自己備忘，也為了以後回頭看看自己的爛作品，哈哈哈。 #!/usr/bin/env python # -*- coding:utf-8 -*- import requests from bs4 import B

帝國CMS教程,使用靈動標籤呼叫上一篇下一篇的文章標題圖片的方法

很多時候，我們為了網站的更加美觀，或者讓網頁更強的使用者體驗，我們不得不改變帝國本身的設計說正事開始：帝國預設的上一篇、下一篇連結比較單一，無法滿足個性化，如下：下一篇連結： 1 [!--info.next--] 上一篇

24種吸引人的營銷文章標題寫法,總有一個適合你！

在如今資訊爆炸的網際網路+時代下，如何提高提高文章的閱讀(新聞稿、軟文宣傳稿、微信公眾號)，成為從業者們共同研究的課題?首先你得有一個足夠吸引的標題，尤其是定向推送的時候，這將是由一個質變帶來量變的過程

wordpress 通過the_title()和get_the_title函式獲取文章標題

the_title() 輸出或返回當前文章標題。此標籤用在主迴圈裡。如果此文章是受密碼保護的或者是私有的，就會在文章標題前標記上"Protected:"或者"Private:"字樣。用法： <?phpthe_title( $before, $after, $echo

dedecms文章標題title屬性顯示font標籤

dedecms文章標題可定義標題顯示顏色，按照呼叫標題的方式把dedecms標籤[field:title/]加入title屬性會發現滑鼠放在標題上會直接把font標籤給顯示出來了，其實我們要的只是標題的文字部分，其它顏色程式碼是要略去的！重現該問題的程式

文章標題複選框點選獲取文字內容

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title></title> <script src="an

用jquery獲取一個div下的span標籤陣列,再判斷點選事件點選的是哪個span

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title></title> <script src="http://l

一個獲取獲取文檔節點的問題

段落 chan logs function html tag color [0 inner 我想獲取一下p段落，通過JS代碼更改其中的內容，代碼如下： 1 <p>第一個段落</P> 2 <button onclick=" changeP()

文章標題的樣式代碼

dff images 設置調試參考 kit ansi cnblogs 100% 主要參考：http://www.cnblogs.com/houkai/p/3394402.html 代碼插入到【管理】-->【設置】-->【博客設置】-->【頁

第2次作業：隨隨便便又是一個響響亮亮的標題！

src 懷舊界面判斷上傳什麽現在流行相對 2.1 介紹產品相關信息你選擇的產品是？　　網易雲音樂為什麽選擇該產品作為分析？　　　　仔細想了想發現原因有很多，不過主要還是最近用的多（好寫？？？！）。該產品是怎麽誕生的？　　　　網易雲音樂是

設置WordPress文章關鍵詞自動獲取，文章所屬分類名稱，描述自動獲取文章內容，給文章的圖片自動加上AlT標簽

是否字符 emp als all 文件中 ext 結構 lar 最近在優化網站，SEO優化標準：每一篇文章都要有關鍵詞，關鍵詞的個數為3到6個。每一篇文章都要有描述，描述的字數為漢字在70~80之間，在160個字符之間。每一篇文章的圖片都要有Alt標簽，自動給圖片加上Al

用hive寫一個獲取本日期的季度初的sql

ble 等於 diff 需要 lock 2-0 lec select ediff 由於項目需要獲取(本季度的的發生額總和)／（本季度經歷的天數）的數據(還有月均，年均的數據)。判斷季度的時候是一個難點，開始的時候寫了一堆case when 來判斷月份，後來寫著寫著發現，這

python 抓取"一個"網站文章信息放入數據庫

python 文章爬蟲 # coding:utf-8 import requests from bs4 import BeautifulSoup import json import time import datetime import pymysql import sys reload(sy

springboot @Value獲取值為空,解決辦法

rtp spring val 內容 log long value cati 必須在spring中,常常使用 @Value("${property}") 從application.properties中取值,需要註意兩點使用 @Value 的類不能使用 new 關鍵字進

第五篇 - Selenium突破反爬獲取qq郵件標題

切換右擊 oar ntc inf 我們 css_ 不同 web 1 from selenium import webdriver 2 from selenium.webdriver import ActionChains 3 4 #1.打開登陸頁面 5 wd

只有“一個字”的小標題，易記又耐讀讓人拍手稱贊

成長基本有一個教育領導力標桿 ron 總結衡量 “一字”一標題，易記又耐讀！選萃12個單字標題，可用於總結成果、列明措施和梳理經驗。所謂“一個字”，並非僅有一個字，而是每個小標題都突出“一個字&rdqu

如何能低成本地快速獲取大量目標用戶，而不是與競爭對手持久戰？

教育 pro 概念 top 出貨量行為擴展利用大數據本文來自網易雲社區。如何能低成本地快速獲取大量目標用戶，而不是與競爭對手持久戰？這在如今的互聯網上並不常見。現在，大部分行業都已經是競爭多時的紅海，而藍海市場的技術、資源壁壘非常之高，一般產品望塵莫及。在競爭激烈

QDBus實現只執行一個程式並在新標籤開啟文字

深度編輯器向我們展示了程式單例並在新標籤開啟文字，我好羨慕啊！ QDBus概念太抽象，看了幾天都沒看懂，好鬱悶，然後做其他專案去了，看看電視，騎騎單車，晒晒太陽，賞賞花，然後再來看看QDBus，一句一句除錯，實驗成功後，發現QDBus更像是另一種 Signal/Slot。參考： https:/

一個獲取大量文章標題標籤的辦法

資料

思維方法

工具

方法1

方法2

有了標籤之後

給資料庫的文章打上標籤

相關推薦