學習筆記CB005:關鍵詞、語料提取
關鍵詞提取。pynlpir庫實現關鍵詞提取。
# coding:utf-8
import sys
import importlib
importlib.reload(sys)
import pynlpir
pynlpir.open()
s = '怎麼才能把電腦裡的垃圾檔案刪除'
key_words = pynlpir.get_key_words(s, weighted=True)
for key_word in key_words:
print(key_word[0], 't', key_word[1])
pynlpir.close()
百度介面:https://www.baidu.com/s?wd=機器學習 資料探勘 資訊檢索
安裝scrapy pip install scrapy。建立scrapy工程 scrapy startproject baidu_search。做抓取器,建立baidu_search/baidu_search/spiders/baidu_search.py檔案。
# coding:utf-8
import sys
import importlib
importlib.reload(sys)
import scrapy
class BaiduSearchSpider(scrapy.Spider):
name = "baidu_search"
allowed_domains = ["baidu.com"]
start_urls = [
"https://www.baidu.com/s?wd=電腦 垃圾 檔案 刪除"
]
def parse(self, response):
filename = "result.html"
with open(filename, 'wb') as f:
f.write(response.body)
修改settings.py檔案,ROBOTSTXT_OBEY = False,USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' ,DOWNLOAD_TIMEOUT = 5 ,
進入baidu_search/baidu_search/目錄,scrapy crawl baidu_search 。生成result.html,正確抓取網頁。
語料提取。搜尋結果只是索引。真正內容需進入連結。分析抓取結果,連結嵌在class=c-container Div h3 a標籤 href屬性。url新增到抓取佇列抓取。提取正文,去掉標籤,儲存摘要。提取url時,提取標題和摘要,scrapy.Request meta傳遞到處理函式parse_url,抓取完成後能接到這兩個值,提取content。完整資料:url、title、abstract、content。
# coding:utf-8
import sys
import importlib
importlib.reload(sys)
import scrapy
from scrapy.utils.markup import remove_tags
class BaiduSearchSpider(scrapy.Spider):
name = "baidu_search"
allowed_domains = ["baidu.com"]
start_urls = [
"https://www.baidu.com/s?wd=電腦 垃圾 檔案 刪除"
]
def parse(self, response):
# filename = "result.html"
# with open(filename, 'wb') as f:
# f.write(response.body)
hrefs = response.selector.xpath('//div[contains(@class, "c-container")]/h3/a/@href').extract()
# for href in hrefs:
# print(href)
# yield scrapy.Request(href, callback=self.parse_url)
containers = response.selector.xpath('//div[contains(@class, "c-container")]')
for container in containers:
href = container.xpath('h3/a/@href').extract()[0]
title = remove_tags(container.xpath('h3/a').extract()[0])
c_abstract = container.xpath('div/div/div[contains(@class, "c-abstract")]').extract()
abstract = ""
if len(c_abstract) > 0:
abstract = remove_tags(c_abstract[0])
request = scrapy.Request(href, callback=self.parse_url)
request.meta['title'] = title
request.meta['abstract'] = abstract
yield request
def parse_url(self, response):
print(len(response.body))
print("url:", response.url)
print("title:", response.meta['title'])
print("abstract:", response.meta['abstract'])
content = remove_tags(response.selector.xpath('//body').extract()[0])
print("content_len:", len(content))
參考資料:
《Python 自然語言處理》
http://www.shareditor.com/blogshow/?blogId=43
http://www.shareditor.com/blogshow?blogId=76
歡迎推薦上海機器學習工作機會,我的微信:qingxingfengzi