1. 程式人生 > 其它 >學習筆記CB005:關鍵詞、語料提取

學習筆記CB005:關鍵詞、語料提取

關鍵詞提取。pynlpir庫實現關鍵詞提取。

# coding:utf-8
import sys
import importlib
importlib.reload(sys)
import pynlpir
pynlpir.open()
s = '怎麼才能把電腦裡的垃圾檔案刪除'
key_words = pynlpir.get_key_words(s, weighted=True)
for key_word in key_words:
    print(key_word[0], 't', key_word[1])
pynlpir.close()

百度介面:https://www.baidu.com/s?wd=機器學習 資料探勘 資訊檢索

安裝scrapy pip install scrapy。建立scrapy工程 scrapy startproject baidu_search。做抓取器,建立baidu_search/baidu_search/spiders/baidu_search.py檔案。

# coding:utf-8
import sys
import importlib
importlib.reload(sys)
import scrapy
class BaiduSearchSpider(scrapy.Spider):
    name = "baidu_search"
    allowed_domains = ["baidu.com"]
    start_urls = [
            "https://www.baidu.com/s?wd=電腦 垃圾 檔案 刪除"
    ]
    def parse(self, response):
        filename = "result.html"
        with open(filename, 'wb') as f:
            f.write(response.body)

修改settings.py檔案,ROBOTSTXT_OBEY = False,USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' ,DOWNLOAD_TIMEOUT = 5 ,

進入baidu_search/baidu_search/目錄,scrapy crawl baidu_search 。生成result.html,正確抓取網頁。

語料提取。搜尋結果只是索引。真正內容需進入連結。分析抓取結果,連結嵌在class=c-container Div h3 a標籤 href屬性。url新增到抓取佇列抓取。提取正文,去掉標籤,儲存摘要。提取url時,提取標題和摘要,scrapy.Request meta傳遞到處理函式parse_url,抓取完成後能接到這兩個值,提取content。完整資料:url、title、abstract、content。

# coding:utf-8
import sys
import importlib
importlib.reload(sys)
import scrapy
from scrapy.utils.markup import remove_tags
class BaiduSearchSpider(scrapy.Spider):
    name = "baidu_search"
    allowed_domains = ["baidu.com"]
    start_urls = [
            "https://www.baidu.com/s?wd=電腦 垃圾 檔案 刪除"
    ]
    def parse(self, response):
        # filename = "result.html"
        # with open(filename, 'wb') as f:
        #     f.write(response.body)
        hrefs = response.selector.xpath('//div[contains(@class, "c-container")]/h3/a/@href').extract()
        # for href in hrefs:
        #     print(href)
        #     yield scrapy.Request(href, callback=self.parse_url)
        containers = response.selector.xpath('//div[contains(@class, "c-container")]')
        for container in containers:
            href = container.xpath('h3/a/@href').extract()[0]
            title = remove_tags(container.xpath('h3/a').extract()[0])
            c_abstract = container.xpath('div/div/div[contains(@class, "c-abstract")]').extract()
            abstract = ""
            if len(c_abstract) > 0:
                abstract = remove_tags(c_abstract[0])
            request = scrapy.Request(href, callback=self.parse_url)
            request.meta['title'] = title
            request.meta['abstract'] = abstract
            yield request
    def parse_url(self, response):
        print(len(response.body))
        print("url:", response.url)
        print("title:", response.meta['title'])
        print("abstract:", response.meta['abstract'])
        content = remove_tags(response.selector.xpath('//body').extract()[0])
        print("content_len:", len(content))

參考資料:

《Python 自然語言處理》

http://www.shareditor.com/blogshow/?blogId=43

http://www.shareditor.com/blogshow?blogId=76

歡迎推薦上海機器學習工作機會,我的微信:qingxingfengzi