新聞網頁Python爬蟲（jieba分詞+關鍵詞搜尋排序）

阿新 • • 發佈：2020-08-22

前言

最近做了一個python3作業題目，涉及到：

網頁爬蟲
網頁中文文字提取
建立文字索引
關鍵詞搜尋

涉及到的庫有：

爬蟲庫：requests
解析庫：xpath
正則：re
分詞庫：jieba
...

放出程式碼方便大家快速參考，實現一個小demo。

題目描述

搜尋引擎的設計與實現

輸入:騰訊體育的頁面連結,以列表的方式作為輸入,數量不定,例如:

["http://fiba.qq.com/a/20190420/001968.htm",
"http://sports.qq.com/a/20190424/000181.htm",
"http://sports.qq.com/a/20190423/007933.htm",
"http://new.qq.com/omn/SPO2019042400075107"]

過程:網路爬蟲,頁面分析、中文提取分析、建立索引,要求應用教材中的第三方庫,中間過程在記憶體中完成,輸出該過程的執行時間;
檢索:提示輸入一個關鍵詞進行檢索;
輸出:輸入的連結列表的按照關鍵詞的出現頻率由高到低排序輸出,並以JSON格式輸出詞頻資訊等輔助資訊;未出現關鍵詞的文件連結不輸出,最後輸出檢索時間,例如:

1 "http:xxxxxx.htm" 3
2 "https:xxxx.htm" 2
3 "https:xxxxx.htm" 1

程式碼

程式碼實現的主要步驟是：

網頁爬蟲：crawler函式
網頁文字元素清洗：清理掉多餘的英文字元和標籤，bs4_page_clean

函式
用正則提取中文：re_chinese函式
使用dict儲存每個網頁的中文字和詞，做索引：jieba_create_index函式
輸入關鍵詞進行搜尋：search函式

import requests
from bs4 import BeautifulSoup
import json
import re
import jieba
import time

USER_AGENT = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) '
                            'Chrome/20.0.1092.0 Safari/536.6'}
URL_TIMEOUT = 10
SLEEP_TIME = 2

# dict_result格式：{"1":
#                       {"url": "xxxxx", "word": {"word1": x, "word2": x, "word3": x}}
#                  "2":
#                       {"url": "xxxxx", "word": {"word1": x, "word2": x, "word3": x}}
#                 }
dict_result = {}

# dict_search格式：[
#                   [url, count]
#                   [url, count]
#                 ]
list_search_result = []


def crawler(list_URL):
    for i, url in enumerate(list_URL):
        print("網頁爬取:", url, "...")
        page = requests.get(url, headers=USER_AGENT, timeout=URL_TIMEOUT)
        page.encoding = page.apparent_encoding  # 防止編碼解析錯誤
        result_clean_page = bs4_page_clean(page)
        result_chinese = re_chinese(result_clean_page)
        # print("網頁中文內容：", result_chinese)
        dict_result[i + 1] = {"url": url, "word": jieba_create_index(result_chinese)}
        print("爬蟲休眠中...")
        time.sleep(SLEEP_TIME)


def bs4_page_clean(page):
    print("正則表示式：清除網頁標籤等無關資訊...")
    soup = BeautifulSoup(page.text, "html.parser")
    [script.extract() for script in soup.findAll('script')]
    [style.extract() for style in soup.findAll('style')]
    reg1 = re.compile("<[^>]*>")
    content = reg1.sub('', soup.prettify())
    return str(content)


def re_chinese(content):
    print("正則表示式：提取中文...")
    pattern = re.compile(u'[\u1100-\uFFFD]+?')
    result = pattern.findall(content)
    return ''.join(result)


def jieba_create_index(string):
    list_word = jieba.lcut_for_search(string)
    dict_word_temp = {}
    for word in list_word:
        if word in dict_word_temp:
            dict_word_temp[word] += 1
        else:
            dict_word_temp[word] = 1
    return dict_word_temp


def search(string):
    for k, v in dict_result.items():
        if string in v["word"]:
            list_search_result.append([v["url"], v["word"][string]])
    # 使用詞頻對列表進行排序
    list_search_result.sort(key=lambda x: x[1], reverse=True)

if __name__ == "__main__":

    list_URL_sport = input("請輸入網址列表：")
    list_URL_sport = list_URL_sport.split(",")
    print(list_URL_sport)
    # 刪除輸入的網頁雙引號
    for i in range(len(list_URL_sport)):
        list_URL_sport[i] = list_URL_sport[i][1:-1]
    print(list_URL_sport)
    # list_URL_sport = ["http://fiba.qq.com/a/20190420/001968.htm",
    #                   "http://sports.qq.com/a/20190424/000181.htm",
    #                   "http://sports.qq.com/a/20190423/007933.htm",
    #                   "http://new.qq.com/omn/SPO2019042400075107"]
    time_start_crawler = time.time()
    crawler(list_URL_sport)
    time_end_crawler = time.time()
    print("網頁爬取和分析時間：", time_end_crawler - time_start_crawler)
    word = input("請輸入查詢的關鍵詞：")
    time_start_search = time.time()
    search(word)
    time_end_search = time.time()
    print("檢索時間：", time_end_search - time_start_search)
    for i, row in enumerate(list_search_result):
        print(i+1, row[0], row[1])
    print("詞頻資訊：")
    print(json.dumps(dict_result, ensure_ascii=False))

執行結果

關注我

我目前是一名後端開發工程師。主要關注後端開發，資料安全，網路爬蟲，物聯網，邊緣計算等方向。

微信：yangzd1102

Github：@qqxx6661

個人部落格：

CSDN：@Rude3Knife
知乎：@Zhendong
簡書：@蠻三刀把刀
掘金：@蠻三刀把刀

原創部落格主要內容

Java知識點複習全手冊
Leetcode演算法題解析
劍指offer演算法題解析
SpringCloud菜鳥入門實戰系列
SpringBoot菜鳥入門實戰系列
Python爬蟲相關技術文章
後端開發相關技術文章

個人公眾號：後端技術漫談

如果文章對你有幫助，不妨收藏起來並轉發給您的朋友們~

新聞網頁Python爬蟲（jieba分詞+關鍵詞搜尋排序）

前言最近做了一個python3作業題目，涉及到：網頁爬蟲網頁中文文字提取建立文字索引

python同義詞替換的實現（jieba分詞）

TihuanWords.txt文件格式注意：同一行的詞用單個空格隔開，每行第一個詞為同行詞的替換詞。

小白學 Python 爬蟲（8）：網頁基礎

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝

jieba 分詞（紅樓夢相關的分詞，出現次數最高的20個）

import jieba txt=open(\'D:/紅樓夢/紅樓夢.txt\',\"r\",encoding=\'utf-8\').read() excludes = {\"什麼\",\"一個\",\"我們\",\"你們\",\"如今\",\"說道\",\"知道\",\"姑娘\",\\

python 用jieba分詞統計關於紅樓夢的高頻詞

import jieba excludes = {\"什麼\",\"一個\",\"我們\",\"那裡\",\"你們\",\"如今\",\"說道\",\"知道\",\"起來\",\"姑娘\",\"這裡\",\"出來\",\"他們\",\"眾人\",\"自己\",

jieba 分詞（西遊記）

import jieba with open(\'./西遊記.txt\', \'r\', encoding=\'utf_8\') as f: words = jieba.lcut(f.read())# 使用精確模式對文字進行分詞

jieba分詞庫——基於 TextRank 演算法的關鍵詞抽取

基於 TextRank 演算法的關鍵詞抽取 TextRank 由 Mihalcea 與 Tarau 於 EMNLP 在2014年提出來，其思想非常簡單。關鍵詞抽取的任務就是從一段給定的文字中自動抽取出若干有意義的調語或片語。 TextRank 演算法是利用區

Python-jieba分詞

先安裝jieba庫地址如下：安裝jieba庫 jieba的分詞模式可以將結果十分精準分開，不存在多餘的詞

Django實現whoosh搜尋引擎使用jieba分詞

本文介紹了Django實現whoosh搜尋引擎使用jieba分詞，分享給大家，具體如下： Django版本：3.0.4

Jieba分詞詞性標註以及詞性說明

分詞例項 import jieba import jieba.analyse import jieba.posseg def dosegment_al(sentence): """

用python實現前向分詞最大匹配演算法的示例程式碼

理論介紹分詞是自然語言處理的一個基本工作，中文分詞和英文不同，字詞之間沒有空格。中文分詞是文字挖掘的基礎，對於輸入的一段中文，成功的進行中文分詞，可以達到電腦自動識別語句含義的效果。中文分詞技術屬於自

Redis + jieba分詞做到排序索引

<dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-data-redis</artifactId></dependency><!-- https://mvn