用Python製作一個每天新聞熱點爬蟲指令碼

阿新 • • 發佈：2020-11-25

前言

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

剛接觸Python的新手、小白，可以複製下面的連結去觀看Python的基礎入門教學視訊

https://v.douyu.com/author/y6AZ4jn9jwKW

基本開發環境

Python 3.6
Pycharm

import parsel
import requests
import re

目標網頁分析

今天就爬取新聞網中的國際新聞欄目

點選顯示更多新聞內容

可以看到相關的資料介面，裡面有新聞標題以及新聞詳情的url地址

如何提取url地址

1、轉成json，鍵值對取值；
2、用正則表示式匹配url地址；

兩種方法都可以實現，看個人喜好

根據介面資料鏈接中的pager 變化進行翻頁，其對應的就是頁碼。

詳情頁可以看到新聞內容都是在 div標籤裡面 p 標籤內，按照正常的解析網站即可獲取新聞內容。

儲存方式

1、你可以儲存txt文字形式
2、也可以儲存成PDF形式

整體爬取思路總結

在欄目列表頁中，點選更多新聞內容，獲取介面資料url

介面資料url中返回的資料內容中匹配新聞詳情頁url

使用常規解析網站操作（re、css、xpath）提取新聞內容

儲存資料

程式碼實現

獲取網頁原始碼

def get_html(html_url):
    """
    獲取網頁原始碼 response
    :param html_url: 網頁url地址
    :return: 網頁原始碼
     
"""
    response = requests.get(url=html_url, headers=headers)
    return response

獲取每篇新聞url地址

def get_page_url(html_data):
    """
    獲取每篇新聞url地址
    :param html_data: response.text
    :return:  每篇新聞的url地址
    """
    page_url_list = re.findall('"url":"(.*?)"', html_data)
    return page_url_list

檔案儲存命名不能含有特殊字元，需要對新聞標題進行處理

def file_name(name):
    """
    檔案命名不能攜帶 特殊字元
    :param name:  新聞標題
    :return: 無特殊字元的標題
    """
    replace = re.compile(r'[\\\/\:\*\?\"\<\>\|]')
    new_name = re.sub(replace, '_', name)
    return new_name

儲存資料

def download(content, title):
    """
    with open 儲存新聞內容 txt
    :param content:  新聞內容
    :param title:   新聞標題
    :return:
    """
    path = '新聞\\' + title + '.txt'
    with open(path, mode='a', encoding='utf-8') as f:
        f.write(content)
        print('正在儲存', title)

主函式

def main(url):
    """
    主函式
    :param url:  新聞列表頁 url地址
    :return:
    """
    html_data = get_html(url).text   # 獲得介面資料response.text
    lis = get_page_url(html_data)   # 獲得新聞url地址列表
    for li in lis:
        page_data = get_html(li).content.decode('utf-8', 'ignore')  # 新聞詳情頁 response.text
        selector = parsel.Selector(page_data)
        title = re.findall('<title>(.*?)</title>', page_data, re.S)[0]  # 獲取新聞標題
        new_title = file_name(title)
        new_data = selector.css('#cont_1_1_2 div.left_zw p::text').getall()
        content = ''.join(new_data)
        download(content, new_title)


if __name__ == '__main__':
    for page in range(1, 101):
        url_1 = 'https://channel.chinanews.com/cns/cjs/gj.shtml?pager={}&pagenum=9&t=5_58'.format(page)
        main(url_1)