【爬蟲】爬取淮安資訊職業學校（蘇電院）的新聞網 python

阿新 • • 發佈：2020-12-12

爬取淮安資訊職業技術學院所有的新聞內容；包含所有不同的新聞內容，本指令碼會輸出顯示爬取到的新聞詳細頁URL、檔案下載URL，同時提供了下載檔案的功能，可以自行研究處理檔案下載的！（暫只支援下載爬取到的DOC/xls/PDF檔案）。

該指令碼是一個框體，解決了爬取過程種的URL爬取，使用者可自行根據實際需要來改動程式碼。同時也為了方便用來存進資料庫，這裡也預留了news_*[]，用來向資料庫寫入，他家可以基於本指令碼自行擴充套件，歡迎留言你的github倉庫。

作者在處理頁面URL遍歷的時候，用了一個比較牽強的方法來處理hcit這種逆向的新聞頁面編號htm；如果大家又有了高明的方法，歡迎與我討論哈~~~

#! python3

import requests
from bs4 import BeautifulSoup


def reptiles(responses,addr):
    soup = BeautifulSoup(responses.text,"html.parser")
    # print (soup.find_all(id="line_u8_0"))
    k = 0
    news_link   = []
    news_title  = []
    news_time   = []
    news_text   = []
    while True:
        key_text = "line_u8_%d"%(k)
        soup_li = soup.find(id=str(key_text))
        if key_text not in responses.text : # 遇空則break
            break
        # 獲取標題
        title = soup_li.a.get_text()
        title = title[10:]
        news_title.append(title)
        # 獲取時間
        time = soup_li.span.get_text()
        news_time.append(time)
        # 獲取正文頁
        link = soup_li.a['href']
        news_link.append(link)
        newsText_url = addr + link
        res = requests.get(newsText_url)
        res.encoding = "utf-8"
        print("[爬取正文]%s  [%d]"%(newsText_url,res.status_code))
        soup_newsTtext = BeautifulSoup(res.text,"html.parser")
        newsText = soup_newsTtext.find(id="vsb_content")
        news_text.append(newsText)
        # 獲取正文附件並下載
        try:
            newsText_a = newsText.a['href']
            newsText_name = newsText.a.string
            newsText_link = "http://www.hcit.edu.cn" + newsText_a
            print("[附件檔案]"+newsText_link)
            # res_link = requests.get(newsText_link)
            # with open(newsText_name,'wb') as code:
            #     code.write(res_link.content)
        except Exception as e:
            print("[無附件檔案]")
        else:
            pass
        
        k += 1 
    # print(responses.url)
    return 


def main():
    url_addr = [
        # "http://www.hcit.edu.cn/sdxw/xyyw",
        # "http://www.hcit.edu.cn/sdxw/ybdt",
        # "http://www.hcit.edu.cn/sdxw/mtjj",
        "http://www.hcit.edu.cn/sdxw/ggtz"
    ]

    for addr in url_addr:
        page = 1
        while True:
            addr_url = addr + "/" + str(page) + ".htm"
            # print(addr_url)
            responses = requests.get(addr_url)
            if responses.status_code != 200 :
                print("[爬取頁面] %s [%s]" % (str(addr + ".htm"), str(requests.get(addr + ".htm").status_code)))
                reptiles(requests.get(addr + ".htm"), addr + "/../")
                print("+++++++++++++++++++++++++[None]++++++++++++++++++++++++++++")
                break
            responses.encoding = "utf-8"
            # 確認訪問正常
            print("[爬取頁面] %s [%s]"%(str(addr_url),str(responses.status_code)))
            reptiles(responses,addr + "/")
            page += 1
            print("=============================================")

if __name__ == "__main__":
    main()

【爬蟲】爬取淮安資訊職業學校（蘇電院）的新聞網 python

【爬蟲】爬取淮安資訊職業學校（蘇電院）的新聞網 python

【Python爬蟲】爬取websockect

爬蟲實戰-爬取小說資訊

【AtCoder】AtCoder Grand Contest 047 解題報告（$A,B,C$）

【AtCoder】AtCoder Grand Contest 029 解題報告（$A,B,C$）

【2】TensorFlow光速入門-資料預處理（得到資料集）

【LeetCode】第141題——環形連結串列（難度：簡單）

C語言重要知識點總結【3】：演算法的概念和特點（程式的靈魂）

一篇文章教會你用Python爬取淘寶評論資料（寫在記事本）

【原創】Java Graphic2D文字繪製換行（多行繪製），註釋詳細

【LeetCode】第461題——漢明距離（難度：簡單）

【HBase】操作 HBase 的必會命令（圖示詳解）

【LeetCode】第463題——島嶼的周長（難度：簡單）

[python][爬蟲]批量爬取【漫畫DB】的漫畫圖片

【爬蟲】對新筆趣閣小說進行爬取，儲存和下載

Python3爬蟲學習之MySQL資料庫儲存爬取的資訊詳解

Python爬蟲爬取新聞資訊案例詳解

Python爬蟲：爬取無賬號無限制獲取企查查資訊

python爬蟲實現爬取網頁主頁資訊（html程式碼）

Python爬蟲實戰，openpyxl模組學習，爬取房價資訊並簡單的資料分析

【爬蟲】爬取淮安資訊職業學校（蘇電院）的新聞網 python

相關推薦