Python爬蟲爬取網站新聞

阿新 • • 發佈：2019-01-17

網站分析

爬取過程

獲取新聞連結地址

使用requests包讀取新聞列表頁面，然後使用正則表示式提取出其中的新聞頁面連結，返回urls列表

def getList(url):
    li = requests.get(url)
    res = r'url":"http:.*?.html'
    urls = re.findall(res,li.text)
    for i in range(len(urls)):
        urls[i] = urls[i][6:]
    return urls

獲取新聞內容

使用requests獲取到新聞頁面的內容，然後使用BeautifulSoup包解析web內容。

def getNews(url):
    url = url[:-5]+"_0.html"
    ss = requests.get(url)
    soup = BeautifulSoup(ss.text,"html.parser")
    title = soup.title.string[:-6].encode('utf-8')
    time = soup.find("div","about").contents[0][9:].encode('utf-8')
    type = soup.find("div","position lBlue").contents[3].string.encode('utf-8' 
)
    content = soup.find("div","content").get_text()[1:-1].encode('utf-8')
    news = News(title,time,type,content)
    return news

手機簡版新聞通常把一個新聞分成幾個頁面顯示，導致爬取內容很麻煩。經過分析發現，在新聞連結地址後加_0即可顯示全部新聞內容，所以先處理一下連結地址。然後使用requests獲取web頁面，再用BeautifulSoup提取新聞的標題，時間，類別和內容。

將結果儲存

def saveAsTxt(news):
    file = open('E:/news.txt' 
,'a')
    file.write("標題:" + news.title +
               "\t時間:" + news.time +
               "\t型別:"+ news.type +
               "\t內容:"+ news.content  +
               "\"\n")

執行程式

程式程式碼

# encoding: utf-8
import requests
import re
from bs4 import BeautifulSoup
import time

class News:
    def __init__(self,title,time,type,content):
        self.title = title  #新聞標題
        self.time = time    #新聞時間
        self.type = type    #新聞類別
        self.content = content  #新聞內容

def getList(url):   #獲取新聞連結地址
    li = requests.get(url)      
    res = r'url":"http:.*?.html'    #正則表示式獲取連結地址
    urls = re.findall(res,li.text)
    for i in range(len(urls)):
        urls[i] = urls[i][6:]
    return urls

def getNews(url):   #獲取新聞內容
    url = url[:-5]+"_0.html"    #處理連結獲取全文
    ss = requests.get(url)
    soup = BeautifulSoup(ss.text,"html.parser")     #獲取新聞內容，注意編碼
    title = soup.title.string[:-6].encode('utf-8')      
    time = soup.find("div","about").contents[0][9:].encode('utf-8')
#    type = soup.find("div","position lBlue").contents[3].string.encode('utf-8')
    content = soup.find("div","content").get_text()[1:-1].encode('utf-8')
    news = News(title,time,type,content)
    return news

def saveAsTxt(news):    #儲存新聞內容
    file = open('E:/news.txt','a')
    file.write("標題:" + news.title +
               "\t時間:" + news.time +
#               "\t型別:"+ news.type +
               "\t內容:"+ news.content  +
               "\"\n")

start = time.clock()
sum = 0
for i in range(1,40):
    wangzhi = "http://3g.163.com/touch/article/list/BA8J7DG9wangning/%s-40.html" %i
    urls = getList(wangzhi)
    sum = sum + len(urls)
#    print "當前頁解析出 %s 條" %len(urls)
    j = 1
    for url in urls:
        print "正在讀取第%s頁第%s/%s條:%s" %(i,j,len(urls),url.encode('utf-8'))
        news = getNews(url)
        saveAsTxt(news)
        j = j + 1
end = time.clock()
print "共爬取%s條新聞，耗時%f s" %(sum,end - start)

執行結果

程式執行結果
程式執行的時間主要和頁面開啟的速度有關，若網速理想的話程式執行還是挺快的。

爬取到的新聞

注

該程式還屬於入門級的爬蟲，代理ip池以及多執行緒效率問題都沒有涉及到。但是如果附加上你需要這些後續處理，比如
有效地儲存（資料庫應該怎樣安排）
有效地判重（這裡指網頁判重，咱可不想把人民日報和抄襲它的大民日報都爬一遍）
有效地資訊抽取（比如怎麼樣抽取出網頁上所有的地址抽取出來，“朝陽區奮進路中華道”），搜尋引擎通常不需要儲存所有的資訊，比如圖片我存來幹嘛…
及時更新（預測這個網頁多久會更新一次）
如你所想，這裡每一個點都可以供很多研究者十數年的研究。（知乎:謝科）

Python爬蟲爬取網站新聞

網站分析

爬取過程

獲取新聞連結地址

獲取新聞內容

將結果儲存

執行程式

程式程式碼

執行結果

注

附錄

Python爬蟲——爬取網站的例項化原始碼

Python爬蟲爬取網站上的圖片

Python爬蟲爬取網站新聞

python 爬蟲爬取證券之星網站

Python爬蟲爬取美劇網站

python爬蟲爬取拉勾網站內容

python 爬蟲爬取某網站的漫畫

Python爬蟲爬取古詩文網站專案分享

Python爬蟲爬取51job招聘網站

使用python爬蟲爬取百度手機助手網站中app的資料

Python爬蟲-爬取糗事百科段子

python爬蟲爬取頁面源碼在本頁面展示

python爬蟲爬取海量病毒文件

用Python爬蟲爬取廣州大學教務系統的成績（內網訪問）

python爬蟲——爬取古詩詞

利用Python爬蟲爬取淘寶商品做數據挖掘分析實戰篇，超詳細教程

Python爬蟲 - 爬取百度html代碼前200行

簡易python爬蟲爬取boss直聘職位，並寫入excel

Python 爬蟲爬取微信文章

python爬蟲爬取QQ說說並且生成詞雲圖，回憶滿滿！

Python爬蟲爬取網站新聞

網站分析

爬取過程

獲取新聞連結地址

獲取新聞內容

將結果儲存

執行程式

程式程式碼

執行結果

注

附錄

相關推薦