Python爬蟲實戰之爬取B站番劇資訊(詳細過程)

阿新 • • 發佈：2019-01-21

目標：爬取b站番劇最近更新
輸出格式:名字+播放量+簡介
那麼開始擼吧~

用到的類庫：
requests:網路請求
pyquery:解析xml文件，像使用jquery一樣簡單哦~

1.分析頁面佈局，找到需要爬取的內容

設計video類:

import requests
from pyquery import PyQuery as pq

class Video(object):
    def __init__(self,name,see,intro):
        self.name=name
        self.see=see
        self.intro=intro

    def 
 __str__(self):
        return "{}--{}--{}".format(self.name,self.see,self.intro)

分析完頁面，設取爬去類:

class bilibili(object):
    host="https://bangumi.bilibili.com"

    def __init__(self):
        self.dom=pq(requests.get('https://bangumi.bilibili.com/22/').text)

    def get_recent(self):
        '''最近更新''' 

        items=self.dom('#list_bangumi_new .c-list .new .c-item')
        videos=[]
        for i in items:
            name=i.find('.r-i .t').attr('title')
            link=self.host+i.find('.r-i .t').attr('href')
            d=pq(requests.get(url=link).text)
            see=d(".info-count .info-count-item" 
).eq(1).find('em').text()
            intro=d('.info-row').eq(3).find('.info-desc').text()
            videos.append(Video(name=name,see=see,intro=intro))
        return videos

測試執行一下:

哎呀，怎麼回事，居然返回為空
這種情況下不要慌,如果程式碼沒有錯誤，那麼一般是由兩種情況造成
沒有選擇到目標，頁面是js動態載入的

我們先試下第一種情況，開啟瀏覽器，f12，將選擇字串複製到console中執行下，我們這就是$('#list_bangumi_new .c-list .new .c-item')

可以選擇到我們想要的目標，那看來是頁面js動態載入了,那就方便我們了，我們就只要找到它的介面就好了，開啟瀏覽器，f12，在network裡面尋找一下就好了，
url:https://bangumi.bilibili.com/api/timeline_v2_global

這是一個item的資訊，裡面有我們想要的名字資訊，那接下來就是去詳情頁尋找播放量和簡介了，但是詳情頁連結在哪那，剛剛那個接口裡並沒有，我們f12，審查一下元素。

這裡的連結是/anime/6439,剛剛的接口裡並沒有這個資訊啊，那這個資訊應該就是拼接出來的了,關鍵就是6439這個數字了,去剛剛那個介面資訊裡尋找一下，果然找到了一個season_id欄位符合，那麼詳情頁連結就構造如下:
detail_url = "https://bangumi.bilibili.com/anime/{season_id}"

那麼接下來就是去分析詳情頁,爬去我們想要播放量和簡介資訊了,構造爬去程式碼如下:
see = d(".info-count .info-count-item").eq(1).find('em').text() intro = d('.info-desc-wrp').find('.info-desc').text()

那麼最終爬取類關鍵程式碼如下:

   class bilibili(object):
    recent_url = "https://bangumi.bilibili.com/api/timeline_v2_global"  # 最近更新
    detail_url = "https://bangumi.bilibili.com/anime/{season_id}"

    def __init__(self):
        self.dom=pq(requests.get('https://bangumi.bilibili.com/22/').text)

    def get_recent(self):
        '''最近更新'''
        items=json.loads(requests.get(self.recent_url).text)['result']
        videos=[]
        for i in items:
            name=i['title']
            link=self.detail_url.format(season_id=i['season_id'])
            d=pq(requests.get(url=link).text)
            see = d(".info-count .info-count-item").eq(1).find('em').text()
            intro = d('.info-desc-wrp').find('.info-desc').text()
            videos.append(Video(name=name,see=see,intro=intro))
        return videos

執行一下:

很ok，那接下來把它做成命令列~

2.製作命令列版

用到的類庫：
argparse:解析命令列引數

主要程式碼如下：

if __name__ == '__main__':
    parser=argparse.ArgumentParser()
    parser.add_argument('--recent',help="get the recent info",action="store_true")
    parser.add_argument('--num',help="The number of results returned,default show all",type=int,default=0)
    parser.add_argument('-v','--version',help="show version",action="store_true")
    args=parser.parse_args()

    if args.version:
        print("bilibili 1.0")
    elif args.recent:
       b = bilibili()
       b.get_recent(args.num)

看下效果：

ok,大功告成，接下來大家就自由發揮新增更多的功能吧~：）

Python爬蟲實戰之爬取B站番劇資訊(詳細過程)

1.分析頁面佈局，找到需要爬取的內容

2.製作命令列版

Python爬蟲實戰之爬取B站番劇資訊(詳細過程)

Python 網路爬蟲實戰：爬取 B站《全職高手》20萬條評論資料

Python爬蟲例項：爬取B站《工作細胞》短評——非同步載入資訊的爬取

Python爬蟲實戰之爬取鏈家廣州房價_04鏈家的模擬登入(記錄)

爬蟲練習四：爬取b站番劇字幕

Python爬蟲實戰(3)-爬取豆瓣音樂Top250資料(超詳細)

python 爬蟲實戰4 爬取淘寶MM照片

Python 爬蟲入門之爬取妹子圖

如何利用Python快速爬取B站全站視訊資訊

Python進階(十八)-Python3爬蟲小試牛刀之爬取CSDN部落格個人資訊

python爬蟲學習之爬取全國各省市縣級城市郵政編碼

python爬蟲入門之爬取小說.md

如何用Python快速爬取B站全站視訊資訊

python爬蟲例項之爬取智聯招聘資料

[python3.6]爬蟲實戰之爬取淘女郎圖片

python3 爬蟲實戰之爬取網易新聞APP端

Python爬蟲實戰之抓取淘寶MM照片（一）

python 爬蟲實戰專案--爬取京東商品資訊（價格、優惠、排名、好評率等）

python爬蟲系列之爬取百度文庫（一）

Python爬蟲實戰(6)-爬取QQ空間好友說說並生成詞雲(超詳細)

Python爬蟲實戰之爬取B站番劇資訊(詳細過程)

1.分析頁面佈局，找到需要爬取的內容

2.製作命令列版

相關推薦