Python爬取抖音APP，竟然只需要十行程式碼

阿新 • • 發佈：2018-11-07

環境說明

環境：

python 3.7.1

centos 7.4

pip 10.0.1

部署

[[email protected] ~]# python3.7 --version
Python 3.7.1
[[email protected] ~]#

[[email protected] ~]# pip3 install douyin

有時候因為網路原因會安裝失敗，這時重新執行上面的命令即可，直到安裝完成。

匯入douyin模組

[[email protected] ~]# python3.7
>>>import douyin
>>>

匯入如果報錯的話，可能douyin模組沒有安裝成功。

下面我們開始爬…爬抖音小視訊和音樂咯

[[email protected] douyin]# python3.7 dou.py

幾分鐘後…我們來看看爬的成果

可以看到視訊配的音樂被儲存成了 mp3 格式的檔案，抖音視訊儲存成了 mp4 檔案。

嗯…不錯，哈哈。

py指令碼

作者說，能爬抖音上所有熱門話題和音樂下的相關視訊都爬取到，並且將爬到的視訊下載下來，同時還要把視訊所配的音樂也單獨下載下來，不僅如此，所有視訊的相關資訊如釋出人、點贊數、評論數、釋出時間、釋出人、釋出地點等等資訊都需要爬取下來，並存儲到 MongoDB 資料庫。

import douyin
from douyin.structures import Topic, Music

# 定義視訊下載、音訊下載、MongoDB 儲存的處理器
video_file_handler = douyin.handlers.VideoFileHandler(folder='./videos')
music_file_handler = douyin.handlers.MusicFileHandler(folder='./musics')
#mongo_handler = douyin.handlers.MongoHandler()
# 定義下載器，並將三個處理器當做引數傳遞
#downloader = douyin.downloaders.VideoDownloader([mongo_handler, video_file_handler, music_
file_handler])
downloader = douyin.downloaders.VideoDownloader([video_file_handler, music_file_handler])
# 迴圈爬取抖音熱榜資訊並下載儲存
for result in douyin.hot.trend():
    for item in result.data:
        # 爬取熱門話題和熱門音樂下面的所有視訊，每個話題或音樂最多爬取 10 個相關視訊。
        downloader.download(item.videos(max=10))

由於我這裡沒有mongodb所以，把這mongodb相關的配置給註釋掉了。

作者github地址： https://github.com/Python3WebSpider/DouYin

====以下摘自作者====

程式碼解讀

本庫依賴的其他庫有：

aiohttp：利用它可以完成非同步資料下載，加快下載速度
dateparser：利用它可以完成任意格式日期的轉化
motor：利用它可以完成非同步 MongoDB 儲存，加快儲存速度
requests：利用它可以完成最基本的 HTTP 請求模擬
tqdm：利用它可以進行進度條的展示

資料結構定義

如果要做一個庫的話，一個很重要的點就是對一些關鍵的資訊進行結構化的定義，使用面向物件的思維對某些物件進行封裝，抖音的爬取也不例外。

在抖音中，其實有很多種物件，比如視訊、音樂、話題、使用者、評論等等，它們之間通過某種關係聯絡在一起，例如視訊中使用了某個配樂，那麼視訊和音樂就存在使用關係；比如使用者釋出了視訊，那麼使用者和視訊就存在釋出關係，我們可以使用面向物件的思維對每個物件進行封裝，比如視訊的話，就可以定義成如下結構：

class Video(Base):
    def __init__(self, **kwargs):
        """
        init video object
        :param kwargs:
        """
        super().__init__()
        self.id = kwargs.get('id')
        self.desc = kwargs.get('desc')
        self.author = kwargs.get('author')
        self.music = kwargs.get('music')
        self.like_count = kwargs.get('like_count')
        self.comment_count = kwargs.get('comment_count')
        self.share_count = kwargs.get('share_count')
        self.hot_count = kwargs.get('hot_count')
        ...
        self.address = kwargs.get('address')

    def __repr__(self):
        """
        video to str
        :return: str
        """
        return '<Video: <%s, %s>>' % (self.id, self.desc[:10].strip() if self.desc else None)

這裡將一些關鍵的屬性定義成 Video 類的一部分，包括 id 索引、desc 描述、author 釋出人、music 配樂等等，其中 author 和 music 並不是簡單的字串的形式，它也是單獨定義的資料結構，比如 author 就是 User 型別的物件，而 User 的定義又是如下結構：

class User(Base):

    def __init__(self, **kwargs):
        """
        init user object
        :param kwargs:
        """
        super().__init__()
        self.id = kwargs.get('id')
        self.gender = kwargs.get('gender')
        self.name = kwargs.get('name')
        self.create_time = kwargs.get('create_time')
        self.birthday = kwargs.get('birthday')
        ...

    def __repr__(self):
        """
        user to str
        :return:
        """
        return '<User: <%s, %s>>' % (self.alias, self.name)

所以說，通過屬性之間的關聯，我們就可以將不同的物件關聯起來，這樣顯得邏輯架構清晰，而且我們也不用一個個單獨維護字典來儲存了，其實這就和 Scrapy 裡面的 Item 的定義是類似的。

請求和重試

實現爬取的過程就不必多說了，這裡面其實用到的就是最簡單的抓包技巧，使用 Charles 直接進行抓包即可。抓包之後便可以觀察到對應的介面請求，然後進行模擬即可。

所以問題就來了，難道我要一個介面寫一個請求方法嗎？另外還要配置 Headers、超時時間等等的內容，那豈不是太費勁了，所以，我們可以將請求的方法進行單獨的封裝，這裡我定義了一個 fetch 方法：

def _fetch(url, **kwargs):
    """
    fetch api response
    :param url: fetch url
    :param kwargs: other requests params
    :return: json of response
    """
    response = requests.get(url, **kwargs)
    if response.status_code != 200:
        raise requests.ConnectionError('Expected status code 200, but got {}'.format(response.status_code))
    return response.json()

這個方法留了一個必要引數，即 url，另外其他的配置我留成了 kwargs，也就是可以任意傳遞，傳遞之後，它會依次傳遞給 requests 的請求方法，然後這裡還做了異常處理，如果成功請求，即可返回正常的請求結果。

定義了這個方法，在其他的呼叫方法裡面我們只需要單獨呼叫這個 fetch 方法即可，而不需要再去關心異常處理，返回型別了。

好，那麼定義好了請求之後，如果出現了請求失敗怎麼辦呢？按照常規的方法，我們可能就會在外面套一層方法，然後記錄呼叫 fetch 方法請求失敗的次數，然後重新呼叫 fetch 方法進行重試，但這裡可以告訴大家一個更好用的庫，叫做 retrying，使用它我們可以通過定義一個裝飾器來完成重試的操作。

比如我可以使用 retry 裝飾器這麼裝飾 fetch 方法：

from retrying import retry

@retry(stop_max_attempt_number=retry_max_number, wait_random_min=retry_min_random_wait,
           wait_random_max=retry_max_random_wait, retry_on_exception=need_retry)
def _fetch(url, **kwargs):
    pass

這裡使用了裝飾器的四個引數：

stop_max_attempt_number：最大重試次數，如果重試次數達到該次數則放棄重試
wait_random_min：下次重試之前隨機等待時間的最小值
wait_random_max：下次重試之前隨機等待時間的最大值
retry_on_exception：判斷出現了怎樣的異常才重試

這裡 retry_on_exception 引數指定了一個方法，叫做 need_retry，方法定義如下：

def need_retry(exception):
    """
    need to retry
    :param exception:
    :return:
    """
    result = isinstance(exception, (requests.ConnectionError, requests.ReadTimeout))
    if result:
        print('Exception', type(exception), 'occurred, retrying...')
    return result

這裡判斷了如果是 requests 的 ConnectionError 和 ReadTimeout 異常的話，就會丟擲異常進行重試，否則不予重試。

所以，這樣我們就實現了請求的封裝和自動重試，是不是非常 Pythonic？

下載處理器的設計

為了下載視訊，我們需要設計一個下載處理器來下載已經爬取到的視訊連結，所以下載處理器的輸入就是一批批的視訊連結，下載器接收到這些連結，會將其進行下載處理，並將視訊儲存到對應的位置，另外也可以完成一些資訊儲存操作。

在設計時，下載處理器的要求有兩個，一個是保證高速的下載，另一個就是可擴充套件性要強，下面我們分別來針對這兩個特點進行設計：
高速下載，為了實現高速的下載，要麼可以使用多執行緒或多程序，要麼可以用非同步下載，很明顯，後者是更有優勢的。
擴充套件性強，下載處理器要能下載音訊、視訊，另外還可以支援資料庫等儲存，所以為了解耦合，我們可以將視訊下載、音訊下載、資料庫儲存的功能獨立出來，下載處理器只負責視訊連結的主要邏輯處理和分配即可。

為了實現高速下載，這裡我們可以使用 aiohttp 庫來完成，另外非同步下載我們也不能一下子下載太多，不然網路波動太大，所以我們可以設定 batch 式下載，可以避免同時大量的請求和網路擁塞，主要的下載函式如下：

def download(self, inputs):
    """
    download video or video lists
    :param data:
    :return:
    """
    if isinstance(inputs, types.GeneratorType):
        temps = []
        for result in inputs:
            print('Processing', result, '...')
            temps.append(result)
            if len(temps) == self.batch:
                self.process_items(temps)
                temps = []
    else:
        inputs = inputs if isinstance(inputs, list) else [inputs]
        self.process_items(inputs)

這個 download 方法設計了多種資料接收型別，可以接收一個生成器，也可以接收單個或列表形式的視訊物件資料，接著呼叫了 process_items 方法進行了非同步下載，其方法實現如下：

def process_items(self, objs):
    """
    process items
    :param objs: objs
    :return:
    """
    # define progress bar
    with tqdm(total=len(objs)) as self.bar:
        # init event loop
        loop = asyncio.get_event_loop()
        # get num of batches
        total_step = int(math.ceil(len(objs) / self.batch))
        # for every batch
        for step in range(total_step):
            start, end = step * self.batch, (step + 1) * self.batch
            print('Processing %d-%d of files' % (start + 1, end))
            # get batch of objs
            objs_batch = objs[start: end]
            # define tasks and run loop
            tasks = [asyncio.ensure_future(self.process_item(obj)) for obj in objs_batch]
            for task in tasks:
                task.add_done_callback(self.update_progress)
            loop.run_until_complete(asyncio.wait(tasks))

這裡使用了 asyncio 實現了非同步處理，並通過對視訊連結進行分批處理保證了流量的穩定性，另外還使用了 tqdm 實現了進度條的顯示。

我們可以看到，真正的處理下載的方法是 process_item，這裡面會呼叫視訊下載、音訊下載、資料庫儲存的一些元件來完成處理，由於我們使用了 asyncio 進行了非同步處理，所以 process_item 也需要是一個支援非同步處理的方法，定義如下：

async def process_item(self, obj):
    """
    process item
    :param obj: single obj
    :return:
    """
    if isinstance(obj, Video):
        print('Processing', obj, '...')
        for handler in self.handlers:
            if isinstance(handler, Handler):
                await handler.process(obj)

這裡我們可以看到，真正的處理邏輯都在一個個 handler 裡面，我們將每個單獨的功能進行了抽離，定義成了一個個 Handler，這樣可以實現良好的解耦合，如果我們要增加和關閉某些功能，只需要配置不同的 Handler 即可，而不需要去改動程式碼，這也是設計模式的一個解耦思想，類似工廠模式。

Handler 的設計

剛才我們講了，Handler 就負責一個個具體功能的實現，比如視訊下載、音訊下載、資料儲存等等，所以我們可以將它們定義成不同的 Handler，而視訊下載、音訊下載又都是檔案下載，所以又可以利用繼承的思想設計一個檔案下載的 Handler，定義如下：

from os.path import join, exists
from os import makedirs
from douyin.handlers import Handler
from douyin.utils.type import mime_to_ext
import aiohttp


class FileHandler(Handler):

    def __init__(self, folder):
        """
        init save folder
        :param folder:
        """
        super().__init__()
        self.folder = folder
        if not exists(self.folder):
            makedirs(self.folder)

    async def _process(self, obj, **kwargs):
        """
        download to file
        :param url: resource url
        :param name: save name
        :param kwargs:
        :return:
        """
        print('Downloading', obj, '...')
        kwargs.update({'ssl': False})
        kwargs.update({'timeout': 10})
        async with aiohttp.ClientSession() as session:
            async with session.get(obj.play_url, **kwargs) as response:
                if response.status == 200:
                    extension = mime_to_ext(response.headers.get('Content-Type'))
                    full_path = join(self.folder, '%s.%s' % (obj.id, extension))
                    with open(full_path, 'wb') as f:
                        f.write(await response.content.read())
                    print('Downloaded file to', full_path)
                else:
                    print('Cannot download %s, response status %s' % (obj.id, response.status))

    async def process(self, obj, **kwargs):
        """
        process obj
        :param obj:
        :param kwargs:
        :return:
        """
        return await self._process(obj, **kwargs)

這裡我們還是使用了 aiohttp，因為在下載處理器中需要 Handler 支援非同步操作，這裡下載的時候就是直接請求了檔案連結，然後判斷了檔案的型別，並完成了檔案儲存。

視訊下載的 Handler 只需要繼承當前的 FileHandler 即可：

from douyin.handlers import FileHandler
from douyin.structures import Video

class VideoFileHandler(FileHandler):

    async def process(self, obj, **kwargs):
        """
        process video obj
        :param obj:
        :param kwargs:
        :return:
        """
        if isinstance(obj, Video):
            return await self._process(obj, **kwargs)

這裡其實就是加了類別判斷，確保資料型別的一致性，當然音訊下載也是一樣的。

非同步 MongoDB 儲存

上面介紹了視訊和音訊處理的 Handler，另外還有一個儲存的 Handler 沒有介紹，那就是 MongoDB 儲存，平常我們可能習慣使用 PyMongo 來完成儲存，但這裡我們為了加速，需要支援非同步操作，所以這裡有一個可以實現非同步 MongoDB 儲存的庫，叫做 Motor，其實使用的方法差不太多，MongoDB 的連線物件不再是 PyMongo 的 MongoClient 了，而是 Motor 的 AsyncIOMotorClient，其他的配置基本類似。

在儲存時使用的是 update_one 方法並開啟了 upsert 引數，這樣可以做到存在即更新，不存在即插入的功能，保證資料的不重複性。

整個 MongoDB 儲存的 Handler 定義如下：

from douyin.handlers import Handler
from motor.motor_asyncio import AsyncIOMotorClient
from douyin.structures import *


class MongoHandler(Handler):

    def __init__(self, conn_uri=None, db='douyin'):
        """
        init save folder
        :param folder:
        """
        super().__init__()
        if not conn_uri:
            conn_uri = 'localhost'
        self.client = AsyncIOMotorClient(conn_uri)
        self.db = self.client[db]

    async def process(self, obj, **kwargs):
        """
        download to file
        :param url: resource url
        :param name: save name
        :param kwargs:
        :return:
        """
        collection_name = 'default'
        if isinstance(obj, Video):
            collection_name = 'videos'
        elif isinstance(obj, Music):
            collection_name = 'musics'
        collection = self.db[collection_name]
        # save to mongodb
        print('Saving', obj, 'to mongodb...')
        if await collection.update_one({'id': obj.id}, {'$set': obj.json()}, upsert=True):
            print('Saved', obj, 'to mongodb successfully')
        else:
            print('Error occurred while saving', obj)

可以看到我們在類中定義了 AsyncIOMotorClient 物件，並暴露了 conn_uri 連線字串和 db 資料庫名稱，可以在宣告 MongoHandler 類的時候指定 MongoDB 的連結地址和資料庫名。

同樣的 process 方法，這裡使用 await 修飾了 update_one 方法，完成了非同步 MongoDB 儲存。

好，以上便是 douyin 庫的所有的關鍵部分介紹，這部分內容可以幫助大家理解這個庫的核心部分實現，另外可能對設計模式、面向物件思維以及一些實用庫的使用有一定的幫助。

Python爬取抖音APP，竟然只需要十行程式碼

Python爬取抖音APP，竟然只需要十行程式碼

使用python爬取抖音app視訊

python爬取抖音APP視訊教程

使用python爬蟲,批量爬取抖音app視訊

怎麽用Python爬取抖音小視頻? 資深程序員都這樣爬取的(附源碼)

教你爬取抖音甜曲《好喜歡你》，感受荷爾蒙的氣息

使用python-requests+Fiddler4+appium爬蟲,批量爬取抖音小視訊

爬取抖音Top20視訊，並自動播放

Python 3.X爬取抖音所有視訊

Python爬蟲之如何爬取抖音小姐姐的視訊

Python爬蟲---爬取抖音短視訊

利用python爬取點小圖片，滿足私欲(爬蟲)

我用 Python 爬取微信好友，最後發現一個大秘密

Python爬取全書網小說，免費看小說

分手後，小夥怒用Python爬取上萬空姐照片，贏取校花選舉大賽！

python爬取ios中app store關鍵字排行榜的介面

爬蟲爬取抖音熱門音樂

教你用python爬取喜馬拉雅FM音訊，乾貨分享~

Python爬取微博APP

利用Python爬取攝影網站圖片，切勿商用

Python爬取抖音APP，竟然只需要十行程式碼

相關推薦