1. 程式人生 > 實用技巧 >千千音樂專案

千千音樂專案

千千音樂專案

github:https://github.com/Norni/spider_project/tree/master/qianqianyinyue

1、千千音樂概述

  1.1 目的

  1.2 開發環境

2、專案設計

  2.1 流程設計

  2.2 專案流程概述

3、模型設計

  3.1 歌手模型

  3.2 歌曲模型

4、settings.py

5、utils模組

  5.1 random_useragent.py

  5.2 log.py5.3 mongo_pool.py

6、core模組

  6.0 url分析

  6.1 artist_total.py

  6.2 author_info_update.py

  6.3 song_total.py

  6.4 song_info_total_update.py

7、bin模組

  7.1 run.py

8、總結


1、千千音樂概述

1.1 目的

  • 通過在程式入口,輸入歌手的名字,便能獲取該歌手的所有歌曲的詳細資訊

  • 預覽

    python run.py
    # 資料通過檔案返回,檔名字為run.py同級目錄下song_list.txt檔案中
    # 需要說明的是,該連結自帶時間戳,隔天會失效
    {
    "song_name": "演員",
    "song_id": "242078437",
    "song_href": "http://music.taihe.com/song/242078437",
    "author_name": "薛之謙",
    "author_tinguid": "2517",
    "author_url": "http://music.taihe.com/artist/2517",
    "song_album": "初學者",
    "song_publish_date": "2016-07-18",
    "song_publish_company": "海蝶(天津)文化傳播有限公司",
    "download_url": "http://audio04.dmhmusic.com/71_53_T10040589078_128_4_1_0_sdk-cpm/cn/0206/M00/90/77/ChR47F1_nqiAfD0hAD_MGBybIdk026.mp3?xcode=aefbd591c37efa806a6c2b65cae142a973974a6",
    "song_lrc_link": "http://qukufile2.qianqian.com/data2/lrc/bed1fcb36f51259eefab8ba6d95f524f/672457403/672457403.lrc",
    "mv_download_url_info": {},
    "song_mv_id": null
    }
    ********************
    {
    "song_name": "你還要我怎樣",
    "song_id": "100575177",
    "song_href": "http://music.taihe.com/song/100575177",
    "author_name": "薛之謙",
    "author_tinguid": "2517",
    "author_url": "http://music.taihe.com/artist/2517",
    "song_album": "意外",
    "song_publish_date": "2013-11-11",
    "song_publish_company": "華宇世博音樂文化(北京)有限公司",
    "download_url": "http://audio04.dmhmusic.com/71_53_T10038986648_128_4_1_0_sdk-cpm/cn/0208/M00/E5/61/ChR46119DrGAW4d4AEvErRDwLyg867.mp3?xcode=6b725779223e2de86a6bbb3ac7a1959393d994b",
    "song_lrc_link": "http://qukufile2.qianqian.com/data2/lrc/0a6ef3d9a86dd4f1aa782a114d9f288e/672463341/672463341.lrc",
    "mv_download_url_info": {},
    "song_mv_id": null
    }
    ......

1.2 開發環境

  • 抓包分析平臺:Window

  • 抓包分析軟體:Chrome

  • 程式碼開發平臺:Linux

  • 程式碼開發軟體:pycharm

  • 技術棧

    • 傳送資料請求:requests

    • 資料儲存:mongodb

    • 資料處理:lxml, re, json, pymongo, logging

2、專案設計

2.1 流程設計

2.2 專案流程概述

  1. 專案目錄

  2. 專案工作流程概述

    • bin模組

      • run.py為程式入口,通過run.py能夠呼叫core.spiders中的各個爬蟲模組,實現功能

      • log.log為日誌記錄

      • song_list.txt為寫入的歌手資料,為使用者視覺化的效果

    • core模組

      • spiders為爬蟲模組

        • artist_total.py獲取所有的歌手資訊,並存入mongodb資料庫

        • author_info_update更新歌手的資訊,包括生日,簡介等

        • song_total.py獲取單個歌手的所有歌曲資訊

        • song_info_total_update.py更新歌手的歌曲資訊,包括下載連結,mv連結(如果有)

    • utils模組

      • log.py提供log的功能

      • mongo_pool.py操作資料庫的功能

      • random_useragent.py提供隨機的user_agent

    • settings.py檔案

      提供配置

3、模型設計

3.1 歌手模型

  • 效果圖

  

  

  • 欄位解釋

    • author_name歌手名字

    • author_url歌手主頁

    • author_tinguid歌手的id

    • author_birthday歌手生日

    • author_constellation歌手星座

    • author_from_area歌手國籍

    • author_gender歌手性別

    • author_hot歌手熱度

    • author_image_url歌手圖片的url

    • author_intro歌手簡介

    • author_share_num歌手主頁被分享的次數

    • author_songs_total歌手的總歌曲數量

    • author_stature歌手的身高

    • auhtor_weight歌手的體重

3.2 歌曲模型

  • 效果圖

  

  

  • 欄位解釋

    • author_info歌手的資訊

      • author_name歌手名字

      • author_tinguid歌手id

      • author_url歌手主頁url

      • 案例

    • song_list歌手的所有歌曲

      • song_name歌曲名字

      • song_id歌曲id

      • song_href歌曲主頁

      • author_name歌手名字

      • author_tinguid歌手id

      • author_url歌手主頁

      • song_album歌曲所屬專輯

      • song_publish_date歌曲發行時間

      • song_publish_company歌曲發行公司

      • download_url歌曲下載地址

      • song_lrc_link歌曲的歌詞連結

      • mv_download_url_info歌曲mv的下載資訊

      • song_mv_id歌曲mv的id

      • 案例

4、settings.py

  • 說明

    • 實現全域性的配置

  • 程式碼

import logging

# 配置MONGO_URL
MONGO_URL = "mongodb://127.0.0.1:27017"

# 配置log
LOG_FMT = "%(asctime)s %(filename)s [line:%(lineno)d] %(levelname)s:%(message)s"
LOG_DATE_FMT = "%D-%m-%d %H:%M:%S"
LOG_FILENAME = 'log.log'
LOG_LEVEL = logging.DEBUG

# 配置user_agent
USER_AGENT_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; rv:11.0) like Gecko",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60",
"Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50",
"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36"
]

5、utils模組

5.1 random_useragent.py

  • 說明

    • 實現隨機生成一個user_agent的功能

    • 通過property裝飾方法,使其可以方便呼叫

  • 程式碼

    import random
    from qianqianyinyue.settings import USER_AGENT_LIST


    class RandomUserAgent(object):
    def __init__(self):
    self.user_agent_list = USER_AGENT_LIST

    @property
    def get_random_user_agent(self):
    return random.choice(self.user_agent_list)


    random_user_agent = RandomUserAgent().get_random_user_agent

    if __name__ == "__main__":
    # obj = RandomUserAgent()
    data = {
    "a": random_user_agent
    }
    print(data)

5.2 log.py

  • 說明

    • 檔案儲存

      • 方便後續檢視日誌

    • 控制檯輸出

      • 方便及時發現數據的變動

  • 程式碼

    import logging
    import sys
    from qianqianyinyue.settings import LOG_FMT, LOG_LEVEL, LOG_DATE_FMT, LOG_FILENAME


    class Logger(object):
    def __init__(self):
    self._logger = logging.getLogger()
    self.formatter = logging.Formatter(fmt=LOG_FMT, datefmt=LOG_DATE_FMT)
    self._logger.addHandler(hdlr=self.get_file_handler(filename=LOG_FILENAME))
    self._logger.addHandler(hdlr=self.get_console_handler())
    self._logger.setLevel(level=LOG_LEVEL)

    def get_file_handler(self, filename):
    file_handler = logging.FileHandler(filename=filename, encoding='utf8')
    file_handler.setFormatter(fmt=self.formatter)
    return file_handler

    def get_console_handler(self):
    console_handler = logging.StreamHandler(stream=sys.stdout)
    console_handler.setFormatter(fmt=self.formatter)
    return console_handler

    @property
    def logger(self):
    return self._logger


    logger = Logger().logger

    if __name__ == "__main__":
    logger.info("哈哈哈")

5.3 mongo_pool.py

  • 說明

    • 將集合設定為變數collection,方便例項化時選擇相應的資料

    • 實現了

      • 插入功能,並記錄日誌

      • 更新功能,並記錄日誌

        • 條件查詢更新

      • 查詢單條資料功能

        • 通過特定條件,實現查詢單條資料

      • 查詢所有資料

        • 返回一個生成器

  • 程式碼

    from pymongo import MongoClient
    from qianqianyinyue.settings import MONGO_URL
    from qianqianyinyue.utils.log import logger


    class MongoPool(object):

    def __init__(self, collection):
    self.mongo_client = MongoClient(MONGO_URL)
    self.collections = self.mongo_client['qianqianyinyue'][collection]

    def __del__(self):
    self.mongo_client.close()

    def insert_one(self, document):
    self.collections.insert_one(document)
    logger.info("插入了新的資料:{}".format(document))

    def update_one(self, conditions, document):
    self.collections.update_one(filter=conditions, update={"$set": document})
    logger.info("更新了->{}<-的資料".format(conditions))

    def find_one(self, conditions):
    collections = self.collections.find(filter=conditions)
    for item in collections:
    item.pop('_id')
    return item

    def find_all(self):
    collections = self.collections.find()
    for item in collections:
    item.pop('_id')
    yield item

6、core模組

6.0 url分析

  • 首頁url分析

    • url:http://music.taihe.com/artist

      • 返回一個html_str

      • 請求體

        • GET請求

        • User-Agent

      • 目標資料

        • author_name

        • author_url

        • author_tinguid

  • 歌手主頁url分析

    • url:http://music.taihe.com/artist/2517

      • 請求體

        • GET請求

        • Referer:http://music.taihe.com/aritst

      • 目標資料

        • song_name

        • song_url

        • song_id

        • song_mv_url

        • song_mv_Id

    • url:http://music.taihe.com/data/user/getsongs?start=15&size=15&ting_uid=2517

      • 說明

        • 這是ajax返回的第二頁資料,start為當前頁數減1後乘15,size固定值,ting_uid為歌手的author_tinguid

        • 返回的資料在data欄位中,為xml格式字串,需要re或lxml來提取資料

        • 如何獲取最大頁數?通過歌手主頁的首頁,拿到頁面下標的最大值,然後通過該值構造下一頁的請求

      • 請求體

        • GET請求

        • Referer:http://music.taihe.com/artist/2517

        • User-Agent

    • url:http://music.taihe.com/data/tingapi/v1/restserver/ting?method=baidu.ting.artist.getInfo&from=web&tinguid=2517

      • 說明

        • 每一個歌手的author_tinguid對應一個json資料

        • 如果同一時刻,傳送的請求過多,會封禁IP

          • 設定time.sleep(),能夠多拿到一些資料

          • 我將時間控制在1到3秒之間,拿到了更多的資料,如果不是為了立刻拿到資料,不妨將時間設定的長一點,即拿到了資料,也降低了被對方伺服器發現的可能

            time.sleep(random.uniform(1,3))
  • 歌曲主頁url分析

    • url:http://music.taihe.com/song/242078437

      • 請求體

        • User-Agent

      • 目標資料

        • 所屬的專輯

        • 發行時間

        • 發行公司

    • url:http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.song.playAAC&format=jsonp&songid=242078437&from=web

      • 請求體

        • Referer:http://music.taihe.com/song/242078437

        • User-Agent

      • 目標資料

        • 下載連結

        • 歌詞連結

    • url:http://music.taihe.com/mv/601422013

      • 目標資料

        • mv的idsong_mv_id

    • url:http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.mv.playMV&mv_id=XXXX

      • 說明

        • 通過來源url,用正則表示式,解析出mv_id,需要注意的是資料庫中的mv_id,可能提取的與目標mv_id是不相符的,如果用那個mv_id,可能提取不到資料

      • 目標資料

        • 下載mv的url

6.1 artist_total.py

  • 程式碼

    import requests
    from lxml import etree
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from qianqianyinyue.utils.log import logger


    class ArtistTotal(object):
    """
    預先執行的程式,獲取該網站曲庫,所有的音樂作者的資訊
    資訊包括:
    作者id: 對應資料庫欄位--->>>“author_tinguid”
    作者的首頁url:對應資料庫欄位--->>>“author_url”
    作者的名字:對應資料庫欄位--->>>“author_name”
    """
    def __init__(self):
    self.mongo_pool = MongoPool(collection='artist')
    self.url = "http://music.taihe.com/artist"
    self.user_agent = random_user_agent

    def get_response(self, url, user_agent):
    headers = {
    "User-Agent": user_agent
    }
    try:
    response = requests.get(url=url, headers=headers)
    if response.ok:
    return response.content.decode()
    except Exception as e:
    logger.warning(e)

    def parse_html_str(self, html_str):
    html_str = etree.HTML(html_str)
    b_li_list = html_str.xpath("//ul[@class=\"container\"]/li[position()>1]")
    for b_li in b_li_list:
    s_li = b_li.xpath('./ul/li')
    for li in s_li:
    item = dict()
    item["author_name"] = li.xpath('./a/@title')[0] if li.xpath('./a/@title') else None
    author_url = "http://music.taihe.com"+li.xpath('./a/@href')[0] if li.xpath('./a/@href') else "-1"
    item["author_url"] = author_url
    item["author_tinguid"] = author_url.rsplit('/', 1)[-1]
    yield item

    def insert_to_mongodb(self, documents):
    for document in documents:
    self.mongo_pool.insert_one(document=document)

    def run(self):
    response = self.get_response(url=self.url, user_agent=self.user_agent)
    author_infos = self.parse_html_str(html_str=response)
    self.insert_to_mongodb(documents=author_infos)


    if __name__ == "__main__":
    obj = ArtistTotal()
    obj.run()

6.2 author_info_update.py

  • 程式碼

    import requests
    import json
    import time
    import random
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from qianqianyinyue.utils.log import logger
    from pprint import pprint
    
    
    class AuthorInfoUpdate(object):
        """更新資料庫中,作者的其他相關資訊"""
        def __init__(self):
            self.mongo_pool = MongoPool(collection='artist')
            self.url = "http://music.taihe.com/data/tingapi/v1/restserver/ting?method=baidu.ting.artist.getInfo&from=web&tinguid={}"
            self.user_agent = random_user_agent
    
        def get_response(self, url, user_agent):
            headers = {
                "User-agent": user_agent
            }
            try:
                response = requests.get(url=url, headers=headers)
                # print(response.text)
                if response.ok:
                    response = json.loads(response.content.decode())
                    if "name" in response.keys():
                        return response
                    return dict()
            except Exception as e:
                logger.warning(e)
            time.sleep(random.uniform(1, 3))
    
        def parse_data(self, data):
            item = dict()
            item['author_tinguid'] = data['ting_uid']
            item['author_name'] = data['name']
            item['author_share_num'] = data['share_num']
            item['author_hot'] = data['hot']
            item['author_from_area'] = data['country']
            item['author_gender'] = "男" if data['gender'] == "0" else "女"
            item['author_image_url'] = data['avatar_s1000']
            item["author_intro"] = data['intro']
            item['author_birthday'] = data['birth']
            item['author_constellation'] = data['constellation']
            item['author_stature'] = data['stature']
            item['author_weight'] = data['weight']
            item['author_songs_total'] = data['songs_total']
            return item
    
        def update_data_to_mongodb(self, item):
            conditions = {
                "author_name": item['author_name'],
                'author_tinguid': item['author_tinguid']
            }
            self.mongo_pool.update_one(conditions=conditions, document=item)
    
        def run(self):
            artists = self.mongo_pool.find_all()
            for artist in artists:
                ting_uid = artist['author_tinguid']
                if "author_intro" in artist.keys():
                    pass
                else:
                    url = self.url.format(ting_uid)
                    dict_data = self.get_response(url=url, user_agent=self.user_agent)
                    if dict_data:
                        item = self.parse_data(data=dict_data)
                        self.update_data_to_mongodb(item=item)
    
    
    if __name__ == "__main__":
        obj = AuthorInfoUpdate()
        obj.run()
    

6.3 song_total.py

  • 程式碼

    import requests
    from lxml import etree
    import json
    import time
    import random
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from qianqianyinyue.utils.log import logger
    from pprint import pprint
    
    
    class SongTotal(object):
        """根據作者的名字,獲取作者的所有音樂作品"""
    
        def __init__(self):
            self.mongo_pool_ = MongoPool(collection="artist")
            self.mongo_pool = MongoPool(collection="songs")
            self.user_agent = random_user_agent
    
        def get_author_info(self, author_name):
            conditions = {"author_name": author_name}
            author_info = self.mongo_pool_.find_one(conditions)
            if author_info:
                author_info = {
                    "author_name": author_info['author_name'],
                    "author_tinguid": author_info['author_tinguid'],
                    "author_url": author_info['author_url']
                }
                return author_info
            return None
    
        def get_requests(self, url, referer=None):
            headers = {
                "User-Agent": self.user_agent,
                "Referer": referer,
            }
            try:
                response = requests.get(url=url, headers=headers)
                if response.ok:
                    html_str = response.content.decode()
                    return html_str
                return None
            except Exception as e:
                logger.warning(e)
            time.sleep(random.uniform(0, 2))
    
        def parse_one_li(self, li):
            item = dict()
            song_name = li.xpath('./div[contains(@class,"songlist-title")]/span[@class="songname"]/a[1]/@title')
            item["song_name"] = song_name[0] if song_name else None
            song_href = li.xpath('./div[contains(@class,"songlist-title")]/span[@class="songname"]/a[1]/@href')
            song_id = song_href[0].rsplit('/', 1)[-1] if song_href else None
            item["song_id"] = song_id
            item['song_href'] = "http://music.taihe.com" + song_href[0] if song_href else "-1"
            # mv_id獲取的有誤
            song_mv_href = li.xpath('./div[contains(@class,"songlist-title")]/span[@class="songname"]/a[2]/@href')
            if song_mv_href:
                song_mv_id = song_href[0].rsplit('/', 1)[-1]
                item["song_mv_id"] = song_mv_id
                item['song_mv_href'] = "http://music.taihe.com" + song_mv_href[0]
            return item
    
        def parse_firstpage_html_str(self, html_str):
            html_str = etree.HTML(html_str)
            page_count = html_str.xpath(
                '//div[@class="list-box song-list-box active"]//div[@class="page_navigator-box"]//div[@class="page-inner"]/a[last()-1]/text()')
            page_count = int(page_count[0]) if page_count else -1
            li_list = html_str.xpath('//div[contains(@class,"song-list-wrap")]/ul/li')
            song_list = list()
            for li in li_list:
                item = self.parse_one_li(li=li)
                song_list.append(item)
            return song_list, page_count
    
        def parse_nextpage_html_str(self, data):
            data = json.loads(data)
            html_str = data["data"]["html"] if data else None
            if html_str:
                html_str = etree.HTML(html_str)
                li_list = html_str.xpath('//div[contains(@class, "song-list")]/ul/li')
                song_list = list()
                for li in li_list:
                    item = self.parse_one_li(li=li)
                    song_list.append(item)
                return song_list
    
        def parse_song_list(self, song_list, author_info):
            # 處理歌曲,構造儲存結構
            song_ = {
                "author_info": author_info,
                "song_list": [],
            }
            for song in song_list:
                song.update(author_info)
                song_['song_list'].append(song)
            return song_
    
        def construct_url_list(self, ting_uid, page_count):
            base_url = "http://music.taihe.com/data/user/getsongs?start={}&size=15" + "&ting_uid={}".format(
                ting_uid)
            url_list = [base_url.format(i * 15) for i in range(1, page_count)]
            return url_list
    
        def insert_song_to_mongodb(self, song_):
            conditions = {"author_info.author_tinguid": song_["author_info"]["author_tinguid"]}
            song = self.mongo_pool.find_one(conditions)
            if not song:
                self.mongo_pool.insert_one(document=song_)
    
        def run(self, author_name=None):
            author_info = self.get_author_info(author_name)
            if author_info is None:
                print("該歌手不存在,請確認無誤後再輸入")
            else:
                # 驗證資料庫,有沒有這個歌手的資料
                song__ = self.mongo_pool.find_one({'author_info.author_tinguid': author_info['author_tinguid']})
                if song__:
                    print("歌手資料已存在,無需再發送請求")
                else:
                    firstpage_html_str = self.get_requests(url=author_info["author_url"])
                    if firstpage_html_str:
                        # 獲取首頁歌曲資訊
                        song_list, page_count = self.parse_firstpage_html_str(html_str=firstpage_html_str)
                        # 該頁面採用ajax介面,需要設計下一頁url
                        url_list = self.construct_url_list(author_info['author_tinguid'], page_count)
                        for url in url_list:
                            # 傳送單頁請求
                            nextpage_html_data = self.get_requests(url=url, referer=author_info["author_url"])
                            song_list_next = self.parse_nextpage_html_str(data=nextpage_html_data)
                            song_list.extend(song_list_next)
                        song_ = self.parse_song_list(song_list, author_info)
                        self.insert_song_to_mongodb(song_)
    
    
    if __name__ == "__main__":
        obj = SongTotal()
        obj.run(author_name="許嵩")
    

6.4 song_info_total_update.py

  • 程式碼

    import requests
    import time
    import random
    import json
    import re
    from lxml import etree
    from qianqianyinyue.utils.mongo_pool import MongoPool
    from qianqianyinyue.utils.log import logger
    from qianqianyinyue.utils.random_useragent import random_user_agent
    from pprint import pprint
    
    
    class SongInfoTotalUpdate(object):
        """
        根據作者的名字,更新該歌手的所有歌曲資訊
        新增的資訊有:
            發行日期
            發行公司
            歌曲下載連結及相關
            mv下載連結及相關
        """
        def __init__(self):
            self.mongo_pool = MongoPool(collection="songs")
            self.user_agent = random_user_agent
    
        def get_song_list(self, author_name):
            conditions = {"author_info.author_name": author_name}
            song_info = self.mongo_pool.find_one(conditions)
            song_list = song_info['song_list']
            return song_list
    
        def get_requests(self, url, referer=None):
            headers = {
                "User-Agent": self.user_agent,
                "Referer": referer,
            }
            try:
                response = requests.get(url=url, headers=headers)
                if response.ok:
                    html_str = response.content.decode()
                    return html_str
                return None
            except Exception as e:
                logger.warning(e)
            time.sleep(random.uniform(0, 2))
    
        def parse_html_str(self, html_str):
            html_str = etree.HTML(html_str)
            song_album = html_str.xpath('//div[@class="song-info-box fl"]/p[contains(@class, "album")]/a/text()')
            song_album = song_album[0] if song_album else None
            song_publish = html_str.xpath('//div[@class="song-info-box fl"]/p[contains(@class,"publish")]/text()')
            song_publish_date = song_publish[0].split(r":", 1)[-1].strip() if song_publish else None
            song_company = html_str.xpath('//div[@class="song-info-box fl"]/p[contains(@class,"company")]/text()')
            song_publish_company = song_company[0].split(r":", 1)[-1].strip() if song_company else None
            return song_album, song_publish_date, song_publish_company
    
        def parse_data(self, data):
            data = json.loads(data)
            if "bitrate" in data.keys():
                download_url = data['bitrate']['file_link']
                song_lrc_link = data['songinfo']['lrclink']
                return download_url, song_lrc_link
            return None, None
    
        def parse_mv_data(self, data):
            data = json.loads(data)
            mv_download_info = list()
            if "result" in data.keys():
                if "files" in data['result'].keys():
                    file_info = data['result']['files']
                    for k, v in file_info.items():
                        item = dict()
                        item['definition_name'] = v["definition_name"]
                        item['file_link'] = v['file_link']
                        mv_download_info.append(item)
            return mv_download_info
    
        def parse_mv_html_str(self, mv_html_str):
            mv_id = re.findall(r'href="/playmv/(.*?)"', mv_html_str, re.S)
            return mv_id[0] if mv_id else None
    
        def update_songs_info(self, info):
            conditions = {"author_info.author_tinguid": info["author_tinguid"]}
            songs_info = self.mongo_pool.find_one(conditions)
            for song_ in songs_info['song_list']:
                if info["song_id"] == song_['song_id']:
                    song_.update(info)
            self.mongo_pool.update_one(conditions, songs_info)
    
        def get_content(self, song):
            global song_album, song_publish_date, song_publish_company
            mv_download_url_info = dict()
            song_href = song['song_href']
            html_str = self.get_requests(url=song_href)
            if html_str:
                song_album, song_publish_date, song_publish_company = self.parse_html_str(html_str)
            url = "http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.song.playAAC&format=jsonp&songid={}&from=web".format(
                song['song_id'])
            data = self.get_requests(url=url, referer=song['song_href'])
            download_url, song_lrc_link = self.parse_data(data)
            if song.get("song_mv_href"):
                mv_html_str = self.get_requests(url=song['song_mv_href'])
                mv_id = self.parse_mv_html_str(mv_html_str)
                if mv_id:
                    song["mv_id"] = mv_id
                    mv_data_url = "http://musicapi.taihe.com/v1/restserver/ting?method=baidu.ting.mv.playMV&mv_id={}".format(
                        song['mv_id'])
                    mv_data = self.get_requests(url=mv_data_url, referer=song["song_mv_href"])
                    mv_download_url_info = self.parse_mv_data(mv_data)
            info = {
                "song_album": song_album,
                "song_publish_date": song_publish_date,
                "song_publish_company": song_publish_company,
                "download_url": download_url,
                "song_lrc_link": song_lrc_link,
                "mv_download_url_info": mv_download_url_info,
                'song_id': song['song_id'],
                "song_mv_id": song['song_mv_id'] if song.get('song_mv_id') else None,
                "author_tinguid": song["author_tinguid"],
            }
            return info
    
        def run(self, author_name=None):
            if author_name:
                song_list = self.get_song_list(author_name)
                for song in song_list:
                    info = self.get_content(song)
                    self.update_songs_info(info)
            else:
                print("請輸入歌手名字")
    
    
    if __name__ == "__main__":
        obj = SongInfoTotalUpdate()
        obj.run(author_name="許嵩")
    

7、bin模組

啟動模組,負責串聯功能模組

7.1 run.py

  • 程式碼

    import json
    import os
    from qianqianyinyue.utils.mongo_pool import MongoPool
    # from qianqianyinyue.utils.log import logger
    from qianqianyinyue.core.spiders.artist_total import ArtistTotal
    from qianqianyinyue.core.spiders.author_info_update import AuthorInfoUpdate
    from qianqianyinyue.core.spiders.song_total import SongTotal
    from qianqianyinyue.core.spiders.song_info_total_update import SongInfoTotalUpdate
    
    
    class ProcessConsole():
        def __init__(self):
            self.mongo_pool_ = MongoPool(collection="artist")
            self.mongo_pool = MongoPool(collection="songs")
    
        def _save_info_to_file(self, conditions):
            song_info = self.mongo_pool.find_one(conditions)
            if song_info:
                with open('song_list.txt', 'w+', encoding='utf8') as f:
                    for song in song_info["song_list"]:
                        f.write(json.dumps(song, ensure_ascii=False, indent=2))
                        f.write(os.linesep)
                        f.write('*'*20)
                        f.write(os.linesep)
                return "寫入檔案完成"
            return "寫入檔案出錯:未找到該資料"
    
        def _check_artist_data(self):
            artist = self.mongo_pool_.find_all()
            if not artist:
                ArtistTotal().run()
                AuthorInfoUpdate().run()
                return "資料載入完成,歡迎使用"
            return "資料載入完成,歡迎使用"
    
        def _check_author_by_author_name(self, author_name):
            conditions = {"author_name": author_name}
            artist = self.mongo_pool_.find_one(conditions)
            if artist:
                conditions = {"author_info.author_name": author_name}
                if self.mongo_pool.find_one(conditions):
                    print("找到了該歌手資料,正在更新資料,請稍後查收檔案song_list.txt")
                    SongInfoTotalUpdate().run(author_name)
                    self._save_info_to_file(conditions)
                else:
                    print("找到了該歌手資料,正在下載資料,請稍後查收檔案song_list.txt")
                    SongTotal().run(author_name)
                    SongInfoTotalUpdate().run(author_name)
                    self._save_info_to_file(conditions)
            else:
                print("資料庫中,無法找到該歌手,請核對無誤後再輸入")
    
        def run(self):
            print("歡迎使用歌曲查詢系統,該系統能通過歌手名字,查詢該歌手的所有歌曲資訊")
            print("資料載入中...請稍等...")
            message = self._check_artist_data()
            print(message)
            author_name = input("請輸入要查詢的歌手名字:")
            self._check_author_by_author_name(author_name)
    
    
    if __name__ == "__main__":
        ProcessConsole().run()

8、總結

  • 總結

    • 通過chrome實現抓包分析,查詢資料來源入口

    • 歌手羅列頁返回一個html_str檔案,通過lxml和re可提取相關資訊

    • 在歌手詳情頁,會返回兩個資料,一個數據返回包括簡介類的資訊,一個數據返回包括髮行公司類的資訊

      • 經過研究,歌手詳情頁首頁返回一個htm_str資料,通過這個html_str能夠拿到相應的發行公司資料,所以我直接通過的該html_str拿到的發行公司類的資料

      • 點選下一頁時,發起的ajax請求,通過構造相應的url,可拿到相應的資料,注意返回的json資料中,data內返回的是xml格式字串,可通過lxml格式化後,提取資料。

      • 將提取到的歌手簡介類的資訊,更新到mongodb相應的集合中

      • 將提取到的歌曲資訊,格式化處理後,儲存到mongodb相應的集合中

      • 以上,拿到該歌手所有的歌曲資訊和該歌手的個人資訊

    • 歌曲詳情頁,返回json資料,其中包括歌曲的下載連結,和歌詞的資料鏈接

    • 用獲取到的mv_url,來到mv首頁,能夠拿到mv的資料鏈接

    • 最後通過bin模組下的run.py來實現與使用者的互動

  • 改進

    • 可以增加協程來減少資料獲取的時間

    • 該域名下,如果爬取速度過快,會封禁ip,所以可以增加proxy反爬蟲中介軟體

    • 使用者目前只能通過歌手,來獲取歌曲資訊,可以細化到單個歌曲的獲取及變更

      • 方案一,通過song_id,來構造請求,只更新區域性,速度比較快

      • 方案二,利用現介面,然後更改資料庫查詢方式,但這樣也意味著獲取資源的速度的變慢

    • 構造web-api,視覺化操作呈現