Python爬蟲小白---（二）爬蟲基礎--Selenium PhantomJS

阿新 • • 發佈：2017-07-04

decode bject windows beautiful 結構由於 target header 速度

一、前言

　　前段時間嘗試爬取了網易雲音樂的歌曲，這次打算爬取QQ音樂的歌曲信息。網易雲音樂歌曲列表是通過iframe展示的，可以借助Selenium獲取到iframe的頁面元素，

　而QQ音樂采用的是異步加載的方式，套路不一樣，這是主流的頁面加載方式，爬取有點難度，不過也是對自己的一個挑戰。

二、Python爬取QQ音樂單曲

之前看的慕課網的一個視頻, 很好地講解了一般編寫爬蟲的步驟，我們也按這個來。

　　　　　　　　　　爬蟲步驟

1.確定目標

首先我們要明確目標，本次爬取的是QQ音樂歌手劉德華的單曲。

（百度百科）->分析目標（策略：url格式（範圍）、數據格式、網頁編碼）->編寫代碼->執行爬蟲

2.分析目標

歌曲鏈接：https://y.qq.com/n/yqq/singer/003aQYLo2x8izP.html#tab=song&

從左邊的截圖可以知道單曲采用分頁的方式排列歌曲信息，每頁顯示30條，總共30頁。點擊頁碼或者最右邊的">"會跳轉到下一頁，瀏覽器會向服務器發送ajax異步請求，從鏈接可以看到begin和num參數，分別代表起始歌曲下標（截圖是第2頁，起始下標是30）和一頁返回30條，服務器響應返回json格式的歌曲信息（MusicJsonCallbacksinger_track({"code":0,"data":{"list":[{"Flisten_count1":......]})），如果只是單獨想獲取歌曲信息，可以直接拼接鏈接請求和解析返回的json格式的數據。這裏不采用直接解析數據格式的方法，我采用的是Python Selenium方式，每獲取和解析完一頁的單曲信息，點擊 ">" 跳轉到下一頁繼續解析，直至解析並記錄所有的單曲信息。最後請求每個單曲的鏈接，獲取詳細的單曲信息。

技術分享

右邊的截圖是網頁的源碼，所有歌曲信息都在類名為mod_songlist的div浮層裏面，類名為songlist_list的無序列表ul下，每個子元素li展示一個單曲，類名為songlist__album下的a標簽，包含單曲的鏈接，名稱和時長等。

技術分享

3.編寫代碼

1）下載網頁內容，這裏使用Python 的Urllib標準庫，自己封裝了一個download方法：

def download(url, user_agent=‘wswp‘, num_retries=2):
    if url is None:
        return None
    print(‘Downloading:‘, url)
 
    headers = {‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36‘}
    request = urllib.request.Request(url, headers=headers)  # 設置用戶代理wswp(Web Scraping with Python)
    try:
        html = urllib.request.urlopen(request).read().decode(‘utf-8‘)
    except urllib.error.URLError as e:
        print(‘Downloading Error:‘, e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, ‘code‘) and 500 <= e.code < 600:
                # retry when return code is 5xx HTTP erros
                return download(url, num_retries-1)  # 請求失敗，默認重試2次,
    return html

2）解析網頁內容，這裏使用第三方插件BeautifulSoup,具體可以參考BeautifulSoup API 。

def music_scrapter(html, page_num=0):
    try:
        soup = BeautifulSoup(html, ‘html.parser‘)
        mod_songlist_div = soup.find_all(‘div‘, class_=‘mod_songlist‘)
        songlist_ul = mod_songlist_div[1].find(‘ul‘, class_=‘songlist__list‘)
        ‘‘‘開始解析li歌曲信息‘‘‘
        lis = songlist_ul.find_all(‘li‘)
        for li in lis:
            a = li.find(‘div‘, class_=‘songlist__album‘).find(‘a‘)
            music_url = a[‘href‘]  # 單曲鏈接
            urls.add_new_url(music_url)  # 保存單曲鏈接
            # print(‘music_url:{0} ‘.format(music_url))
        print(‘total music link num:%s‘ % len(urls.new_urls))
        next_page(page_num+1)
    except TimeoutException as err:
        print(‘解析網頁出錯:‘, err.args)
        return next_page(page_num + 1)
    return None

def get_music():
     try:
        while urls.has_new_url():
            # print(‘urls count:%s‘ % len(urls.new_urls))
            ‘‘‘跳轉到歌曲鏈接，獲取歌曲詳情‘‘‘
            new_music_url = urls.get_new_url()
            print(‘url leave count:%s‘ % str( len(urls.new_urls) - 1))
            html_data_info = download(new_music_url)
            # 下載網頁失敗，直接進入下一循環，避免程序中斷
            if html_data_info is None:
                continue
            soup_data_info = BeautifulSoup(html_data_info, ‘html.parser‘)
            if soup_data_info.find(‘div‘, class_=‘none_txt‘) is not None:
                print(new_music_url, ‘   對不起，由於版權原因，暫無法查看該專輯！‘)
                continue
            mod_songlist_div = soup_data_info.find(‘div‘, class_=‘mod_songlist‘)
            songlist_ul = mod_songlist_div.find(‘ul‘, class_=‘songlist__list‘)
            lis = songlist_ul.find_all(‘li‘)
            del lis[0]  # 刪除第一個li
            # print(‘len(lis):$s‘ % len(lis))
            for li in lis:
                a_songname_txt = li.find(‘div‘, class_=‘songlist__songname‘).find(‘span‘, class_=‘songlist__songname_txt‘).find(‘a‘)
                if ‘https‘ not in a_songname_txt[‘href‘]:  #如果單曲鏈接不包含協議頭，加上
                    song_url = ‘https:‘ + a_songname_txt[‘href‘]
                song_name = a_songname_txt[‘title‘]
                singer_name = li.find(‘div‘, class_=‘songlist__artist‘).find(‘a‘).get_text()
                song_time =li.find(‘div‘, class_=‘songlist__time‘).get_text()
                music_info = {}
                music_info[‘song_name‘] = song_name
                music_info[‘song_url‘] = song_url
                music_info[‘singer_name‘] = singer_name
                music_info[‘song_time‘] = song_time
                collect_data(music_info)
     except Exception as err:  # 如果解析異常，跳過
         print(‘Downloading or parse music information error continue:‘, err.args)

4.執行爬蟲

爬蟲跑起來了，一頁一頁地去爬取專輯的鏈接，並保存到集合中，最後通過get_music()方法獲取單曲的名稱，鏈接，歌手名稱和時長並保存到Excel文件中。

三、Python爬取QQ音樂單曲總結

1.單曲采用的是分頁方式，切換下一頁是通過異步ajax請求從服務器獲取json格式的數據並渲染到頁面，瀏覽器地址欄鏈接是不變的，不能通過拼接鏈接來請求。一開始想過都通過Python Urllib庫來模擬ajax請求，後來想想還是用Selenium。Selenium能夠很好地模擬瀏覽器真實的操作，頁面元素定位也很方便，模擬單擊下一頁，不斷地切換單曲分頁，再通過BeautifulSoup解析網頁源碼，獲取單曲信息。

2.url鏈接管理器，采用集合數據結構來保存單曲鏈接，為什麽要使用集合？因為多個單曲可能來自同一專輯（專輯網址一樣），這樣可以減少請求次數。

class UrlManager(object):
    def __init__(self):
        self.new_urls = set()  # 使用集合數據結構，過濾重復元素
        self.old_urls = set()  # 使用集合數據結構，過濾重復元素

    def add_new_url(self, url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_new_url(self):
        return len(self.new_urls) != 0

    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

3.通過Python第三方插件openpyxl讀寫Excel十分方便，把單曲信息通過Excel文件可以很好地保存起來。

def write_to_excel(self, content):
    try:
        for row in content:
            self.workSheet.append([row[‘song_name‘], row[‘song_url‘], row[‘singer_name‘], row[‘song_time‘]])
        self.workBook.save(self.excelName)  # 保存單曲信息到Excel文件
    except Exception as arr:
        print(‘write to excel error‘, arr.args)

四、後語

最後還是要慶祝下，畢竟成功把QQ音樂的單曲信息爬取下來了。本次能夠成功爬取單曲，Selenium功不可沒，這次只是用到了selenium一些簡單的功能，後續會更加深入學習Selenium，不僅在爬蟲方面還有UI自動化。

後續還需要優化的點：

1.下載的鏈接比較多，一個一個下載起來比較慢，後面打算用多線程並發下載。

2.下載速度過快，為了避免服務器禁用IP，後面還要對於同一域名訪問過於頻繁的問題，有個等待機制，每個請求之間有個等待間隔。

3. 解析網頁是一個重要的過程，可以采用正則表達式，BeautifulSoup和lxml，目前采用的是BeautifulSoup庫，在效率方面，BeautifulSoup沒lxml效率高，後面會嘗試采用lxml。

Python爬蟲小白---（二）爬蟲基礎--Selenium PhantomJS

decode bject windows beautiful 結構由於 target header 速度一、前言　　前段時間嘗試爬取了網易雲音樂的歌曲，這次打算爬取QQ音樂的歌曲信息。網易雲音樂歌曲列表是通過iframe展示的，可以借助Selenium獲

Python爬蟲小白---（二）爬蟲基礎--Selenium PhantomJS

一、前言

二、Python爬取QQ音樂單曲

三、Python爬取QQ音樂單曲總結

四、後語

Python爬蟲小白---（二）爬蟲基礎--Selenium PhantomJS

Python爬蟲小白——（二）爬蟲基礎——Selenium PhantomJS

進擊的JavaScript小白（二）

python學習--小練習題（二）

小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門

Python爬蟲實例（二）使用selenium抓取鬥魚直播平臺數據

Python學習之路（三）爬蟲（二）

Python爬蟲學習筆記（二）——requests庫的使用

python爬蟲學習筆記（二）——基礎篇之爬蟲基本原理

神箭手爬蟲學習筆記（二）

Python學習之路（四）爬蟲（三）HTTP和HTTPS

Python學習之路（五）爬蟲（四）正則表示式爬去名言網

爬蟲學習筆記（二）http請求詳解

爬蟲相關知識（二）xpath

node 爬蟲初嘗試（二）async控制並發量

Java基礎（二）Java 基礎語法，小白趕緊來學習吧！

爬蟲入門系列（二）：優雅的HTTP庫requests

python小練習——（二）

六、學習爬蟲框架WebMagic（二）---使用註解編寫爬蟲

開發一款開源爬蟲框架系列（二）：設計爬蟲架構

Python爬蟲小白---（二）爬蟲基礎--Selenium PhantomJS

一、前言

二、Python爬取QQ音樂單曲

三、Python爬取QQ音樂單曲總結

四、後語

相關推薦