資料分析：當迴音哥唱music時，他在唱些什麼~~~

阿新 • • 發佈：2018-12-21

思路來源於此，註明出處：

尊重原創

——————————————————————————————

簡單來說，我們想分析某一位歌手所唱的所有歌曲（主流網站上可以找出來的），主要出現的詞彙是什麼（更能反映歌手的偏好）。下面開始動手做：

第一個，爬資料

爬資料這裡我用的是scrapy + selenium,二話不說，先上程式碼：

# scrapy中置於 spider 下的 爬蟲.py

from scrapy import Spider,Request
from selenium import webdriver
from .. import process_text_format
from .. import items


class HuiyingeSpider(Spider):
    name = 'huiyinge'
    allowed_domains=['https://y.qq.com']

    def __init__(self):
        self.browser = webdriver.Chrome()
        self.browser.set_page_load_timeout(30)

    def closed(self,spider):
        print("spider closed")
        self.browser.close()

    def start_requests(self):
        start_urls = ['https://y.qq.com/portal/search.html#page={}&searchid=1&remoteplace=txt.yqq.top&t=lyric&w=%E5%9B%9E%E9%9F%B3%E5%93%A5'.format(str(i)) for i in range(1,11,1)]
        for url in start_urls:
            yield self.make_requests_from_url(url=url)

    def parse(self, response):
        titles = response.xpath('//*[@id="lyric_box"]/div[1]/ul/li/h3/a[1]/text()').extract()
        lrcs = response.xpath('//*[@id="lyric_box"]/div[1]/ul/li/div[2]/p').extract()

        for title, lrc in zip(titles, lrcs):
            item = items.HuiyingeItem()
            item['title'] = title
            item['lrc'] = process_text_format.prcessTextFormat(lrc)
            yield item

這裡我們選擇的是*q音樂，嗯，迴音哥的音樂在這個網站上比較全

這裡由於後邊我們對資料的儲存需要（我們是儲存到檔案txt中的，而不是存入資料庫），我們把歌名當做txt檔名，歌詞存入其中充當內容，所以分成title和lrc兩個欄位，但是lrc裡有很多html標籤，例如<p><span>之類的，我們要取出較為正常的歌詞，對此我們做一個格式化處理，就是上面的prcessTextFormat函式

#scrapy 下 自己建立的py檔案 process_text_format.py

def prcessTextFormat(text):

    flagOfIsHaveHtml = text.find('<')

    while flagOfIsHaveHtml != -1:
        indexStart = flagOfIsHaveHtml
        indexEnd = text.find('>')
        text = text.replace(text[indexStart:indexEnd + 1], '\n', 1)
        flagOfIsHaveHtml = text.find('<')

    return text.strip()

if __name__ == '__main__':
    text = '''<p>天后 - <span class="c_tx_highlight">迴音哥</span> (Echo)<br> 詞：彭學斌<br> 曲：彭學斌<br> 終於找到藉口趁著醉意上心頭<br> 表達我所有感受<br> 寂寞漸濃沉默留在舞池角落<br> 你說的太少或太多<br> 都會讓人更惶恐<br> 誰任由誰放縱誰會先讓出自由<br> 最後一定總是我<br> 雙腳懸空在你冷酷熱情間遊走<br> 被侵佔所有還要笑著接受<br> 我嫉妒你的愛氣勢如虹<br> 像個人氣高居不下的天后<br> 你要的不是我而是一種虛榮<br> 有人疼才顯得多麼出眾<br> 我陷入盲目狂戀的寬容<br> 成全了你萬眾寵愛的天后<br> 若愛只剩誘惑只剩彼此忍受<br> 別再互相折磨<br> 因為我們都有錯<br> 推開蒼白的手推開蒼白的廝守<br> 管你有多麼失措<br> 別再叫我心軟是最致命的脆弱<br> 我明明都懂卻仍拼死效忠<br> 我嫉妒你的愛氣勢如虹<br> 像個人氣高居不下的天后<br> 你要的不是我而是一種虛榮<br> 有人疼才顯得多麼出眾<br> 我陷入盲目狂戀的寬容<br> 成全了你萬眾寵愛的天后<br> 若愛只剩誘惑只剩彼此忍受<br> 別再互相折磨<br> 因為我們都有錯<br> 如果有一天愛不再迷惑<br> 足夠去看清所有是非對錯<br> 直到那個時候<br> 你在我的心中<br> 將不再被歌頌<br> 把你當作天后<br> 不會再是我</p>'''
    print(prcessTextFormat(text))

由於我們選擇是selenium來去載入頁面（這樣就不用害怕js或者之類的載入東東導致我們不能爬到資料了），所以我們需要修改中間鍵

#scrapy 下的 middlewares.py

from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time

class SeleniumMiddleware(object):
    def process_request(self, request, spider):
        if spider.name == 'huiyinge':
            try:
                spider.browser.get(request.url)
                # elem = spider.browser.find_element_by_class_name('next js_pageindex')
            except TimeoutException as e:
                print('超時')
                spider.browser.execute_script('window.stop()')
            time.sleep(2)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
                                encoding="utf-8", request=request)

同樣貼出items.py和pipelines.py，感覺沒什麼好說的，pipelines我選擇的處理方式是直接把資料儲存成檔案而不是存入資料庫

# scrapy 下的 items.py

import scrapy


class HuiyingeItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    lrc = scrapy.Field()

# scrapy 下的 pipelines.py

class ScrapySeleniumPipeline(object):

    def process_item(self, item, spider):

        fileName = item['title']

        with open('file/{}.txt'.format(fileName), 'w') as f:
            f.write(item['lrc'])

為了在執行過程中便於除錯，我們加一個指令碼

# scrapy 下 新建的便於pycharm執行和除錯的指令碼py begin.py

from scrapy import cmdline
cmdline.execute("scrapy crawl huiyinge".split())

然後把這個在執行/除錯設定裡面設定一下就好了（更詳細的過程百度也有，這裡就不再贅述了）

展示一個爬到的歌曲吧，也是目前比較喜歡的一首歌：

然後問題來了，lrc裡有很多冗餘欄位，比如重複出現歌手，監製，編曲之類的人名，這些可能會對我們後邊篩選關鍵詞造成影響，所以我們做一個簡單的預處理，剔除其中的一些欄位（正式歌詞之前的段落）

# pretreatment_huiyin.py  預處理歌詞py

import os

def remove_sundry(line):
    indexOfColon = line.find('：')
    if (indexOfColon != -1):
        if line.__len__() == (indexOfColon + 1):
            return 2
        return 1
    return 0

flagOfIsSkip = False

if __name__ == '__main__':
    list = os.listdir('file')
    for fileName in list:
        if(fileName.find('.txt') != -1):
            with open('file/{}'.format(fileName),'r') as f:
                 # print(fileName)
                index = 1
                newFile = '';
                for line in f.readlines():
                    # print(line)
                    if (index > 3):

                        if flagOfIsSkip:
                            flagOfIsSkip = False
                            continue

                        flagOfIdAdd = remove_sundry(line)

                        if flagOfIdAdd == 0:
                            newFile += line
                        if flagOfIdAdd == 2:
                            flagOfIsSkip = True

                    else:
                        index = index + 1

            with open('file/{}'.format(fileName), 'w') as f:
                f.write(newFile)

這裡處理的效果並不是很理想，首先把前三行除掉，因為前三行都是歌名，歌手名，還有他的英文名，然後我們隊後邊的段落檢查是否這一行有冒號（：）有的話，說明這一行是類似於監製，編曲之類的冗餘資訊，我們就把他去掉，然而還是有漏網之魚，有的冒號之後沒有內容，而是直接換行顯示對應的音樂人，我們加一個檢查就是如果冒號後換行的話，就刪掉下一行，然而即使這樣，還是又一部分有漏網之魚（唉，只能抱怨一句網站還是不夠規範吧）。剩下的自己稍微改一下吧。（格式太千奇百怪的話，也就只能人為干預了，哼！）

修改好後，我們就開始正式的分析資料了，上程式碼

# 分析資料的py analyze_huiyin.py

import jieba.posseg as psg
import os
from collections import Counter

def check_word_characteristic(word_flag):
    if(word_flag.find('r') != -1 or word_flag.find('p') != -1 or word_flag.find('c') != -1 or word_flag.find('u') != -1):
        return False
    return True

if __name__ == '__main__':

    files = os.listdir('file')
    print(files.__len__())
    items = []
    for fileName in files:
        if(fileName.find('.txt') != -1):
            with open('file/{}'.format(fileName),'r') as f:
                item = []
                itemSet = set()
                for line in f.readlines():
                    for word, flag in psg.cut(line.strip()):
                        if(check_word_characteristic(flag)):
                            temp = word + "_" + flag;
                            item.append(temp)
            itemSet.update(item)
            items.append(itemSet)
    counter_items = Counter()
    for item in items:
        counter_items.update(item)
    print(counter_items)

這裡主要的思想前面的微信公眾號已經說了，主要是就讀取資料 -> 對單個歌曲做分詞，set去重 -> 統計所有的歌曲，累加起來 -> collections.counter來進行統計。這裡我們加上了詞性的過濾，過濾掉一些助詞，代詞，介詞，連詞之類的虛詞

最後的統計結構展示一下（人為過濾了一下常見的動詞）

'沒有_v': 21

'不會_v': 18

'不要_df': 16

'知道_v': 15

'幸福_a': 15

'寂寞_a': 14

'不能_v': 14

'夢_n': 13

'眼淚_n': 12

'永遠_d': 12

emmmm。。。。，好吧，迴音哥確實比較傷感吧。（笑哭表情）

資料分析：當迴音哥唱music時，他在唱些什麼~~~

資料分析：當迴音哥唱music時，他在唱些什麼~~~

慧數汽車大資料分析：奧迪與大眾內耗加劇，將危及大眾集團在華的戰略佈局

當談論迭代器時，我談些什麼？

3D點雲資料分析：pointNet++論文分析及閱讀筆記

資料分析：北京Python開發的現狀

python資料分析：迴歸分析

Python 資料分析：第一篇準備工作

python資料分析：分類分析（classification analysis）

python資料分析：聚類分析（cluster analysis）

python資料分析：關聯規則學習（Association rule learning）

資料分析：分析性圖表

資料分析：分類問題和預測--KNN演算法

離線資料分析：kafka+logstash+elasticsearch

python資料分析：缺失值處理

大資料分析：大規模價格戰能否啟用長城汽車的“SUV產品矩陣”？

資料分析：Pandas單變數圖形分析

資料分析：異常值檢測--箱型圖

有關資料分析：簡單部分的技術層面已經成為過去

大資料分析：機器學習演算法實現的演化

python資料分析：內容資料化運營（下）——基於多項式貝葉斯增量學習分類文字

資料分析：當迴音哥唱music時，他在唱些什麼~~~

相關推薦