Python網路爬蟲課程設計——嗶哩嗶哩彈幕爬取+地圖詞雲
一、選題背景
在大資料的時代,人們的物質生活提升了很多,對視訊的播放內容,都有自己獨特的簡介,因而在視訊中,會被某個視訊,進行評論,此專案,就是抓取B站視訊評論,並使用詞雲圖進行展示。
二、開發的環境與硬體支撐和功能的描述
開發環境:
Python 3.7.4 + Pycharm 2020.1.3
Python是Python程式碼執行環境,Pycharm是編輯器,用於寫Python程式碼
三、實訓目的
抓取指定B站的評論資料,並使用stylecloud生成視覺化詞雲圖。
四、實訓內容
1、使用爬蟲技術,抓取B戰視訊主頁的評論資料:
程式碼截圖和效果截圖:
A、User-Agent大列表, 防止被反爬
B、導包部分
a)Requests-html是請求模組,用於傳送請求;
b)Jasonpath是解析模組,用於解析疫情資料
c)worldcloud是視覺化模組,用於詞雲視覺化
d)Numpy模組,資料分析模組,用於資料分析
e)os模組:建立資料夾,用於儲存
f)Xlutils,xlrd, xlwd模組,用於儲存excel評論檔案
g)Time模組,用於新增時間延時,進行時間轉換
h)Random模組,用於生成隨機延時時間
i)Re模組,用於解析
C、初始化部分,獲取使用者輸入電影名字,翻頁起始頁碼數,百度搜索介面部分
D、傳送請求,獲取響應資料,其中響應資料為response響應,提取豆瓣電影相關電影連結。
E、請求使用者輸入地址
F、解析評論,並且使用is_running實現下一頁翻頁
G、解析生成詞雲圖
H、解析評論大列表,使用jsonpath解析,並且將解析出的格林威治時間進行時間轉換
I、獲取評論內容後,將資料儲存乘excel表格
J、生成地圖詞雲圖,使用地圖背景
K、程式碼執行結果圖:
blblMapObjSpider.py程式碼
USER_AGENT_LIST = [ 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3451.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:57.0) Gecko/20100101 Firefox/57.0', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2999.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.70 Safari/537.36', 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2', 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36 OPR/31.0.1889.174', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; MS-RTC LM 8; InfoPath.2; Tablet PC 2.0)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.61', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.814.0 Safari/535.1', 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X; ja-jp) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36', 'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0; Touch; MASMJS)', 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1041.0 Safari/535.21', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4093.3 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Swurl) Chrome/77.0.3865.120 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4086.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:75.0) Gecko/20100101 Firefox/75.0', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) coc_coc_browser/91.0.146 Chrome/85.0.4183.146 Safari/537.36', 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/537.36 (KHTML, like Gecko) Safari/537.36 VivoBrowser/8.4.72.0 Chrome/62.0.3202.84', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.60', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0', 'Mozilla/5.0 (X11; CrOS x86_64 13505.63.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:68.0) Gecko/20100101 Firefox/68.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36', ] from requests_html import HTMLSession from jsonpath import jsonpath from PIL import Image import os, xlwt, xlrd, time, stylecloud, random, re from xlutils.copy import copy import numpy as np import pandas as pd import matplotlib.pyplot as plt from wordcloud import WordCloud session = HTMLSession() class BZSpider(object): def __init__(self): self.yun_list = [] self.start_url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid={}&mode=3&plat=1&_=1623082600632' """迴圈條件""" self.is_running = True """迴圈計數""" self.start_page = 1 """評論內容大容器""" self.big_list = [] # self.start_url = input('請輸入視訊的連結') self.pinglun_url = 'https://www.bilibili.com/video/BV1PN411X7QW?from=search&seid=13620076173109636987' def parse_pl_url_response(self): """ 解析使用者輸入的地址 :return: """ headers = { 'user-agent': random.choice(USER_AGENT_LIST) } response = session.get(self.pinglun_url, headers=headers).content.decode() aid_set = re.findall(r'"aid":(.*?),', response) aid_list = list(set(aid_set)) for aid in aid_list: self.parse_start_url(aid) """回撥解析詞雲圖方法""" self.parse_c_y_img() def parse_start_url(self, aid): """ 解析視訊的評論 :return: """ while self.is_running: headers = { 'user-agent': random.choice(USER_AGENT_LIST) } response = session.get(self.start_url.format(self.start_page, aid), headers=headers).json() """jsonpath提取評論大列表""" data_replies = jsonpath(response, '$..replies')[0] """回撥解析評論大列表""" self.parse_data_replies(data_replies) """迴圈出口""" if data_replies == 'null': self.is_running = False if self.start_page == 10: self.is_running = False """迴圈計數 +1""" self.start_page += 1 break def parse_c_y_img(self): """ 解析生成詞雲圖 :return: """ print('--------------詞雲圖生成中logging--------------') data = ''.join(self.big_list) stylecloud.gen_stylecloud(data, font_path="C:/Windows/Fonts/simfang.ttf") img = Image.open("stylecloud.png") img.show() print('\n' + '----------------------詞雲圖已生成---------------------' + '\n') def parse_data_replies(self, data_replies): """ 解析評論大列表 :param data_replies: :return: """ for dict_data in data_replies: message = jsonpath(dict_data, '$..message') c_time = jsonpath(dict_data, '$..ctime') for text, temp in zip(message, c_time): """時間戳轉換""" timeArray = time.localtime(int(temp)) otherStyleTime = time.strftime("%Y--%m--%d %H:%M:%S", timeArray) self.big_list.append(text) data = { '評論資料': [otherStyleTime, text] } self.save_excel(data) self.yun_list.append(text) print('評論資料儲存一條完成----logging!!!') def save_excel(self, data): # data = { # '基本詳情': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] # } os_path_1 = os.getcwd() + '/資料/' if not os.path.exists(os_path_1): os.mkdir(os_path_1) # os_path = os_path_1 + self.os_path_name + '.xls' os_path = os_path_1 + '評論資料.xls' if not os.path.exists(os_path): # 建立新的workbook(其實就是建立新的excel) workbook = xlwt.Workbook(encoding='utf-8') # 建立新的sheet表 worksheet1 = workbook.add_sheet("評論資料", cell_overwrite_ok=True) borders = xlwt.Borders() # Create Borders """定義邊框實線""" borders.left = xlwt.Borders.THIN borders.right = xlwt.Borders.THIN borders.top = xlwt.Borders.THIN borders.bottom = xlwt.Borders.THIN borders.left_colour = 0x40 borders.right_colour = 0x40 borders.top_colour = 0x40 borders.bottom_colour = 0x40 style = xlwt.XFStyle() # Create Style style.borders = borders # Add Borders to Style """居中寫入設定""" al = xlwt.Alignment() al.horz = 0x02 # 水平居中 al.vert = 0x01 # 垂直居中 style.alignment = al # 合併 第0行到第0列 的 第0列到第13列 '''基本詳情13''' # worksheet1.write_merge(0, 0, 0, 13, '基本詳情', style) excel_data_1 = ('評論時間', '評論內容') for i in range(0, len(excel_data_1)): worksheet1.col(i).width = 2560 * 3 # 行,列, 內容, 樣式 worksheet1.write(0, i, excel_data_1[i], style) workbook.save(os_path) # 判斷工作表是否存在 if os.path.exists(os_path): # 開啟工作薄 workbook = xlrd.open_workbook(os_path) # 獲取工作薄中所有表的個數 sheets = workbook.sheet_names() for i in range(len(sheets)): for name in data.keys(): worksheet = workbook.sheet_by_name(sheets[i]) # 獲取工作薄中所有表中的表名與資料名對比 if worksheet.name == name: # 獲取表中已存在的行數 rows_old = worksheet.nrows # 將xlrd物件拷貝轉化為xlwt物件 new_workbook = copy(workbook) # 獲取轉化後的工作薄中的第i張表 new_worksheet = new_workbook.get_sheet(i) for num in range(0, len(data[name])): new_worksheet.write(rows_old, num, data[name][num]) new_workbook.save(os_path) def show_img(self): ''' 生成地圖詞雲圖 ''' data = ''.join(self.yun_list) bg = np.array(Image.open("qq.jpg")) mask = bg wc = WordCloud(width=500, # 詞雲圖寬 height=500, # 詞雲圖高 mask=mask, # 詞雲蒙版圖 background_color='white', # 詞雲圖背景顏色,預設為白色 font_path=r'C:/Windows/Fonts/simfang.ttf', # 詞雲圖 字型(中文需要設定為本機有的中文字型) max_font_size=400, # 最大字型,預設為200 random_state=50, # 為每個單詞返回一個PIL顏色 ) wc.generate(data) # matplotlib用於顯示 詞雲圖 import matplotlib.pyplot as plt plt.imshow(wc) plt.axis("off") # plt方式存為本地圖片 plt.savefig('B站視訊-詞雲圖.png') plt.show() if __name__ == '__main__': b = BZSpider() b.parse_pl_url_response() b.show_img()
爬取內容:
四、實訓總結
這次實訓,在同學和老師的幫助下,成功完成,收貨頗多,瞭解了requests請求庫的使用,jsonpath,jsonpath資料解析庫的使用。
此次實訓中,發現對worldcloud瞭解不夠深入,瞭解了面向物件的含義,對反爬機制有了進一步瞭解。