1. 程式人生 > 其它 >python爬蟲—孔夫子舊書網資料視覺化分析

python爬蟲—孔夫子舊書網資料視覺化分析

一、選題背景

現如今,購買書的渠道有很多,京東、淘寶、天貓、噹噹網、鹹魚……我此次選題是舊二手書期刊類資料視覺化分析。

二、網路爬蟲設計方案

爬蟲名稱:孔夫子舊書網期刊資料爬取

內容:通過爬蟲程式爬取期刊舊書價格,然後進行資料視覺化分析。

方案描述:

1、request請求訪問

2、解析網頁,爬取資料。這裡採用xtree.xpath

3、資料儲存,使用sys

三、結構特徵分析

結構特徵:內容導航型

結構分析:

及查詢方法

#書名bookname、出版社publishing_house、發貨率delivery、價格price、上架時間bookTime_on_shelf、書店bookShop
bookname = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[1]/a/text()'.format(count)) publishing_house = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[2]/div[1]/div/span[2]/text()'.format(count)) delivery = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[2]/span[2]/i/text()'.format(count)) price = html.xpath('
//*[@id="listBox"]/div[{}]/div[3]/div[1]/div[2]/span[2]/text()'.format(count)) bookTime_on_shelf = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[4]/span[1]/text()'.format(count)) bookShop = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[1]/div[3]/a/text()'.format(count))

遍歷:

            for i in range(50):
                bookname 
= html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[1]/a/text()'.format(count)) for i in bookname: bookname = i publishing_house = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[2]/div[1]/div/span[2]/text()'.format(count)) for i in publishing_house: publishing_house = i delivery = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[2]/span[2]/i/text()'.format(count)) for i in delivery: delivery = i.strip("%") price = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[1]/div[2]/span[2]/text()'.format(count)) for i in price: price = i bookTime_on_shelf = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[4]/span[1]/text()'.format(count)) for i in bookTime_on_shelf: bookTime_on_shelf = i bookShop = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[1]/div[3]/a/text()'.format(count)) for i in bookShop: bookShop = i count += 1

四、網路爬蟲設計

資料爬取與採集

程式碼分析:

 1 import time
 2 import random
 3 import requests
 4 from lxml import etree
 5 import sys
 6 import re
 7 
 8 
 9 USER_AGENTS = [
10                 'Mozilla/5.0 (Windows NT 6.2; rv:22.0) Gecko/20130405 Firefox/22.0',
11                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:22.0) Gecko/20130328 Firefox/22.0',
12                 'Mozilla/5.0 (Windows NT 6.1; rv:22.0) Gecko/20130405 Firefox/22.0',
13                 'Mozilla/5.0 (Microsoft Windows NT 6.2.9200.0); rv:22.0) Gecko/20130405 Firefox/22.0',
14                 'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1',
15                 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:16.0.1) Gecko/20121011 Firefox/21.0.1',
16                 'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:21.0.0) Gecko/20121011 Firefox/21.0.0',
17                 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:21.0) Gecko/20130514 Firefox/21.0',
18                 'Mozilla/5.0 (Windows NT 6.2; rv:21.0) Gecko/20130326 Firefox/21.0',
19                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130401 Firefox/21.0',
20                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130331 Firefox/21.0',
21                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20130330 Firefox/21.0',
22                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) Gecko/20100101 Firefox/21.0',
23                 'Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20130401 Firefox/21.0',
24                 'Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20130328 Firefox/21.0',
25                 'Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20100101 Firefox/21.0',
26                 'Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20130401 Firefox/21.0',
27                 'Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20130331 Firefox/21.0',
28                 'Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20100101 Firefox/21.0',
29                 'Mozilla/5.0 (Windows NT 5.0; rv:21.0) Gecko/20100101 Firefox/21.0',
30                 'Mozilla/5.0 (Windows NT 6.2; Win64; x64;) Gecko/20100101 Firefox/20.0',
31                 'Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20100101 Firefox/19.0',
32                 'Mozilla/5.0 (Windows NT 6.1; rv:14.0) Gecko/20100101 Firefox/18.0.1',
33                 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0)  Gecko/20100101 Firefox/18.0',
34                 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
35                 ]
36 headers = {
37     'User-Agent':random.choice(USER_AGENTS),
38     'Connection':'keep-alive',
39     'Accept-Language':'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2'
40     }
41 # 建立Kongfuzi.csv
42 file = open("Kongfuzi.csv", "a")
43 file.write("bookname" + "," + "publishing_house"  + "," + "price" +  "," + "bookTime_on_shelf" +  "," + "bookShop" + '\n')
44 file = file.close()
45 
46 def Kongfuzi(keyword):
47     try:
48         for i in range(0,keyword):
49             url = "https://book.kongfz.com/Cqikan/cat_10002w{}".format(str(i))
50             req = requests.get(url=url,headers=headers)
51             # print(req.text)
52             html = etree.HTML(req.text)
53             count = 1
54 
55             #書名bookname、出版社publishing_house、發貨率delivery、價格price、上架時間bookTime_on_shelf、書店bookShop
56             for i in range(50):
57                 bookname = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[1]/a/text()'.format(count))
58                 for i in bookname:
59                     bookname = i
60                 publishing_house = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[2]/div[1]/div/span[2]/text()'.format(count))
61                 for i in publishing_house:
62                     publishing_house = i
63                 delivery = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[2]/span[2]/i/text()'.format(count))
64                 for i in delivery:
65                     delivery = i.strip("%")
66                 price = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[1]/div[2]/span[2]/text()'.format(count))
67                 for i in price:
68                     price = i
69                 bookTime_on_shelf = html.xpath('//*[@id="listBox"]/div[{}]/div[3]/div[4]/span[1]/text()'.format(count))
70                 for i in bookTime_on_shelf:
71                     bookTime_on_shelf = i
72                 bookShop = html.xpath('//*[@id="listBox"]/div[{}]/div[2]/div[3]/div[1]/div[3]/a/text()'.format(count))
73                 for i in bookShop:
74                     bookShop = i
75                 count += 1
76                 #儲存資料
77                 with open("Kongfuzi.csv", "a") as f2:
78                     f2.writelines(bookname + "," + publishing_house + "," + price +  "," + bookTime_on_shelf +  "," + bookShop + '\n')
79                     f2.close()
80 
81                 #顯示儲存資料
82                 print(bookname,
83                       "出版社:",publishing_house,'\n',
84                       "發貨率:",delivery,'%\n',
85                       "價格:",price,'元\n',
86                       "上架時間:",bookTime_on_shelf,'\n',
87                       "書店名:",bookShop)
88                 print('\n')
89     except:
90         print("網路錯誤")
91 
92 
93 if __name__ == '__main__':
94     keyword = input("爬取幾頁:")
95     Kongfuzi(int(keyword))

資料的清洗與處理

import pandas as pd
import numpy as np
# xs為銷量排行的表格、zh為綜合表排序
xs =  pd.read_csv(r'D:\Py_project\Kongfuzi.csv',error_bad_lines=False,encoding='gbk')
# 重複值處理
xs = xs.drop_duplicates('bookname')
# Nan處理
xs = xs.dropna(axis = 0)
# 根據價格數降序排序
xs.sort_values(by=["price"],inplace=True,ascending=[False])
xs.head(20)

  

# 價格排行視覺化分析
import matplotlib.pyplot as plt
x = xs['bookname'].head(20)
y = xs['price'].head(20)
plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標籤
plt.rcParams['axes.unicode_minus']=False
plt.xticks(rotation=90)
plt.bar(x,y,alpha=0.2, width=0.4, color='b', lw=3,label="price")
plt.plot(x,y,'-',color = 'r',label="sell")
plt.legend(loc = "best")#圖例
plt.title("價格趨勢圖")
plt.xlabel("書名",)#橫座標名字
plt.ylabel("價格")#縱座標名字
plt.show()
plt.barh(x,y, alpha=0.2, height=0.4, color='g',label="價格", lw=3)
plt.title("價格水平圖")
plt.legend(loc = "best")#圖例
plt.xlabel("價格",)#橫座標名字
plt.ylabel("書名")#縱座標名字
plt.show()
# 散點圖
plt.scatter(x,y,color='gray',marker='o',s=40,alpha=0.5)
plt.xticks(rotation=90)
plt.title("價格散點圖")
plt.xlabel("主題",)#橫座標名字
plt.ylabel("價格")#縱座標名字
plt.show()
plt.boxplot(y)
plt.title("價格盒圖")
plt.show()

雲詞:

import pandas as pd
import numpy as np
import wordcloud as wc
from PIL import Image
import matplotlib.pyplot as plt
import random

bk = np.array(Image.open(r"C:\Users\X0iaoyan\Downloads\111.jpg"))
mask = bk
# 定義尺寸
word_cloud = wc.WordCloud(
                       width=1000,  # 詞雲圖寬
                       height=1000,  # 詞雲圖高
                       mask = mask,
                       background_color='black',  # 詞雲圖背景顏色,預設為白色
                       font_path='msyhbd.ttc',  # 詞雲圖 字型(中文需要設定為本機有的中文字型)
                       max_font_size=400,  # 最大字型,預設為200
                       random_state=50,  # 為每個單詞返回一個PIL顏色
                       )
text = xs["bookname"]
text = " ".join(text)
word_cloud.generate(text)
plt.imshow(word_cloud)
plt.show()

視覺化分析總程式碼:

 1 import pandas as pd
 2 import numpy as np
 3 # xs為銷量排行的表格、zh為綜合表排序
 4 xs =  pd.read_csv(r'D:\Py_project\Kongfuzi.csv',error_bad_lines=False,encoding='gbk')
 5 
 6 # 重複值處理
 7 xs = xs.drop_duplicates('bookname')
 8 # Nan處理
 9 xs = xs.dropna(axis = 0)
10 
11 # 根據價格數降序排序
12 xs.sort_values(by=["price"],inplace=True,ascending=[False])
13 xs.head(20)
14 
15 # 價格排行視覺化分析
16 import matplotlib.pyplot as plt
17 x = xs['bookname'].head(20)
18 y = xs['price'].head(20)
19 plt.rcParams['font.sans-serif']=['SimHei'] #用來正常顯示中文標籤
20 plt.rcParams['axes.unicode_minus']=False
21 plt.xticks(rotation=90)
22 plt.bar(x,y,alpha=0.2, width=0.4, color='b', lw=3,label="price")
23 plt.plot(x,y,'-',color = 'r',label="sell")
24 plt.legend(loc = "best")#圖例
25 plt.title("價格趨勢圖")
26 plt.xlabel("書名",)#橫座標名字
27 plt.ylabel("價格")#縱座標名字
28 plt.show()
29 
30 plt.barh(x,y, alpha=0.2, height=0.4, color='g',label="價格", lw=3)
31 plt.title("價格水平圖")
32 plt.legend(loc = "best")#圖例
33 plt.xlabel("價格",)#橫座標名字
34 plt.ylabel("書名")#縱座標名字
35 plt.show()
36 
37 # 散點圖
38 plt.scatter(x,y,color='gray',marker='o',s=40,alpha=0.5)
39 plt.xticks(rotation=90)
40 plt.title("價格散點圖")
41 plt.xlabel("主題",)#橫座標名字
42 plt.ylabel("價格")#縱座標名字
43 plt.show()
44 
45 plt.boxplot(y)
46 plt.title("價格盒圖")
47 plt.show()
48 
49 
50 import pandas as pd
51 import numpy as np
52 import wordcloud as wc
53 from PIL import Image
54 import matplotlib.pyplot as plt
55 import random
56 
57 bk = np.array(Image.open(r"C:\Users\X0iaoyan\Downloads\111.jpg"))
58 mask = bk
59 # 定義尺寸
60 word_cloud = wc.WordCloud(
61                        width=1000,  # 詞雲圖寬
62                        height=1000,  # 詞雲圖高
63                        mask = mask,
64                        background_color='black',  # 詞雲圖背景顏色,預設為白色
65                        font_path='msyhbd.ttc',  # 詞雲圖 字型(中文需要設定為本機有的中文字型)
66                        max_font_size=400,  # 最大字型,預設為200
67                        random_state=50,  # 為每個單詞返回一個PIL顏色
68                        )
69 text = xs["bookname"]
70 text = " ".join(text)
71 word_cloud.generate(text)
72 plt.imshow(word_cloud)
73 plt.show()

五、總結

1.經過對主題資料的分析與視覺化,可以得到哪些結論?是否達到預期的目標? 分析結果達到預期,可以看出價格趨勢走向。 2.在完成此設計過程中,得到哪些收穫?以及要改進的建議?在此次設計過程種我對資料處理種的資料篩出有了很大的收穫,說白了就是怎麼進行型別轉換,然後達到自己的想要的效果。受益匪淺!需要改進的地方可能就是編寫程式反應時間過慢了!程式設計經驗比較欠缺。