python爬蟲 攜程上海
阿新 • • 發佈:2018-12-22
年關將至,不想磕鹽。正好有個美麗的小仙女要來上海玩。閒來無事,先用爬蟲踩踩點。畢竟人懶,不想實地考察。
先看遊記,注意到網頁連結為http://you.ctrip.com/travels/shanghai2.html 我就很好奇第一頁就是shanghai2???那shanghai1 是啥。懷著好奇的心情點進去一看,http://you.ctrip.com/travels/shanghai1.html
(⊙o⊙)…居然是北京遊記,真是驚了個呆。為攜程網的命名方式點贊,好了題外話結束。
翻到第二頁,http://you.ctrip.com/travels/shanghai2/t3-p2.html
可以大膽地揣測p是指第幾頁,那麼-p1,-p2,-p3…是我們將要爬取的網頁。先爬個20頁吧
urls=['http://you.ctrip.com/travels/shanghai2/t3-p'+str(i)+'.html' for i in range(1,21)]
攜程網有最基礎的反爬蟲機制,那我們就套件外套,加個headers
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
res=requests.get(url,headers=headers)
bs4 解析獲得每篇遊記的地址,以第一頁為例
tmp=soup.find_all('a',attrs={'class':'journal-item cf','target':'_blank'})
for t in tmp:
detail_url.append(t.get('href'))
['/travels/shanghai2/3333236.html', '/travels/shanghai2/3534134.html', '/travels/shanghai2/3635663.html', '/travels/shanghai2/3742279.html', '/travels/tibet100003/1755676.html', '/travels/shanghai2/1560853.html', '/travels/shanghai2/1816039.html', '/travels/shanghai2/1578243.html', '/travels/shanghai2/1885378.html', '/travels/huangshan19/2189034.html']
貌似混進來了很了不得的東西,攜程網還是個很神奇的網站,真的包容一切。加條判斷‘shanghai’
if 'shanghai' in t.get('href'):detail_url.append(t.get('href'))
接下來提取正文中的中文字
注意到文字在p標籤中,xpath路徑為
/html/body/div[3]/div[4]/div[1]/div[1]/div[2]/p[2]/text()
/html/body/div[3]/div[4]/div[1]/div[1]/div[2]/p[3]/text()
接下來應該是p[4],p[5]… 這樣就把他們安排得明明白白,一家人排排坐。但是lxml.etree解析xpath一直不對,這就尷尬了,博主水平有限啊。還是老老實實迴歸老本行bs4. 發現正文內容在class:ctd_content內 然後判斷是否是中文字,是的話就寫入
def isContainChinese(s):
for c in s:
if ('\u4e00' <= c <= '\u9fa5'):
return True
return False
def get_detail_content(url):
res=requests.get('http://you.ctrip.com'+url,headers=headers)
soup = BeautifulSoup(res.content,'html.parser')
tmp=soup.find_all('div',attrs={'class':'ctd_content'})
s=str(tmp[0])
contain=''
for c in s:
if isContainChinese(c):
contain+=c
return contain
將結果儲存到txt文件中
最後用多執行緒加速(算了一共就20也貌似不用多執行緒)
完整程式碼如下
import requests
from bs4 import BeautifulSoup
from lxml import etree
import os
urls=['http://you.ctrip.com/travels/shanghai2/t3-p'+str(i)+'.html' for i in range(1,21)]
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
path=os.getcwd()
def isContainChinese(s):
for c in s:
if ('\u4e00' <= c <= '\u9fa5'):
return True
return False
def get_detail_url(urls):
detail_url=[]
for url in urls:
res=requests.get(url,headers=headers)
soup = BeautifulSoup(res.content,'html.parser')
tmp=soup.find_all('a',attrs={'class':'journal-item cf','target':'_blank'})
for t in tmp:
if 'shanghai' in t.get('href'):detail_url.append(t.get('href'))
return detail_url
def get_detail_content(url):
print(url)
res=requests.get('http://you.ctrip.com'+url,headers=headers)
soup = BeautifulSoup(res.content,'html.parser')
tmp=soup.find_all('div',attrs={'class':'ctd_content'})
s=str(tmp[0])
contain=''
for c in s:
if isContainChinese(c):
contain+=c
return contain
detail_url=get_detail_url(urls)
txt=''
for url in detail_url:
txt+=get_detail_content(url)
with open(path+'/shanghai.txt','a') as f:
f.write(txt)
這樣我們就得到了遊記正文的內容(好像忘記爬圖片,算了有機會再更吧)。資料有了,接下來開始處理資料,從簡單的開始,先來做個詞頻統計吧。
jieba中文分詞,自己設定停用詞(好多啊,好煩啊)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from os import path
import jieba
from scipy.misc import imread
d = path.dirname(__file__)
ciyun1=''
lists=''
remove=['點選','檢視','原圖','資訊','相關','一個','可以','因為','這個','一下','這裡','很多',
'我們','沒有','自己','還是','還有','就是','最後','覺得','開始','現在','裡面','看到',
'而且','一些','一種','一樣','所以','如果','不過','時候','大家','附近','這樣']
with open(path.join(d,"shanghai.txt"),'r') as f1:
lists = f1.read()
word1=jieba.cut(lists)
ciyun1 = ",".join(word1)
text=ciyun1
alice_coloring = imread(path.join(d, "氣球.png"))
wc = WordCloud(background_color="white", #背景顏色max_words=2000,# 詞雲顯示的最大詞數
mask=alice_coloring,#設定背景圖片
font_path='simkai.ttf',
stopwords=remove,
max_font_size=40, #字型最大值
random_state=42)
# 生成詞雲, 可以用generate輸入全部文字(中文不好分詞),也可以我們計算好詞頻後使用generate_from_frequencies函式
wc.generate(text)
# wc.generate_from_frequencies(txt_freq)
# txt_freq例子為[('詞a', 100),('詞b', 90),('詞c', 80)]
# 從背景圖片生成顏色值
image_colors = ImageColorGenerator(alice_coloring)
# 以下程式碼顯示圖片
plt.imshow(wc)
plt.axis("off")
# 繪製詞雲
plt.figure()
# recolor wordcloud and show
# we could also give color_func=image_colors directly in the constructor
plt.imshow(wc.recolor(color_func=image_colors))
plt.axis("off")
# 繪製背景圖片為顏色的圖片
plt.figure()
plt.imshow(alice_coloring, cmap=plt.cm.gray)
plt.axis("off")
#plt.show()
# 儲存圖片
wc.to_file(path.join(d, "上海.png"))