1. 程式人生 > >python爬蟲 攜程上海

python爬蟲 攜程上海

年關將至,不想磕鹽。正好有個美麗的小仙女要來上海玩。閒來無事,先用爬蟲踩踩點。畢竟人懶,不想實地考察。

先看遊記,注意到網頁連結為http://you.ctrip.com/travels/shanghai2.html 我就很好奇第一頁就是shanghai2???那shanghai1 是啥。懷著好奇的心情點進去一看,http://you.ctrip.com/travels/shanghai1.html

(⊙o⊙)…居然是北京遊記,真是驚了個呆。為攜程網的命名方式點贊,好了題外話結束。

翻到第二頁,http://you.ctrip.com/travels/shanghai2/t3-p2.html
可以大膽地揣測p是指第幾頁,那麼-p1,-p2,-p3…是我們將要爬取的網頁。先爬個20頁吧

urls=['http://you.ctrip.com/travels/shanghai2/t3-p'+str(i)+'.html' for i in range(1,21)]

攜程網有最基礎的反爬蟲機制,那我們就套件外套,加個headers

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
res=requests.get(url,headers=headers)

bs4 解析獲得每篇遊記的地址,以第一頁為例

tmp=soup.find_all('a',attrs={'class':'journal-item cf','target':'_blank'})
for t in tmp:
     detail_url.append(t.get('href'))
['/travels/shanghai2/3333236.html',
 '/travels/shanghai2/3534134.html',
 '/travels/shanghai2/3635663.html',
 '/travels/shanghai2/3742279.html',
 '/travels/tibet100003/1755676.html',
 '/travels/shanghai2/1560853.html',
 '/travels/shanghai2/1816039.html',
 '/travels/shanghai2/1578243.html',
 '/travels/shanghai2/1885378.html',
 '/travels/huangshan19/2189034.html']

貌似混進來了很了不得的東西,攜程網還是個很神奇的網站,真的包容一切。加條判斷‘shanghai’

if 'shanghai' in t.get('href'):detail_url.append(t.get('href'))

接下來提取正文中的中文字
注意到文字在p標籤中,xpath路徑為

/html/body/div[3]/div[4]/div[1]/div[1]/div[2]/p[2]/text()
/html/body/div[3]/div[4]/div[1]/div[1]/div[2]/p[3]/text()

接下來應該是p[4],p[5]… 這樣就把他們安排得明明白白,一家人排排坐。但是lxml.etree解析xpath一直不對,這就尷尬了,博主水平有限啊。還是老老實實迴歸老本行bs4. 發現正文內容在class:ctd_content內 然後判斷是否是中文字,是的話就寫

def isContainChinese(s):
    for c in s:
        if ('\u4e00' <= c <= '\u9fa5'):
            return True
    return False
def get_detail_content(url):
    res=requests.get('http://you.ctrip.com'+url,headers=headers)
    soup = BeautifulSoup(res.content,'html.parser')
    tmp=soup.find_all('div',attrs={'class':'ctd_content'})
    s=str(tmp[0])
    contain=''
    for c in s:
        if isContainChinese(c):
            contain+=c 
    return contain

將結果儲存到txt文件中
最後用多執行緒加速(算了一共就20也貌似不用多執行緒)
完整程式碼如下

import requests
from  bs4 import BeautifulSoup
from lxml import etree
import os
urls=['http://you.ctrip.com/travels/shanghai2/t3-p'+str(i)+'.html' for i in range(1,21)]
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
path=os.getcwd()
def isContainChinese(s):
    for c in s:
        if ('\u4e00' <= c <= '\u9fa5'):
            return True
    return False
def get_detail_url(urls):
    detail_url=[]
    for url in urls:
        res=requests.get(url,headers=headers)
        soup = BeautifulSoup(res.content,'html.parser')
        tmp=soup.find_all('a',attrs={'class':'journal-item cf','target':'_blank'})
        for t in tmp:
            if 'shanghai' in t.get('href'):detail_url.append(t.get('href'))
    return detail_url

def get_detail_content(url):
    print(url)
    res=requests.get('http://you.ctrip.com'+url,headers=headers)
    soup = BeautifulSoup(res.content,'html.parser')
    tmp=soup.find_all('div',attrs={'class':'ctd_content'})
    s=str(tmp[0])
    contain=''
    for c in s:
        if isContainChinese(c):
            contain+=c
    return contain

detail_url=get_detail_url(urls)
txt=''
for url in detail_url:
    txt+=get_detail_content(url)
with open(path+'/shanghai.txt','a') as f:
    f.write(txt)

這樣我們就得到了遊記正文的內容(好像忘記爬圖片,算了有機會再更吧)。資料有了,接下來開始處理資料,從簡單的開始,先來做個詞頻統計吧。

jieba中文分詞,自己設定停用詞(好多啊,好煩啊)

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from os import path
import jieba
from scipy.misc import imread
d = path.dirname(__file__)
ciyun1=''
lists=''
remove=['點選','檢視','原圖','資訊','相關','一個','可以','因為','這個','一下','這裡','很多',
        '我們','沒有','自己','還是','還有','就是','最後','覺得','開始','現在','裡面','看到',
        '而且','一些','一種','一樣','所以','如果','不過','時候','大家','附近','這樣']
with open(path.join(d,"shanghai.txt"),'r') as f1:
	lists = f1.read()
word1=jieba.cut(lists)
ciyun1 = ",".join(word1)
text=ciyun1

alice_coloring = imread(path.join(d, "氣球.png"))

wc = WordCloud(background_color="white", #背景顏色max_words=2000,# 詞雲顯示的最大詞數
mask=alice_coloring,#設定背景圖片
font_path='simkai.ttf',
stopwords=remove,
max_font_size=40, #字型最大值
random_state=42)
# 生成詞雲, 可以用generate輸入全部文字(中文不好分詞),也可以我們計算好詞頻後使用generate_from_frequencies函式
wc.generate(text)
# wc.generate_from_frequencies(txt_freq)
# txt_freq例子為[('詞a', 100),('詞b', 90),('詞c', 80)]
# 從背景圖片生成顏色值
image_colors = ImageColorGenerator(alice_coloring)

# 以下程式碼顯示圖片
plt.imshow(wc)
plt.axis("off")
# 繪製詞雲
plt.figure()
# recolor wordcloud and show
# we could also give color_func=image_colors directly in the constructor
plt.imshow(wc.recolor(color_func=image_colors))
plt.axis("off")
# 繪製背景圖片為顏色的圖片
plt.figure()
plt.imshow(alice_coloring, cmap=plt.cm.gray)
plt.axis("off")
#plt.show()
# 儲存圖片
wc.to_file(path.join(d, "上海.png"))

氣球

上海

今天先到這裡了,下次再更,想爬什麼東西底下留言。