Python爬蟲系列:京東商品爬蟲
阿新 • • 發佈:2019-02-10
需求:爬取京東手機頻道的手機商品資訊:名稱、價格、評論數、商家名稱等
這裡涉及2個問題需要解決。
1、手機圖片的爬取和儲存
2、手機價格的爬取與儲存(因為手機價格是非同步載入的,無法從網頁原始碼中直接獲取)
圖片的爬取和儲存
import requests
url="https://img13.360buyimg.com/n7/jfs/t3391/79/1963324994/297093/187de6d4/583ced0fN27e50577.jpg"
res=requests.get(url)
with open("E:\\jupyter-notebook\\PyCrawler\\jd1.jpg","wb" ) as fd:
fd.write(res.content)
非同步載入的資料-以京東商城價格資訊提取為例
import re
url="https://p.3.cn/prices/mgets?callback=jQuery6775278&skuids=J_5089253"
res=requests.get(url)
pat='"p":"(.*?)"}'
price=re.compile(pat).findall(res.text)
print(price)
京東手機圖片採集
url="https://list.jd.com/list.html?cat=9987,653,655"
res=requests.get(url)
imagepat='<img width="220" height="220" data-img="1" data-lazy-img="//(.*?)">'
imagelist=re.compile(imagepat).findall(res.text)
print(imagelist)
x=1
for imageurl in imagelist:
imagename="E:\\jupyter-notebook\\PyCrawler\\jdpic\\"+str(x)+".jpg"
x+=1
imageurl="http://" +imageurl
res=requests.get(imageurl)
with open(imagename,'wb') as fd:
fd.write(res.content)
完整程式碼如下
#京東手機資訊採集:名稱、價格、評論數、商家名稱等
import requests
from lxml import etree
from pandas import DataFrame
import pandas as pd
jdInfoAll=DataFrame()
for i in range(1,4):
url="https://list.jd.com/list.html?cat=9987,653,655&page="+str(i)
res=requests.get(url)
res.encoding='utf-8'
root=etree.HTML(res.text)
name=root.xpath('//li[@class="gl-item"]//div[@class="p-name"]/a/em/text()')
for i in range(0,len(name)):
name[i]=re.sub('\s','',name[i])
#sku
sku=root.xpath('//li[@class="gl-item"]/div/@data-sku')
#價格
price=[]
comment=[]
for i in range(0,len(sku)):
thissku=sku[i]
priceurl="https://p.3.cn/prices/mgets?callback=jQuery6775278&skuids=J_"+str(thissku)
pricedata=requests.get(priceurl)
pricepat='"p":"(.*?)"}'
thisprice=re.compile(pricepat).findall(pricedata.text)
price=price+thisprice
commenturl="https://club.jd.com/comment/productCommentSummaries.action?my=pinglun&referenceIds="+str(thissku)
commentdata=requests.get(commenturl)
commentpat='"CommentCount":(.*?),"'
thiscomment=re.compile(commentpat).findall(commentdata.text)
comment=comment+thiscomment
#商家名稱
shopname=root.xpath('//li[@class="gl-item"]//div[@class="p-shop"]/@data-shop_name')
print(shopname)
jdInfo=DataFrame([name,price,shopname,comment]).T
jdInfo.columns=['產品名稱','價格','商家名稱','評論數']
jdInfoAll=pd.concat([jdInfoAll,jdInfo])
jdInfoAll.to_excel('jdInfoAll.xls')