爬蟲-京東商品搜尋頁爬取

阿新 • • 發佈：2018-11-07

難點：

1，京東首次搜尋只展示30條資料，這個可以直接在原始碼取到，但是也要注意不同頁面抓取規則可能不一樣（頁面結構有變化需要判斷）
2，繼續下拉可以在ajax獲取到另外30條資料，但是這個requests提交需要各種引數，很麻煩，我這暫時沒有找到自動填寫的方法，只能根據搜尋需求人工改寫下
解析頁面資訊有好多坑，比如有的價格不全，同一個頁面需要解析的規則就不一樣
ajax引數，headers的path和referer和data的login_id和show_time這是在前30個上篇頁程式碼裡提取到拼湊的

在這裡插入圖片描述

程式碼：

import requests
from  lxml import etree
import re

keyword=input("請輸入查詢商品：")
print(type(keyword))
g=[]
for i in range(1,2):
    url="https://search.jd.com/Search?keyword={}&enc=utf-8&page={}".format(keyword,i*2-1)
    header={
    # ":authority":"search.jd.com",
    # ":method":"GET",
    # ":scheme":" https",
    "upgrade - insecure - requests": "1",
    "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36"
    }
    html=requests.get(url=url,headers=header)
    html.encoding="utf-8"
    # print(html.text)
    html=html.text
    newhtml=etree.HTML(html)
    print(newhtml)
    log_id=re.findall("log_id:'(.*?)'",html,re.S)[0]
    cid=re.findall("LogParm.*?cid:(.*?),",html,re.S)[0]
    print(cid)
    sku_id=newhtml.xpath('//*[@id="J_goodsList"]/ul/li/@data-sku')
    p_list=",".join('%s' % id for id in sku_id)
    print(sku_id)
    print(p_list)
    if len(newhtml.xpath('//*[@id="J_goodsList"]/ul/li//div[@class="p-price"]/strong/i/text()'))==30:
        img_url=newhtml.xpath('//div[@id="J_goodsList"]/ul/li//div[@class="p-img"]//img/@source-data-lazy-img')
        print(img_url)
        price=newhtml.xpath('//*[@id="J_goodsList"]/ul/li//div[@class="p-price"]/strong/i/text()')
        title=newhtml.xpath('//*[@id="J_goodsList"]/ul/li//div[contains(@class,"p-name")]/a/em/text()')#.strip()
        # title="".join(title)
        # title=title.split(":")
        product_url=newhtml.xpath('//*[@id="J_goodsList"]/ul/li//div[contains(@class,"p-name")]//a/@href')
        # commit=html.xpath('//*[@id="J_goodsList"]/ul/li[1]/div/div[4]/strong/text()')#.strip()
        commit=newhtml.xpath('//*[@id="J_goodsList"]/ul/li//div[@class="p-commit"]/strong/a/text()')#.strip()
        # shop_name=newhtml.xpath('//*[@id="J_goodsList"]/ul/li//div[@class="p-shop"]/span/a/text()')
        for i in range(30):
            list_1=[keyword,"https:"+product_url[i],price[i],commit[i],"http:"+img_url[i]]
            print(list_1)
            g.append(list_1)
    else:
        for n in range(30):
            if newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[@class="p-price"]/strong/i/text()'%(n+1)):
                img_url = newhtml.xpath('//div[@id="J_goodsList"]/ul/li[%d]//div[@class="p-img"]//img/@source-data-lazy-img'%(n+1))
                # print(img_url)
                price = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[@class="p-price"]/strong/i/text()'%(n+1))
                title = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[contains(@class,"p-name")]/a/em/font[1]/text()'%(n+1))  # .strip()
                # title="".join(title)
                # title=title.split(":")
                product_url = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[contains(@class,"p-name")]//a/@href'%(n+1))
                # commit=html.xpath('//*[@id="J_goodsList"]/ul/li[1]/div/div[4]/strong/text()')#.strip()
                commit = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[@class="p-commit"]/strong/a/text()'%(n+1))  # .strip()
                list_2 = [keyword , "https:"+product_url[0], price[0], commit[0], "http:"+img_url[0]]
                print(list_2)
                g.append(list_2)
            else:
                img_url = newhtml.xpath('//div[@id="J_goodsList"]/ul/li[%d]//div[@class="p-img"]//img/@source-data-lazy-img'%(n+1))
                # print(img_url)
                price = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[@class="p-price"]/strong/@data-price'%(n+1))
                title = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[contains(@class,"p-name")]/a/em/font[1]/text()'%(n+1))  # .strip()
                # title="".join(title)
                # title=title.split(":")
                product_url = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[contains(@class,"p-name")]//a/@href'%(n+1))
                # commit=html.xpath('//*[@id="J_goodsList"]/ul/li[1]/div/div[4]/strong/text()')#.strip()
                commit = newhtml.xpath('//*[@id="J_goodsList"]/ul/li[%d]//div[@class="p-commit"]/strong/a/text()'%(n+1))  # .strip()
                list_3 = [keyword , "https:"+product_url[0], price[0], commit[0], "http:"+img_url[0]]
                print(list_3)
                g.append(list_3)


header1={
'authority': 'search.jd.com',
'method': 'GET',
'scheme': 'https',
"path": "/s_new.php?keyword=%E5%A5%B3%E9%9E%8B&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E5%A5%B3%E9%9E%8B&cid2=11731&page=2&s=27&scrolling=y&log_id=1541499650.39561&tpl=3_M",
"referer": "https://search.jd.com/Search?keyword=%E5%A5%B3%E9%9E%8B&enc=utf-8&wq=%E5%A5%B3%E9%9E%8B&pvid=11f0d7bbd549489ea0ff9c18280008e3",
"user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
'Cookie': '__jdv=122270672|direct|-|none|-|1541463047359; __jdc=122270672; __jdu=15414630473591816564566; PCSYCityID=country_2468; shshshfpa=06433229-fec9-71f1-ee80-5e4d2053f3b2-1541463051; xtest=1192.cf6b6759; ipLoc-djd=1-72-2799-0; shshshfpb=29d35c4b6bd0a4874aff06fde9a21bb415be00d7767a8105a1c51da983; rkv=V0000; qrsc=3; mt_xid=V2_52007VwMWV11dVVgeTB9eAW8DG1JaXFVfG04ebFVuVkJQVVFSRh5NSgsZYgERB0FQW1gYVRsJAjcFFFZZWQAKGHkaXQVuHxJSQVlVSx5AElgFbAcbYl9oUmocThBdBWAFE1RtWFdcGA%3D%3D; 3AB9D23F7A4B3C9B=3DVBCHQ2ZQDBE7WHQNBTXMZIG2LRSITXIEP5G2KLLX7F665PL45NH2F4HIBZ7GYW7TBTBPQEGC27GWLCFQV3UVL2EQ; _gcl_au=1.1.201025286.1541468212; shshshfp=72033013163d2ec74f4450e6f7114c1a; __jda=122270672.15414630473591816564566.1541463047.1541468202.1541471285.4; wlfstk_smdl=607dk6qcnoo82vqtspoz07g6hf0a8h6f; TrackID=1aDczZLZIOi53VMMgAJw6R6jU_JwW0j0Q3kPXr2DBxehnhKeoPkixGxlJ1XFOqdIqsW5IHw3HorqriaLnpP7qx_rF45aE522LK_J72xHV0XU; thor=A2F041014FF97AD3CBE36A18D7A197BD280A27D92876522948662EEBEF7FAAE9D4EE69FADC66DD46EC5FB2DA15E3A77A2B031AB32800A19FDD8BF76438EC46467045B795A654A74E62B5D2C1BEE34F0566FBA73C6ADB9AE74640B83FFF64DB25EF4E84890A70EC7A2A054562CA4A906EBC3E8B8DE2E06A32A741577FBDE89130428D846DC18B195004A8AFE75665A1DA43AAC5AC651F19D0CCB3FDF2AD68D88A; pinId=Gs1wY_18rJHCJb2AcSkHc7V9-x-f3wj7; pin=jd_605257796165b; unick=%E6%B6%9F%E6%BC%AA%E5%9E%84; ceshi3.com=103; _tp=wwla5rjsr%2FuWFA%2FmLBXQrei5UhIKee6ThwQXFShjs60%3D; _pst=jd_605257796165b; __jdb=122270672.4.15414630473591816564566|4.1541471285; shshshsID=eaceb1587ae4eaba8646ed1b230970b4_2_1541472058709'

}
url1="https://search.jd.com/s_new.php?keyword=%s&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&suggest=7.def.0.V00&cid2=%s&page=2&s=28&scrolling=y&log_id=%s&tpl=3_M&show_items=%s"%(str(keyword),cid,log_id,p_list)
html1=requests.get(url=url1,headers=header1)
html1.encoding="utf-8"
html2=html1.text
# print(html2)
html3=etree.HTML(html2)
product_url=html3.xpath('//div[contains(@class,"p-name")]//a/@href')
price=html3.xpath('//div[@class="p-price"]/strong/i/text()')
commit=html3.xpath('//div[@class="p-commit"]/strong/a/text()')
img_ul=html3.xpath('//div[@class="p-img"]//img/@source-data-lazy-img')
title=html3.xpath('//div[contains(@class,"p-name")]/a/em')

for i in range(30):
    list_4 = [keyword, "https:" + product_url[i], price[i], commit[i], "http:" + img_url[i]]
    print(list_4)
    g.append(list_4)



print(url1)

print(g)
print(len(g))

爬蟲-京東商品搜尋頁爬取

難點： 1，京東首次搜尋只展示30條資料，這個可以直接在原始碼取到，但是也要注意不同頁面抓取規則可能不一樣（頁面結構有變化需要判斷） 2，繼續下拉可以在ajax獲取到另外30條資料，但是這個requests提交需要各種引數，很麻煩，我這暫時沒有找到自動填寫的方法，只能

python3.x爬蟲：按頁爬取淘寶商品列表

import requests import re '''https://s.taobao.com/search?initiative_id=tbindexz_20170315&ie=utf8&spm=a21bo.50862.201856-taobao-it

python3 學習1（搜尋關鍵字爬取一頁word格式的百度文庫並下載成文字）

#coding: utf-8 from bs4 import BeautifulSoup # -*- coding: UTF-8 -*- from selenium import webdriver import time browser = webdriver.Chrome

京東商品列表反爬解析+下拉url介面構造(反爬蟲)

由於京東運用ajax載入頁面，正常的爬取頁面不能獲得全部頁面內容，之前做過用Scrapy + Selenium實現京東商品列表摘要資訊的爬取，今天又研究了一下其下拉後接口url的構造，終於發現了其中的奧祕！經過分析可以發現：第二次截獲的ur有三處需要構造

Python爬取亞馬遜商品列表-xpath(詳情頁爬取待更新...)

一.分析頁面結構先行爬取首頁內容的兩個欄位，一個是商品名稱title以及價格price；二.分析頁面的請求：首先按照PC端的url進行請求，結果未得到返回響應的response的資料，於是通過chrom瀏覽器切換至手機端的來獲取響應：觀察到其url

CasperJS 抓取京東商品詳情頁中的商品資訊

一、京東商品詳情頁 1. 詳情頁截圖 2. 網頁結構二、實戰：抓取詳情頁中紅色方框中的資訊 1. 程式碼 phantom.outputEncoding="GBK"

python爬蟲系列（一）百度首頁爬取

前言經受不住爬蟲技術的吸引，為此決定踏入”爬蟲”這條不歸路。爬蟲介紹其實在我眼裡，爬蟲無非所見即所得，也就是一切皆可爬。至於url技術和python環境在此就不重複。在此使用urllib庫進行初步學習。 python:2.7 初次嘗試

java爬蟲京東商品頁簡單案例

要爬的資料資料庫表結構資料庫建表語句SET FOREIGN_KEY_CHECKS=0; -- ---------------------------- -- Table structure for `spider` -- ---------------------------- DROP TABLE I

網絡爬蟲——針對任意主題批量爬取PDF

open 代碼針對得到搜索結果 pre ner tps -c |本文為博主原創，轉載請說明出處任務需求：要求通過Google針對任意關鍵字爬取大量PDF文檔，如K-means，KNN，SVM等。環境：Anaconda3——Windows7-64位——Python3

java爬蟲一（分析要爬取數據的網站）

java爬蟲一、獲取你想要抓取的網站地址：http://www.zhaopin.com/然後打開控制臺，F12，打開。我用的是Chrome瀏覽器，跟個人更喜歡Chrome的控制臺字體。找到搜索欄對應的html標簽：http://sou.zhaopin.com/jobs/searchresult.ashx?jl

爬蟲實例之selenium爬取淘寶美食

獲取 web tex 匹配 ive cati def presence dea 這次的實例是使用selenium爬取淘寶美食關鍵字下的商品信息，然後存儲到MongoDB。首先我們需要聲明一個browser用來操作，我的是chrome。這裏的wait是在後面的判斷元素是

java爬蟲問題二: 使用jsoup爬取數據class選擇器中空格多選擇怎麽解決

凱哥Java問題描述：在使用jsoup爬取其他網站數據的時候，發現class是帶空格的多選擇，如果直接使用doc.getElementsByClass(“class的值”),這種方法獲取不到想要的數據。爬取網站頁面結構如下：其中文章列表的div為：<div class="am-cf in

爬蟲（七）：爬取貓眼電影top100

all for rip pattern 分享爬取 values findall proc 一：分析網站目標站和目標數據目標地址：http://maoyan.com/board/4?offset=20目標數據：目標地址頁面的電影列表，包括電影名，電影圖片，主演，上映日期以

大神教你如果學習Python爬蟲如何才能高效地爬取海量數據

Python 爬蟲分布式大數據編程 Python如何才能高效地爬取海量數據我們都知道在互聯網時代，數據才是最重要的，而且如果把數據用用得好的話，會創造很大的價值空間。但是沒有大量的數據，怎麽來創建價值呢？如果是自己的業務每天都能產生大量的數據，那麽數據量的來源問題就解決啦，但是沒有數

【Python爬蟲】從html裏爬取中國大學排名

ext 排名所有一個 requests 空格創建 .text request from bs4 import BeautifulSoupimport requestsimport bs4 #bs4.element.Tag時用的上#獲取網頁頁面HTMLdef

python爬蟲-20行代碼爬取王者榮耀所有英雄圖片，小白也輕輕松松

需要 tis tca wcf 爬取 html eas request 有用 1.環境 python3.6 需要用到的庫： re、os、requests 2.簡介王者榮耀可以算得上是比較受歡迎的手遊之一了，應該有不少的人都入坑過農藥，我們今天的目的就是要爬取王者榮耀的高

Python爬蟲案例：利用Python爬取笑話網

htm 分享 targe pen 技術分享搞笑 lan tle import 學校的服務器可以上外網了，所以打算寫一個自動爬取笑話並發到bbs的東西，從網上搜了一個笑話網站，感覺大部分還不太冷，html結構如下：可以看到，笑話的鏈接列表都在<div cla

Python網絡爬蟲Scrapy+MongoDB +Redis實戰爬取騰訊視頻動態評論教學視頻

並發數 www. 深入圖例編程 ppt 研發 read 網絡爬蟲課程簡介學習Python爬蟲開發數據采集程序啦！網絡編程，數據采集、提取、存儲，陷阱處理……一站式全精通！！！目標人群掌握Python編程語言基礎，有誌從事網絡爬蟲開發及數據采集程序開發的人群。學習目

Python爬蟲初探 - selenium+beautifulsoup4+chromedriver爬取需要登錄的網頁信息

-- pro tag bug gui 結果 .com 工作 ges 目標之前的自動答復機器人需要從一個內部網頁上獲取的消息用於回復一些問題，但是沒有對應的查詢api，於是想到了用腳本模擬瀏覽器訪問網站爬取內容返回給用戶。詳細介紹了第一次探索python爬蟲的坑。準備工作

Java爬蟲學習《一、爬取網頁URL》

導包，如果是用的maven，新增依賴： <dependency> <groupId>commons-httpclient</groupId> <artifactId>commons

爬蟲-京東商品搜尋頁爬取

相關推薦