使用selenium+BeautifulSoup 抓取京東商城手機資訊

阿新 • • 發佈：2018-11-23

1.準備工作：

chromedriver 傳送門：國內：http://npm.taobao.org/mirrors/chromedriver/ vpn:
selenium
BeautifulSoup4(美味湯)

pip3 install selenium

pip3 install BeautifulSoup4

　　chromedriver 的安裝請自行百度。我們直奔主題。

起飛前請確保準備工作以就緒...

2.分析網頁：

　目標網址：https://www.jd.com/

所有item均儲存在class="gl-item"裡面

需求：

使用selenium 驅動瀏覽器自動偵測到input輸入框，輸入框中輸入“手機”,點選搜尋按鈕.
使用seleinum抓取發揮頁面的總頁碼，並模擬手動翻頁
使用BeautifulSoup分析頁面，抓取手機資訊

從入口首頁進入查詢狀態

 1 # 定義入口查詢介面
 2 def search():
 3     browser.get('https://www.jd.com/')
 4     try:
 5         # 
 查詢搜尋框及搜尋按鈕，輸入資訊並點選按鈕
 6         input = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#key")))
 7         submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#search > div > div.form > button")))
 8         input[0].send_keys('手機')
 9         submit.click()
 
10         # 獲取總頁數
11         page = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > em:nth-child(1) > b')))
12         return page[0].text
13     # 如果異常，遞迴呼叫本函式
14     except TimeoutException:
15         search()

查詢結束後模擬翻頁

 1 # 翻頁
 2 def next_page(page_number):
 3     try:
 4         # 滑動到網頁底部，加載出所有商品資訊
 5         browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 6         time.sleep(4)
 7         html = browser.page_source
 8         # 當網頁到達100頁時，下一頁按鈕失效，所以選擇結束程式
 9         while page_number == 101:
10             exit()
11         # 查詢下一頁按鈕，並點選按鈕
12         button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.pn-next > em')))
13         button.click()
14         # 判斷是否載入到本頁最後一款產品Item(每頁顯示60條商品資訊)
15         wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#J_goodsList > ul > li:nth-child(60)")))
16         # 判斷翻頁成功
17         wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#J_bottomPage > span.p-num > a.curr"), str(page_number)))
18         return html
19     except TimeoutException:
20         return next_page(page_number)

解析頁面上的a標籤

# 解析每一頁面上的a連結
def parse_html(html):
    """
    解析商品列表網頁,獲取商品的詳情頁
    """

    soup = BeautifulSoup(html, 'html.parser')
    items = soup.select('.gl-item')
    for item in items:
        a = item.select('.p-name.p-name-type-2 a')
        link = str(a[0].attrs['href'])
        if 'https:' in link:
            continue
        else:
            link = "https:"+link
        yield link

根據url 擷取商品id 獲取價格資訊

# 獲取手機價格，由於價格資訊是請求另外一個地址https://p.3.cn/prices/mgets?skuIds=J_+product_id
def get_price(product_id):
    url = 'https://p.3.cn/prices/mgets?skuIds=J_' + product_id
    response = requests.get(url,heeders)
    result = ujson.loads(response.text)
    return result

進入item商品詳情頁

# 進入詳情頁
def detail_page(link):
    """
    進入item詳情頁
    :param link: item link
    :return: html
    """
    browser.get(link)
    try:
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(3)
        html = browser.page_source
        return html
    except TimeoutException:
        detail_page(link)

 1 # 獲取詳情頁的手機資訊
 2 def get_detail(html,result):
 3     """
 4     獲取詳情頁的資料
 5     :param html:
 6     :return:
 7     """
 8     dic ={}
 9     soup = BeautifulSoup(html, 'html.parser')
10     item_list = soup.find_all('div', class_='Ptable-item')
11     for item in item_list:
12         contents1 = item.findAll('dt')
13         contents2 = item.findAll('dd')
14         for i in range(len(contents1)):
15             dic[contents1[i].string] = contents2[i].string
16 
17     dic['price_jd '] = result[0]['p']
18     dic['price_mk '] = result[0]['m']
19     print(dic)

滴滴滴.. 基本上的思路就醬紫咯.. 傳送門依舊開啟直github: https://github.com/shinefairy/spider/tree/master/Crawler

end~

使用selenium+BeautifulSoup 抓取京東商城手機資訊

1.準備工作： chromedriver 傳送門：國內：http://npm.taobao.org/mirrors/chromedriver/ vpn: selenium BeautifulSoup4(美味湯) pip3 instal

Python的網路爬蟲小系統——爬取京東商城商品資訊

import time from selenium import webdriver keyword = input('請輸入你要搜尋的關鍵字：') brow = webdriver.Firefox() brow.get('http://www.jd.com') searc

利用Eclipse-Python簡單爬取京東商城書籍資訊進行視覺化

【實驗目的】熟悉從網上爬取資料到將資料進行視覺化的全部流程，通過例項瞭解並掌握每個過程。瞭解爬蟲爬取資料的原理，並選擇一種語言程式設計，將資料獲取到資料庫。熟練使用 eclipse 中 Java 語言操作資料庫資料。瞭解 ECharts 中各種圖的特

Python抓取京東手機的配置資訊

以下程式碼是使用python抓取京東小米8手機的配置資訊然後找到其配置資訊的標籤，我們找到其配置資訊的標籤為 <div class="Ptable"> 然後再分析其配置資訊的頁面的規律，我們發現都是一個dt對應的一個dd，dt對應的是引數，dd對

scrapy爬取京東商城某一類商品的資訊和評論（二）

2、任務二：爬取商品評論資訊如果不需要爬取使用者的地域資訊，那麼用這個網址爬就好： http://club.jd.com/review/10321370917-1-1-0.html 其中10321370917是商品的ID，評論的第一頁就是 -1-1-0.htm

scrapy爬取京東商城某一類商品的資訊和評論（一）

剛寫完京東爬蟲，趁著記憶還深刻，寫點總結吧。一、前提預設已用scrapy爬取過網站，有爬蟲基礎，有爬蟲環境二、以爬取電子煙為例 1、任務一：爬取商品資訊在搜尋框裡面直接搜尋電子煙，搜出來的介面，你會發現它是動態載入的。即一開始原始碼裡面只

Scrapy爬取京東商城華為全系列手機評論

本文轉自：https://mp.weixin.qq.com/s?__biz=MzA4MTk3ODI2OA==&mid=2650342004&idx=1&sn=4d270ab7ca54f6f2f7ec7aca113993f4&chksm=87811487b0f

CasperJS 抓取京東商品詳情頁中的商品資訊

一、京東商品詳情頁 1. 詳情頁截圖 2. 網頁結構二、實戰：抓取詳情頁中紅色方框中的資訊 1. 程式碼 phantom.outputEncoding="GBK"

Spring Boot + JSoup 抓取京東商品資訊

需求分析匯入京東商品URL列表生成京東商品資訊並輸出到excel表思路讀取excel獲取URL列表訪問url並獲得HTML原始碼提取對應的商品資訊欄位輸出到excel 搭建框架建立Spring Boot工程選

BeautifulSoup抓取百度貼吧

爬蟲 python beautifulsoup 百度貼吧 BeautifulSoup是python一種原生的解析文件的模塊，區別於scrapy，scrapy是一種封裝好的框架，只需要按結構進行填空，而BeautifulSoup就需要自己造輪子，相對scrapy麻煩一點但也更加靈活一些以爬取百度

抓取天貓手機評論

meta pen 天貓 lang ret ntp cat last eat import re import json import time import requests from bs4 import BeautifulSoup tm_headers =

python 自學第二課：使用BeautifulSoup抓取鏈接正則表達式

find sof stdout mpi new page 正則 ges 效果 python 自學第二課：使用BeautifulSoup抓取鏈接正則表達式具體的查看BeautifulSoup文檔（根據自己的安裝的版本查看對應文檔）文檔鏈接https://www.cr

用scrapy爬取京東商城的商品信息

keywords XML 1.5 rom toc ons lines open 3.6 軟件環境： 1 gevent (1.2.2) 2 greenlet (0.4.12) 3 lxml (4.1.1) 4 pymongo (3.6.0) 5 pyO

Selenium+Chrome抓取淘寶數據

www. PQ ret lda location driver glob eal OS 在學習了網易雲課堂上崔慶才老師的Python3爬蟲三大案例實戰分享之後模仿了一段代碼，PhantomJS和MongoDB還沒學，暫時沒放進去，用pandas代替。 1 from se

C#使用Selenium+PhantomJS抓取數據

16px proxy pan 使用 AD driver def ima avi 本文主要介紹了C#使用Selenium+PhantomJS抓取數據的方法步驟，具有很好的參考價值，下面跟著小編一起來看下吧手頭項目需要抓取一個用js渲染出來的網站中的數據。使用常用的htt

Python抓取京東商品信息

Python抓取京東商品信息打開網頁http://item.jd.com/7336413.html定位到“規格與包裝” Python抓取京東商品信息

利用Python批量抓取京東評論數據

() 開始 book for return SQ 數據返回 python js對象京東圖書評論有非常豐富的信息，這裏面就包含了購買日期、書名、作者、好評、中評、差評等等。以購買日期為例，使用Python + Mysql的搭配進行實現，程序不大，才100行。相關的解釋我都在

爬蟲 - 動態分頁抓取遊民星空的資訊 - bs4

# coding=utf-8 # !/usr/bin/env python ''' author: dangxusheng desc : 動態分頁抓取遊民星空的資訊 date : 2018-08-29 ''' import requests from bs4 impo

python抓取豆瓣電影top250資訊

1、本博文中程式碼是轉載內容，原文章地址如下： https://blog.csdn.net/submit66/article/details/78631342?utm_source=blogxgwz1 2、只是在原文程式碼的基礎上稍作修改，添加了一些註釋及無關緊要的程式碼 3、本

使用scrapy框架,用模擬瀏覽器的方法爬取京東上面膜資訊,並存入mysql,sqlite,mongodb資料庫

因為京東的頁面是由JavaScript動態載入的所以使用模擬瀏覽器的方法進行爬取,具體程式碼如下 : spider.py # -*- coding: utf-8 -*- import scrapy from scrapy import Request from jdpro.items

使用selenium+BeautifulSoup 抓取京東商城手機資訊

1.準備工作：

2.分析網頁：

需求：

相關推薦