教您使用java爬蟲gecco抓取JD全部商品資訊

阿新 • • 發佈：2018-12-18

轉自：http://www.geccocrawler.com/demo-jd/

gecco爬蟲

如果對gecco還沒有了解可以參看一下gecco的github首頁。gecco爬蟲十分的簡單易用，JD全部商品資訊的抓取9個類就能搞定。

JD網站的分析

要抓取JD網站的全部商品資訊，我們要先分析一下網站，京東網站可以大體分為三級，首頁上通過分類跳轉到商品列表頁，商品列表頁對每個商品有詳情頁。那麼我們通過找到所有分類就能逐個分類抓取商品資訊。

入口地址

新建開始頁面的HtmlBean類AllSort

@Gecco(matchUrl="http://www.jd.com/allSort.aspx" 
, pipelines={"consolePipeline", "allSortPipeline"})
public class AllSort implements HtmlBean {
    private static final long serialVersionUID = 665662335318691818L;
    @Request
    private HttpRequest request;
    //手機
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl" 
)
    private List<Category> mobile;
    //家用電器
    @HtmlField(cssPath=".category-items > div:nth-child(1) > div:nth-child(3) > div.mc > div.items > dl")
    private List<Category> domestic;
    public List<Category> getMobile() {
        return mobile;
    }
    public void setMobile 
(List<Category> mobile) {
        this.mobile = mobile;
    }
    public List<Category> getDomestic() {
        return domestic;
    }
    public void setDomestic(List<Category> domestic) {
        this.domestic = domestic;
    }
    public HttpRequest getRequest() {
        return request;
    }
    public void setRequest(HttpRequest request) {
        this.request = request;
    }
}

可以看到，這裡以抓取手機和家用電器兩個大類的商品資訊為例，可以看到每個大類都包含若干個子分類，用List\表示。gecco支援Bean的巢狀，可以很好的表達html頁面結構。Category表示子分類資訊內容，HrefBean是共用的連結Bean。

public class Category implements HtmlBean {
    private static final long serialVersionUID = 3018760488621382659L;
    @Text
    @HtmlField(cssPath="dt a")
    private String parentName;
    @HtmlField(cssPath="dd a")
    private List<HrefBean> categorys;
    public String getParentName() {
        return parentName;
    }
    public void setParentName(String parentName) {
        this.parentName = parentName;
    }
    public List<HrefBean> getCategorys() {
        return categorys;
    }
    public void setCategorys(List<HrefBean> categorys) {
        this.categorys = categorys;
    }
}

獲取頁面元素cssPath的小技巧

上面兩個類難點就在cssPath的獲取上，這裡介紹一些cssPath獲取的小技巧。用Chrome瀏覽器開啟需要抓取的網頁，按F12進入發者模式。選擇你要獲取的元素，如圖：輸入圖片說明在瀏覽器右側選中該元素，滑鼠右鍵選擇Copy--Copy selector，即可獲得該元素的cssPath

body > div:nth-child(5) > div.main-classify > div.list > div.category-items.clearfix > div:nth-child(1) > div:nth-child(2) > div.mc > div.items

如果你對jquery的selector有了解，另外我們只希望獲得dl元素，因此即可簡化為：

.category-items > div:nth-child(1) > div:nth-child(2) > div.mc > div.items > dl

編寫AllSort的業務處理類

完成對AllSort的注入後，我們需要對AllSort進行業務處理，這裡我們不做分類資訊持久化等處理，只對分類連結進行提取，進一步抓取商品列表資訊。看程式碼：

@PipelineName("allSortPipeline")
public class AllSortPipeline implements Pipeline<AllSort> {
    @Override
    public void process(AllSort allSort) {
        List<Category> categorys = allSort.getMobile();
        for(Category category : categorys) {
            List<HrefBean> hrefs = category.getCategorys();
            for(HrefBean href : hrefs) {
                String url = href.getUrl()+"&delivery=1&page=1&JL=4_10_0&go=0";
                HttpRequest currRequest = allSort.getRequest();
                SchedulerContext.into(currRequest.subRequest(url));
            }
        }
    }
}

@PipelinName定義該pipeline的名稱，在AllSort的@Gecco註解裡進行關聯，這樣，gecco在抓取完並注入Bean後就會逐個呼叫@Gecco定義的pipeline了。為每個子連結增加"&delivery=1&page=1&JL=4100&go=0"的目的是隻抓取京東自營並且有貨的商品。SchedulerContext.into()方法是將待抓取的連結放入佇列中等待進一步抓取。

抓取商品列表資訊

AllSortPipeline已經將需要進一步抓取的商品列表資訊的連結提取出來了，可以看到連結的格式是：http://list.jd.com/list.html?cat=9987,653,659&delivery=1&JL=4100&go=0。因此我們建立商品列表的Bean——ProductList，程式碼如下：

@Gecco(matchUrl="http://list.jd.com/list.html?cat={cat}&delivery={delivery}&page={page}&JL={JL}&go=0", pipelines={"consolePipeline", "productListPipeline"})
public class ProductList implements HtmlBean {
    private static final long serialVersionUID = 4369792078959596706L;
    @Request
    private HttpRequest request;
    /**
     * 抓取列表項的詳細內容，包括titile，價格，詳情頁地址等
     */
    @HtmlField(cssPath="#plist .gl-item")
    private List<ProductBrief> details;
    /**
     * 獲得商品列表的當前頁
     */
    @Text
    @HtmlField(cssPath="#J_topPage > span > b")
    private int currPage;
    /**
     * 獲得商品列表的總頁數
     */
    @Text
    @HtmlField(cssPath="#J_topPage > span > i")
    private int totalPage;
    public List<ProductBrief> getDetails() {
        return details;
    }
    public void setDetails(List<ProductBrief> details) {
        this.details = details;
    }
    public int getCurrPage() {
        return currPage;
    }
    public void setCurrPage(int currPage) {
        this.currPage = currPage;
    }
    public int getTotalPage() {
        return totalPage;
    }
    public void setTotalPage(int totalPage) {
        this.totalPage = totalPage;
    }
    public HttpRequest getRequest() {
        return request;
    }
    public void setRequest(HttpRequest request) {
        this.request = request;
    }
}

currPage和totalPage是頁面上的分頁資訊，為之後的分頁抓取提供支援。ProductBrief物件是商品的簡介，主要包括標題、預覽圖、詳情頁地址等。

public class ProductBrief implements HtmlBean {
    private static final long serialVersionUID = -377053120283382723L;
    @Attr("data-sku")
    @HtmlField(cssPath=".j-sku-item")
    private String code;
    @Text
    @HtmlField(cssPath=".p-name> a > em")
    private String title;
    @Image({"data-lazy-img", "src"})
    @HtmlField(cssPath=".p-img > a > img")
    private String preview;
    @Href(click=true)
    @HtmlField(cssPath=".p-name > a")
    private String detailUrl;
    public String getTitle() {
        return title;
    }
    public void setTitle(String title) {
        this.title = title;
    }
    public String getPreview() {
        return preview;
    }
    public void setPreview(String preview) {
        this.preview = preview;
    }
    public String getDetailUrl() {
        return detailUrl;
    }
    public void setDetailUrl(String detailUrl) {
        this.detailUrl = detailUrl;
    }
    public String getCode() {
        return code;
    }
    public void setCode(String code) {
        this.code = code;
    }
}

這裡需要說明一下@Href(click=true)的click屬性，click屬性形象的說明了，這個連結我們希望gecco繼續點選抓取。對於增加了click=true的連結，gecco會自動加入下載佇列中，不需要在手動呼叫SchedulerContext.into()增加。

編寫ProductList的業務邏輯

ProductList抓取完成後一般需要進行持久化，也就是將商品的基本資訊入庫，入庫的方式有很多種，這個例子並沒有介紹，gecco支援整合spring，可以利用spring進行pipeline的開發，大家可以參考gecco-spring這個專案。本例子是進行了控制檯輸出。ProductList的業務處理還有一個很重要的任務，就是對分頁的處理，列表頁通常都有很多頁，如果需要全部抓取，我們需要將下一頁的連結入抓取佇列。

@PipelineName("productListPipeline")
public class ProductListPipeline implements Pipeline<ProductList> {
    @Override
    public void process(ProductList productList) {
        HttpRequest currRequest = productList.getRequest();
        //下一頁繼續抓取
        int currPage = productList.getCurrPage();
        int nextPage = currPage + 1;
        int totalPage = productList.getTotalPage();
        if(nextPage <= totalPage) {
            String nextUrl = "";
            String currUrl = currRequest.getUrl();
            if(currUrl.indexOf("page=") != -1) {
                nextUrl = StringUtils.replaceOnce(currUrl, "page=" + currPage, "page=" + nextPage);
            } else {
                nextUrl = currUrl + "&" + "page=" + nextPage;
            }
            SchedulerContext.into(currRequest.subRequest(nextUrl));
        }
    }
}

JD的列表頁通過page引數來指定頁碼，我們通過替換page引數達到分頁抓取的目的。至此，所有的商品的列表資訊都已經可以正常抓取了。

詳情頁抓取

@Gecco(matchUrl="http://item.jd.com/{code}.html", pipelines="consolePipeline")
public class ProductDetail implements HtmlBean {
    private static final long serialVersionUID = -377053120283382723L;
    /**
     * 商品程式碼
     */
    @RequestParameter
    private String code;
    /**
     * 標題
     */
    @Text
    @HtmlField(cssPath="#name > h1")
    private String title;
    /**
     * ajax獲取商品價格
     */
    @Ajax(url="http://p.3.cn/prices/get?skuIds=J_[code]")
    private JDPrice price;
    /**
     * 商品的推廣語
     */
    @Ajax(url="http://cd.jd.com/promotion/v2?skuId={code}&area=1_2805_2855_0&cat=737%2C794%2C798")
    private JDad jdAd;
    /*
     * 商品規格引數
     */
    @HtmlField(cssPath="#product-detail-2")
    private String detail;
    public JDPrice getPrice() {
        return price;
    }
    public void setPrice(JDPrice price) {
        this.price = price;
    }
    public String getTitle() {
        return title;
    }
    public void setTitle(String title) {
        this.title = title;
    }
    public JDad getJdAd() {
        return jdAd;
    }
    public void setJdAd(JDad jdAd) {
        this.jdAd = jdAd;
    }
    public String getDetail() {
        return detail;
    }
    public void setDetail(String detail) {
        this.detail = detail;
    }
    public String getCode() {
        return code;
    }
    public void setCode(String code) {
        this.code = code;
    }
}

@RequestParameter可以獲取@Gecco裡定義的url變數{code}。

@Ajax是頁面中的ajax請求，JD的商品價格和推廣語都是通過ajax請求非同步獲取的，gecco支援非同步ajax請求，指定ajax請求的url地址，url中的變數可以通過兩種方式指定。

一種是花括號{}，可以獲取request的引數類似@RequestParameter，例子中獲取推廣語的{code}是matchUrl="http://item.jd.com/{code}.html"中的code；

一種是中括號[]，可以獲取bean中的任意屬性。例子中獲取價格的[code]是變數private String code;。

json資料的元素抽取

商品的價格是通過ajax獲取的，ajax一般返回的都是json格式的資料，這裡需要將json格式的資料抽取出來。我們先定義價格的Bean：

public class JDPrice implements JsonBean {
    private static final long serialVersionUID = -5696033709028657709L;
    @JSONPath("$.id[0]")
    private String code;
    @JSONPath("$.p[0]")
    private float price;
    @JSONPath("$.m[0]")
    private float srcPrice;
    public float getPrice() {
        return price;
    }
    public void setPrice(float price) {
        this.price = price;
    }
    public float getSrcPrice() {
        return srcPrice;
    }
    public void setSrcPrice(float srcPrice) {
        this.srcPrice = srcPrice;
    }
    public String getCode() {
        return code;
    }
    public void setCode(String code) {
        this.code = code;
    }
}

我們獲取的商品價格資訊的json資料格式為：[{"id":"J_1861098","p":"6488.00","m":"7488.00"}]。可以看到是一個數組，因為這個介面其實可以批量獲取商品的價格。json資料的資料抽取使用@JSONPath註解，語法是使用的fastjson的JSONPath語法。

JDad的抓取類似，下面是Bean的程式碼：

public class JDad implements JsonBean {
    private static final long serialVersionUID = 2250225801616402995L;
    @JSONPath("$.ads[0].ad")
    private String ad;
    @JSONPath("$.ads")
    private List<JSONObject> ads;
    public String getAd() {
        return ad;
    }
    public void setAd(String ad) {
        this.ad = ad;
    }
    public List<JSONObject> getAds() {
        return ads;
    }
    public void setAds(List<JSONObject> ads) {
        this.ads = ads;
    }
}

學會分析ajax請求

目前爬蟲抓取頁面內容針對ajax請求有兩種主流方式：

一種是模擬瀏覽器將頁面完全繪製出來，比如可以利用htmlunit。這種方式存在一個問題就是效率低，因為頁面中的所有ajax都會被請求，而且需要解析所有的js程式碼。gecco可以通過自定義downloader來實現這種方式
還一種就是需要哪些ajax就執行哪些，這就要開發人員分析網頁中的ajax請求，獲得請求的地址，比如抓取JD的商品價格的地址@Ajax(url="http://p.3.cn/prices/mgets?skuIds=J_[code]")。而且這個地址之後可能會變。

這兩種方式都有各自的優缺點，gecco通過擴充套件都支援，本人還是更傾向於使用第二種方式。

下面說說怎麼分析頁面中的ajax請求，還是要利用chrome的開發者模式，network選項可以看到頁面中的所有請求：輸入圖片說明可以看到請求的地址是：http://p.3.cn/prices/get?type=1&area=128052855&pdtk=&pduid=836516317&pdpin=&pdbp=0&skuid=J1861098&callback=cnp。我們去掉其他引數只留下商品的程式碼，發現一樣可以訪問，http://p.3.cn/prices/get? skuid=J1861098就是我們要請求的地址。

gecco的其他一些有用的特性

gecco支援頁面中的定義的全域性javascript變數的提取，如頁面中定義的var變數。
gecco支援分散式抓取，通過redis管理startRequest實現分散式抓取。

原始碼

全部原始碼可以在gecco的github上下載，程式碼位於src/test/java/com/geccocrawler/gecco/demo/jd包下。如果使用過程中發現任何bug歡迎Pull request，或者通過Issue提問，當然也可以在部落格中留言。

教您使用java爬蟲gecco抓取JD全部商品資訊

gecco爬蟲

JD網站的分析

入口地址

新建開始頁面的HtmlBean類AllSort

獲取頁面元素cssPath的小技巧

編寫AllSort的業務處理類

抓取商品列表資訊

編寫ProductList的業務邏輯

詳情頁抓取

json資料的元素抽取

學會分析ajax請求

gecco的其他一些有用的特性

原始碼

教您使用java爬蟲gecco抓取JD全部商品資訊

教您使用java爬蟲gecco抓取JD全部商品資訊（三）

Java爬蟲——phantomjs抓取ajax動態載入網頁

Java爬蟲網頁抓取圖片

抓取淘寶商品資訊爬蟲

抓取淘寶商品資訊並製作商品資訊比價表（以口紅為例）

Java實現網路爬蟲001-抓取網頁

基於Java的網路爬蟲實現抓取網路小說（一）

java 開發用到網路爬蟲，抓取汽車之家網站全部資料經歷

Java使用HtmlUnit抓取js渲染頁面

python學習第一彈：爬蟲（抓取博客園新聞）

爬蟲發起抓取被服務器拒絕訪問返回403禁止訪問解決方案

爬蟲-day02-抓取和分析

python學習筆記——爬蟲的抓取策略

Python爬蟲：抓取手機APP的數據

java做web抓取

【爬蟲】抓取msdn.itellyou.cn所有作業系統映象下載連結

爬蟲 - 動態分頁抓取遊民星空的資訊 - bs4

如何使用免費爬蟲軟體抓取大眾點評商家電話資訊！請勿洩露資訊！

python爬蟲之抓取代理伺服器IP

教您使用java爬蟲gecco抓取JD全部商品資訊

gecco爬蟲

JD網站的分析

入口地址

新建開始頁面的HtmlBean類AllSort

獲取頁面元素cssPath的小技巧

編寫AllSort的業務處理類

抓取商品列表資訊

編寫ProductList的業務邏輯

詳情頁抓取

json資料的元素抽取

學會分析ajax請求

gecco的其他一些有用的特性

原始碼

相關推薦