1. 程式人生 > >爬蟲學習5-JSON 資料的分析與解析

爬蟲學習5-JSON 資料的分析與解析

        JSON 資料格式以及在 Java 網路爬蟲中如何解析 JSON 資料?一般java中我們用於操作json的工具有: org.json、Gson 以及 Fastjson,這篇我們來操作網路爬蟲中返回資料是json格式的,該怎麼處理了。

     網路爬蟲中經常會遇到 JSON 資料,而在我們請求封裝有 JSON 資料的網頁時,需要對其進行預處理,使其成為標準化的 JSON 資料。例如可能出現下面的形式:

jQuery18305886476962892728_1531402823026({
    "id":"07",
    "language": "C++",
    "edition": "second",
    "author": "E.Balagurusamy"
})
此種包含 JSON 的字串需要進行預處理(掐頭去尾操作),例如上述字串,在 Java 中可進行如下處理:

//拼接JSON串
String json = "jQuery18305886476962892728_1531402823026({\"id\":\"07\",\"language\": \"C++\",\"edition\": \"second\",\"author\": \"E.Balagurusamy\"})";
//掐頭去尾操作
String arr = json.split("\\(")[1];
System.out.println(arr.substring(0,arr.length() - 1));

驗證json的網站:json驗證

       針對java物件轉json,json物件轉java物件,json字串轉java物件,json字串轉json物件,這些基礎知識,需要了解的網上有相關資料,可以去查一查,這裡就不囉嗦了。

爬蟲實戰案例

     下面來一個真實的爬蟲網站例項:

第一步,抓包分析評論對應的真實地址

開啟f12:

{
	"status": 200,
	"data": {
		"total": 7,
		"data": {
			"_30376977": {
				"CommentId": 30376977,
				"ItemId": 853171,
				"UserId": 4003739,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "漂亮美味",
				"ImageNum": 0,
				"Platform": "iPhone客戶端",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "yxeg5",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-4003739\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/9b\/17\/4003739_70.jpg",
				"CreateTime": "2016-02-15 12:22",
				"Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-513793.html\" target=\"_blank\">【第119期】好問豆答:蜜三刀的製作技巧<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29589112": {
				"CommentId": 29589112,
				"ItemId": 853171,
				"UserId": 9235790,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "紫菜是乾的還是",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "喻平凶",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-9235790\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/4e\/ed\/9235790_70.jpg",
				"CreateTime": "2015-12-26 09:36",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近釋出了菜譜專輯:<\/span> <a href=\"http:\/\/www.haodou.com\/recipe\/album\/9061657\/\" target=\"_blank\">炒飯<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29407043": {
				"CommentId": 29407043,
				"ItemId": 853171,
				"UserId": 3342562,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "超市有乾貝和海蠣賣?",
				"ImageNum": 0,
				"Platform": "好豆網",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "秋玉的美",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-3342562\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e2\/00\/3342562_70.jpg",
				"CreateTime": "2015-12-05 15:54",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>",
				"LastAct": "",
				"PlatformUrl": "http:\/\/www.haodou.com\/",
				"Admin": "non"
			},
			"_28188378": {
				"CommentId": 28188378,
				"ItemId": 853171,
				"UserId": 8008371,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "乾貝蝦米一般都是鹹的,要用水多泡會,泡軟",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 1,
				"OpenDataId": "",
				"OpenUserName": "月上荒城6",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-8008371\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/b3\/32\/8008371_70.jpg",
				"CreateTime": "2015-07-09 12:51",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>",
				"LastAct": "",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27165505": {
				"CommentId": 27165505,
				"ItemId": 853171,
				"UserId": 3837,
				"ReplyId": 0,
				"Type": 0,
				"AtUserId": 0,
				"Content": "食材豐富--口感也豐富!",
				"ImageNum": 0,
				"Platform": "好豆網",
				"Status": 1,
				"SubCommentCnt": 3,
				"OpenDataId": "",
				"OpenUserName": "愛跳舞的老太",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-3837\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/fd\/0e\/3837_70.jpg",
				"CreateTime": "2015-02-26 09:42",
				"Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-556709.html\" target=\"_blank\">【深秋食語】在朋友單位吃午餐<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/",
				"Admin": "non"
			},
			"_30383571": {
				"CommentId": 30383571,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 30376977,
				"Type": 0,
				"AtUserId": 4003739,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-4003739\/\" target=\"_blank\">yxeg5<\/a> 感謝你的分享。",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪紅",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2016-02-15 21:39",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29596058": {
				"CommentId": 29596058,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 29589112,
				"Type": 0,
				"AtUserId": 9235790,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-9235790\/\" target=\"_blank\">喻平凶<\/a> 是乾的,要衝洗一下。",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪紅",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-12-26 23:15",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_29407675": {
				"CommentId": 29407675,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 29407043,
				"Type": 0,
				"AtUserId": 3342562,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3342562\/\" target=\"_blank\">秋玉的美<\/a> 商店裡有網上也有。",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪紅",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-12-05 17:11",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_28189130": {
				"CommentId": 28189130,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 28188378,
				"Type": 0,
				"AtUserId": 8008371,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-8008371\/\" target=\"_blank\">月上荒城6<\/a> 我買的這種不是那種很硬的,很多鹽的,要根據情況而定。",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪紅",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-07-09 15:19",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27729797": {
				"CommentId": 27729797,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 27165505,
				"Type": 0,
				"AtUserId": 7566907,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-7566907\/\" target=\"_blank\">haodou8704818142<\/a> 我在廈門,漳州吃的,每一次都不是不一樣的。都有紫菜",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪紅",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-05-07 01:53",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27727527": {
				"CommentId": 27727527,
				"ItemId": 853171,
				"UserId": 7566907,
				"ReplyId": 27165505,
				"Type": 0,
				"AtUserId": 489704,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-489704\/\" target=\"_blank\">挪紅<\/a> 和我們的配料不一樣",
				"ImageNum": 0,
				"Platform": "Android客戶端",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "haodou8704818142",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-7566907\/",
				"OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/3b\/76\/7566907_70.jpg",
				"CreateTime": "2015-05-06 19:23",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>",
				"LastAct": "",
				"PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php",
				"Admin": "non"
			},
			"_27166153": {
				"CommentId": 27166153,
				"ItemId": 853171,
				"UserId": 489704,
				"ReplyId": 27165505,
				"Type": 0,
				"AtUserId": 3837,
				"Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3837\/\" target=\"_blank\">愛跳舞的老太<\/a> 姐是這兒的人,不知我這樣做對嗎?",
				"ImageNum": 0,
				"Platform": "好豆網",
				"Status": 1,
				"SubCommentCnt": 0,
				"OpenDataId": "",
				"OpenUserName": "挪紅",
				"OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/",
				"OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg",
				"CreateTime": "2015-02-26 11:26",
				"Vip": "",
				"Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>",
				"LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>",
				"PlatformUrl": "http:\/\/www.haodou.com\/",
				"Admin": "non"
			}
		},
		"avatar": "",
		"page_nav": "<a href='javaScript:;' page='1' id='' class='cur'>1<\/a><a href='javaScript:;' page='2' id=''>2<\/a><span class='next'><a href='javaScript:;' page='2' id='' class='next'>下一頁<\/a><\/span>",
		"more": null,
		"offset": 0
	},
	"message": ""
}

第三步,根據介面資料獲取欄位,封裝javabean

package com.jack.spiderone.entity;

import lombok.Data;

/**
 * create by jack 2018/11/18
 *
 * @author jack
 * @date: 2018/11/18 11:26
 * @Description:
 */
@Data
public class CommentModel {

    /**
     * 評論的id
     */
    private String CommentId;
    //評論的菜品
    private String ItemId;
    //評論的內容
    private String Content;
    //評論的時間
    private String CreateTime;
    //評論作者的名稱
    private String OpenUserName;
}

第四步:

        使用 Httpclient 工具或其他 URL 請求工具,獲取網頁真實地址對應的字串。針對已獲取的字串在程式中做掐頭去尾處理,使其轉化成易於解析的 JSON 串(經常使用到正則表示式操作)

程式碼:

package com.jack.spiderone.service;

import com.alibaba.fastjson.JSONObject;
import com.jack.spiderone.entity.CommentModel;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.util.List;

/**
 * create by jack 2018/11/18
 *
 * @author jack
 * @date: 2018/11/18 11:35
 * @Description:
 */
public class CookBookSpider {

    /**
     * 通過url獲取json字串
     * @param url
     * @return
     */
    public static String getJson(String url) throws IOException {
        //初始化httpclient
        HttpClient httpClient = HttpClients.custom().build();
        //使用的請求方法
        HttpGet httpget = new HttpGet(url);
        //發出get請求
        HttpResponse response = httpClient.execute(httpget);
        //獲取網頁內容流
        HttpEntity httpEntity = response.getEntity();
        //以字串的形式(需設定編碼)
        String entity = EntityUtils.toString(httpEntity, "gbk");
        //關閉內容流
        EntityUtils.consume(httpEntity);
        //返回JSON字串
        return entity;
    }


    /**
     * 解析json字串為物件陣列
     * @param jsonStr
     * @return
     */
    public static List<CommentModel> parseData(String jsonStr){
        //將uncode碼轉化為中文
        jsonStr = decode(jsonStr);
        //使用分割以及正則取代,處理成標準化JSON陣列
        String jsondata  = "{"+jsonStr.split("data\":\\{")[2].split("\"avatar")[0].replaceAll("\"_\\d*[0-9]\":", "");
        jsonStr = jsondata.substring(0, jsondata.length()-2);
        //將json陣列解析成物件集合
        List<CommentModel>  datalis = JSONObject.parseArray("["+jsonStr.substring(1,jsonStr.length())+"]", CommentModel.class);
        return datalis;
    }

   public static void spiderCookBook() throws IOException {
       //需要解析的URL
       String url = "http://www.haodou.com/comment.php?do=list&callback=jQuery18304706379730622201_1542510303429&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common&_=1542510303816";
       //獲取JSON資料
       String jsonstring = getJson(url);
       //解析JSON資料
       List<CommentModel> datalist = parseData(jsonstring);
       //輸出資料
       for (CommentModel comm : datalist) {
           System.out.println(comm.getCommentId() + "\t" + comm.getItemId() + "\t" + comm.getContent());
       }
   }



    /**
     * 將unicode碼轉化為中文
     * @param unicodeStr
     * @return
     */
    public static String decode(String unicodeStr) {
        if (unicodeStr == null) {
            return null;
        }
        StringBuffer retBuf = new StringBuffer();
        int maxLoop = unicodeStr.length();
        for (int i = 0; i < maxLoop; i++) {
            if (unicodeStr.charAt(i) == '\\') {
                if ((i < maxLoop - 5) && ((unicodeStr.charAt(i + 1) == 'u') || (unicodeStr
                        .charAt(i + 1) == 'U')))
                    try {
                        retBuf.append((char) Integer.parseInt(
                                unicodeStr.substring(i + 2, i + 6), 16));
                        i += 5;
                    } catch (NumberFormatException localNumberFormatException) {
                        retBuf.append(unicodeStr.charAt(i));
                    }
                else
                    retBuf.append(unicodeStr.charAt(i));
            } else {
                retBuf.append(unicodeStr.charAt(i));
            }
        }
        return retBuf.toString();
    }

    public static void main(String[] args) throws IOException {
        spiderCookBook();
    }

}

執行程式,輸出如下:

30376977	853171	漂亮美味
29589112	853171	紫菜是乾的還是
29407043	853171	超市有乾貝和海蠣賣?
28188378	853171	乾貝蝦米一般都是鹹的,要用水多泡會,泡軟
27165505	853171	食材豐富--口感也豐富!
30383571	853171	@<a href="http://www.haodou.com/cook-4003739/" target="_blank">yxeg5</a> 感謝你的分享。
29596058	853171	@<a href="http://www.haodou.com/cook-9235790/" target="_blank">喻平凶</a> 是乾的,要衝洗一下。
29407675	853171	@<a href="http://www.haodou.com/cook-3342562/" target="_blank">秋玉的美</a> 商店裡有網上也有。
28189130	853171	@<a href="http://www.haodou.com/cook-8008371/" target="_blank">月上荒城6</a> 我買的這種不是那種很硬的,很多鹽的,要根據情況而定。
27729797	853171	@<a href="http://www.haodou.com/cook-7566907/" target="_blank">haodou8704818142</a> 我在廈門,漳州吃的,每一次都不是不一樣的。都有紫菜
27727527	853171	@<a href="http://www.haodou.com/cook-489704/" target="_blank">挪紅</a> 和我們的配料不一樣
27166153	853171	@<a href="http://www.haodou.com/cook-3837/" target="_blank">愛跳舞的老太</a> 姐是這兒的人,不知我這樣做對嗎?

             需要注意的是該網頁的中文編碼 Unicode 碼,故需在操作之前將其轉化成中文字元。再者,讀者可能會思考,一般情況下,我們只知道一個菜譜的 ID(http://www.haodou.com/recipe/853171/),即853171,該如何操作?

抓包獲取的真實 URL 中包含 &callback=jQuery183016721538977115902_1531563599327,這個字串又該如何拼接?另外一個字串 &_=1531563599599 又該怎麼得到?在抓包時,我們會發現,這兩個字串是動態變化的,這和前端 JS 操作有關。但我們可以將這兩個字串從抓包的 URL 中去除,對應的地址為:

http://www.haodou.com/comment.php?do=list&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common

請求這個地址,也是可以成功獲取資料的,而且得到的是標準化的 JSON 資料。假如給定另外一個菜品的 ID(http://www.haodou.com/recipe/344953/),即344953,便可有規律的拼接其評論內容對應的 URL:

http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common

再者,評論如果存在多頁情況,我們可以通過上述 URL 中的 page 欄位操作迴圈的方式獲取多頁評論資料。例如,ID 為344953菜品的第二頁評論 URL 地址為:

http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=2&size=5&comment_id=0&cate=0&purify=common

原始碼地址:

原始碼