爬蟲學習5-JSON 資料的分析與解析
JSON 資料格式以及在 Java 網路爬蟲中如何解析 JSON 資料?一般java中我們用於操作json的工具有: org.json、Gson 以及 Fastjson,這篇我們來操作網路爬蟲中返回資料是json格式的,該怎麼處理了。
網路爬蟲中經常會遇到 JSON 資料,而在我們請求封裝有 JSON 資料的網頁時,需要對其進行預處理,使其成為標準化的 JSON 資料。例如可能出現下面的形式:
jQuery18305886476962892728_1531402823026({
"id":"07",
"language": "C++",
"edition": "second",
"author": "E.Balagurusamy"
})
此種包含 JSON 的字串需要進行預處理(掐頭去尾操作),例如上述字串,在 Java 中可進行如下處理:
//拼接JSON串
String json = "jQuery18305886476962892728_1531402823026({\"id\":\"07\",\"language\": \"C++\",\"edition\": \"second\",\"author\": \"E.Balagurusamy\"})";
//掐頭去尾操作
String arr = json.split("\\(")[1];
System.out.println(arr.substring(0,arr.length() - 1));
驗證json的網站:json驗證
針對java物件轉json,json物件轉java物件,json字串轉java物件,json字串轉json物件,這些基礎知識,需要了解的網上有相關資料,可以去查一查,這裡就不囉嗦了。
爬蟲實戰案例
下面來一個真實的爬蟲網站例項:
第一步,抓包分析評論對應的真實地址
開啟f12:
{ "status": 200, "data": { "total": 7, "data": { "_30376977": { "CommentId": 30376977, "ItemId": 853171, "UserId": 4003739, "ReplyId": 0, "Type": 0, "AtUserId": 0, "Content": "漂亮美味", "ImageNum": 0, "Platform": "iPhone客戶端", "Status": 1, "SubCommentCnt": 1, "OpenDataId": "", "OpenUserName": "yxeg5", "OpenUserHome": "http:\/\/www.haodou.com\/cook-4003739\/", "OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/9b\/17\/4003739_70.jpg", "CreateTime": "2016-02-15 12:22", "Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-513793.html\" target=\"_blank\">【第119期】好問豆答:蜜三刀的製作技巧<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_29589112": { "CommentId": 29589112, "ItemId": 853171, "UserId": 9235790, "ReplyId": 0, "Type": 0, "AtUserId": 0, "Content": "紫菜是乾的還是", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 1, "OpenDataId": "", "OpenUserName": "喻平凶", "OpenUserHome": "http:\/\/www.haodou.com\/cook-9235790\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/4e\/ed\/9235790_70.jpg", "CreateTime": "2015-12-26 09:36", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>", "LastAct": "<span><span class=\"gray9\">最近釋出了菜譜專輯:<\/span> <a href=\"http:\/\/www.haodou.com\/recipe\/album\/9061657\/\" target=\"_blank\">炒飯<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_29407043": { "CommentId": 29407043, "ItemId": 853171, "UserId": 3342562, "ReplyId": 0, "Type": 0, "AtUserId": 0, "Content": "超市有乾貝和海蠣賣?", "ImageNum": 0, "Platform": "好豆網", "Status": 1, "SubCommentCnt": 1, "OpenDataId": "", "OpenUserName": "秋玉的美", "OpenUserHome": "http:\/\/www.haodou.com\/cook-3342562\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e2\/00\/3342562_70.jpg", "CreateTime": "2015-12-05 15:54", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>", "LastAct": "", "PlatformUrl": "http:\/\/www.haodou.com\/", "Admin": "non" }, "_28188378": { "CommentId": 28188378, "ItemId": 853171, "UserId": 8008371, "ReplyId": 0, "Type": 0, "AtUserId": 0, "Content": "乾貝蝦米一般都是鹹的,要用水多泡會,泡軟", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 1, "OpenDataId": "", "OpenUserName": "月上荒城6", "OpenUserHome": "http:\/\/www.haodou.com\/cook-8008371\/", "OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/b3\/32\/8008371_70.jpg", "CreateTime": "2015-07-09 12:51", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_1\"><\/span> 豆芽<\/span>", "LastAct": "", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_27165505": { "CommentId": 27165505, "ItemId": 853171, "UserId": 3837, "ReplyId": 0, "Type": 0, "AtUserId": 0, "Content": "食材豐富--口感也豐富!", "ImageNum": 0, "Platform": "好豆網", "Status": 1, "SubCommentCnt": 3, "OpenDataId": "", "OpenUserName": "愛跳舞的老太", "OpenUserHome": "http:\/\/www.haodou.com\/cook-3837\/", "OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/fd\/0e\/3837_70.jpg", "CreateTime": "2015-02-26 09:42", "Vip": "<a href=\"http:\/\/www.haodou.com\/recipe\/expert\/apply\" target=\"_blank\"><i class=\"ico12 mod_v\"><\/i><\/a> ", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-556709.html\" target=\"_blank\">【深秋食語】在朋友單位吃午餐<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/", "Admin": "non" }, "_30383571": { "CommentId": 30383571, "ItemId": 853171, "UserId": 489704, "ReplyId": 30376977, "Type": 0, "AtUserId": 4003739, "Content": "@<a href=\"http:\/\/www.haodou.com\/cook-4003739\/\" target=\"_blank\">yxeg5<\/a> 感謝你的分享。", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 0, "OpenDataId": "", "OpenUserName": "挪紅", "OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg", "CreateTime": "2016-02-15 21:39", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_29596058": { "CommentId": 29596058, "ItemId": 853171, "UserId": 489704, "ReplyId": 29589112, "Type": 0, "AtUserId": 9235790, "Content": "@<a href=\"http:\/\/www.haodou.com\/cook-9235790\/\" target=\"_blank\">喻平凶<\/a> 是乾的,要衝洗一下。", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 0, "OpenDataId": "", "OpenUserName": "挪紅", "OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg", "CreateTime": "2015-12-26 23:15", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_29407675": { "CommentId": 29407675, "ItemId": 853171, "UserId": 489704, "ReplyId": 29407043, "Type": 0, "AtUserId": 3342562, "Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3342562\/\" target=\"_blank\">秋玉的美<\/a> 商店裡有網上也有。", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 0, "OpenDataId": "", "OpenUserName": "挪紅", "OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg", "CreateTime": "2015-12-05 17:11", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_28189130": { "CommentId": 28189130, "ItemId": 853171, "UserId": 489704, "ReplyId": 28188378, "Type": 0, "AtUserId": 8008371, "Content": "@<a href=\"http:\/\/www.haodou.com\/cook-8008371\/\" target=\"_blank\">月上荒城6<\/a> 我買的這種不是那種很硬的,很多鹽的,要根據情況而定。", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 0, "OpenDataId": "", "OpenUserName": "挪紅", "OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg", "CreateTime": "2015-07-09 15:19", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_27729797": { "CommentId": 27729797, "ItemId": 853171, "UserId": 489704, "ReplyId": 27165505, "Type": 0, "AtUserId": 7566907, "Content": "@<a href=\"http:\/\/www.haodou.com\/cook-7566907\/\" target=\"_blank\">haodou8704818142<\/a> 我在廈門,漳州吃的,每一次都不是不一樣的。都有紫菜", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 0, "OpenDataId": "", "OpenUserName": "挪紅", "OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg", "CreateTime": "2015-05-07 01:53", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_27727527": { "CommentId": 27727527, "ItemId": 853171, "UserId": 7566907, "ReplyId": 27165505, "Type": 0, "AtUserId": 489704, "Content": "@<a href=\"http:\/\/www.haodou.com\/cook-489704\/\" target=\"_blank\">挪紅<\/a> 和我們的配料不一樣", "ImageNum": 0, "Platform": "Android客戶端", "Status": 1, "SubCommentCnt": 0, "OpenDataId": "", "OpenUserName": "haodou8704818142", "OpenUserHome": "http:\/\/www.haodou.com\/cook-7566907\/", "OpenUserAvatar": "http:\/\/avatar1.hoto.cn\/3b\/76\/7566907_70.jpg", "CreateTime": "2015-05-06 19:23", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_0\"><\/span> 新手<\/span>", "LastAct": "", "PlatformUrl": "http:\/\/www.haodou.com\/help\/mobile.php", "Admin": "non" }, "_27166153": { "CommentId": 27166153, "ItemId": 853171, "UserId": 489704, "ReplyId": 27165505, "Type": 0, "AtUserId": 3837, "Content": "@<a href=\"http:\/\/www.haodou.com\/cook-3837\/\" target=\"_blank\">愛跳舞的老太<\/a> 姐是這兒的人,不知我這樣做對嗎?", "ImageNum": 0, "Platform": "好豆網", "Status": 1, "SubCommentCnt": 0, "OpenDataId": "", "OpenUserName": "挪紅", "OpenUserHome": "http:\/\/www.haodou.com\/cook-489704\/", "OpenUserAvatar": "http:\/\/avatar0.hoto.cn\/e8\/78\/489704_70.jpg", "CreateTime": "2015-02-26 11:26", "Vip": "", "Level": "<span class=\"gray6 mgr10\"><span class=\"ico32 mod_level_7\"><\/span> 金豆<\/span>", "LastAct": "<span><span class=\"gray9\">最近發表了話題:<\/span> <a href=\"http:\/\/group.haodou.com\/topic-557724.html\" target=\"_blank\">【尋找溫暖】港仔後請客,品沙縣小吃<\/a><\/span>", "PlatformUrl": "http:\/\/www.haodou.com\/", "Admin": "non" } }, "avatar": "", "page_nav": "<a href='javaScript:;' page='1' id='' class='cur'>1<\/a><a href='javaScript:;' page='2' id=''>2<\/a><span class='next'><a href='javaScript:;' page='2' id='' class='next'>下一頁<\/a><\/span>", "more": null, "offset": 0 }, "message": "" }
第三步,根據介面資料獲取欄位,封裝javabean
package com.jack.spiderone.entity;
import lombok.Data;
/**
* create by jack 2018/11/18
*
* @author jack
* @date: 2018/11/18 11:26
* @Description:
*/
@Data
public class CommentModel {
/**
* 評論的id
*/
private String CommentId;
//評論的菜品
private String ItemId;
//評論的內容
private String Content;
//評論的時間
private String CreateTime;
//評論作者的名稱
private String OpenUserName;
}
第四步:
使用 Httpclient 工具或其他 URL 請求工具,獲取網頁真實地址對應的字串。針對已獲取的字串在程式中做掐頭去尾處理,使其轉化成易於解析的 JSON 串(經常使用到正則表示式操作)
程式碼:
package com.jack.spiderone.service;
import com.alibaba.fastjson.JSONObject;
import com.jack.spiderone.entity.CommentModel;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import java.io.IOException;
import java.util.List;
/**
* create by jack 2018/11/18
*
* @author jack
* @date: 2018/11/18 11:35
* @Description:
*/
public class CookBookSpider {
/**
* 通過url獲取json字串
* @param url
* @return
*/
public static String getJson(String url) throws IOException {
//初始化httpclient
HttpClient httpClient = HttpClients.custom().build();
//使用的請求方法
HttpGet httpget = new HttpGet(url);
//發出get請求
HttpResponse response = httpClient.execute(httpget);
//獲取網頁內容流
HttpEntity httpEntity = response.getEntity();
//以字串的形式(需設定編碼)
String entity = EntityUtils.toString(httpEntity, "gbk");
//關閉內容流
EntityUtils.consume(httpEntity);
//返回JSON字串
return entity;
}
/**
* 解析json字串為物件陣列
* @param jsonStr
* @return
*/
public static List<CommentModel> parseData(String jsonStr){
//將uncode碼轉化為中文
jsonStr = decode(jsonStr);
//使用分割以及正則取代,處理成標準化JSON陣列
String jsondata = "{"+jsonStr.split("data\":\\{")[2].split("\"avatar")[0].replaceAll("\"_\\d*[0-9]\":", "");
jsonStr = jsondata.substring(0, jsondata.length()-2);
//將json陣列解析成物件集合
List<CommentModel> datalis = JSONObject.parseArray("["+jsonStr.substring(1,jsonStr.length())+"]", CommentModel.class);
return datalis;
}
public static void spiderCookBook() throws IOException {
//需要解析的URL
String url = "http://www.haodou.com/comment.php?do=list&callback=jQuery18304706379730622201_1542510303429&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common&_=1542510303816";
//獲取JSON資料
String jsonstring = getJson(url);
//解析JSON資料
List<CommentModel> datalist = parseData(jsonstring);
//輸出資料
for (CommentModel comm : datalist) {
System.out.println(comm.getCommentId() + "\t" + comm.getItemId() + "\t" + comm.getContent());
}
}
/**
* 將unicode碼轉化為中文
* @param unicodeStr
* @return
*/
public static String decode(String unicodeStr) {
if (unicodeStr == null) {
return null;
}
StringBuffer retBuf = new StringBuffer();
int maxLoop = unicodeStr.length();
for (int i = 0; i < maxLoop; i++) {
if (unicodeStr.charAt(i) == '\\') {
if ((i < maxLoop - 5) && ((unicodeStr.charAt(i + 1) == 'u') || (unicodeStr
.charAt(i + 1) == 'U')))
try {
retBuf.append((char) Integer.parseInt(
unicodeStr.substring(i + 2, i + 6), 16));
i += 5;
} catch (NumberFormatException localNumberFormatException) {
retBuf.append(unicodeStr.charAt(i));
}
else
retBuf.append(unicodeStr.charAt(i));
} else {
retBuf.append(unicodeStr.charAt(i));
}
}
return retBuf.toString();
}
public static void main(String[] args) throws IOException {
spiderCookBook();
}
}
執行程式,輸出如下:
30376977 853171 漂亮美味
29589112 853171 紫菜是乾的還是
29407043 853171 超市有乾貝和海蠣賣?
28188378 853171 乾貝蝦米一般都是鹹的,要用水多泡會,泡軟
27165505 853171 食材豐富--口感也豐富!
30383571 853171 @<a href="http://www.haodou.com/cook-4003739/" target="_blank">yxeg5</a> 感謝你的分享。
29596058 853171 @<a href="http://www.haodou.com/cook-9235790/" target="_blank">喻平凶</a> 是乾的,要衝洗一下。
29407675 853171 @<a href="http://www.haodou.com/cook-3342562/" target="_blank">秋玉的美</a> 商店裡有網上也有。
28189130 853171 @<a href="http://www.haodou.com/cook-8008371/" target="_blank">月上荒城6</a> 我買的這種不是那種很硬的,很多鹽的,要根據情況而定。
27729797 853171 @<a href="http://www.haodou.com/cook-7566907/" target="_blank">haodou8704818142</a> 我在廈門,漳州吃的,每一次都不是不一樣的。都有紫菜
27727527 853171 @<a href="http://www.haodou.com/cook-489704/" target="_blank">挪紅</a> 和我們的配料不一樣
27166153 853171 @<a href="http://www.haodou.com/cook-3837/" target="_blank">愛跳舞的老太</a> 姐是這兒的人,不知我這樣做對嗎?
需要注意的是該網頁的中文編碼 Unicode 碼,故需在操作之前將其轉化成中文字元。再者,讀者可能會思考,一般情況下,我們只知道一個菜譜的 ID(http://www.haodou.com/recipe/853171/),即853171,該如何操作?
抓包獲取的真實 URL 中包含 &callback=jQuery183016721538977115902_1531563599327,這個字串又該如何拼接?另外一個字串 &_=1531563599599 又該怎麼得到?在抓包時,我們會發現,這兩個字串是動態變化的,這和前端 JS 操作有關。但我們可以將這兩個字串從抓包的 URL 中去除,對應的地址為:
http://www.haodou.com/comment.php?do=list&channel=recipe&item=853171&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common
請求這個地址,也是可以成功獲取資料的,而且得到的是標準化的 JSON 資料。假如給定另外一個菜品的 ID(http://www.haodou.com/recipe/344953/),即344953,便可有規律的拼接其評論內容對應的 URL:
http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=1&size=5&comment_id=0&cate=0&purify=common
再者,評論如果存在多頁情況,我們可以通過上述 URL 中的 page 欄位操作迴圈的方式獲取多頁評論資料。例如,ID 為344953菜品的第二頁評論 URL 地址為:
http://www.haodou.com/comment.php?do=list&channel=recipe&item=344953&sort=desc&page=2&size=5&comment_id=0&cate=0&purify=common
原始碼地址: