1. 程式人生 > >java網絡爬蟲基礎學習(四)

java網絡爬蟲基礎學習(四)

Language lock ide tro max-age ria 連接 rom web

jsoup的使用

jsoup介紹

  jsoup是一款Java的HTML解析器可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,css以及類似於Jquery的操作方法來取出和操作數據。

主要功能

  1. 從一個URL,文件或字符串中解析出HTML。
  2. 使用DOM或css選擇器來查找、取出數據。 
  3. 可操作HTML元素、屬性、文本。

直接請求URL

一開始直接使用jsonp的connect方法調用上節說的請求電影json數據會報錯

技術分享圖片

錯誤如下:

技術分享圖片

這裏不太清楚發生錯誤的原因,畢竟換了一個連接變成http://www.w3school.com.cn/b.asp就可以正常輸出html頁面

如下

技術分享圖片

後來看了下網上,又看了看異常代碼,發現是缺少contentType設置,於是加ignoreContentType(true)設置

public class Simple {
    public static void main(String[] args) {
        try {
            Document doc = Jsoup
                    .connect("https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=time&page_limit=20&page_start=0")
                    .ignoreContentType(
true).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36") .timeout(5000) .get(); //Document doc1 = Jsoup //.connect("http://www.w3school.com.cn/b.asp").get(); System.out.println(doc); }
catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }

成功

技術分享圖片


整合一下,用jsoup來抓取電影信息如下

main裏運行:

public static void test2(){
        try {
            Response res = Jsoup
                    .connect("https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=time&page_limit=20&page_start=0")
                    .header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8")
                    .header("Host", "movie.douban.com")
                    .header("Accept-Encoding", "gzip, deflate")
                    .header("Accept-Language","zh-cn,zh;q=0.5")
                    //.header("Content-Type", "application/json;charset=UTF-8")
                    .header("User-Agent","Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36")
                    .header("Connection", "keep-alive")
                    .header("Cache-Control", "max-age=0")
                    .ignoreContentType(true)
                    .timeout(5000)
                    .execute();
            String body = res.body();
            JSONObject jsonObject = JSONObject.parseObject(body);
            JSONArray array = jsonObject.getJSONArray("subjects");
            
            for(int i=0;i<array.size();i++){ //循環projects的json數組
                JSONObject jo = array.getJSONObject(i);
                Movie movie = jo.toJavaObject(Movie.class);
                System.out.println(movie);
            }
            
            //System.out.println(array.get(1));
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

Movie.java:

public class Movie implements Serializable{
    /**
     * 
     */
    private static final long serialVersionUID = 1L;
    private String rate;
    private String cover_x;
    private String title;
    private String url;
    private String playable;
    private String cover;
    private String id;
    private String cover_y;
    private String is_new;
    
    public Movie() {
        // TODO Auto-generated constructor stub
    }
    
    public Movie(String rate, String cover_x, String title, String url, String playable, String cover, String id,
            String cover_y, String is_new) {
        super();
        this.rate = rate;
        this.cover_x = cover_x;
        this.title = title;
        this.url = url;
        this.playable = playable;
        this.cover = cover;
        this.id = id;
        this.cover_y = cover_y;
        this.is_new = is_new;
    }



    public String getRate() {
        return rate;
    }

    public void setRate(String rate) {
        this.rate = rate;
    }

    public String getCover_x() {
        return cover_x;
    }

    public void setCover_x(String cover_x) {
        this.cover_x = cover_x;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public String getPlayable() {
        return playable;
    }

    public void setPlayable(String playable) {
        this.playable = playable;
    }

    public String getCover() {
        return cover;
    }

    public void setCover(String cover) {
        this.cover = cover;
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getCover_y() {
        return cover_y;
    }

    public void setCover_y(String cover_y) {
        this.cover_y = cover_y;
    }

    public String getIs_new() {
        return is_new;
    }

    public void setIs_new(String is_new) {
        this.is_new = is_new;
    }

    @Override
    public String toString() {
        return "Movie [評分:" + rate + ", 電影:" + title +"]";
    }
    
    
}

輸出

技術分享圖片

到此,簡單的jsoup測試~

java網絡爬蟲基礎學習(四)