Java網路爬蟲Spider

阿新 • • 發佈：2018-12-27

三更燈火五更雞，正是男兒讀書時。

本次爬取百度貼吧的圖片。

本次爬蟲網址為 https://tieba.baidu.com/f?ie=utf-8&kw=%E9%A3%8E%E6%99%AF%E5%9B%BE&fr=search

本次爬蟲用到的工具： IJ URL類，IO流，UUID(隨機命名)，正則表示式。

首先得到 https://tieba.baidu.com/f?ie=utf-8&kw=%E9%A3%8E%E6%99%AF%E5%9B%BE&fr=search 的html。

 HttpURLConnection connection =(HttpURLConnection)
                new URL("https://tieba.baidu.com/f?ie=utf-8&kw=%E9%A3%8E%E6%99%AF%E5%9B%BE&fr=search").openConnection();

返回一個，字串形式的html便於後面操作。

接下來就要寫正則來匹配圖片的地址。首先要分析我們要的圖片是那個地址。

我們要的是中間的風景圖片，而不是其他的二維碼圖片或者頭像圖片，這個時候就開啟開發者工具F12

那麼將有用的程式碼拿出來，供我們寫正則

data-original="https://imgsa.baidu.com/forum/wh%3D200%2C90%3B/sign=561eeed4c4ea15ce41bbe80b863016ca/57335443fbf2b211c636051fc78065380cd78e56.jpg"  



bpic="https://imgsa.baidu.com/forum/w%3D580%3B/sign=8436c6918f025aafd3327ec3cbd6aa64/a08b87d6277f9e2f522d9d591230e924b899f343.jpg"

經過驗證，上面的是我們需要的風景圖片的地址，就是date-original後面的。

注意：正則寫的時候，需要注意我們把後面的bpic帶上來精確正則表示式。於是就開始匹配圖片地址，放再一個List中。

 private static ArrayList getPictureUrl(String html) {
        Pattern patternRegex = Pattern.compile("bpic=\"(.*?)\"");//匹配圖片地址的正則
        Matcher matcher = patternRegex.matcher(html);
        String htmlUrl=null;

        ArrayList<String> listUrl = new ArrayList<>();
        while (matcher.find()){
            htmlUrl = matcher.group();
            String[] split = htmlUrl.split("=\"");
            String[] split1 = split[1].split("\"");//切出來了標準的URL
            listUrl.add(split1[0]);//將圖片的url存入陣列中
        }
        return listUrl;
    }

返回一個listURl，那麼接下來就繼續再listURL中去下載圖片，用到IO流。

 private static void getPicture(ArrayList pictureUrl) throws IOException {
        for (Object p : pictureUrl) {
           HttpURLConnection conn = (HttpURLConnection) new URL((String)p).openConnection();
            InputStream in = conn.getInputStream();
            UUID uuid = UUID.randomUUID();//形成隨機名字
            FileOutputStream out = new FileOutputStream("C:\\Users\\Administrator\\IdeaProjects\\homework\\picture\\"+uuid+".png");
            while (true){
                byte[] bytes = new byte[1024];
                int len = in.read(bytes);
                if(len == -1){
                    break;
                }
                out.write(bytes,0,len);
            }
            out.close();
        }
    }

完整程式碼

package com.westos.danli;

import java.io.*;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.UUID;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Spider1 {
    public static void main(String[] args) throws IOException {
        HttpURLConnection connection =(HttpURLConnection)
                new URL("https://tieba.baidu.com/f?ie=utf-8&kw=%E9%A3%8E%E6%99%AF%E5%9B%BE&fr=search").openConnection();
//        InputStream in = connection.getInputStream();
//        BufferedReader reader;
//        reader = new BufferedReader(new FileReader(String.valueOf(in)));
        String html = Spider1.getHtml(connection);//返回一個字串的html物件
        ArrayList pictureUrl = getPictureUrl(html);
        getPicture(pictureUrl);//直接存在C:\Users\Administrator\IdeaProjects\homework\picture中

        return;
    }

    private static void getPicture(ArrayList pictureUrl) throws IOException {
        for (Object p : pictureUrl) {
           HttpURLConnection conn = (HttpURLConnection) new URL((String)p).openConnection();
            InputStream in = conn.getInputStream();
            UUID uuid = UUID.randomUUID();//形成一個隨機命名。
            FileOutputStream out = new FileOutputStream("C:\\Users\\Administrator\\IdeaProjects\\homework\\picture\\"+uuid+".png");//直接用網址命名
            while (true){
                byte[] bytes = new byte[1024];
                int len = in.read(bytes);
                if(len == -1){
                    break;
                }
                out.write(bytes,0,len);
            }
            out.close();
        }
    }
    private static ArrayList getPictureUrl(String html) {
        Pattern patternRegex = Pattern.compile("bpic=\"(.*?)\"");//匹配圖片地址的正則
        Matcher matcher = patternRegex.matcher(html);
        String htmlUrl=null;

        ArrayList<String> listUrl = new ArrayList<>();
        while (matcher.find()){
            htmlUrl = matcher.group();
            String[] split = htmlUrl.split("=\"");
            String[] split1 = split[1].split("\"");//切出來了標準的URL
            listUrl.add(split1[0]);//將圖片的url存入陣列中
        }
        return listUrl;
    }
    /*
          獲得html頁面，寫入本地。
     */
    private static String getHtml(HttpURLConnection connection) throws IOException {
        InputStream in = connection.getInputStream();//拿到html位元組流
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(new FileOutputStream("123.html"));
        StringBuffer stringBuffer1 = new StringBuffer();//定義一個stringbuffer,將html轉換為字串物件
        while (true) {
            String s = reader.readLine();//讀取html
            if (s == null) {
                break;
            }
            bufferedOutputStream.write(s.getBytes());
            stringBuffer1.append(s).append("\n");//存為字串物件
        }
        return stringBuffer1.toString();
    }


    }

Java網路爬蟲Spider

三更燈火五更雞，正是男兒讀書時。本次爬取百度貼吧的圖片。本次爬蟲網址為 https://tieba.baidu.com/f?ie=utf-8&kw=%E9%A3%8E%E6%99%AF%E5%9B%BE&fr=search &nb

Java網路爬蟲初體驗

一.什麼是爬蟲引用百度百科的介紹：“網路爬蟲（又被稱為網頁蜘蛛，網路機器人，在FOAF社群中間，更經常的稱為網頁追逐者），是一種按照一定的規則，自動地抓取全球資訊網資訊的程式或者指令碼。另外一些不常使用的名字還有螞蟻、自動索引、模擬程式或著蠕蟲” 以上介紹關鍵資訊：自動的抓取資訊的程式或指

Java網路爬蟲crawler4j學習筆記 SAX解析工具類

ExtractedUrlAnchorPair 類 package edu.uci.ics.crawler4j.parser; // 將html文字中的超連結標籤，拆分為href（超連結）,anchor（錨文字）,tag（HTML標籤）各部分 public

Java網路爬蟲crawler4j學習筆記 AuthInfo類

原始碼 package edu.uci.ics.crawler4j.crawler.authentication; import javax.swing.text.html.FormSubmitEvent.MethodType; import java.ne

Java網路爬蟲crawler4j學習筆記 CrawlConfig類

簡介 CrawlConfig類存放著爬蟲的基本配置，可供使用者在初始化爬蟲時進行配置。CrawlConfig類也向其他的功能模組提供它們需要的爬蟲配置資訊。原始碼 /** * Licensed to the Apache Software Fo

Java網路爬蟲crawler4j學習筆記 PageFetcher類

簡介 PageFetcher類主要是HTTPClient包的運用。需要了解其API 程式碼 package edu.uci.ics.crawler4j.fetcher; import java.io.IOException; import java.io

Java網路爬蟲crawler4j學習筆記 HostDirectives類

原始碼 package edu.uci.ics.crawler4j.robotstxt; // 存放當前Host的robot.txt指令 public class HostDirectives

Java網路爬蟲crawler4j學習筆記 BasicAuthInfo類

原始碼 package edu.uci.ics.crawler4j.crawler.authentication; import javax.swing.text.html.FormSubmit

Java網路爬蟲crawler4j學習筆記 RobotstxtParser類

原始碼 package edu.uci.ics.crawler4j.robotstxt; import java.util.StringTokenizer; // 根據網站的robot.txt文字，構建allows和disallow集合 public

Java網路爬蟲crawler4j學習筆記 IdleConnectionMonitorThread類

簡介 IdleConnectionMonitorThread類負責監控httpclient中的連線，進行清理操作。同時提供終止爬蟲的功能。原始碼 package edu.uci.ics.cr

Java網路爬蟲crawler4j學習筆記 Parser 類

簡介 Parser類負責將從伺服器得到的byte[]資料（儲存在Page物件裡）進行解析，按照binary,text,html的型別，分別呼叫相應的parseData類>。這裡有個容易混淆的點：類BinaryParseData，TextParseData

Java網路爬蟲crawler4j學習筆記 exceptions

簡介 edu.uci.ics.crawler4j.crawler.exceptions包比較簡單，裡面都是一些自定義的異常類。edu.uci.ics.crawler4j.parser包裡面也有一個異常

Java網路爬蟲crawler4j學習筆記 URLCanonicalizer類

原始碼 package edu.uci.ics.crawler4j.url; import java.net.MalformedURLException; import java.net.URI; import java.net.URISyntaxExc

Java網路爬蟲crawler4j學習筆記 UrlResolver類

原始碼 package edu.uci.ics.crawler4j.url; // 將相對地址轉化為絕對地址（具體內容參考文件http://www.faqs.org/rfcs/rfc1808.html） public final class UrlRes

Java網路爬蟲crawler4j學習筆記 PageFetchResult類

原始碼 package edu.uci.ics.crawler4j.fetcher; import java.io.EOFException; import java.io.IOException; import org.apache.http.Hea

Java網路爬蟲crawler4j學習筆記 Page 類

簡介 Page 類解析httpClient包中的Entity物件，獲取當前頁面的資訊，包括url(轉換為WebURl)，response的資訊（status code, response header等），解析後的內容資訊等等。原始碼 packa

Java網路爬蟲crawler4j學習筆記 Configurable類

簡介 Configurable抽象類包含了一個爬蟲配置資訊物件config，爬蟲其他的功能模組有可能需要用到這些配置資訊。原始碼 package edu.uci.ics.crawler4j.

Java網路爬蟲crawler4j學習筆記 RobotstxtConfig類

原始碼 package edu.uci.ics.crawler4j.robotstxt; // robot.txt的配置類 public class RobotstxtConfig { /

Java網路爬蟲crawler4j學習筆記 RuleSet類

原始碼 package edu.uci.ics.crawler4j.robotstxt; import java.util.SortedSet; import java.util.TreeSet;

Java網路爬蟲crawler4j學習筆記 FormAuthInfo類

原始碼 package edu.uci.ics.crawler4j.crawler.authentication; import javax.swing.text.html.FormSubmit

Java網路爬蟲Spider

相關推薦