爬蟲記錄（2）——簡單爬取一個頁面的圖片並儲存

阿新 • • 發佈：2019-02-08

1、爬蟲工具類，用來獲取網頁內容

package com.dyw.crawler.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;

/**
 * 爬蟲工具類
 * Created by dyw on 2017/9/1.
 */ 

public class CrawlerUtils {

    /**
     * 獲取html內容轉成string輸出。
     *
     * @param url url連結
     * @return 整個網頁轉成String字串
     */
    public static String getHtml(String url) throws Exception {
        URL url1 = new URL(url);//使用java.net.URL
        URLConnection connection = url1.openConnection();//開啟連結 

        InputStream in = connection.getInputStream();//獲取輸入流
        InputStreamReader isr = new InputStreamReader(in);//流的包裝
        BufferedReader br = new BufferedReader(isr);

        String line;
        StringBuffer sb = new StringBuffer();
        while ((line = br.readLine()) != null) {//整行讀取
            sb.append(line, 0 
, line.length());//新增到StringBuffer中
            sb.append('\n');//新增換行符
        }
        //關閉各種流，先宣告的後關閉
        br.close();
        isr.close();
        in.close();
        return sb.toString();
    }

    /**
     * 下載檔案流
     * @param urlStr url地址
     * @return InputStream
     */
    public static InputStream downLoadFromUrl(String urlStr) throws IOException {
        URL url = new URL(urlStr);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection();
        //防止遮蔽程式抓取而返回403錯誤
        conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
        //設定超時間為3秒
        conn.setConnectTimeout(3 * 1000);
        conn.setRequestProperty("Accept",
                "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-powerpoint, application/vnd.ms-excel, application/msword, */*");
        conn.setRequestProperty("Accept-Language", "zh-cn");
        conn.setRequestProperty("UA-CPU", "x86");
        conn.setRequestProperty("Accept-Encoding", "gzip");//為什麼沒有deflate呢
        conn.setRequestProperty("Content-type", "text/html");
        conn.setRequestProperty("Connection", "keep-alive");
        //得到輸入流
        return conn.getInputStream();
    }
}

2、正則工具類，用來匹配需要獲取的url地址

package com.dyw.crawler.util;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * 正則表示式工具類
 * Created by dyw on 2017/9/1.
 */
public class RegularUtils {
    //獲取img標籤正則
    private static final String IMGURL_REG = "<img.*src=(.*?)[^>]*?>";
    //獲取href正則
    private static final String AURL_REG = "href=\"(.*?)\"";
    //獲取http開頭，png|jpg|bmp|gif結尾的 正則
    private static final String IMGSRC_REG = "[a-zA-z]+://[^\\s]*(?:png|jpg|bmp|gif)";

    /**
     * 獲取 A 標籤的正則表示式
     *
     * @param html 匹配的內容
     * @return List結果集
     */
    public static List<String> getAUrl(String html) {
        return match(AURL_REG, html);
    }

    /**
     * 獲取 IMG 標籤的正則表示式
     *
     * @param html 匹配的內容
     * @return List結果集
     */
    public static List<String> getIMGUrl(String html) {
        List<String> imgUrl = match(IMGURL_REG, html);
        return match(IMGSRC_REG, imgUrl);
    }
    /**
     * 獲取 A 標籤的正則表示式
     *
     * @param html 匹配的內容
     * @return List結果集
     */
    public static List<String> getIMGSrc(String html) {
        return match(IMGSRC_REG, html);
    }

    /**
     * String匹配正則，封裝到list中
     *
     * @param regular 正則表示式
     * @param html    匹配的內容
     * @return 匹配到的結果 List
     */
    private static List<String> match(String regular, String html) {
        Matcher matcher = Pattern.compile(regular).matcher(html);
        List<String> list = new ArrayList<>();
        while (matcher.find()) {
            list.add(matcher.group());
        }
        return list;
    }

    /**
     * list匹配正則，封裝到list中
     *
     * @param regular 正則表示式
     * @param list    匹配的列表
     * @return 匹配到的結果 List
     */
    private static List<String> match(String regular, List<String> list) {
        List<String> result = new ArrayList<>();
        list.forEach(string -> {
            Matcher matcher = Pattern.compile(regular).matcher(string);
            while (matcher.find()) {
                result.add(matcher.group());
            }
        });
        return result;
    }
}

3、IO工具類，用來把獲取的html內容進行寫入到檔案中

package com.dyw.crawler.util;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;

/**
 * IO工具類
 * Created by dyw on 2017/9/1.
 */
public class IOUtils {

    /**
     * 建立檔案
     *
     * @param file File型別
     */
    public static void createFile(File file) throws Exception {
        try {
            if (!file.exists()) {
                file.createNewFile();
            }
        } catch (Exception e) {
            throw new Exception("建立檔案的時候錯誤！", e);
        }
    }

    /**
     * 寫入String到file中
     *
     * @param content  寫入內容
     * @param fileName 寫入位置
     */
    public static void writeFile(String content, File fileName) throws Exception {
        writeFile(content.getBytes("Utf-8"), fileName);
    }

    /**
     * 寫入bytes到file中
     *
     * @param bytes    寫入內容
     * @param fileName 寫入位置
     */
    public static void writeFile(byte[] bytes, File fileName) throws Exception {
        FileOutputStream o;
        try {
            o = new FileOutputStream(fileName);
            o.write(bytes);
            o.close();
        } catch (Exception e) {
            throw new Exception("寫入檔案的時候錯誤！", e);
        }
    }

    /**
     * 儲存inputStream到檔案
     *
     * @param inputStream 輸入流
     * @param fileName    儲存檔案的位置
     */
    public static void saveFile(InputStream inputStream, File fileName) throws Exception {
        writeFile(readInputStream(inputStream), fileName);
    }

    /**
     * 從輸入流中獲取位元組陣列
     *
     * @param inputStream 輸入流
     * @return byte陣列
     */
    private static byte[] readInputStream(InputStream inputStream) throws IOException {
        byte[] buffer = new byte[1024];
        int len = 0;
        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        while ((len = inputStream.read(buffer)) != -1) {
            bos.write(buffer, 0, len);
        }
        bos.close();
        inputStream.close();
        return bos.toByteArray();
    }
}

4、main方法執行

package com.dyw.crawler.project;

import com.dyw.crawler.util.CrawlerUtils;
import com.dyw.crawler.util.IOUtils;
import com.dyw.crawler.util.RegularUtils;

import java.io.File;
import java.io.InputStream;
import java.util.List;

/**
 * 下載網頁中的圖片
 * Created by dyw on 2017/9/4.
 */
public class Project1 {
    public static void main(String[] args) {
        //檔案放置的路徑
        String path = "C:\\Users\\dyw\\Desktop\\crawler";
        //爬取的網站地址
        String url = "http://blog.csdn.net/juewang_love";
        //獲取內容
        String htmlContent = null;
        try {
            htmlContent = CrawlerUtils.getHtml(url);
        } catch (Exception e) {
            throw new RuntimeException("獲取內容失敗!", e);
        }
        //獲取所有的img的內容
        List<String> imgUrls = RegularUtils.getIMGUrl(htmlContent);
        //分別下載每個img
        imgUrls.forEach(imgUrl -> {
            String[] split = imgUrl.split("/");
            String imgName = split[split.length - 1];
            try {
                File file1 = new File(path + "/" + imgName);
                InputStream inputStream = CrawlerUtils.downLoadFromUrl(imgUrl);
                IOUtils.saveFile(inputStream, file1);
                System.out.println("success:" + imgName);
            } catch (Exception e) {
                System.out.println("fail:" + imgUrl + "" + imgName);
            }
        });
    }
}

5、修改 CrawlerUtils 工具類用 httpclient 替代 urlConnection

package com.dyw.crawler.util;

import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.methods.GetMethod;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;

/**
 * 爬蟲工具類
 * Created by dyw on 2017/9/1.
 */
public class CrawlerUtils {

    /**
     * http請求設定訊息頭
     *
     * @param httpMethod http請求方法
     */
    private static void setHead(HttpMethod httpMethod) {
        httpMethod.setRequestHeader("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");
        httpMethod.setRequestHeader("Content-Type", "Utf-8");
        httpMethod.setRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
    }

    /**
     * 獲取html內容轉成string輸出（get方法）
     *
     * @param url url連結
     * @return 整個網頁轉成String字串
     */
    public static String getHtml(String url) throws Exception {
        InputStream inputStream = downLoadFromUrl(url);
        BufferedReader br = new BufferedReader(new InputStreamReader(inputStream, "Utf-8"));
        StringBuffer stringBuffer = new StringBuffer();
        String str;
        while ((str = br.readLine()) != null) {
            stringBuffer.append(str);
            stringBuffer.append('\n');//新增換行符
        }
        return stringBuffer.toString();
    }

    /**
     * 獲取檔案流（get方法）
     *
     * @param urlStr url地址
     * @return InputStream
     */
    public static InputStream downLoadFromUrl(String urlStr) throws IOException {
        //通過httpclient來代替urlConnection
        HttpClient httpClient = new HttpClient();
        HttpMethod httpMethod = new GetMethod(urlStr);
        setHead(httpMethod);
        int status = httpClient.executeMethod(httpMethod);
        InputStream responseBodyAsStream = null;
        if (status == HttpStatus.SC_OK) {
            responseBodyAsStream = httpMethod.getResponseBodyAsStream();
        }
        return responseBodyAsStream;
    }
}

爬蟲記錄（2）——簡單爬取一個頁面的圖片並儲存

1、爬蟲工具類，用來獲取網頁內容 package com.dyw.crawler.util; import java.io.BufferedReader; import java.io.IOException; import java.io.

爬蟲記錄（1）——簡單爬取一個頁面的內容並寫入到文字中

1、爬蟲工具類，用來獲取網頁內容 package com.dyw.crawler.util; import java.io.BufferedReader; import java.io.In

爬蟲系列（2）-----python爬取CSDN博客首頁所有文章

成功 -name 保存 eas attr eve lan url att 對於Python初學者來說，爬蟲技能是應該是最好入門，也是最能夠有讓自己有成就感的，今天在整理代碼時，整理了一下之前自己學習爬蟲的一些代碼，今天上第2個簡單的例子，python爬取CSDN博客首頁所有

Python爬蟲實戰（2）：爬取京東商品列表

1，引言在上一篇》，爬取了一個用Drupal做的論壇，是靜態頁面，抓取比較容易，即使直接解析html原始檔都可以抓取到需要的內容。相反，JavaScript實現的動態網頁內容，無法從html原始碼抓取

scrapy爬蟲框架（三）：爬取桌布儲存並命名

寫在開始之前按照上一篇介紹過的 scrapy爬蟲的建立順序，我們開始爬取桌布的爬蟲的建立。首先，我們先過一遍 scrapy爬蟲的建立順序：第一步：確定要在pipelines裡進行處理的資料，寫好items檔案第二步：建立爬蟲檔案，將所需要的資訊從

（55）-- 簡單爬取人人網個人首頁資訊

# 簡單爬取人人網個人首頁資訊from urllib import request base_url = 'http://www.renren.com/964943656' headers = { "Host" : "www.renren.com", "Co

爬蟲03 爬取堆糖圖片並儲存到本地

# -*- coding: utf-8 -*- import urllib import urllib2 import re i=0 page = 1 url = 'http://www.duitan

爬蟲記錄（6）——爬蟲實戰：爬取知乎網站內容，儲存到資料庫，並匯出到Excel

前面幾篇文字我們介紹了相關的爬蟲的方法爬取網站內容和網站的圖片，且儲存到資料庫中。今天呢，我們來次實戰練習，爬取知乎網站跟話題網站top的幾個問題和答案，然後儲存到資料庫中，最後把資料庫中的所有內容再匯出到Excel中。我們還是繼續之前的程式碼，同樣的程式碼

爬蟲記錄（4）——多執行緒爬取圖片並下載

還是繼續前幾篇文章的程式碼。當我們需要爬取的圖片量級比較大的時候，就需要多執行緒爬取下載了。這裡我們用到forkjoin pool來處理併發。 1、DownloadTask下載任務類 package com.dyw.crawler.util;

一個鹹魚的Python爬蟲之路（三）：爬取網頁圖片

you os.path odin 路徑生成存在 parent lose exist 學完Requests庫與Beautifulsoup庫我們今天來實戰一波，爬取網頁圖片。依照現在所學只能爬取圖片在html頁面的而不能爬取由JavaScript生成的圖。所以我找了這個網站

爬蟲（七）：爬取貓眼電影top100

all for rip pattern 分享爬取 values findall proc 一：分析網站目標站和目標數據目標地址：http://maoyan.com/board/4?offset=20目標數據：目標地址頁面的電影列表，包括電影名，電影圖片，主演，上映日期以

★ Python爬蟲 - 爬取網頁文字資訊並儲存（美文的爬取與儲存）

本篇文章所包含的主要內容：使用requests模組實現對網頁以字串的形式儲存使用open()、write()、close()函式實現檔案的開啟與寫入使用if() 條件語句對所需要的文字資訊進行過濾以形成一個專用提取函式 &n

簡單的爬蟲知識（2）

cookie的使用方法1： headers={ "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0

python爬蟲（3）——python爬取大規模資料的的方法和步驟

python爬取大規模資料的的方法和步驟：一、爬取我們所需要的一線連結 channel_extract.py 這裡的一線連結也就是我們所說的大類連結： from bs4 import BeautifulSoup import requests

前程無憂爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

![0e644a1fa9dc00c3e7c752bdf4382aa2.jpg](https://upload-images.jianshu.io/upload_images/9136378-72ab92577ff68f7d.jpg?imageMogr2/auto-orient/strip%7Ci

拉勾爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

![178bc26d6a28e9f177010e9150d849f2.jpg](https://upload-images.jianshu.io/upload_images/9136378-068a8b1de5a0204f.jpg?imageMogr2/auto-orient/

Python網路爬蟲（九）：爬取頂點小說網站全部小說，並存入MongoDB

前言：本篇部落格將爬取頂點小說網站全部小說、涉及到的問題有：Scrapy架構、斷點續傳問題、Mongodb資料庫相關操作。背景： Python版本：Anaconda3 執行平臺：Windows IDE：PyCharm 資料庫：MongoDB 瀏

webmagic是個神奇的爬蟲（二）-- webmagic爬取流程細講

webmagic流程圖鎮樓：第一篇筆記講到了如何建立webmagic專案，這一講來說一說webmagic爬取的主要流程。 webmagic主要由Downloader（下載器）、PageProcesser（解析器）、Schedule（排程器）和Pi

Python爬蟲之爬取知乎帖子並儲存到mysql（以及遇到問題和解決方法）

爬取問題標題並儲存到資料庫：程式碼： # coding=utf-8 import urllib import urllib2 import re import MySQLdb #co

記一次企業級爬蟲系統升級改造（四）：爬取微信公眾號文章（通過搜狗與新榜等第三方平臺）

首先表示抱歉，年底大家都懂的，又涉及SupportYun系統V1.0上線。故而第四篇文章來的有點晚了些~~~對關注的朋友說聲sorry! SupportYun系統當前一覽：　　首先說一下，文章的進度一直是延後於系統開發進度的。　　當前系統V1.0 已經正式上線服役了，這

爬蟲記錄（2）——簡單爬取一個頁面的圖片並儲存

相關推薦