一個很垃圾的整站爬取--Java爬蟲

阿新 • • 發佈：2019-01-13

Jsoup---

讀取檔案中的種子頁，整站爬取整站資料，並儲存。

如果你想簡單用一下，可以，如果學習使用，個人覺得有點亂，

package cn;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.FileUtils;
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class CrawlerUtils {
	public static int count = 0;
	// seeds
	public static List<String> list = new ArrayList<String>();
	// 存所有url
	public static HashSet<String> hashSet = new HashSet<String>();
	// 執行緒池
	ExecutorService pool = Executors.newFixedThreadPool(5);

	public static String gethtml(String url) {
		String content;
		try {
			Connection con = Jsoup.connect(url);
			con.header("Accept", "text/html, application/xhtml+xml, */*");
			con.header("Content-Type", "application/x-www-form-urlencoded");
			con.header("User-Agent",
					"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0))");
			con.header("Cookie", "");
			content = con.get().toString();

		} catch (IOException e) {
			e.printStackTrace();
			return null;
		}
		return content;
	}

	/**
	 * 拿取所有包含主站的url地址 返回list
	 */
	public static List<String> geturl(String html, String url) {
		List<String> list = new ArrayList<String>();
		Pattern pattern = Pattern.compile("href=\"(.*?)\"");
		Matcher matcher = pattern.matcher(html);
		// find向前迭代
		while (matcher.find()) {
			String urlline = matcher.group().replace("href=\"", "")
					.replace("\"", "");
			if (urlline.contains("http")) {
				if (url.contains("http")) {
					if (urlline.contains(url.replace("http://", ""))) {
						System.out.println(urlline);
						list.add(urlline);
					}
				}
			} else if (urlline.contains("https")) {
				if (url.contains("https")) {
					if (urlline.contains(url.replace("https://", ""))) {
						System.out.println(urlline);
						list.add(urlline);
					}
				}
			} else {
				String urlString = url.substring(0, url.length() - 1) + urlline;
				System.out.println(urlString);
				list.add(urlString);
			}
		}
		return list;
	}

	/**
	 * 儲存html -寫入檔案
	 * 
	 * @throws IOException
	 */
	public static void saveFile(String pathname, String html, String charset)
			throws IOException {
		FileUtils.write(new File(pathname), html, charset, true);

	}

	/**
	 * 通過位元組流-寫入檔案
	 * 
	 * @throws IOException
	 */
	public static void WriteByte(String pathname, String date, String charset)
			throws IOException {
		File file = new File(pathname);
		OutputStream outputStream = new FileOutputStream(file);
		byte[] datebyte = date.getBytes(charset);
		outputStream.write(datebyte);
		outputStream.close();
	}

	/** 主執行類 */
	public static void mainUtil(String url, String maintitle) {
		try {
			String html = gethtml(url);
			System.out.println(html);
			List<String> urlList = geturl(html, url);
			for (String string : urlList) {
				if (hashSet.add(string)) {

					String htmlline = gethtml(string);
					try {
						String title = "未命名";
						title = Jsoup.parse(htmlline).getElementsByTag("title")
								.get(0).text();
						saveFile("E://crawler4j/房地產行業/" + maintitle + "/"
								+ title + System.currentTimeMillis() + ".html",
								htmlline, "utf8");
						System.out.println("第" + count++ + "儲存檔案：" + string);
					} catch (Exception e) {
						// TODO Auto-generated catch block
						e.printStackTrace();
						System.out.println("第" + count + "寫入失敗！！！" + "網址："
								+ string);
					}
				}
			}
		} catch (Exception e) {
			// TODO: handle exception
			System.out.println("99999999999");
		}

	}

	public static void main(String[] args) {

		try {
			FileReader reader = new FileReader("E://crawler4j/房地產行業seeds.txt");
			BufferedReader br = new BufferedReader(reader);
			while (br.ready()) {
				String line = br.readLine();
				list.add(line);
			}
			br.close();
			reader.close();
		} catch (Exception e1) {
			// TODO Auto-generated catch block
			e1.printStackTrace();
			System.out.println("沒有種子頁！！");
		}

		String url1 = "http://www.minagri.gov.rw/index.php?id=16";

		for (String url : list) {
			String maintitle = "未命名" + System.currentTimeMillis();
			try {
				maintitle = Jsoup.connect(url).get().getElementsByTag("title")
						.get(0).text();
			} catch (Exception e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
				continue;
			}
			mainUtil(url, maintitle);
			String html = gethtml(url);
			if (html.equals(null)) {
				continue;
			}
			System.out.println(html);
			List<String> urlList = geturl(html, url);
			for (int i = 0; i < urlList.size(); i++) {
				mainUtil(urlList.get(i), maintitle);
			}
		}
	}
}

一個很垃圾的整站爬取--Java爬蟲

Jsoup--- 讀取檔案中的種子頁，整站爬取整站資料，並儲存。如果你想簡單用一下，可以，如果學習使用，個人覺得有點亂， package cn; import java.io.BufferedReader; import java.io.File; import java.io.

wget整站抓取、網站抓取功能

.net 工作 www. .html ack 保存 tps log tac wget -r -p -np -k -E http://www.xxx.com 抓取整站 wget -l 1 -p -np -k http://www.xxx.com 抓取第一級

一個很好用的桌面取色器和一個線上取色器

有時候不只是美工會用到取色器，平時很多時候都會用到取色器，其中有線上取色器，但是使用內網的時候往往是沒法使用線上取色器這時候就需要使用安裝桌面版取色器。下面是經過多次查詢獲取的比較好用的取色器，一款線上一款應用：線上取色器：以羅列出本地圖片上所有出現的顏色，可在指定區域點選左上角顯示出

Scrapy 使用CrawlSpider整站抓取文章內容實現

剛接觸Scrapy框架，不是很熟悉，之前用webdriver+selenium實現過頭條的抓取，但是感覺對於整站抓取，之前的這種用無GUI的瀏覽器方式，效率不夠高，所以嘗試用CrawlSpider來實

微信公眾號批量爬取——Java版

最近需要爬取微信公眾號的文章資訊。在網上找了找發現微信公眾號爬取的難點在於公眾號文章連結在pc端是打不開的，要用微信的自帶瀏覽器（拿到微信客戶端補充的引數，才可以在其它平臺開啟），這就給爬蟲程式造成很大困擾。後來在知乎上看到了一位大牛用php寫的微信公眾號爬取程式，就直接按大佬的思路整了整搞成java的了。

Python網路資料爬取----網路爬蟲基礎（一）

The website is the API......(未來的資料都是通過網路來提供的，website本身對爬蟲來講就是自動獲取資料的API)。掌握定向網路資料爬取和網頁解析的基本能力。 ##Requests 庫的使用，此庫是Python公認的優秀的第三方網路爬蟲庫。能夠自動的爬取HTML頁面；自動的

爬取小說網站整站小說內容 -《狗嗨默示錄》-

exception chap color row con print 動漫 pri value # !/usr/bin/env python # -*- coding: utf-8 -*- import urllib.request import re import M

使用wget命令爬取整站

TP 抓取 boot 下載圖片 windows mce 使用外部 -c 快速上手(整個bootstrap網頁全被你抓取下來了~_~) wget -c -r -npH -k -nv http://www.baidu.com 參數說明 -c：斷點續傳 -r：遞歸下載 -np：

scrapy進階（CrawlSpider爬蟲__爬取整站小說）

bool rap val 正則表達 attr 種類 python list false # -*- coding: utf-8 -*- import scrapy,re from scrapy.linkextractors import LinkExtractor f

Web偵察工具HTTrack （爬取整站）

項目 name 一個下載 root image inf 爬取獲取 Web偵察工具HTTrack （爬取整站） HTTrack介紹爬取整站的網頁，用於離線瀏覽，減少與目標系統交互，HTTrack是一個免費的（GPL，自由軟件）和易於使用的離線瀏覽器工具。它允許您從Int

爬取有驗證碼的網站，（爬之前最好看一下君子協定）robots.txt,以人人網為例，每爬100條資料需要驗證一次（需要自己購買一個驗證碼破解會員，不是很貴，我這裡選擇的是超級鷹），簡版

#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2018/10/15 14:03 # @Author : zhangz # @File : day4_yanzhengma.py # @Software: Py

漫客們的福利啦，爬取整站動漫圖片，小白都能學，超簡單

正文目標網站divinl 首先看看這網站是怎樣載入資料的; 開啟網站後發現底部有下一頁的按鈕，ok，爬這個網站就很簡單了; 學習Python中有不明白推薦加入交流裙 &nbs

用JAVA實現一個爬蟲，爬取知乎的上的內容（程式碼已無法使用）

在學習JAVA的過程中寫的一個程式，處理上還是有許多問題，爬簡單的頁面還行，複雜的就要跪. 爬取內容主要使用URLConnection請求獲得頁面內容，使用正則匹配頁面內容獲得所需的資訊存入檔案，使用正則尋找這個頁面中可訪問的URL，使用佇列儲存未訪問的URL

java程式設計師菜鳥進階（八）分享一個爬取B2B網站資訊的程式

前段時間，女朋友如願以償的找到了銷售的工作，第一天正式上班還挺高興，第二天就開始愁眉苦臉了。就是因為他這銷售實在是太麻煩，以後每天要到一些B2B網站去找一些客戶資訊，每天要找幾百條，剛開始我還安慰的說，沒事，以後我幫你找，我接手這工作第一天還很老實，第一天用了不到一個小時的時間幫忙找了八十條，但到

今天發現的一個有用的爬蟲視訊，對靜態網頁爬取整體關係有很好的講解

http://www.imooc.com/learn/563 優點是比較清楚的介紹了爬蟲結構，讓我對爬蟲有了比較全面的瞭解。比較有用內容摘要（一）：爬蟲排程端：用來啟動、停止、和監視爬蟲 URL管理：對等待爬取和已經爬取的URL進行管理，簡單來說就是為後續模組提供可供

Scrapy爬取拉鉤網的爬蟲（爬取整站CrawlSpider）

經過我的測試，拉鉤網是一個不能直接進行爬取的網站，由於我的上一個網站是扒的介面，所以這次我使用的是scrapy的整站爬取，貼上當時的程式碼（程式碼是我買的視訊裡面的，但是當時是不需要登陸就可以爬取的）： class LagouSpider(CrawlSpider):

一個簡單的爬取b站up下所有視訊的所有評論資訊的爬蟲

心血來潮搞了一個簡單的爬蟲，主要是想知道某個人的b站賬號，但是你知道，b站在搜尋一個使用者時，如果這個使用者沒有投過稿，是搜不到的，，，這時就只能想方法搞到對方的mid，，就是 space.bilibili.com/9444976 後面的那一串數字。偶然看到這個人關注了某個主播，，想到可能這個人會回覆主播的視

一個鹹魚的Python爬蟲之路（三）：爬取網頁圖片

you os.path odin 路徑生成存在 parent lose exist 學完Requests庫與Beautifulsoup庫我們今天來實戰一波，爬取網頁圖片。依照現在所學只能爬取圖片在html頁面的而不能爬取由JavaScript生成的圖。所以我找了這個網站

java爬取百度首頁源代碼

clas read 意思出現異常 nts java.net new 有意思 all 爬蟲感覺挺有意思的，寫一個最簡單的抓取百度首頁html代碼的程序。雖然簡單了一點，後期會加深的。 1 package test; 2 3 import java.io.B

java爬蟲一（分析要爬取數據的網站）

java爬蟲一、獲取你想要抓取的網站地址：http://www.zhaopin.com/然後打開控制臺，F12，打開。我用的是Chrome瀏覽器，跟個人更喜歡Chrome的控制臺字體。找到搜索欄對應的html標簽：http://sou.zhaopin.com/jobs/searchresult.ashx?jl

一個很垃圾的整站爬取--Java爬蟲

相關推薦