Java網路爬蟲crawler4j學習筆記 RobotstxtParser類

阿新 • • 發佈：2018-12-26

原始碼

package edu.uci.ics.crawler4j.robotstxt;

import java.util.StringTokenizer;

// 根據網站的robot.txt文字，構建allows和disallow集合
public class RobotstxtParser {

  // 當使用String.matches方法呼叫時，"?i"表示忽略大小寫 
  private static final String PATTERNS_USERAGENT = "(?i)^User-agent:.*";
  private static final String PATTERNS_DISALLOW = "(?i)Disallow:.*" 
;
  private static final String PATTERNS_ALLOW = "(?i)Allow:.*";

  // "User-agent:"長度為11
  private static final int PATTERNS_USERAGENT_LENGTH = 11;
  private static final int PATTERNS_DISALLOW_LENGTH = 9;
  private static final int PATTERNS_ALLOW_LENGTH = 6;

  public static HostDirectives parse(String content, String myUserAgent) {

    HostDirectives directives = null 
;
    boolean inMatchingUserAgent = false;

    // 一次提取robot.txt的每一行
    StringTokenizer st = new StringTokenizer(content, "\n\r");
    while (st.hasMoreTokens()) {
      String line = st.nextToken();

      // #號之後的都是註釋
      int commentIndex = line.indexOf("#");
      if (commentIndex > -1) {
        line = line.substring(0 
, commentIndex);
      }

      // remove any html markup
      line = line.replaceAll("<[^>]+>", "");    // "<[除了右括號的字元]+>"
      line = line.trim();

      if (line.length() == 0) {
        continue;
      }

      if (line.matches(PATTERNS_USERAGENT)) {   // User-agenet行的內容
        String ua = line.substring(PATTERNS_USERAGENT_LENGTH).trim().toLowerCase();
        // user-agent是否是針對當前爬蟲的
        if (ua.equals("*") || ua.contains(myUserAgent)) {
          inMatchingUserAgent = true;
        } else {
          inMatchingUserAgent = false;
        }
      } else if (line.matches(PATTERNS_DISALLOW)) { // disallow行的內容
        if (!inMatchingUserAgent) {
          continue;
        }
        String path = line.substring(PATTERNS_DISALLOW_LENGTH).trim();
        if (path.endsWith("*")) {
            // 獲取星號之前的path路徑
            path = path.substring(0, path.length() - 1);
        }
        path = path.trim();
        if (path.length() > 0) {
          if (directives == null) {
            directives = new HostDirectives();
          }
          // 增加disallow規則
          directives.addDisallow(path);
        }
      } else if (line.matches(PATTERNS_ALLOW)) {    // allow行的內容
        if (!inMatchingUserAgent) {
          continue;
        }
        String path = line.substring(PATTERNS_ALLOW_LENGTH).trim();
        // 獲取星號之前的Path路徑
        if (path.endsWith("*")) {
          path = path.substring(0, path.length() - 1);
        }
        path = path.trim();
        if (directives == null) {
          directives = new HostDirectives();
        }
        // 增加allow規則
        directives.addAllow(path);
      }
    }

    return directives;
  }
}

Java網路爬蟲crawler4j學習筆記 RobotstxtParser類

原始碼 package edu.uci.ics.crawler4j.robotstxt; import java.util.StringTokenizer; // 根據網站的robot.txt文字，構建allows和disallow集合 public

Java網路爬蟲crawler4j學習筆記 AuthInfo類

原始碼 package edu.uci.ics.crawler4j.crawler.authentication; import javax.swing.text.html.FormSubmitEvent.MethodType; import java.ne

Java網路爬蟲crawler4j學習筆記 CrawlConfig類

簡介 CrawlConfig類存放著爬蟲的基本配置，可供使用者在初始化爬蟲時進行配置。CrawlConfig類也向其他的功能模組提供它們需要的爬蟲配置資訊。原始碼 /** * Licensed to the Apache Software Fo

Java網路爬蟲crawler4j學習筆記 PageFetcher類

簡介 PageFetcher類主要是HTTPClient包的運用。需要了解其API 程式碼 package edu.uci.ics.crawler4j.fetcher; import java.io.IOException; import java.io

Java網路爬蟲crawler4j學習筆記 HostDirectives類

原始碼 package edu.uci.ics.crawler4j.robotstxt; // 存放當前Host的robot.txt指令 public class HostDirectives

Java網路爬蟲crawler4j學習筆記 BasicAuthInfo類

原始碼 package edu.uci.ics.crawler4j.crawler.authentication; import javax.swing.text.html.FormSubmit

Java網路爬蟲crawler4j學習筆記 IdleConnectionMonitorThread類

簡介 IdleConnectionMonitorThread類負責監控httpclient中的連線，進行清理操作。同時提供終止爬蟲的功能。原始碼 package edu.uci.ics.cr

Java網路爬蟲crawler4j學習筆記 Parser 類

簡介 Parser類負責將從伺服器得到的byte[]資料（儲存在Page物件裡）進行解析，按照binary,text,html的型別，分別呼叫相應的parseData類>。這裡有個容易混淆的點：類BinaryParseData，TextParseData

Java網路爬蟲crawler4j學習筆記 URLCanonicalizer類

原始碼 package edu.uci.ics.crawler4j.url; import java.net.MalformedURLException; import java.net.URI; import java.net.URISyntaxExc

Java網路爬蟲crawler4j學習筆記 UrlResolver類

原始碼 package edu.uci.ics.crawler4j.url; // 將相對地址轉化為絕對地址（具體內容參考文件http://www.faqs.org/rfcs/rfc1808.html） public final class UrlRes

Java網路爬蟲crawler4j學習筆記 PageFetchResult類

原始碼 package edu.uci.ics.crawler4j.fetcher; import java.io.EOFException; import java.io.IOException; import org.apache.http.Hea

Java網路爬蟲crawler4j學習筆記 Page 類

簡介 Page 類解析httpClient包中的Entity物件，獲取當前頁面的資訊，包括url(轉換為WebURl)，response的資訊（status code, response header等），解析後的內容資訊等等。原始碼 packa

Java網路爬蟲crawler4j學習筆記 Configurable類

簡介 Configurable抽象類包含了一個爬蟲配置資訊物件config，爬蟲其他的功能模組有可能需要用到這些配置資訊。原始碼 package edu.uci.ics.crawler4j.

Java網路爬蟲crawler4j學習筆記 RobotstxtConfig類

原始碼 package edu.uci.ics.crawler4j.robotstxt; // robot.txt的配置類 public class RobotstxtConfig { /

Java網路爬蟲crawler4j學習筆記 RuleSet類

原始碼 package edu.uci.ics.crawler4j.robotstxt; import java.util.SortedSet; import java.util.TreeSet;

Java網路爬蟲crawler4j學習筆記 FormAuthInfo類

原始碼 package edu.uci.ics.crawler4j.crawler.authentication; import javax.swing.text.html.FormSubmit

Java網路爬蟲crawler4j學習筆記 SAX解析工具類

ExtractedUrlAnchorPair 類 package edu.uci.ics.crawler4j.parser; // 將html文字中的超連結標籤，拆分為href（超連結）,anchor（錨文字）,tag（HTML標籤）各部分 public

Java網路爬蟲crawler4j學習筆記 exceptions

簡介 edu.uci.ics.crawler4j.crawler.exceptions包比較簡單，裡面都是一些自定義的異常類。edu.uci.ics.crawler4j.parser包裡面也有一個異常

Java網路爬蟲crawler4j學習筆記網頁內容轉碼解析

簡介網頁內容解析相關的類和介面位於包edu.uci.ics.crawler4j.parser中，用於拆分解析html網頁的各部分內容。下面的Parser的基本作用就是從各種各樣的資料（二進位制，文字）中抽取出我們需要的html頁面。原始碼 Pars

用網路爬蟲爬取新浪新聞----Python網路爬蟲實戰學習筆記

今天學完了網易雲課堂上Python網路爬蟲實戰的全部課程，特在此記錄一下學習的過程中遇到的問題和學習收穫。我們要爬取的網站是新浪新聞的國內版首頁下面依次編寫各個功能模組 1.得到某新聞頁面下的評論數評論數的資料是個動態內容，應該是存在伺服器

Java網路爬蟲crawler4j學習筆記 RobotstxtParser類

原始碼

相關推薦