網頁內容解析簡單實現

阿新 • • 發佈：2017-05-04

return end pro spa del crawl 測試節點 nod

概述

　　在日常開發工作中，有時候我們需要去一些網站上抓取數據，要想抓取數據，就必須先了解網頁結構，根據具體的網頁結構，編寫對應的程序對數據進行采集。最近剛好有一個需求，需要更新收貨地址。由於系統現有的收貨地址是很早以前的數據了，用戶在使用的過程中反映找不到用戶所在地的地址信息，因此對現有地址數據的更新也就提上了日程。

　　通過查找，最終找到了中華人民共和國國家統計局官網上有需要的地址數據，官方渠道，數據的準確性、完整性都有保障。本文采用國家統計局截止到2016年7月31日公布的數據（目前最新的數據）為例子進行演示，有關頁面結構的分析我這裏就不多說了（開發基本技能），直接上代碼吧，並附上完整的DEMO。

需要解析頁面效果

技術分享

POM 配置文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>crawler</groupId 
>
  <artifactId>crawler</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <build>
    <sourceDirectory>src</sourceDirectory>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version 
>3.5.1</version>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <dependencies>
      <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.7.2</version>
    </dependency>

    <dependency>
        <groupId>net.sf.json-lib</groupId>
        <artifactId>json-lib</artifactId>
        <version>2.4</version>
        <classifier>jdk13</classifier><!--指定jdk版本-->
    </dependency>
  </dependencies>
</project>

地址節點 Location 文件

package crawler;

import java.util.List;
public class Location {
    String code;
    String name;
    List<Location> children;

    public Location(){

    }

    public Location(String code, String name){
        this.code = code;
        this.name = name;
    }

    public String getCode() {
        return code;
    }

    public void setCode(String code) {
        this.code = code;
    }

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public List<Location> getChildren() {
        return children;
    }

    public void setChildren(List<Location> children) {
        this.children = children;
    }
}

測試類 TestMain

package crawler;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import net.sf.json.JSONObject;

public class TestMain {
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.connect("http://www.stats.gov.cn/tjsj/tjbz/xzqhdm/201703/t20170310_1471429.html").get();
        Element masthead = doc.select("div.xilan_con").first().
                getElementsByClass("TRS_Editor").first().
                getElementsByClass("TRS_PreAppend").first();
        Elements allElements = masthead.getElementsByTag("p");

        List<Location> provinceList = new ArrayList<Location>();
        Location province = null;
        Location city = null;
        for(Element element : allElements){
            String html = element.select("span[lang]").first().html();
            String locationCode = TestMain.getLocationCode(html);
            String locationName = element.select("span[style]").last().html();
            if(locationCode.endsWith("0000")){    //省或直轄市
                province = new Location(locationCode, locationName);
                province.setChildren(new ArrayList<Location>());
                provinceList.add(province);
            }else if(locationCode.endsWith("00")){    //市
                city = new Location(locationCode, locationName);
                city.setChildren(new ArrayList<Location>());
                province.getChildren().add(city);
            }else{    //縣或區
                Location county = new Location(locationCode, locationName);
                city.getChildren().add(county);
            }
        }

        Location root = new Location("0", "root");
        root.setChildren(provinceList);
        JSONObject jsonObj = JSONObject.fromObject(root);
        System.out.println(jsonObj.toString());
    }

    public static String getLocationCode(String html){
        String regEx="[^0-9]";
        Pattern p = Pattern.compile(regEx);
        Matcher m = p.matcher(html);
        return m.replaceAll("").trim();
    }
}

結果（數據下載）

技術分享

歡迎轉載，轉載必須標明出處

網頁內容解析簡單實現

return end pro spa del crawl 測試節點 nod 概述　　在日常開發工作中，有時候我們需要去一些網站上抓取數據，要想抓取數據，就必須先了解網頁結構，根據具體的網頁結構，編寫對應的程序對數據進行采集。最近剛好有一個需求，需要更新收貨地址

網頁內容解析簡單實現

網頁內容解析簡單實現

網頁對聯效果簡單實現

JS禁止檢視網頁原始碼的簡單實現

網頁列印的簡單實現 + window.print

c#關於網頁內容抓取，簡單爬蟲的實現。（包括動態，靜態的）

python3 簡單實現從csv文件中讀取內容，並對內容進行分類統計

nginx自定義站點目錄及簡單編寫開發網頁內容講解

jQuery簡單實現點擊文本框復制內容到剪貼板上的方法

簡單實現iframe的高度根據頁面內容自適應（jQuery方式）

Spring 容器IOC解析及簡單實現

JSP簡單實現統計網頁訪問次數

jQuery簡單實現iframe的高度根據頁面內容自適應的方法(轉)

菜鳥學SSH——Spring容器IOC解析及簡單實現

網站開發之MyEclipse簡單實現JSP網頁表單提交及傳遞值

基於Tags的簡單內容推薦的實現

.NetCore實踐爬蟲系統（一）解析網頁內容

利用Clipboard.js解決ios手機瀏覽器無法實現網頁內容複製

Python之簡單爬取網頁內容

使用Flexible.js實現手機端網頁內容適配(rem適配法)

奔五的人學iOS:swift獲取網頁並解析需要的內容(1)

網頁內容解析簡單實現

相關推薦