利用HtmlParse獲取Html內容並提取

阿新 • • 發佈：2019-01-26

一．網上獲取html內容

1.利用url獲取html內容：


public static String getHtmlContent(String urlstr){
		/*思路： 1.讀出原網頁：url--》openstream--》inputStreamRead---》bufferReader---》。read
		 * 2.解決自動識別字符編碼 利用cpdetecter：http://sourceforge.jp/projects/sfnet_cpdetector/

		*/
		String result="";
		if(StringUtil.isEmpty(urlstr)) return null;
		try {
			String charset = getCode(urlstr);
			//System.out.println(charset);
			URL url = new URL(urlstr);
			InputStream is = url.openStream();
			InputStreamReader isr = new InputStreamReader(is, charset);
			BufferedReader br = new BufferedReader(isr);
			String temp =null;
			while(true){
				temp = br.readLine();
				/*if(StringUtil.isNotEmpty(temp)){*/  // 這個工具不能濫用，因為temp可能是“”但是正文沒結束;   
				if(temp !=null){
					result += temp+"\n";
				}else{
					break;
				}
			}
		} catch (Exception e) {
			e.printStackTrace();
		} 
		
		return result;
		
	}

2.自動原始碼的識別字符編碼

public static StringgetCode(String url){
        // 引入cpdector包（），利用CodepageDetectorProxy代理裝入JChardetfacade容器,然後detectCodePage出東東；具體檢視文件，並自己推敲出來。
        String result="";
        if(StringUtil.isEmpty(url))return null;
       
        CodepageDetectorProxy cdp =CodepageDetectorProxy.getInstance();
        cdp.add(JChardetFacade.getInstance());
        try {
            result = cdp.detectCodepage(newURL(url)).toString();
        } catch(Exception e) {
            e.printStackTrace();
        }
        return result;

3.總結如何引入包，如何快速推敲所需的類

在包中有一個說明文件：binary-release.txt,仔細閱讀即可。

我們知道CodepageDetectorProxy 是一個代理類，是個單例模式；開發api中有

說明這個代理類需要一個容器，於是我們找到ICodepageDetector有：

這幾個實現類都是有對用功能的類，故由名字可以猜出JChardetFacade…這個可能比較大

二．正則表示式提取內容

在StringUtil中新增這個方法

public static StringgetContentUseRegex(String regexString ,String content,int index){
        String result="";
        if(isEmpty(regexString)|| isEmpty(content)) return result;
       
        Pattern pattern = Pattern.compile(regexString);
        Matcher matcher =pattern.matcher(content);
        if(matcher.find()){
            //System.out.println("find");
            result = matcher.group(index);
        }
        return result;
}
測試：
@Test
    public voidgetContentUseRegexTest(){
        //<h1 itemprop="headline">習近平在中非合作論壇約翰內斯堡峰會上總結講話</h1>
        String source = "<h1itemprop=\"headline\">習近平在中非合作論壇約翰內斯堡峰會上總結講話</h1>";
        String regex ="<h1(.*)itemprop=\\\"headline(.*)\\\">(.*)</h1>";
        String str = StringUtil.getContentUseRegex(regex,source,3);
        System.out.println(str);
       
        //<divclass="time" id="pubtime_baidu" itemprop="datePublished"content="2015-12-06T08:35:00+08:00">2015-12-06 08:35:00</div>
        source = "<divclass=\"time\" id=\"pubtime_baidu\"itemprop=\"datePublished\"content=\"2015-12-06T08:35:00+08:00\">2015-12-0608:35:00</div>";
        regex = "<div(.*)itemprop=\\\"datePublished\\\"(.*)>(.*)</div>";
        str = StringUtil.getContentUseRegex(regex,source, 3);
        System.out.println(str);
}

三． htmlparser抽取內容

1引入htmlparser.jar,htmlexer.jar

2封裝獲取節點文字方法

public static StringgetContentUseParse(String urlstr,String encoding,String tag,StringattrName,String attrVal){
        /* 思路：引用htmlParse包--》Parse。parse（AndFileter）
         *其中NodeFileter是一個介面，AndFilterTagNameFilter HasAttributeFilter都是其實現類
         *AndFilter 是一個可以層層封裝的過濾類；用AndFilter andFilter= new AndFilter(new TagNameFilter("h1"),new HasAttributeFilter("itemprop","headline"));
         *解析後得到NodeList ，於是就可以了
         */    
        String result ="";
        AndFilter andFilter=null;
        if(StringUtil.isEmpty(urlstr))return result;
        if(StringUtil.isEmpty(encoding))encoding="utf-8";
        try {
            Parser parser = newParser(urlstr);
            parser.setEncoding(encoding);
            if(StringUtil.isNotEmpty(attrName)&& StringUtil.isNotEmpty(attrVal)){
                 andFilter = newAndFilter(new TagNameFilter(tag),newHasAttributeFilter(attrName, attrVal));
            }else if(StringUtil.isNotEmpty(attrName)&& StringUtil.isEmpty(attrVal)){
                 andFilter = newAndFilter(new TagNameFilter(tag),newHasAttributeFilter(attrName));
            }else{
                NodeFilter[]  nodeFilters = newNodeFilter[1];
                nodeFilters[0] = newTagNameFilter(tag);
                 andFilter = newAndFilter(nodeFilters);
            }
            NodeList nodeLists =parser.parse(andFilter);
            parser.reset();
            Node node = nodeLists.elementAt(0);
            result = node.toPlainTextString();
        } catch(Exception e) {
            e.printStackTrace();
        }
        return result;
   }

3 測試：

@Test
	public void getHtmlContentUseParseTest(){
		//<div class=\"time\" id=\"pubtime_baidu\" itemprop=\"datePublished\" content=\"2015-12-06T08:35:00+08:00\">2015-12-06 08:35:00</div>
		//<h1 itemprop="headline">習近平在中非合作論壇約翰內斯堡峰會上總結講話</h1>
		String encoding = HtmlUtil.getCode("http://news.sohu.com/20151206/n429917146.shtml");
		String str = HtmlUtil.getContentUseParse("http://news.sohu.com/20151206/n429917146.shtml", encoding,"h1","itemprop","headline");
	
		System.out.println(str);
	}

利用HtmlParse獲取Html內容並提取

一．網上獲取html內容 1.利用url獲取html內容： public static String getHtmlContent(String urlstr){ /*思路： 1.讀出原網頁：url--》openstream--》inputStreamRe

zbb20180827 java獲取html內容

orm pid ace string trace static != nec class package com.zbb.test; import java.io.BufferedReader;import java.io.BufferedWriter;import jav

Python2.7 呼叫Windows X86 DLL檔案獲取返回報文並提取token

研究了我兩天，主要是對這個dll的資料型別不瞭解不知道如何轉換，所以現在記下筆記。下面是調取win32介面工具的dll檔案進行對介面返回報文進行解碼並獲取token值。。。。。。 #!user/bin/python2.7 #coding:utf-8 import re import ct

PHP獲取HTML內容及動態渲染js載入內容（使用querylist）

1.安裝安裝querylistcomposer require jaeger/querylist安裝phantomjscomposer require jaeger/querylist-phantomjs //PHP版本必須 >=7.0下載對應你電腦系統的Phanto

WebBrowser載入自定義HTML內容並顯示

use Winapi.ActiveX; //呼叫IPersistStreamInit類 procedure TForm1.Button2Click(Sender: TObject); var PostList: TStringList; R

Android-通過WebView獲取html內容

轉自：https://blog.csdn.net/z82367825/article/details/52187921 覺得寫得很好，轉載做自己收藏通過WebView獲得某個url的html內容。實現 1. 自定義一個Java物件 /** * 邏輯處理

利用ffmpeg獲取rtsp視訊流並使用opencv播放

/*opencv庫*/ #include <opencv2\opencv.hpp> #include <iostream> extern "C" /*這裡必須要使用C方式匯入*/ { #include "libavcodec/avcodec.h"

iOS webView獲取html內容

stringByEvaluatingJavaScriptFromString 使用stringByEvaluatingJavaScriptFromString方法，需要等UIWebView中的頁面載入完成之後去呼叫。我們在界面上拖放一個UIWebView控制元件。在Load中將

從富文字編輯器獲取html內容組裝json，特殊字元引起報錯解決辦法。

最近專案需要，需要從富文字編輯器獲取html內容組裝json，然後還要把組裝後的json物件利用json2轉成json字串，資料放入編輯器提交，由於相容ie8以上瀏覽器。所以搞了好久的特殊字元轉義，

將影象扭正確。利用OpenCV檢測影象中的長方形畫布或紙張並提取影象內容

也就是在一張照片裡，已知有個長方形的物體，但是經過了透視投影，已經不再是規則的長方形，那麼如何提取這個圖形裡的內容呢？這是個很常見的場景，比如在博物館裡看到一幅很喜歡的畫，用手機找了下來，可是回家一看歪歪斜斜，腦補原畫內容又覺得不對，那麼就需要演算法輔助來從原圖裡提取原來的內容了。不妨把應用的場景分為以下

php利用curl獲取網頁title內容

charset 釋放 tput head func reg 文字編碼 top titles <?php $url = ‘http://www.k7wan.com‘; echo getTitle_web_curl($url); function getTitl

python提取內容並寫入到Excel

pytho win exc -1 author mesh a-z += adl #! /usr/bin/env python#_*_coding:utf-8_*_‘‘‘Created on 2017-5-12@author: Win-1‘‘‘import re,os,tim

動態獲取html頁面的內容，而且取當中的某塊元素的方法

var mod .ajax pos spa app sync lte index ??$.ajax({ url: "http://192.168.1.59:8888/app-tpl-webapp/tpl/design.html", async:f

chrome擴展程序獲取當前頁面URL和HTML內容

start pop https 註入 tin fun fuck font .json 先交代一下manifest.json中的配置 // 引入註入腳本"content_scripts": [ { "js": ["content_script.js"], //

php利用simple_html_dom類，獲取頁面內容，充當爬蟲角色

contents names mac tro upd tool one mit 一個 PHP腳本扮演爬蟲的角色，可能大家第一時間想到可能會是會正則，個人對正則的規則老是記不住，表示比較難下手，今天工作中有個需求需要爬取某個網站上的一些門店信息無意間在網上看到一個比較好的

獲取WebView加載的網頁內容並進行動態修改

datawit cli cap jsoup 動態修改成功 scale 技術 parse http://www.jianshu.com/p/3f207a8e32cb 【Android】WebView讀取本地圖片 http://www.cnblogs.com/kimmy/p/

c#獲取當前系統時間，並提取按格式提取年月日為字符串

sta ogr mon 系統 ram AR 當前系統時間 ren 獲取 class Program { static void Main(String[] args) { DateTime currentTi

利用xlwt、xlrd搜索excel表格內容並復制出需要的那一行內容

sha 源碼 excel 技術 read img bin 需要 ado 需求有如圖表格：然後有姓名，想要把這些人所在的這一行資料給導出來。 1、把姓名保存成名字.txt 源表格為‘excelFile.xls‘2、源碼如下： #!/usr/bin/python # -*-

利用Python3獲取辦公室的公網IP並修改阿裏雲安全組規則

lencod 函數 range plain url move __name__ method port 阿裏雲Python SDK：SDK使用說明 API詳情請參考：阿裏雲ECS API 安裝依賴 #本文使用的Python版本為Python 3.7 pip

利用sklearn獲取手寫數字數據集，並進行可視化

字數 size pre code http text 添加 col sha %matplotlib inline from sklearn import datasets from matplotlib import pyplot as plt #獲取數據集 digits

利用HtmlParse獲取Html內容並提取

相關推薦