【垂直搜尋引擎搭建11】使用htmlparser獲取頁面的字元編碼encoding

阿新 • • 發佈：2019-01-02

1，確定目標。對於html頁面來說，一般都有確定編碼的語句：

<meta http-equiv=”Content-Type” content=”text/html; charset=gb2312″ />

可以通過這一行的特徵來取出網頁的編碼。

2，選出特徵。

      1）它是meta標籤
      2）具有http-equiv屬性值為Content-Type
      3）將屬性content中的值取出，先採用“;”分拆取第二個元素，再採用“=”分拆取第二個元素

3，一切就緒，編碼實現。通過目標的選取，以及特徵的勾畫，已經可以找到解決方法了，像上一篇htmlparser中filter使用實戰中講的類似，還是採用AndFilter、NodeFilter以及HasAttributeFilter實現，程式碼如下：

package org.algorithm;



import java.io.IOException;
import java.io.UnsupportedEncodingException;

import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.TagNameFilter;
import 
 org.htmlparser.nodes.TagNode;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

public class getEncoding {
    //<meta http-equiv="Content-Type" content="text/html;charset=gb2312" /> //HTML編碼
    public static String getContentEncoding(String url) throws ParserException, IOException ,UnsupportedEncodingException{
        String encoding = "" 
;
        try{
            Parser parser = new Parser(url); //解析url連結

            NodeFilter filter = new AndFilter(new TagNameFilter("meta"),new HasAttributeFilter("http-equiv","Content-Type"));//獲取meta標籤下的http-equiv屬性
            NodeList nodelist = parser.extractAllNodesThatMatch(filter);
            if(nodelist != null){
                TagNode list = (TagNode)nodelist.elementAt(0);
                encoding = list.getAttribute("content").split(";")[1].trim();//將content的內容看成陣列，使用split(";")[1]通過分號劃分來獲得陣列中的第1個元素“charset=gb2312”，第0個元素是“text/html”
                encoding = encoding.split("=")[1].trim();//仍然使用split()[]通過等於號來進行劃分，獲得陣列中的第1個元素“gb2312”,第0個元素是“charset”
            }

        }catch(ParserException e){
            e.printStackTrace();
        }
        return encoding;
    }

    public static void main(String[] args) throws ParserException, IOException, UnsupportedEncodingException {
        String url="http://news.baidu.com/";
        String encoding = getContentEncoding(url);
        System.out.print(encoding);
    }

}

Output：

gb2312

【垂直搜尋引擎搭建11】使用htmlparser獲取頁面的字元編碼encoding

1，確定目標。對於html頁面來說，一般都有確定編碼的語句： <meta http-equiv=”Content-Type” content=”text/html; charset=gb2312″ /> 可以通過這一行的特徵來取出網頁的編碼。

【垂直搜尋引擎搭建12】htmlparser簡介

1、相關資料 2、使用HtmlPaser的關鍵步驟（1）通過Parser類建立一個直譯器（2）建立Filter或者Visitor （3）使用parser根據filter或者visitor來取得所有符合條件的節點（4）對節點內容進行處理

【垂直搜尋引擎搭建15】HtmlParser中Filter方法（本地URL地址）

package org.algorithm; import java.io.BufferedReader; import java.io.File; import java.io.FileReader

【垂直搜尋引擎搭建10】HtmlParser中Filter實踐

Filter種類：判斷類Filter： TagNameFilter HasAttributeFilter HasChildFilter HasParent

【垂直搜尋引擎搭建14】HtmlParser中Filter方法（URL網路地址）

1、TagNameFilter import java.io.IOException; import org.htmlparser.Node; import org.htmlparser.NodeF

【Spark深入學習-11】Spark基本概念和運行模式

nmf 磁盤大數據平臺並不是鼠標 .cn 管理系統大型數據集 spa ----本節內容------- 1.大數據基礎 1.1大數據平臺基本框架 1.2學習大數據的基礎 1.3學習Spark的Hadoop基礎 2.Hadoop生態基本介紹 2.1

【bzoj4551】【NOIP2016模擬7.11】樹

for noip 給定 getc detail 問題實現 href 並查集題目在2016年，佳媛姐姐剛剛學習了樹，非常開心。現在他想解決這樣一個問題：給定一顆有根樹（根為1），有以下兩種操作：1. 標記操作：對某個結點打上標記（在最開始，只有結點1有標記，其他結點均

jzoj 5863. 【NOIP2018模擬9.11】移動游標 rmq

Description Input Output Sample Input 4 3 2 4 3 3 1 1 3 2 3 3 4 2 1 3 3 4 Sample

jzoj 5865. 【NOIP2018模擬9.11】假期旅行線段樹

Description Input Output Sample Input 5 4 3 1 4 1 2 5 3 2 3 2 4 5 2 3 1 5 3 5 4 5 Sample Output -1 2 1 Data

【Linux學習筆記11】移動檔案，目錄的mv命令以及關於檢視檔案的技巧

首先給大家分享一下移動檔案、目錄的命令—mv命令（move）（這裡不會像前面那麼詳細地說，因為與前一篇的cp命令有很多相同點，重頭戲在檢視檔案的技巧） mv命令：用於移動檔案或者目錄 mv /tmp/CJlinux/1/2/1.txt /t

【swoole快速入門11】多程序共享資料

由於PHP語言不支援多執行緒，因此Swoole使用多程序模式。在多程序模式下存在程序記憶體隔離，在工作程序內修改global全域性變數和超全域性變數時，在其他程序是無效的。程序隔離 $fds = array(); $server->on('connect', fu

【百度地圖api】之獲取當前使用者地理位置-瀏覽器定位

華東師範大學2018.11月賽【EOJ Monthly 2018.11】

#include<iostream> #include<algorithm> #include<vector> #include<cmath> #include<cstring> #include<stri

【支付寶小程式】PHP 獲取使用者敏感資訊手機號驗籤解密 RSA解密 AES解密

需求支付寶小程式端，獲取到加密的使用者手機號資料，需要經過服務端對資料進行解密，得到使用者的手機號問題使用者資訊為敏感資訊，需要用到敏感資訊加密解密方法中的方式進行解密服務端為PHP，由於官方沒有對應的演示demo，經過摸索測試，還是出現了驗籤不通過，並且解密不成

【LeetCode題目記錄-11】判斷二叉樹是否是映象的（對稱的）

Symmetric Tree Given a binary tree, check whether it is a mirror of itself (ie, symmetric around it

【GDOI2016模擬3.11】遊戲

Description Input Output Sample Input 2 2 RL LR 2 2 RR RR Sample Output LOSE WIN Data Constraint

2017 ACM/ICPC Asia Regional Qingdao Online【solved：7 / 11】

hdu 6206 Apple（計算幾何+java高精度）題意：給你三個點，保證不再同一條直線上，再給你一點，問你是否在這三個點形成的圓外。思路：用java套個板子即可。就是求出三個點外接圓的圓心和半徑判斷下。 import java.mat

JZOJ5864. 【NOIP2018模擬9.11】很多序列

題解觀察資料範圍，就發現x1x1很小，而其他很大，於是就可以根據這裡來想。將其他數換成ax1+bax1+b的形式。可以知道只有出現了連續x1x1個數在序列裡面，那麼最大的數就是在這連續x1x1的數前面一個數。設fifi表示，用x2,x

【掉過的坑】axios獲取cookie的正確姿勢

問題描述正常人使用axios的時候，要獲取response中的cookie，正常寫法是: axios.post('xxx.url',params) .then(res => { console.log(res.headers['set-c

【C++心路歷程11】1182火柴棒等式，打表！

#include<cstdio> #include<iostream> #include<algorithm> #include<cstring> us

【垂直搜尋引擎搭建11】使用htmlparser獲取頁面的字元編碼encoding

相關推薦