Android中利用jsoup解析html頁面

阿新 • • 發佈：2018-12-27

Android 中使用:

新增依賴

    implementation 'org.jsoup:jsoup:1.10.1'

直接上程式碼:

package com.loaderman.jsoupdemo;

import android.os.Bundle;
import android.support.v7.app.AppCompatActivity;
import android.view.View;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Main2Activity extends AppCompatActivity {

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main2);
        findViewById(R.id.btn).setOnClickListener( 
new View.OnClickListener() {
            @Override
            public void onClick(View view) {
                new Thread(new Runnable() {
                    @Override
                    public void run() {
                        try {
                            Document doc = (Document) Jsoup.connect("http://192.168.0.195:8088/news.html").get();// 
解析html
                            Elements links = doc.select("ul[class=w_newslistpage_list]");//獲取li標籤且class為w_newslistpage_list的標籤
                            for (Element link : links) {
                                Elements li = link.select("li");//查詢li標籤
                                for (Element element : li) {//遍歷
                                    Elements select = element.select("a[title]");//查詢a標籤且帶有title屬性的標籤
                                    if (select!=null&&select.size()>0){
                                        String linkHref = select.get(0).attr("href");//獲取href值
                                        String linkText = select.get(0).text();//獲取text
                                        System.out.println("爬蟲結果 1 -->  " + linkHref +linkText);
                                    }
                                    Elements select1 = link.select("span[class=date]");//獲取span標籤且class為date的標籤
                                    if (select1!=null&&select1.size()>0){
                                        String date = select1.get(0).text();
                                        System.out.println("爬蟲結果 2--> " + date);
                                    }
                                }
                            }
                        } catch (IOException e) {
                            e.printStackTrace();
                        }

                    }
                }).start();
            }
        });
    }
}

小結如下:

解析和遍歷一個HTML文件

如何解析一個HTML文件：

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

其解析器能夠盡最大可能從你提供的HTML文件來創見一個乾淨的解析結果，無論HTML的格式是否完整。比如它可以處理：

沒有關閉的標籤 (比如： Lorem Ipsum parses to Lorem Ipsum)
隱式標籤 (比如. 它可以自動將 <td>Table data</td>包裝成<table><tr><td>?)
建立可靠的文件結構（html標籤包含head 和 body，在head只出現恰當的元素）

一個文件的物件模型

文件由多個Elements和TextNodes組成 (
其繼承結構如下：Document繼承Element繼承Node. TextNode繼承 Node.
一個Element包含一個子節點集合，並擁有一個父Element。他們還提供了一個唯一的子元素過濾列表。

從一個檔案載入一個文件

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

說明

parse(File in, String charsetName, String baseUri) 這個方法用來載入和解析一個HTML檔案。如在載入檔案的時候發生錯誤，將丟擲IOException，應作適當處理。

baseUri 引數用於解決檔案中URLs是相對路徑的問題。如果不需要可以傳入一個空的字串。

另外還有一個方法parse(File in, String charsetName) ，它使用檔案的路徑做為 baseUri。這個方法適用於如果被解析檔案位於網站的本地檔案系統，且相關連結也指向該檔案系統。

使用選擇器語法來查詢元素

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); //帶有href屬性的a元素
Elements pngs = doc.select("img[src$=.png]");
  //副檔名為.png的圖片

Element masthead = doc.select("div.masthead").first();
  //class等於masthead的div標籤

Elements resultLinks = doc.select("h3.r > a"); //在h3元素之後的a元素

說明

jsoup elements物件支援類似於CSS (或jquery)的選擇器語法，來實現非常強大和靈活的查詢功能。.

這個select 方法在Document, Element,或Elements物件中都可以使用。且是上下文相關的，因此可實現指定元素的過濾，或者鏈式選擇訪問。

Select方法將返回一個Elements集合，並提供一組方法來抽取和處理結果。

Selector選擇器概述

tagname: 通過標籤查詢元素，比如：a
ns|tag: 通過標籤在名稱空間查詢元素，比如：可以用 fb|name 語法來查詢 <fb:name> 元素
#id: 通過ID查詢元素，比如：#logo
.class: 通過class名稱查詢元素，比如：.masthead
[attribute]: 利用屬性查詢元素，比如：[href]
[^attr]: 利用屬性名字首來查詢元素，比如：可以用[^data-] 來查詢帶有HTML5 Dataset屬性的元素
[attr=value]: 利用屬性值來查詢元素，比如：[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配屬性值開頭、結尾或包含屬性值來查詢元素，比如：[href*=/path/]
[attr~=regex]: 利用屬性值匹配正則表示式來查詢元素，比如： img[src~=(?i)\.(png|jpe?g)]
*: 這個符號將匹配所有元素

Selector選擇器組合使用

el#id: 元素+ID，比如： div#logo
el.class: 元素+class，比如： div.masthead
el[attr]: 元素+class，比如： a[href]
任意組合，比如：a[href].highlight
ancestor child: 查詢某個元素下子元素，比如：可以用.body p 查詢在"body"元素下的所有 p元素
parent > child: 查詢某個父元素下的直接子元素，比如：可以用div.content > p 查詢 p 元素，也可以用body > * 查詢body標籤下所有直接子元素
siblingA + siblingB: 查詢在A元素之前第一個同級元素B，比如：div.head + div
siblingA ~ siblingX: 查詢A元素之前的同級X元素，比如：h1 ~ p
el, el, el:多個選擇器組合，查詢匹配任一選擇器的唯一元素，例如：div.masthead, div.logo

偽選擇器selectors

:lt(n): 查詢哪些元素的同級索引值（它的位置在DOM樹中是相對於它的父節點）小於n，比如：td:lt(3) 表示小於三列的元素
:gt(n):查詢哪些元素的同級索引值大於n，比如： div p:gt(2)表示哪些div中有包含2個以上的p元素
:eq(n): 查詢哪些元素的同級索引值與n相等，比如：form input:eq(1)表示包含一個input標籤的Form元素
:has(seletor): 查詢匹配選擇器包含元素的元素，比如：div:has(p)表示哪些div包含了p元素
:not(selector): 查詢與選擇器不匹配的元素，比如： div:not(.logo) 表示不包含 class="logo" 元素的所有 div 列表
:contains(text): 查詢包含給定文字的元素，搜尋不區分大不寫，比如： p:contains(jsoup)
:containsOwn(text): 查詢直接包含給定文字的元素
:matches(regex): 查詢哪些元素的文字匹配指定的正則表示式，比如：div:matches((?i)login)
:matchesOwn(regex): 查詢自身包含文字匹配指定正則表示式的元素
注意：上述偽選擇器索引是從0開始的，也就是說第一個元素索引值為0，第二個元素index為1等

具體api如下:

CSS-like element selector, that finds elements matching a query.

Selector syntax

A selector is a chain of simple selectors, separated by combinators. Selectors are case insensitive (including against elements, attributes, and attribute values).

The universal selector (*) is implicit when no element selector is supplied (i.e. *.header and .header is equivalent).

Pattern	Matches	Example
`*`	any element	`*`
`tag`	elements with the given tag name	`div`
`*\|E`	elements of type E in any namespace ns	`*\|name` finds `<fb:name>` elements
`ns\|E`	elements of type E in the namespace ns	`fb\|name` finds `<fb:name>` elements
`#id`	elements with attribute ID of "id"	`div#wrap`, `#logo`
`.class`	elements with a class name of "class"	`div.left`, `.result`
`[attr]`	elements with an attribute named "attr" (with any value)	`a[href]`, `[title]`
`[^attrPrefix]`	elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets	`[^data-]`, `div[^data-]`
`[attr=val]`	elements with an attribute named "attr", and value equal to "val"	`img[width=500]`, `a[rel=nofollow]`
`[attr="val"]`	elements with an attribute named "attr", and value equal to "val"	`span[hello="Cleveland"][goodbye="Columbus"]`, `a[rel="nofollow"]`
`[attr^=valPrefix]`	elements with an attribute named "attr", and value starting with "valPrefix"	`a[href^=http:]`
`[attr$=valSuffix]`	elements with an attribute named "attr", and value ending with "valSuffix"	`img[src$=.png]`
`[attr*=valContaining]`	elements with an attribute named "attr", and value containing "valContaining"	`a[href*=/search/]`
`[attr~=regex]`	elements with an attribute named "attr", and value matching the regular expression	`img[src~=(?i)\\.(png\|jpe?g)]`
	The above may be combined in any order	`div.header[title]`
	Combinators
`E F`	an F element descended from an E element	`div a`, `.logo h1`
`E > F`	an F direct child of E	`ol > li`
`E + F`	an F element immediately preceded by sibling E	`li + li`, `div.head + div`
`E ~ F`	an F element preceded by sibling E	`h1 ~ p`
`E, F, G`	all matching elements E, F, or G	`a[href], div, h3`
	Pseudo selectors
`:lt(n)`	elements whose sibling index is less than n	`td:lt(3)` finds the first 3 cells of each row
`:gt(n)`	elements whose sibling index is greater than n	`td:gt(1)` finds cells after skipping the first two
`:eq(n)`	elements whose sibling index is equal to n	`td:eq(0)` finds the first cell of each row
`:has(selector)`	elements that contains at least one element matching the selector	`div:has(p)` finds divs that contain p elements
`:not(selector)`	elements that do not match the selector. See also `Elements.not(String)`	`div:not(.logo)` finds all divs that do not have the "logo" class. `div:not(:has(div))` finds divs that do not contain divs.
`:contains(text)`	elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.	`p:contains(jsoup)` finds p elements containing the text "jsoup".
`:matches(regex)`	elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants.	`td:matches(\\d+)` finds table cells containing digits. `div:matches((?i)login)` finds divs containing the text, case insensitively.
`:containsOwn(text)`	elements that directly contain the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants.	`p:containsOwn(jsoup)` finds p elements with own text "jsoup".
`:matchesOwn(regex)`	elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants.	`td:matchesOwn(\\d+)` finds table cells directly containing digits. `div:matchesOwn((?i)login)` finds divs containing the text, case insensitively.
`:containsData(data)`	elements that contains the specified data. The contents of `script` and `style` elements, and `comment` nodes (etc) are considered data nodes, not text nodes. The search is case insensitive. The data may appear in the found element, or any of its descendants.	`script:contains(jsoup)` finds script elements containing the data "jsoup".
	The above may be combined in any order and with other selectors	`.light:contains(name):eq(0)`
`:matchText`	treats text nodes as elements, and so allows you to match against and select text nodes. Note that using this selector will modify the DOM, so you may want to `clone` your document before using.	`p:matchText:firstChild` with input `<p>One<br />Two</p>` will return one `PseudoTextElement` with text "`One`".
Structural pseudo selectors
`:root`	The element that is the root of the document. In HTML, this is the `html` element	`:root`
`:nth-child(an+b)`	elements that have `an+b-1` siblings before it in the document tree, for any positive integer or zero value of `n`, and has a parent element. For values of `a` and `b` greater than zero, this effectively divides the element's children into groups of a elements (the last group taking the remainder), and selecting the bth element of each group. For example, this allows the selectors to address every other row in a table, and could be used to alternate the color of paragraph text in a cycle of four. The `a` and `b` values must be integers (positive, negative, or zero). The index of the first child of an element is 1. In addition to this, `:nth-child()` can take `odd` and `even` as arguments instead. `odd` has the same signification as `2n+1`, and `even` has the same signification as `2n`.	`tr:nth-child(2n+1)` finds every odd row of a table. `:nth-child(10n-1)` the 9th, 19th, 29th, etc, element. `li:nth-child(5)` the 5h li
`:nth-last-child(an+b)`	elements that have `an+b-1` siblings after it in the document tree. Otherwise like `:nth-child()`	`tr:nth-last-child(-n+2)` the last two rows of a table
`:nth-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-of-type(2n+1)`
`:nth-last-of-type(an+b)`	pseudo-class notation represents an element that has `an+b-1` siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element	`img:nth-last-of-type(2n+1)`
`:first-child`	elements that are the first child of some other element.	`div > p:first-child`
`:last-child`	elements that are the last child of some other element.	`ol > li:last-child`
`:first-of-type`	elements that are the first sibling of its type in the list of children of its parent element	`dl dt:first-of-type`
`:last-of-type`	elements that are the last sibling of its type in the list of children of its parent element	`tr > td:last-of-type`
`:only-child`	elements that have a parent element and whose parent element hasve no other element children
`:only-of-type`	an element that has a parent element and whose parent element has no other element children with the same expanded element name
`:empty`	elements that have no children at all

Android中利用jsoup解析html頁面

小結如下:

解析和遍歷一個HTML文件

一個文件的物件模型

從一個檔案載入一個文件

說明

使用選擇器語法來查詢元素

說明

Selector選擇器概述

Selector選擇器組合使用

偽選擇器selectors

Selector syntax

Combinators

Pseudo selectors

Structural pseudo selectors

Android中利用jsoup解析html頁面

關於利用Jsoup解析HTML中；變成非傳統空格或亂碼問題解決方法

Android利用Jsoup解析html 開發網站客戶端小記。

Jsoup—解析HTML頁面資料的工具

利用jsoup解析html

Java爬蟲入門簡介（三） —— Jsoup解析HTML頁面

Python中利用xpath解析HTML

手把手教學 Android用jsoup解析html

Java使用Jsoup解析Html中標籤，新增屬性。

SpringMVC中利用@InitBinder來對頁面資料進行解析繫結

Android通過Jsoup解析Html原始碼

Andorid中使用Jsoup解析庫解析XML、HTML、Dom節點---第三方庫學習筆記（三）

我的Android筆記（八）—— 使用Jsoup解析Html

通過使用jsoup解析html,繪畫表格生成execl文件

Java解析html頁面,獲取想要的元素

利用itext將html頁面轉成pdf(不模糊)

【轉載儲存】Jsoup解析html常用方法

使用JSOUP解析HTML文件

Android中反射機制解析 API介紹建立private構造方法類例項反射內部類使用demo

Android中使用Intent實現一般頁面跳轉和帶引數頁面跳轉

Android中利用jsoup解析html頁面

小結如下:

解析和遍歷一個HTML文件

一個文件的物件模型

從一個檔案載入一個文件

說明

使用選擇器語法來查詢元素

說明

Selector選擇器概述

Selector選擇器組合使用

偽選擇器selectors

Selector syntax

Combinators

Pseudo selectors

Structural pseudo selectors

相關推薦