【垂直搜尋引擎搭建10】HtmlParser中Filter實踐
阿新 • • 發佈:2019-01-03
Filter種類:
判斷類Filter:
TagNameFilter
HasAttributeFilter
HasChildFilter
HasParentFilter
HasSiblingFilter
IsEqualFilter
邏輯運算Filter:
AndFilter
NotFilter
OrFilter
XorFilter
其他Filter:
NodeClassFilter
StringFilter
LinkStringFilter
LinkRegexFilter
RegexFilter
CssSelectorNodeFilter
這裡介紹一下TagNameFilter、HasChildFilter、HasAttributeFilter 和這幾個filter的組合使用方法。
package org.algorithm;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter ;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.Node;
public class FilterImg {
public static void main(String[] args) throws ParserException {
Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html" );
NodeFilter filter = new TagNameFilter("p");
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Node source = nodes.elementAt(0);
String sou = "";
if(source!=null){
sou = source.toString();
}
System.out.println(sou);
}
}
場景一:
如果你想抓取頁面中帶有圖片的連結,如何實現?方法很簡單,採用一個連結的TagNameFilter,以及 具有圖片的HasChildFilter,最後採用AndFilter將這兩個串聯起來,程式碼如下:
package org.algorithm;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.Node;
public class FilterImg {
public static void main(String[] args) throws ParserException {
Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html");
NodeFilter filter = new AndFilter(new TagNameFilter ("a"),new HasChildFilter (new TagNameFilter ("img")));
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Node source = nodes.elementAt(0);
String sou = "";
if(source!=null){
sou = source.toString();
}
System.out.println(sou);
}
}
場景二:
對於<div class=”f”>
或<li class=”m”>
這種型別的頁面程式碼,如何抓取裡面的內容。方式也不難,還是採用三個filter來實現,TagNameFilter,HasAttributeFilter 和AndFilter,程式碼如下:
package org.algorithm;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.AndFilter;
import org.htmlparser.filters.HasAttributeFilter;
import org.htmlparser.filters.HasChildFilter;
import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.Node;
public class FilterImg {
public static void main(String[] args) throws ParserException {
Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html");
NodeFilter filter = new AndFilter(new TagNameFilter("p"),new HasAttributeFilter("title"));
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Node source = nodes.elementAt(0);
String sou = "";
if(source!=null){
sou = source.toString();
}
System.out.println(sou);
}
}