gecco爬蟲多個HtmlBean 匹配同一個matchUrl的問題
阿新 • • 發佈:2019-01-02
兩個爬蟲HtmlBean如下:
第一個HtmlBean,獲取小說內容
@Gecco( matchUrl="http://www.xs2345.com/read/18/18914/([^0{1}]|{index}).html", pipelines="xybwPipeline" ) /** * 獲取小說內容 */ public class XYBW implements HtmlBean{ /** * */ private static final long serialVersionUID = 2833184596055251729L; @RequestParameter private Long index; @Text @HtmlField(cssPath=".read_m > h1:nth-child(2) > a:nth-child(1)") private String bookName; @Text @HtmlField(cssPath=".ydleft > h2:nth-child(2)") private String chapterName; @Html @HtmlField(cssPath=".yd_text2") private String content; public Long getIndex() { return index; } public void setIndex(Long index) { this.index = index; } public String getBookName() { return bookName; } public void setBookName(String bookName) { this.bookName = bookName; } public String getChapterName() { return chapterName; } public void setChapterName(String chapterName) { this.chapterName = chapterName; } public String getContent() { return content; } public void setContent(String content) { if (content != null && !content.isEmpty()) { content = content.replaceAll(" ", ""); content = content.replaceAll(" ", ""); content = content.replaceAll("<br/>", ""); content = content.replaceAll("<br>", ""); content = content.replaceAll("\\n{2}", "\n"); this.content = content; }else{ this.content = ""; } } }
第二個HtmlBean ,獲取小說目錄
@Gecco( matchUrl="http://www.xs2345.com/read/18/18914/0.html", pipelines="xybwIndexPipeline" ) public class XYBWIndex implements HtmlBean{ private static final long serialVersionUID = 6065963771104230481L; @Text @HtmlField(cssPath=".ml_title > h1:nth-child(1)") private String bookName; @Text @HtmlField(cssPath=".ml_main > dl > dd > a") private List<String> chapterNameList; @Href(click=true) @HtmlField(cssPath=".ml_main > dl > dd > a") private List<String> chapterList; public String getBookName() { return bookName; } public void setBookName(String bookName) { this.bookName = bookName; } public List<String> getChapterNameList() { return chapterNameList; } public void setChapterNameList(List<String> chapterNameList) { this.chapterNameList = chapterNameList; } public List<String> getChapterList() { return chapterList; } public void setChapterList(List<String> chapterList) { this.chapterList = chapterList; } }
注意相應的處理Pipeline,這裡忽略不提
啟動抓取
HttpRequest request_xybw = new HttpGetRequest(); request_xybw.setUrl("http://www.xs2345.com/read/18/18914/0.html"); request_xybw.setCharset("gbk"); GeccoEngine.create() .classpath("com.xfire") .start(request_xybw) .thread(1) .interval(1000) .mobile(false) .start();
分析:
剛開始出現問題在於
XYBW 的
matchUrl="http://www.xs2345.com/read/18/18914/{index}.html"
XYBWIndex 的matchUrl="http://www.xs2345.com/read/18/18914/0.html"
當執行時第一個HtmlBean被匹配後(就是http://www.xs2345.com/read/18/18914/0.html
先被
http://www.xs2345.com/read/18/18914/{index}.html
匹配了,),spider執行就結束了
所以本想獲取小說目錄的HtmlBean 沒有被處理。
將XYBW 的matchUrl改成如下就解決了這個問題
matchUrl="http://www.xs2345.com/read/18/18914/([^0{1}]|{index}).html"
但我覺得更好的解決方法是將所有的匹配HtmlBean都處理,將Spider中單獨獲取一個匹配,改成獲取所有匹配的陣列
//匹配SpiderBean
currSpiderBeanClass = engine.getSpiderBeanFactory().matchSpider(request);