1. 程式人生 > >Net開源網路爬蟲Abot介紹

Net開源網路爬蟲Abot介紹

.Net中也有很多很多開源的爬蟲工具,abot就是其中之一。Abot是一個開源的.net爬蟲,速度快,易於使用和擴充套件。專案的地址是https://code.google.com/p/abot/

對於爬取的Html,使用的分析工具是CsQuery, CsQuery可以算是.net中實現的Jquery, 可以使用類似Jquery中的方法來處理html頁面。CsQuery的專案地址是https://github.com/afeiship/CsQuery

一. 對Abot爬蟲配置

1. 通過屬性設定

先建立config物件,然後設定config中的各項屬性:

複製程式碼
CrawlConfiguration crawlConfig = new
CrawlConfiguration(); crawlConfig.CrawlTimeoutSeconds = 100; crawlConfig.MaxConcurrentThreads = 10; crawlConfig.MaxPagesToCrawl = 1000; crawlConfig.UserAgentString = "abot v1.0 http://code.google.com/p/abot"; crawlConfig.ConfigurationExtensions.Add("SomeCustomConfigValue1", "1111"); crawlConfig.ConfigurationExtensions.Add(
"SomeCustomConfigValue2", "2222");
複製程式碼

2. 通過App.config配置

直接從配置檔案中讀取,但是也任然可以在修改各項屬性:

CrawlConfiguration crawlConfig = AbotConfigurationSectionHandler.LoadFromXml().Convert(); 
crawlConfig.CrawlTimeoutSeconds = 100; 
crawlConfig.MaxConcurrentThreads = 10;

3. 應用配置到爬蟲物件

PoliteWebCrawler crawler = new PoliteWebCrawler();
PoliteWebCrawler crawler 
= new PoliteWebCrawler(crawlConfig, null, null, null, null, null, null, null);

二,使用爬蟲,註冊各種事件

爬蟲中主要是4個事件, 頁面爬取開始、頁面爬取失敗、頁面不允許爬取事件、頁面中的連結不允許爬取事件。

下面是示例程式碼:

複製程式碼
crawlergeCrawlStartingAsync += crawler_ProcessPageCrawlStarting;//單個頁面爬取開始 
crawler.PageCrawlCompletedAsync += crawler_ProcessPageCrawlCompleted;//單個頁面爬取結束 
crawler.PageCrawlDisallowedAsync += crawler_PageCrawlDisallowed;//頁面不允許爬取事件 
crawler.PageLinksCrawlDisallowedAsync += crawler_PageLinksCrawlDisallowed;//頁面連結不允許爬取事件

void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
        PageToCrawl pageToCrawl = e.PageToCrawl;
        Console.WriteLine("About to crawl link {0} which was found on page {1}", pageToCrawl.Uri.AbsoluteUri, pageToCrawl.ParentUri.AbsoluteUri);
}

void crawler_ProcessPageCrawlCompleted(object sender, PageCrawlCompletedArgs e)
{
        CrawledPage crawledPage = e.CrawledPage;
        if (crawledPage.WebException != null || crawledPage.HttpWebResponse.StatusCode != HttpStatusCode.OK)
                Console.WriteLine("Crawl of page failed {0}", crawledPage.Uri.AbsoluteUri);
        else
                Console.WriteLine("Crawl of page succeeded {0}", crawledPage.Uri.AbsoluteUri);
        if (string.IsNullOrEmpty(crawledPage.Content.Text))
                Console.WriteLine("Page had no content {0}", crawledPage.Uri.AbsoluteUri);

}

void crawler_PageLinksCrawlDisallowed(object sender, PageLinksCrawlDisallowedArgs e)
{
        CrawledPage crawledPage = e.CrawledPage;
        Console.WriteLine("Did not crawl the links on page {0} due to {1}", crawledPage.Uri.AbsoluteUri, e.DisallowedReason);
}

void crawler_PageCrawlDisallowed(object sender, PageCrawlDisallowedArgs e)
{
        PageToCrawl pageToCrawl = e.PageToCrawl;
        Console.WriteLine("Did not crawl page {0} due to {1}", pageToCrawl.Uri.AbsoluteUri, e.DisallowedReason);
}
複製程式碼

三, 為爬蟲新增多個附加物件

Abot應該是借鑑了Asp.net MVC中的ViewBag, 也為爬蟲物件設定了物件級別的CrwalBag和Page級別的ViewBag.

複製程式碼
PoliteWebCrawler crawler = new PoliteWebCrawler();
crawler.CrawlBag.MyFoo1 = new Foo();//物件級別的CrwalBag
crawler.CrawlBag.MyFoo2 = new Foo();
crawler.PageCrawlStartingAsync += crawler_ProcessPageCrawlStarting;
...
void crawler_ProcessPageCrawlStarting(object sender, PageCrawlStartingArgs e)
{
        //獲取CrwalBag中的物件
        CrawlContext context = e.CrawlContext;
        context.CrawlBag.MyFoo1.Bar();//使用CrwalBag
        context.CrawlBag.MyFoo2.Bar();

        //使用頁面級別的PageBag
        e.PageToCrawl.PageBag.Bar = new Bar();
}
複製程式碼

四,啟動爬蟲

啟動爬蟲非常簡單,呼叫Crawl方法,指定好開始頁面,就可以了。
複製程式碼
CrawlResult result = crawler.Crawl(new Uri("http://localhost:1111/"));

if (result.ErrorOccurred)
        Console.WriteLine("Crawl of {0} completed with error: {1}", result.RootUri.AbsoluteUri, result.ErrorException.Message);
else
        Console.WriteLine("Crawl of {0} completed without error.", result.RootUri.AbsoluteUri);
複製程式碼

五,介紹CsQuery

在PageCrawlCompletedAsync事件中, e.CrawledPage.CsQueryDocument就是一個CsQuery物件。

這裡介紹一下CsQuery在分析Html上的優勢:

cqDocument.Select(".bigtitle > h1")
這裡的選擇器的用法和Jquery完全相同,這裡是取class為.bittitle下的h1標籤。如果你能熟練的使用Jquery,那麼上手CsQuery會非常快和容易