1. 程式人生 > >python爬蟲scrapy的LinkExtractor

python爬蟲scrapy的LinkExtractor

pattern pri 包含 ref ont def type 示例 scrapy

使用背景:

  我們通常在爬去某個網站的時候都是爬去每個標簽下的某些內容,往往一個網站的主頁後面會包含很多物品或者信息的詳細的內容,我們只提取某個大標簽下的某些內容的話,會顯的效率較低,大部分網站的都是按照固定套路(也就是固定模板,把各種信息展示給用戶),LinkExtrator就非常適合整站抓取,為什麽呢?因為你通過xpath、css等一些列參數設置,拿到整個網站的你想要的鏈接,而不是固定的某個標簽下的一些鏈接內容,非常適合整站爬取。

 1 import scrapy
 2 from scrapy.linkextractor import LinkExtractor
 3 
 4 class
WeidsSpider(scrapy.Spider): 5 name = "weids" 6 allowed_domains = ["wds.modian.com"] 7 start_urls = [http://www.gaosiedu.com/gsschool/] 8 9 def parse(self, response): 10 link = LinkExtractor(restrict_xpaths=//ul[@class="cont_xiaoqu"]/li) 11 links = link.extract_links(response)
12 print(links)

links是一個list

技術分享圖片

我們來叠代一下這個list

1         for link in links:
2             print(link)

links裏面包含了我們要提取的url,那我們怎麽才能拿到這個url呢?

技術分享圖片

直接在for循環裏面link.url就能拿到我們要的url和text信息

1         for link in links:
2             print(link.url,link.text)

技術分享圖片

別著急,LinkExtrator裏面不止一個xpath提取方法,還有很多參數。

>allow:接收一個正則表達式或一個正則表達式列表,提取絕對url於正則表達式匹配的鏈接,如果該參數為空,默認全部提取。

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractor import LinkExtractor
 4 
 5 class WeidsSpider(scrapy.Spider):
 6     name = "weids"
 7     allowed_domains = ["wds.modian.com"]
 8     start_urls = [http://www.gaosiedu.com/gsschool/]
 9 
10     def parse(self, response):
11         pattern = /gsschool/.+\.shtml
12         link = LinkExtractor(allow=pattern)
13         links = link.extract_links(response)
14         print(type(links))
15         for link in links:
16             print(link)

>deny:接收一個正則表達式或一個正則表達式列表,與allow相反,排除絕對url於正則表達式匹配的鏈接,換句話說,就是凡是跟正則表達式能匹配上的全部不提取。

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractor import LinkExtractor
 4 
 5 class WeidsSpider(scrapy.Spider):
 6     name = "weids"
 7     allowed_domains = ["wds.modian.com"]
 8     start_urls = [http://www.gaosiedu.com/gsschool/]
 9 
10     def parse(self, response):
11         pattern = /gsschool/.+\.shtml
12         link = LinkExtractor(deny=pattern)
13         links = link.extract_links(response)
14         print(type(links))
15         for link in links:
16             print(link)

>allow_domains:接收一個域名或一個域名列表,提取到指定域的鏈接。

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractor import LinkExtractor
 4 
 5 class WeidsSpider(scrapy.Spider):
 6     name = "weids"
 7     allowed_domains = ["wds.modian.com"]
 8     start_urls = [http://www.gaosiedu.com/gsschool/]
 9 
10     def parse(self, response):
11         domain = [gaosivip.com,gaosiedu.com]
12         link = LinkExtractor(allow_domains=domain)
13         links = link.extract_links(response)
14         print(type(links))
15         for link in links:
16             print(link)

>deny_domains:和allow_doains相反,拒絕一個域名或一個域名列表,提取除被deny掉的所有匹配url。

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractor import LinkExtractor
 4 
 5 class WeidsSpider(scrapy.Spider):
 6     name = "weids"
 7     allowed_domains = ["wds.modian.com"]
 8     start_urls = [http://www.gaosiedu.com/gsschool/]
 9 
10     def parse(self, response):
11         domain = [gaosivip.com,gaosiedu.com]
12         link = LinkExtractor(deny_domains=domain)
13         links = link.extract_links(response)
14         print(type(links))
15         for link in links:
16             print(link)

技術分享圖片

>restrict_xpaths:我們在最開始做那個那個例子,接收一個xpath表達式或一個xpath表達式列表,提取xpath表達式選中區域下的鏈接。

>restrict_css:這參數和restrict_xpaths參數經常能用到,所以同學必須掌握,個人更喜歡xpath。

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractor import LinkExtractor
 4 
 5 class WeidsSpider(scrapy.Spider):
 6     name = "weids"
 7     allowed_domains = ["wds.modian.com"]
 8     start_urls = [http://www.gaosiedu.com/gsschool/]
 9 
10     def parse(self, response):
11         link = LinkExtractor(restrict_css=ul.cont_xiaoqu > li)
12         links = link.extract_links(response)
13         print(type(links))
14         for link in links:
15             print(link)

技術分享圖片

>tags:接收一個標簽(字符串)或一個標簽列表,提取指定標簽內的鏈接,默認為tags=(‘a’,‘area’)

>attrs:接收一個屬性(字符串)或者一個屬性列表,提取指定的屬性內的鏈接,默認為attrs=(‘href’,),示例,按照這個中提取方法的話,這個頁面上的某些標簽的屬性都會被提取出來,如下例所示,這個頁面的a標簽的href屬性值都被提取到了。

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractor import LinkExtractor
 4 
 5 class WeidsSpider(scrapy.Spider):
 6     name = "weids"
 7     allowed_domains = ["wds.modian.com"]
 8     start_urls = [http://www.gaosiedu.com/gsschool/]
 9 
10     def parse(self, response):
11         link = LinkExtractor(tags=a,attrs=href)
12         links = link.extract_links(response)
13         print(type(links))
14         for link in links:
15             print(link)

python爬蟲scrapy的LinkExtractor