1. 程式人生 > >『Scrapy』終端調用&選擇器方法

『Scrapy』終端調用&選擇器方法

selector 我們 resp 必須 數據結構 tor ipy lec 結合

Scrapy終端

技術分享

示例,輸入如下命令後shell會進入Python(或IPython)交互式界面:

scrapy shell "http://www.itcast.cn/channel/teacher.shtml"

有一點註意的是必須是雙引號,單引號會報錯。

之後會顯示當前保存的數據結構以供查詢,這和我們編寫py腳本時的數據結構完全相同,可以直接使用相關方法,

技術分享

諸如:

技術分享

Scrapy Selectors

技術分享

如下所示,

>>> response.xpath(‘//title/text()‘)
[<Selector (text) xpath=//title/text()>]
>>> response.css(‘title::text‘)
[<Selector (text) xpath=//title/text()>]

這兩種方式提取的都是節點型數據,所以都可以使用.extract()或者.extract_first()方法提取data部分

技術分享

以下面的源碼為例進行提取示範:

<html>
 <head>
  <base href=‘http://example.com/‘ />
  <title>Example website</title>
 </head>
 <body>
  <div id=‘images‘>
   <a href=‘image1.html‘>Name: My image 1 <br /><img src=‘image1_thumb.jpg‘ /></a>
   <a href=‘image2.html‘>Name: My image 2 <br /><img src=‘image2_thumb.jpg‘ /></a>
   <a href=‘image3.html‘>Name: My image 3 <br /><img src=‘image3_thumb.jpg‘ /></a>
   <a href=‘image4.html‘>Name: My image 4 <br /><img src=‘image4_thumb.jpg‘ /></a>
   <a href=‘image5.html‘>Name: My image 5 <br /><img src=‘image5_thumb.jpg‘ /></a>
  </div>
 </body>
</html>

提取標簽屬性,

>>> response.xpath(‘//base/@href‘).extract()
[u‘http://example.com/‘]

>>> response.css(‘base::attr(href)‘).extract()
[u‘http://example.com/‘]

對提取目標路徑的標簽進行篩選,contains(@href, "image")表示href熟悉需要包含image字符,css同理,

response.xpath(‘//a[contains(@href, "image")]/@href‘).extract()
Out[1]: [‘image1.html‘, ‘image2.html‘, ‘image3.html‘, ‘image4.html‘, ‘image5.html‘]

response.xpath(‘//a[contains(@href, "image1")]/@href‘).extract()
Out[2]: [‘image1.html‘]
response.css(‘a[href*=image]::attr(href)‘).extract()
Out[3]: [‘image1.html‘, ‘image2.html‘, ‘image3.html‘, ‘image4.html‘, ‘image5.html‘]

esponse.css(‘a[href*=image2]::attr(href)‘).extract()
Out[4]: [‘image2.html‘]

結合兩者,

>>> response.xpath(‘//a[contains(@href, "image")]/img/@src‘).extract()
[u‘image1_thumb.jpg‘,
 u‘image2_thumb.jpg‘,
 u‘image3_thumb.jpg‘,
 u‘image4_thumb.jpg‘,
 u‘image5_thumb.jpg‘]

>>> response.css(‘a[href*=image] img::attr(src)‘).extract()
[u‘image1_thumb.jpg‘,
 u‘image2_thumb.jpg‘,
 u‘image3_thumb.jpg‘,
 u‘image4_thumb.jpg‘,
 u‘image5_thumb.jpg‘]

內置了正則表達式re和re_first方法,

response.xpath(‘//a[contains(@href, "image")]/text()‘)
Out[8]:
[<Selector xpath=‘//a[contains(@href, "image")]/text()‘ data=‘Name: My image 1 ‘>,
<Selector xpath=‘//a[contains(@href, "image")]/text()‘ data=‘Name: My image 2 ‘>,
<Selector xpath=‘//a[contains(@href, "image")]/text()‘ data=‘Name: My image 3 ‘>,
<Selector xpath=‘//a[contains(@href, "image")]/text()‘ data=‘Name: My image 4 ‘>,
<Selector xpath=‘//a[contains(@href, "image")]/text()‘ data=‘Name: My image 5 ‘>]


response.xpath(‘//a[contains(@href, "image")]/text()‘).re(r‘Name:\s*(.*)‘)
Out[7]: [‘My image 1 ‘, ‘My image 2 ‘, ‘My image 3 ‘, ‘My image 4 ‘, ‘My image 5 ‘]

response.xpath(‘//a[contains(@href, "image")]/text()‘).re_first(r‘Name:\s*(.*)‘)
Out[9]: ‘My image 1 ‘

『Scrapy』終端調用&選擇器方法