常用python爬蟲框架

阿新 • • 發佈：2018-12-05

https://www.imooc.com/article/44716

Python中好用的爬蟲框架

一般比價小型的爬蟲需求，我是直接使用requests庫 + bs4就解決了，再麻煩點就使用selenium解決js的非同步載入問題。相對比較大型的需求才使用框架，主要是便於管理以及擴充套件等。

1.Scrapy

Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。

其最初是為了頁面抓取 (更確切來說, 網路抓取 )所設計的，也可以應用在獲取API所返回的資料(例如

Amazon Associates Web Services ) 或者通用的網路爬蟲。

特性：

HTML, XML源資料選擇及提取的內建支援
提供了一系列在spider之間共享的可複用的過濾器(即 Item Loaders)，對智慧處理爬取資料提供了內建支援。
通過 feed匯出提供了多格式(JSON、CSV、XML)，多儲存後端(FTP、S3、本地檔案系統)的內建支援
提供了media pipeline，可以自動下載爬取到的資料中的圖片(或者其他資源)。
高擴充套件性。您可以通過使用 signals ，設計好的API(中介軟體, extensions, pipelines)來定製實現您的功能。
- 內建的中介軟體及擴充套件為下列功能提供了支援:
- cookies and session 處理
- HTTP 壓縮
- HTTP 認證
- HTTP 快取
- user-agent模擬
- robots.txt
- 爬取深度限制
- 其他
針對非英語語系中不標準或者錯誤的編碼宣告, 提供了自動檢測以及健壯的編碼支援。
支援根據模板生成爬蟲。在加速爬蟲建立的同時，保持在大型專案中的程式碼更為一致。詳細內容請參閱 genspider 命令。
針對多爬蟲下效能評估、失敗檢測，提供了可擴充套件的狀態收集工具。
提供互動式shell終端 , 為您測試XPath表示式，編寫和除錯爬蟲提供了極大的方便
提供 System service, 簡化在生產環境的部署及執行
內建 Web service, 使您可以監視及控制您的機器
內建 Telnet終端，通過在Scrapy程序中鉤入Python終端，使您可以檢視並且除錯爬蟲
Logging 為您在爬取過程中捕捉錯誤提供了方便
支援 Sitemaps 爬取
具有快取的DNS解析器

快速入門

安裝

pip install scrapy

建立專案

scrapy startproject tutorial

ls 
tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

寫爬蟲

import scrapyclass DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]    def parse(self, response):
        filename = response.url.split("/")[-2]        with open(filename, 'wb') as f:
            f.write(response.body)

執行

scrapy crawl dmoz

這裡就簡單介紹一下，後面有時間詳細寫一些關於scrapy的文章，我的很多爬蟲的資料都是scrapy基礎上實現的。

專案地址：https://scrapy.org/

2.PySpider

PySpider：一個國人編寫的強大的網路爬蟲系統並帶有強大的WebUI。採用Python語言編寫，分散式架構，支援多種資料庫後端，強大的WebUI支援指令碼編輯器，任務監視器，專案管理器以及結果檢視器。

image.png

python 指令碼控制，可以用任何你喜歡的html解析包（內建 pyquery）
WEB 介面編寫除錯指令碼，起停指令碼，監控執行狀態，檢視活動歷史，獲取結果產出
資料儲存支援MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL 及 SQLAlchemy
佇列服務支援RabbitMQ, Beanstalk, Redis 和 Kombu
支援抓取 JavaScript 的頁面
元件可替換，支援單機/分散式部署，支援 Docker 部署
強大的排程控制，支援超時重爬及優先順序設定
支援python2&3

示例

代開web介面的編輯輸入程式碼即可

from pyspider.libs.base_handler import *class Handler(BaseHandler):
    crawl_config = {
    }    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)    def detail_page(self, response):
        return {            "url": response.url,            "title": response.doc('title').text(),
        }

專案地址：https://github.com/binux/pyspider

3.Crawley

Crawley可以高速爬取對應網站的內容，支援關係和非關係資料庫，資料可以匯出為JSON、XML等。

建立project

~$ crawley startproject [project_name]
~$ cd [project_name]

定義models

""" models.py """from crawley.persistance import Entity, UrlEntity, Field, Unicodeclass Package(Entity):
   
   #add your table fields here
   updated = Field(Unicode(255))    
   package = Field(Unicode(255))
   description = Field(Unicode(255))

寫爬蟲邏輯

""" crawlers.py """from crawley.crawlers import BaseCrawlerfrom crawley.scrapers import BaseScraperfrom crawley.extractors import XPathExtractorfrom models import *class pypiScraper(BaseScraper):

    #specify the urls that can be scraped by this class
    matching_urls = ["%"]    def scrape(self, response):

        #getting the current document's url.
        current_url = response.url        
        #getting the html table.
        table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]        #for rows 1 to n-1
        for tr in table[1:-1]:            #obtaining the searched html inside the rows
            td_updated = tr[0]
            td_package = tr[1]
            package_link = td_package[0]
            td_description = tr[2]            #storing data in Packages table
            Package(updated=td_updated.text, package=package_link.text, description=td_description.text)class pypiCrawler(BaseCrawler):

    #add your starting urls here
    start_urls = ["http://pypi.python.org/pypi"]    #add your scraper classes here    
    scrapers = [pypiScraper]    #specify you maximum crawling depth level    
    max_depth = 0

    #select your favourite HTML parsing tool
    extractor = XPathExtractor

配置

""" settings.py """import os 
PATH = os.path.dirname(os.path.abspath(__file__))#Don't change this if you don't have renamed the projectPROJECT_NAME = "pypi"PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)

DATABASE_ENGINE = 'sqlite'     DATABASE_NAME = 'pypi'  DATABASE_USER = ''             DATABASE_PASSWORD = ''         DATABASE_HOST = ''             DATABASE_PORT = ''     SHOW_DEBUG_INFO = True

執行

~$ crawley run

專案地址：http://project.crawley-cloud.com/

4.Portia

Portia是一個開源視覺化爬蟲工具，可讓您在不需要任何程式設計知識的情況下爬取網站！簡單地註釋您感興趣的頁面，Portia將建立一個蜘蛛來從類似的頁面提取資料。
這個使用時超級簡單，你們可以看一下文件。http://portia.readthedocs.io/en/latest/index.html

基於 scrapy 核心
視覺化爬取內容，不需要任何開發專業知識
動態匹配相同模板的內容

專案地址：https://github.com/scrapinghub/portia

5.Newspaper

Newspaper可以用來提取新聞、文章和內容分析。使用多執行緒，支援10多種語言等。作者從requests庫的簡潔與強大得到靈感，使用python開發的可用於提取文章內容的程式。
支援10多種語言並且所有的都是unicode編碼。

示例

>>> from newspaper import Article>>> url = 'http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/'>>> article = Article(url)>>> article.download()>>> article.html'<!DOCTYPE HTML><html itemscope itemtype="http://...'>>> article.parse()>>> article.authors
['Leigh Ann Caldwell', 'John Honway']>>> article.publish_date
datetime.datetime(2013, 12, 30, 0, 0)>>> article.text'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

>>> article.top_image
'http://someCDN.com/blah/blah/blah/file.png'

>>> article.movies
['http://youtube.com/path/to/link.com', ...]
>>> article.nlp()

>>> article.keywords
['New Years', 'resolution', ...]

>>> article.summary
'The study shows that 93% of people ...'

專案地址：https://github.com/codelucas/newspaper

6.Beautiful Soup

Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫.它能夠通過你喜歡的轉換器實現慣用的文件導航,查詢,修改文件的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間。這個我是使用的特別頻繁的。在獲取html元素，都是bs4完成的。

示例

# -*- coding: utf-8 -*-import scrapyfrom bs4 import BeautifulSoupfrom urllib.parse import urljoinfrom six.moves import urllib
DOMAIN = 'http://flagpedia.asia'class FlagSpider(scrapy.Spider):
    name = 'flag'
    allowed_domains = ['flagpedia.asia', 'flags.fmcdn.net']
    start_urls = ['http://flagpedia.asia/index']    def parse(self, response):
        html_doc = response.body
        soup = BeautifulSoup(html_doc, 'html.parser')

        a = soup.findAll('td', class_="td-flag")        for i in a:
            url = i.a.attrs.get("href")
            full_url = urljoin(DOMAIN, url)            yield scrapy.Request(full_url, callback=self.parse_news)    def parse_news(self, response):
        html_doc = response.body
        soup = BeautifulSoup(html_doc, 'html.parser')
        p = soup.find("p", id="flag-detail")
        img_url = p.img.attrs.get("srcset").split(" 2x")[0]
        url = "http:" + img_url
        img_name = img_url.split("/")[-1]

        urllib.request.urlretrieve(url, "/Users/youdi/Project/python/Rino_nakasone_backend/RinoNakasone/flag/{}".format(img_name))
        print(url)

專案地址：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

7.Grab

Grab是一個用於構建Web刮板的Python框架。藉助Grab，您可以構建各種複雜的網頁抓取工具，從簡單的5行指令碼到處理數百萬個網頁的複雜非同步網站抓取工具。Grab提供一個API用於執行網路請求和處理接收到的內容，例如與HTML文件的DOM樹進行互動。

專案地址：http://docs.grablib.org/en/latest/#grab-spider-user-manual

8.Cola

Cola是一個分散式的爬蟲框架，對於使用者來說，只需編寫幾個特定的函式，而無需關注分散式執行的細節。任務會自動分配到多臺機器上，整個過程對使用者是透明的。

專案地址：https://github.com/chineking/cola

9.selenium

Selenium 是自動化測試工具。它支援各種瀏覽器，包括 Chrome，Safari，Firefox 等主流介面式瀏覽器，如果在這些瀏覽器裡面安裝一個 Selenium 的外掛，可以方便地實現Web介面的測試. Selenium 支援瀏覽器驅動。Selenium支援多種語言開發，比如 Java，C，Ruby等等，PhantomJS 用來渲染解析JS，Selenium 用來驅動以及與 Python 的對接，Python 進行後期的處理。

示例：

from selenium import webdriverfrom selenium.webdriver.common.keys import Keys

browser = webdriver.Firefox()

browser.get('http://www.yahoo.com')assert 'Yahoo' in browser.title

elem = browser.find_element_by_name('p')  # Find the search boxelem.send_keys('seleniumhq' + Keys.RETURN)

browser.quit()

專案地址：http://seleniumhq.github.io/selenium/docs/api/py/

10 .Python-goose框架

Python-goose框架可提取的資訊包括：

文章主體內容
文章主要圖片
文章中嵌入的任何Youtube/Vimeo視訊
元描述
元標籤

用法示例

>>> from goose import Goose
>>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'>>> g = Goose()
>>> article = g.extract(url=url)
>>> article.title
u'Occupy London loses eviction fight'>>> article.meta_description"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal.">>> article.cleaned_text[:150]
(CNN) -- Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi
>>> article.top_image.src
http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg

專案地址：https://github.com/grangier/python-goose

作者：若與
連結：https://www.jianshu.com/p/59ead750099a

作者：絕地無雙
連結：https://www.imooc.com/article/44716
來源：慕課網