day26-爬蟲進階

阿新 • • 發佈：2018-11-28

5.程式碼書寫請求-全棧資料爬取
    例子4：爬取所有頁面choutiAll--手動請求傳送形式start_urls = ['https://dig.chouti.com/r/pic/hot/1']
    解析抽屜圖片下所有的超鏈！
    #設計了一個所有頁碼通用的url（pageNum表示的就是不同頁碼）
    url = 'https://dig.chouti.com/r/pic/hot/%d'
    重點是parse方法的呼叫yield scrapy.Request(url=url,callback=self.parse)

# -*- coding: utf-8 -*-
import scrapy
 
from choutiAllPro.items import ChoutiallproItem

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    #allowed_domains = ['www.ddd.com']
    start_urls = ['https://dig.chouti.com/r/pic/hot/1']

    #設計了一個所有頁碼通用的url（pageNum表示的就是不同頁碼）
    url = 'https://dig.chouti.com/r/pic/hot/%d'
    pageNum = 1
    
    def parse(self, response):
        div_list  
= response.xpath('//div[@class="content-list"]/div')
        for div in div_list:
            title = div.xpath('./div[3]/div[1]/a/text()').extract_first()
            item = ChoutiallproItem()
            item['title']=title
            
            yield item
        
        #進行其他頁碼對應url的請求操作
        if 
 self.pageNum <= 120: #假設只有120個頁碼
            self.pageNum += 1
            url = format(self.url%self.pageNum)
            #print(url)
            #進行手動請求的傳送
            yield scrapy.Request(url=url,callback=self.parse) #yield共傳送頁碼的次數，無yield只發一次！parse被遞迴的呼叫

chouti.py


    //text獲取多個文字內容    /text獲取單個文字內容
    scarpy框架會自動處理get請求的cookie
    
    例子5：百度翻譯--發post請求--處理cookie--postPro
    修改父類方法：    
    def start_requests(self):
        for url in self.start_urls:
            #該方法可以發起一個post請求
            yield scrapy.FormRequest(url=url,callback=self.parse,formdata={'kw':'dog'})

# -*- coding: utf-8 -*-
import scrapy

#需求：對start_urls列表中的url發起post請求
class PostSpider(scrapy.Spider):
    name = 'post'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://fanyi.baidu.com/sug']
    
    #Spider父類中的一個方法：可以將 start_urls列表中的url一次進行請求傳送
    def start_requests(self):
        for url in self.start_urls:
            # yield scrapy.Request(url=url, callback=self.parse) #預設發get請求
            #該方法可以發起一個post請求
            yield scrapy.FormRequest(url=url,callback=self.parse,formdata={'kw':'dog'}) #formdata處理攜帶的引數

    def parse(self, response):
        print(response.text) #結果為json串

post.py

      
    例子6：登入操作(登入豆瓣電影)，發post請求---loginPro
    登入即可獲取cookie

# -*- coding: utf-8 -*-
import scrapy


class LoginSpider(scrapy.Spider):
    name = 'login'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://accounts.douban.com/login']
    
    def start_requests(self):
        data = {
            'source':    'movie',
            'redir':    'https://movie.douban.com/',
            'form_email':    '15027900535',
            'form_password':    '[email protected]',
            'login':    '登入',
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
    
    def getPageText(self,response):
        page_text = response.text
        with open('./douban.html','w',encoding='utf-8') as fp:
            fp.write(page_text)
            print('over')
    
    def parse(self, response):
        #對當前使用者的個人主頁頁面進行獲取（有使用者資訊說明攜帶cookie，否則是登入介面）
        url = 'https://www.douban.com/people/185687620/'
        yield scrapy.Request(url=url,callback=self.getPageText)


 
6.scrapy核心元件--5大核心元件
    總結流程描述：
    引擎呼叫爬蟲檔案中的start_requests方法，將列表中url封裝成請求物件（start_urls、yield中的），會有一系列的請求物件，引擎將請求物件給排程器，排程器會進行去重，
請求物件放在排程器的佇列中，排程器將請求物件排程給下載器，下載器拿著請求物件到網際網路中下載，頁面資料下載完後給下載器，下載器給爬蟲檔案，
爬蟲檔案進行解析（呼叫parse方法），將解析後的資料封裝到item物件中，提交給管道，管道進行持久化儲存。
    注意：排程器中佇列，排程器對請求物件有去重功能。
    1.引擎：所有方法的呼叫
    2.排程器：接收引擎傳送的請求，壓入到佇列中，去除重複網址
    3.下載器：下載頁面內容，將下載好的頁面內容返回給蜘蛛（scrapy，就是爬蟲檔案）
    4.爬蟲檔案（spiders）：幹活的，將獲取的頁面資料進行解析操作
    5.管道：進行持久化儲存
    網際網路

    下載中介軟體（介於排程器、引擎、爬蟲檔案和下載器的中間）：可進行代理ip的更換
    例子7：代理中介軟體的應用----dailiPro
    daili.py的書寫；middlewares.py中DailiproDownloaderMiddleware下process_request方法
        def process_request(self, request, spider):
        #request引數表示的就是攔截到的請求物件
        request.meta['proxy'] = "https://151.106.15.3:1080"
        return None
     在settings中DOWNLOADER_MIDDLEWARES開啟  55-57行

# -*- coding: utf-8 -*-
import scrapy


class DailiSpider(scrapy.Spider):
    name = 'daili'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.baidu.com/s?wd=ip']

    def parse(self, response):
       page_text = response.text
       with open('daili.html','w',encoding='utf-8') as fp:
           fp.write(page_text)

daili.py

# -*- coding: utf-8 -*-
from scrapy import signals


class DailiproDownloaderMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # request引數表示的就是攔截到的請求物件
        request.meta['proxy'] = "https://151.106.15.3:1080"
        # request.meta={"https":"151.106.15.3:1080"} #不推薦
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

middlewares.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for dailiPro project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     https://doc.scrapy.org/en/latest/topics/settings.html
 9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'dailiPro'
13 
14 SPIDER_MODULES = ['dailiPro.spiders']
15 NEWSPIDER_MODULE = 'dailiPro.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'dailiPro (+http://www.yourdomain.com)'
20 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = False
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 #DOWNLOAD_DELAY = 3
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 #DEFAULT_REQUEST_HEADERS = {
43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44 #   'Accept-Language': 'en',
45 #}
46 
47 # Enable or disable spider middlewares
48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    'dailiPro.middlewares.DailiproSpiderMiddleware': 543,
51 #}
52 
53 # Enable or disable downloader middlewares
54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
55 DOWNLOADER_MIDDLEWARES = {
56     'dailiPro.middlewares.DailiproDownloaderMiddleware': 543,
57 }
58 
59 # Enable or disable extensions
60 # See https://doc.scrapy.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    'scrapy.extensions.telnet.TelnetConsole': None,
63 #}
64 
65 # Configure item pipelines
66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
67 #ITEM_PIPELINES = {
68 #    'dailiPro.pipelines.DailiproPipeline': 300,
69 #}
70 
71 # Enable and configure the AutoThrottle extension (disabled by default)
72 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
73 #AUTOTHROTTLE_ENABLED = True
74 # The initial download delay
75 #AUTOTHROTTLE_START_DELAY = 5
76 # The maximum download delay to be set in case of high latencies
77 #AUTOTHROTTLE_MAX_DELAY = 60
78 # The average number of requests Scrapy should be sending in parallel to
79 # each remote server
80 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
81 # Enable showing throttling stats for every response received:
82 #AUTOTHROTTLE_DEBUG = False
83 
84 # Enable and configure HTTP caching (disabled by default)
85 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
86 #HTTPCACHE_ENABLED = True
87 #HTTPCACHE_EXPIRATION_SECS = 0
88 #HTTPCACHE_DIR = 'httpcache'
89 #HTTPCACHE_IGNORE_HTTP_CODES = []
90 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
91 
92 #DEBUG  INFO  ERROR  WARNING
93 #LOG_LEVEL = 'ERROR'
94 
95 LOG_FILE = 'log.txt'

settings.py


     
7.日誌資訊的設定  
日誌登記  #DEBUG  INFO  ERROR  WARNING
在settings中寫 #LOG_LEVEL = 'ERROR' 只輸出error型別的日誌
LOG_FILE = 'log.txt'日誌輸出到檔案,上看6.上面settings.py中配置


8.請求傳參 ：爬取的資料不在同一個頁面中
  正則未生效！？？？
例子8：請求傳參---爬取電影詳情的資料---moviePro  
  將不同頁面的值放到同一個item裡（名稱和作者）
  手動發請求--yield
  請求傳參：通過Request方法的meta引數將某一個具體的資料值傳遞給request方法中指定的callback方法，callback中方法通過response去取，
item = response.meta['item'] 一個取name，二級子頁面中取author
  yield scrapy.Request(url=url,callback=self.getSencodPageText,meta={'item':item}
  
  def getSencodPageText(self,response):
    #2.接收Request方法傳遞過來的item物件
    item = response.meta['item']

# -*- coding: utf-8 -*-
import scrapy
from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.dy2018.com/html/gndy/dyzz/']
    #該方法可以將電影詳情頁中的資料進行解析
    def getSencodPageText(self,response):
        #2.接收Request方法傳遞過來的item物件
        item = response.meta['item']
        actor = response.xpath('//*[@id="Zoom"]/p[16]/text()').extract_first()
        item['actor'] = actor
        
        yield item
        
    def parse(self, response):
        print(response.text)
        table_list = response.xpath('//div[@class="co_content8"]/ul/table')
        for table in table_list:
            url = "https://www.dy2018.com"+table.xpath('./tbody/tr[2]/td[2]/b/a/@href').extract_first() #需要加https字首
            name = table.xpath('./tbody/tr[2]/td[2]/b/a/text()').extract_first()
            print(url)
            item = MovieproItem() #例項化item型別物件
            item['name']=name
            
            #1.讓Request方法將item物件傳遞給getSencodPageText方法，加入meta
            yield scrapy.Request(url=url,callback=self.getSencodPageText,meta={'item':item}) #手動發請求

movie.py



9.SrawlSpider的使用--連結提取器&規則解析器
SrawlSpider可以進行全棧資料的爬取！  --重點！
例子9：SrawlSpider的使用--爬取糗百圖片全棧資料--crawlPro
注意：專案建立 scrapy genspider -t crawl qiubai www.xxx.com
    取第一頁的標籤？--注意allow取得是符合正則的連結 link1 = LinkExtractor(allow=r'/pic/$')

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class QiubaiSpider(CrawlSpider):
    name = 'qiubai'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.qiushibaike.com/pic/']
    #連線提取器（提取頁碼連線）：從起始url表示的頁面原始碼中進行指定連線的提取
    #allow引數：正則表示式。可以將起始url頁面原始碼資料中符合該正則的連線進行全部的提取
    link = LinkExtractor(allow=r'/pic/page/\d+\?s=\d+')
    #href="/pic/page/5?s=5144132"
    
    link1 = LinkExtractor(allow=r'/pic/$') #正則表示式提取到的是所有連線的內容
    #href="/pic/"
    rules = (
        #規則解析器：將連線提取器提取到的連線對應的頁面資料進行指定（callback）負責解析
        #follow = True:將連線提取器繼續作用到連線提取器提取出的連線所對應的頁面中（會繼續作用於link中）；為False時，只會作用到start_urls，出現幾個結果。
        Rule(link, callback='parse_item', follow=True),
        Rule(link1, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)

qiubai.py


    
10.分散式爬取--多臺機器同時爬取同一頁面資料--重點！
在pycharm中下載redis

例子10：分散式爬取--爬取抽屜42區--redisPro
#爬取抽屜42區所有圖片所對應的url連線
提交到redis中的管道
settings.py中ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 from scrapy_redis.spiders import RedisCrawlSpider
 6 from redisPro.items import RedisproItem
 7 #0.將RedisCrawlSpider類進行匯入
 8 #1.將爬蟲類的父類修改成RedisCrawlSpider
 9 #2.將start_urls修改成redis_key屬性
10 #3.編寫具體的解析程式碼
11 # 4.將item提交到scrapy-redis元件中被封裝好的管道里（settings.py中ITEM_PIPELINES = {
12 #     'scrapy_redis.pipelines.RedisPipeline': 400
13 # }）
14 #5.將爬蟲檔案中產生的url對應的請求物件全部都提交到scrapy-redis封裝好的排程器中（settings.py中配置95-100）
15 #6.在配置檔案中指明將爬取到的資料值儲存到哪一個redis資料庫中（settings.py中105-108）
16 #7.對redis資料庫的配置檔案（redis.windows.conf）進行修改：protected-mode no   #bind 127.0.0.1
17 #8.執行爬蟲檔案：scrapy runspider xxx.py
18 #9.向排程器中扔一個起始的url
19 class ChoutiSpider(RedisCrawlSpider):
20     name = 'chouti'
21     #allowed_domains = ['www.xxx.com']
22     #start_urls = ['http://www.xxx.com/']
23     #排程器佇列的名稱：將起始的url扔到該名稱表示的排程器佇列中
24     redis_key = "chouti"
25     
26     rules = (
27         Rule(LinkExtractor(allow=r'/r/news/hot/\d+'), callback='parse_item', follow=True),
28     )
29 
30     def parse_item(self, response):
31         
32         imgUrl_list =  response.xpath('//div[@class="news-pic"]/img/@src').extract()
33         for url in imgUrl_list:
34             item = RedisproItem()
35             item['url'] = url
36             
37             yield item

chouti.py

  1 # -*- coding: utf-8 -*-
  2 
  3 # Scrapy settings for redisPro project
  4 #
  5 # For simplicity, this file contains only settings considered important or
  6 # commonly used. You can find more settings consulting the documentation:
  7 #
  8 #     https://doc.scrapy.org/en/latest/topics/settings.html
  9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 11 
 12 BOT_NAME = 'redisPro'
 13 
 14 SPIDER_MODULES = ['redisPro.spiders']
 15 NEWSPIDER_MODULE = 'redisPro.spiders'
 16 
 17 
 18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 19 #USER_AGENT = 'redisPro (+http://www.yourdomain.com)'
 20 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
 21 # Obey robots.txt rules
 22 ROBOTSTXT_OBEY = False
 23 
 24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
 25 #CONCURRENT_REQUESTS = 32
 26 
 27 # Configure a delay for requests for the same website (default: 0)
 28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
 29 # See also autothrottle settings and docs
 30 #DOWNLOAD_DELAY = 3
 31 # The download delay setting will honor only one of:
 32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
 33 #CONCURRENT_REQUESTS_PER_IP = 16
 34 
 35 # Disable cookies (enabled by default)
 36 #COOKIES_ENABLED = False
 37 
 38 # Disable Telnet Console (enabled by default)
 39 #TELNETCONSOLE_ENABLED = False
 40 
 41 # Override the default request headers:
 42 #DEFAULT_REQUEST_HEADERS = {
 43 #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 44 #   'Accept-Language': 'en',
 45 #}
 46 
 47 # Enable or disable spider middlewares
 48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
 49 #SPIDER_MIDDLEWARES = {
 50 #    'redisPro.middlewares.RedisproSpiderMiddleware': 543,
 51 #}
 52 
 53 # Enable or disable downloader middlewares
 54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
 55 #DOWNLOADER_MIDDLEWARES = {
 56 #    'redisPro.middlewares.RedisproDownloaderMiddleware': 543,
 57 #}
 58 
 59 # Enable or disable extensions
 60 # See https://doc.scrapy.org/en/latest/topics/extensions.html
 61 #EXTENSIONS = {
 62 #    'scrapy.extensions.telnet.TelnetConsole': None,
 63 #}
 64 
 65 # Configure item pipelines
 66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 67 ITEM_PIPELINES = {
 68     'scrapy_redis.pipelines.RedisPipeline': 400
 69 
 70 #    'redisPro.pipelines.RedisproPipeline': 300,
 71 
 72 }
 73 
 74 # Enable and configure the AutoThrottle extension (disabled by default)
 75 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
 76 #AUTOTHROTTLE_ENABLED = True
 77 # The initial download delay
 78 #AUTOTHROTTLE_START_DELAY = 5
 79 # The maximum download delay to be set in case of high latencies
 80 #AUTOTHROTTLE_MAX_DELAY = 60
 81 # The average number of requests Scrapy should be sending in parallel to
 82 # each remote server
 83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 84 # Enable showing throttling stats for every response received:
 85 #AUTOTHROTTLE_DEBUG = False
 86 
 87 # Enable and configure HTTP caching (disabled by default)
 88 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
 89 #HTTPCACHE_ENABLED = True
 90 #HTTPCACHE_EXPIRATION_SECS = 0
 91 #HTTPCACHE_DIR = 'httpcache'
 92 #HTTPCACHE_IGNORE_HTTP_CODES = []
 93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
 94 
 95 # 使用scrapy-redis元件的去重佇列
 96 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
 97 # 使用scrapy-redis元件自己的排程器
 98 SCHEDULER = "scrapy_redis.scheduler.Scheduler"
 99 # 是否允許暫停
100 SCHEDULER_PERSIST = True
101 
102 
103 
104 
105 REDIS_HOST = '192.168.12.65'
106 REDIS_PORT = 6379
107 #REDIS_ENCODING = ‘utf-8’
108 #REDIS_PARAMS = {‘password’:’123456’}

settings.py


redis配置檔案中註釋56行 75儲存模式改為no
執行：
1.啟動redis伺服器：進入到redis目錄，在cmd中輸入redis-server ./redis.windows.conf
2.啟動redis 資料庫客戶端：redis cli

3.執行配置檔案：cmd進入到F:\Python自動化21期\3.Django&專案\day26 爬蟲1104\課上程式碼及筆記\scrapy專案\redisPro\redisPro\spiders下的chouti.py目錄,
scrapy runspider chouti.py  會停在監聽的位置

4.在redis中：redis-cli
lpush chouti https://dig.chouti.com/r/news/hot/1 執行之後專案cmd中會進行資料爬取操作

5.在redis中檢視爬取的資料 
keys * -------存在chouti:items
lrange chouti:items 0 -1 

刪除資料：redis cli
flushall即可

  
小結18：40-50  總結的答案：
1.2種爬蟲模組，requests、urllib
2.robots協議作用：防君子不妨小人，常用的一種反扒手段
3.使用雲打碼或者人工識別--注：驗證碼也是入口網站的一種反扒手段
4.3種解析方式：xpath、BeautifulSoup、正則
5.selenium--執行js程式碼/PhantomJs、谷歌無頭瀏覽器
6.重要！資料加密（下載密文），動態資料爬取（梨視訊）
token--登入時rkey對應的值
7.5個，爬蟲檔案、引擎、排程器、下載器、管道
8.sqiders/CrawlSpider/RedisCrawlSpider
9.總結的10步---可以自己嘗試--分散式樣本儲存
10.未講到




想要的內容括起來

day26-爬蟲進階

5.程式碼書寫請求-全棧資料爬取例子4：爬取所有頁面choutiAll--手動請求傳送形式start_urls = ['https://dig.chouti.com/r/pic/hot/1'] 解析抽屜圖片下所有的超鏈！ #設計了一個所有頁碼通用的url（pageNum表示的就是不同頁碼）

Python爬蟲進階六之多進程的用法

maxsize clas 生產依然 queue consumer mac 裏的 filesize 前言在上一節中介紹了thread多線程庫。python中的多線程其實並不是真正的多線程，並不能做到充分利用多核CPU資源。如果想要充分利用，在python中大部分情況需要

python筆記26（爬蟲進階）

一、scrapy框架簡介 1、什麼是Scrapy？　　Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架，非常出名，非常強悍。所謂的框架就是一個已經被集成了各種功能（高效能非同步下載，佇列，分散式，解析，持久化等）的具有很強通用性的專案模板。對於框架的學習，重點是要學習其框架的特性、各個

那些年，我爬過的北科(四)——爬蟲進階之極簡併行爬蟲框架開發

寫在前面在看過目錄之後，讀者可能會問為什麼這個教程沒有講一個框架，比如說scrapy或者pyspider。在這裡，我認為理解爬蟲的原理更加重要，而不是學習一個框架。爬蟲說到底就是HTTP請求，與語言無關，與框架也無關。在本節，我們將用26行程式碼開發一個簡單的併發的（甚至分散式的）爬蟲框架。爬蟲的

爬蟲進階：Scrapy抓取boss直聘、拉勾心得經驗

關於使用Scrapy的體會，最明顯的感受就是這種模板化、工程化的腳手架體系，可以說是拿來即可開箱便用，大多僅需按一定的規則套路配置，剩下的就是專注於編寫跟爬蟲業務有關的程式碼。絕大多數的反反爬蟲策略，大多有以下幾種：忽略robots.txt協議新增隨機請求

爬蟲進階：反反爬蟲技巧

主要針對以下四種反爬技術：Useragent過濾；模糊的Javascript重定向；驗證碼；請求頭一致性檢查。高階網路爬蟲技術:繞過 “403 Forbidden”，驗證碼等爬蟲的完整程式碼可以在 github 上對應的倉庫裡找到。加vx：tanzhouyiw

爬蟲進階教程：極驗(GEETEST)驗證碼破解教程

摘要: 爬蟲最大的敵人之一是什麼？沒錯，驗證碼！Geetest作為提供驗證碼服務的行家，市場佔有率還是蠻高的。遇到Geetest提供的滑動驗證碼怎麼破？授人予魚不如授人予漁，接下來就為大家呈現本教程的精彩內容。一、前言爬蟲最大的敵人之一是什麼？沒錯，驗證碼！Ge

爬蟲進階（1）

import random import requests from fake_useragent import UserAgent from retrying import retry # 裝飾器下載錯誤重複下載 import hashlib # 資訊摘要 md5 import q

爬蟲進階教程：抖音APP無水印視訊批量下載

本文轉自：https://cuijiahua.com/blog/2018/03/spider-5.html 爬蟲進階教程：抖音APP無水印視訊批量下載摘要本文為兩類人準備：技術控和工具控。如果你是工具控

python爬蟲進階（八）：分散式系統的高可用與高併發處理

一、應對高併發的基本思路 1、加快單機的速度，例如使用Redis，提高資料訪問頻率；增加CPU的核心數，增大記憶體； 2、增加伺服器的數量，利用叢集。二、分散式系統的設計 1、無狀態應用本身沒有狀態，狀態全部通過配置檔案或者叢集的服務端提供並與之同步。比如不同

python爬蟲進階使用多執行緒爬取小說

Python多執行緒，thread標準庫。都說Python的多執行緒是雞肋，推薦使用多程序。 Python為了安全考慮有一個GIL。每個CPU在同一時間只能執行一個執行緒 GIL的全稱是Global Interpreter

Java爬蟲進階-Jsoup+httpclient獲取動態生成的資料

前面我們詳細講了一下Jsoup發現這玩意其實也就那樣，只要是可以訪問到的靜態資源頁面都可以直接用他來獲取你所需要的資料，詳情情跳轉-Jsoup爬蟲詳解，但是很多時候網站為了防止資料被惡意爬取做了很多遮掩，比如說加密啊動態載入啊，這無形中給我們寫的爬蟲程式造成了很

node爬蟲進階之——登入

轉載自：http://www.jianshu.com/p/87867f325184 在之前的文章node入門場景之——爬蟲已經介紹過最簡單的node爬蟲實現，本文在原先的基礎上更進一步，探討一下如何繞過登入，爬取登入區內的資料目錄一、理論基礎如何維持登

python爬蟲進階(模擬人為上網)

import random import socket import urllib2 import cookielib ERROR = { '0':'Can not open the url,checck you net', '1':'Creat download dir e

python爬蟲進階（十）：日誌系統、守護執行緒以及驗證碼處理

一、日誌系統首先，關日誌系統的設計參考這篇部落格。 1、日誌系統基本用途（1）多執行緒情況下，debug除錯非常困難（2）錯誤出現可能有一些隨機性（3）效能分析（4）錯誤記錄與分析（5）執行狀態的實時監測 2、日誌系統設計（1）錯誤級別：Debug，I

Java爬蟲進階-phantomJS+selenium2抓取網站圖片和小說

閒來無事，應小夥伴要求，最近寫了一個專門爬取小說和美女圖片的爬蟲工具類，有不足之處歡迎小夥伴們指出。準備工作：新建maven工程，匯入pom依賴如下：<project xmlns="http://maven.apache.org/POM/4.0.

python爬蟲進階（一）：靜態網頁爬取

一、文章說明本文是在學習過程中的筆記分享，開發環境是win7，Python3，編輯器pycharm，文章中若有錯誤歡迎指出、積極討論。另外，推薦一個比較好的爬蟲教程二、課程基礎 1、HTML和CSS 爬蟲和網頁內容處處打交道，首先要掌握一部分前端內容。參考教程： 2、

爬蟲進階

解決 gif 現在目錄可執行 enter oca mozilla lec 目錄利用多線程爬取數據爬取動態數據 selenium快速入門與基本操作關閉頁面

爬蟲進階之非同步協程

一、背景　　之前爬蟲使用的是requests+多執行緒/多程序，後來隨著前幾天的深入瞭解，才發現，對於爬蟲來說，真正的瓶頸並不是CPU的處理速度，而是對於網頁抓取時候的往返時間，因為如果採用requests+多執行緒/多程序，他本身是阻塞式的程式設計，所以時間都花費在了等待網頁結果的返回和對爬取到的資料的寫

Python爬蟲新手進階版：怎樣讀取非結構化、圖像、視頻、語音數據

image clas 訓練在線的功能方式 base64編碼 contain width 通過open讀取之後會返回一個圖像文件對象，後續所有的圖像處理都基於該對象進行。上述代碼執行後，通過 img.show() 會調用系統默認的圖像瀏覽器查看打

day26-爬蟲進階

相關推薦