1. 程式人生 > >爬蟲框架Scrapy入門——爬取acg12某頁面

爬蟲框架Scrapy入門——爬取acg12某頁面

ima 需要 random 代碼 定義 ons tps 框架 resp

1.安裝
1.1自行安裝python3環境
1.2ide使用pycharm
1.3安裝scrapy框架
2.入門案例
2.1新建項目工程
2.2配置settings文件
2.3新建爬蟲app
新建app
將start_urls的值修改為需要爬取的第一個url
修改parse()方法
然後運行一下看看,在mySpider目錄下執行:

1.安裝

1.1自行安裝python3環境

1.2ide使用pycharm

1.3安裝scrapy框架

pip install twisted
pip install lxml
pip install scrapy

2.入門案例

2.1新建項目工程

scrapy startproject mySpider

2.2配置settings文件

ROBOTSTXT_OBEY = True

USER_AGENT_LIST = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"     "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",     "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",     "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",     "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",     "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",     "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",     "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",     "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",     "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",     "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",     "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",     "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
ua = random.choice(USER_AGENT_LIST)
if ua:
    USER_AGENT = ua
    print(ua)
else:
    USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

#是否遵守robots規則
ROBOTSTXT_OBEY = False
#線程數量
CONCURRENT_REQUESTS = 32
#下載延遲單位秒
DOWNLOAD_DELAY = 3
#cookies開關,建議禁用
COOKIES_ENABLED = False

2.3新建爬蟲app

新建app

在spiders目錄下

scrapy genspider acg12 "acg12.com"

技術分享圖片
技術分享圖片

要建立一個Spider, 你必須用scrapy.Spider類創建一個子類,並確定了三個強制的屬性 和 一個方法。

  • name = "" :這個爬蟲的識別名稱,必須是唯一的,在不同的爬蟲必須定義不同的名字。
  • allow_domains = [] 是搜索的域名範圍,也就是爬蟲的約束區域,規定爬蟲只爬取這個域名下的網頁,不存在的URL會被忽略。
  • start_urls = () :爬取的URL元祖/列表。爬蟲從這裏開始抓取數據,所以,第一次下載的數據將會從這些urls開始。其他子URL將會從這些起始URL中繼承性生成。
  • parse(self, response) :解析的方法,每個初始URL完成下載後將被調用,調用的時候傳入從每一個URL傳回的Response對象來作為唯一參數,主要作用如下:
    1. 負責解析返回的網頁數據(response.body),提取結構化數據(生成item)
    2. 生成需要下一頁的URL請求。

將start_urls的值修改為需要爬取的第一個url

start_urls = ("https://acg12.com/274004/",) #這個網頁可能壞掉換一個新的

修改parse()方法

def parse(self, response):
    with open("acg12.html", "w",encoding=‘utf-8‘) as f:
        f.write(response.text)

然後運行一下看看,在mySpider目錄下執行:

scrapy crawl acg12

是的,就是 acg12,看上面代碼,它是 Acg12Spider 類的 name 屬性,也就是使用 scrapy genspider命令的爬蟲名。技術分享圖片技術分享圖片

爬蟲框架Scrapy入門——爬取acg12某頁面