爬蟲框架Scrapy入門——爬取acg12某頁面
阿新 • • 發佈:2018-09-17
ima 需要 random 代碼 定義 ons tps 框架 resp 1.安裝
1.1自行安裝python3環境
1.2ide使用pycharm
1.3安裝scrapy框架
2.入門案例
2.1新建項目工程
2.2配置settings文件
2.3新建爬蟲app
新建app
將start_urls的值修改為需要爬取的第一個url
修改parse()方法
然後運行一下看看,在mySpider目錄下執行:
1.1自行安裝python3環境
1.2ide使用pycharm
1.3安裝scrapy框架
2.入門案例
2.1新建項目工程
2.2配置settings文件
2.3新建爬蟲app
新建app
將start_urls的值修改為需要爬取的第一個url
修改parse()方法
然後運行一下看看,在mySpider目錄下執行:
1.安裝
1.1自行安裝python3環境
1.2ide使用pycharm
1.3安裝scrapy框架
pip install twisted
pip install lxml
pip install scrapy
2.入門案例
2.1新建項目工程
scrapy startproject mySpider
2.2配置settings文件
ROBOTSTXT_OBEY = True USER_AGENT_LIST = [ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1", "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ] ua = random.choice(USER_AGENT_LIST) if ua: USER_AGENT = ua print(ua) else: USER_AGENT = "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" #是否遵守robots規則 ROBOTSTXT_OBEY = False #線程數量 CONCURRENT_REQUESTS = 32 #下載延遲單位秒 DOWNLOAD_DELAY = 3 #cookies開關,建議禁用 COOKIES_ENABLED = False
2.3新建爬蟲app
新建app
在spiders目錄下
scrapy genspider acg12 "acg12.com"
要建立一個Spider, 你必須用scrapy.Spider類創建一個子類,並確定了三個強制的屬性 和 一個方法。
name = ""
:這個爬蟲的識別名稱,必須是唯一的,在不同的爬蟲必須定義不同的名字。allow_domains = []
是搜索的域名範圍,也就是爬蟲的約束區域,規定爬蟲只爬取這個域名下的網頁,不存在的URL會被忽略。start_urls = ()
:爬取的URL元祖/列表。爬蟲從這裏開始抓取數據,所以,第一次下載的數據將會從這些urls開始。其他子URL將會從這些起始URL中繼承性生成。parse(self, response)
:解析的方法,每個初始URL完成下載後將被調用,調用的時候傳入從每一個URL傳回的Response對象來作為唯一參數,主要作用如下:- 負責解析返回的網頁數據(response.body),提取結構化數據(生成item)
- 生成需要下一頁的URL請求。
將start_urls的值修改為需要爬取的第一個url
start_urls = ("https://acg12.com/274004/",) #這個網頁可能壞掉換一個新的
修改parse()方法
def parse(self, response):
with open("acg12.html", "w",encoding=‘utf-8‘) as f:
f.write(response.text)
然後運行一下看看,在mySpider目錄下執行:
scrapy crawl acg12
是的,就是 acg12,看上面代碼,它是 Acg12Spider 類的 name 屬性,也就是使用 scrapy genspider
命令的爬蟲名。
爬蟲框架Scrapy入門——爬取acg12某頁面