scrapy——抓取知乎
分享一下我老師大神的人工智慧教程!零基礎,通俗易懂!http://blog.csdn.net/jiangjunshow
也歡迎大家轉載本篇文章。分享知識,造福人民,實現我們中華民族偉大復興!
主要目標是:
· 從如何評價X的話題下開始抓取問題,然後開始爬相關問題再迴圈
·
1 建立專案
$ scrapy startproject zhihu
New Scrapy project 'zhihu', using template directory'/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-1.3.2-py2.7.egg/scrapy/templates/project',created in:
/Users/huilinwang/zhihu
You can start your first spider with:
cd zhihu
scrapy genspider exampleexample.com
2 編輯SPIDER
在/zhihu/zhihu/spiders目錄中,建立zhihuspider.py檔案,具體內容檢視文件後面。
2.1 函式def start_requests(self):
This method must return aniterable with the first Requests to crawl for this spider.只調用一次。
當沒有指定URLs(定義在start_urls變數中)時候呼叫,如果指定了URLs則呼叫make_requests_from_url() 函式。
如果要通過登入來POST請求,可以如下程式碼:
class MySpider(scrapy.Spider):
name = 'myspider'
def start_requests(self):
return[scrapy.FormRequest("http://www.example.com/login",
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extractlinks to follow and return Requests for
# each of them, with anothercallback
pass
例項中start_requests內容如下:
def start_requests(self):
yield scrapy.Request(
url=self.zhihu_url,
headers=self.headers_dict,
callback=self.request_captcha
)
其中zhihu_url,headers_dict分別在共享變數中定義。
request_captcha為回撥函式。然後呼叫回撥函式。
2.1.1 scrapy.Request類
位於scrapy/http/request/_init_.py檔案中,初始化函式如下:
def __init__(self, url, callback=None, method='GET', headers=None, body=None,
cookies=None, meta=None, encoding='utf-8', priority=0,
dont_filter=False, errback=None):
該模組實現Requst類,用於表示HTTP請求。官方文件(docs/topics/request-response.rst)
scrapy使用request和response物件來爬網站。request物件在spider中產生,在系統中傳遞直到達到downloader.下載器執行request並返回response物件。
request和response類都有子類,增加了在基類中沒有要求的函式。
request物件如下:
class scrapy.http.Request(url[, callback, method='GET', headers, body,cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
一個request物件表示一個HTTP請求,通過在SPIDER中產生,在下載器中執行,然後產生一個Response.
2.1.2 request的引數說明:
url (string) – 這個請求的URL,這個屬性是隻讀的,改變URL使用replace().
callback (callable) – 這個request(一旦請求下載後)會呼叫的函式。如果沒有指定callback,預設呼叫parse()函式。注意,如果處理過程發生意外,errback會被呼叫。
method (string) – 這個request的HTTP方法,預設是’GET’。
meta (dict) – Request的初始化值request.meta.如果有這個值,dict會被複制。
body (str or unicode) – request的主體。如果傳遞是一個unicode,會編碼到utf-8。如果主體沒有給定,空字串會被儲存。不管這個引數的型別,最後值會被儲存為str.
headers (dict) – 這個request的頭。可以是單獨值或者多值頭。如果沒有傳遞這個值,HTTP頭就不會被髮送。
cookies (dict or list) –request的cookies,可以用兩種方式傳送。
Using a dict:
request_with_cookies =Request(url="http://www.example.com",
cookies={'currency':'USD', 'country': 'UY'})
Using a list of dicts:
request_with_cookies =Request(url="http://www.example.com",
cookies=[{'name': 'currency',
'value': 'USD',
'domain': 'example.com',
'path': '/currency'}])
在後續requests的時候,儲存的cookies才有用。
有些站點返回cookies(作為響應),後續有requests時候會被再次傳送。這是通常的WEB瀏覽器行為。如果,出於某些原因考慮,要避免合併存在的cookies,可以設定dont_merge_cookies為True在request.meta.
例如不合並cookies的request:
request_with_cookies =Request(url="http://www.example.com",
cookies={'currency': 'USD','country': 'UY'},
meta={'dont_merge_cookies': True})
encoding (string) – 這個request的編碼(預設是為’utf-8’). 用於將字串轉換成給定的編碼。
priority (int) – 這個request的優先順序(預設是0)。優先順序用於被排程器排程順序。高的優先順序會被更早呼叫。負值表示相對的低優先順序。
dont_filter (boolean) – 表示這個request不會被排程器過濾。在一個request 需要被多次排程時候有效,需要忽略過濾器。使用這個需要小心,不然會迴圈爬行,預設是False。 errback (callable) – 當有異常時候會被呼叫。包括404 HTTP錯誤等。
2.2 函式def request_captcha(self, response):
是start_request的回撥函式,當下載器處理完request後返回response,該函式於處理response.
def request_captcha(self, response):
_xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]
captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() *1000)
yield scrapy.Request(
url=captcha_url,
headers=self.headers_dict,
callback=self.download_captcha
)
其中time.time() * 1000為獲取時間。返回當前時間的時間戳(1970紀元後經過的浮點秒數)
程式碼中
response.css('input[name="_xsrf"]::attr(value)').extract()[0]
獲取_xsrf對應的值,為獲取HTML原始碼中:
<input type="hidden"name="_xsrf" value="fb57ee37dc9bd70821e6ed878bdfe24f"/>
該函式最後呼叫函式download_captcha。
2.3 函式 def download_captcha(self, response):
函式如下:
def download_captcha(self, response):
with open("captcha.gif","wb") as fp:
fp.write(response.body)
os.system('opencaptcha.gif')
print "請輸入驗證碼:\n"
captcha = raw_input()
yield scrapy.FormRequest(
url=self.login_url,
headers=self.headers_dict,
formdata={
"email": email,
"password": password,
"_xsrf": response.meta["_xsrf"],
"remember_me": "true",
"captcha": captcha
},
callback=self.request_zhihu
)
該函式是request_captcha函式的回撥函式。
該函式主要是增加了scrapy.FormRequest函式
2.3.1 FormRequest
FormRequest類擴充套件了Request基類。使用lxml.html forms來預處理來自Response物件的資料。
類增加一個新的引數到構造器。其他引數和Request類一致。
classmethodfrom_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
使用預填充的元素來,返回一個formrequest物件。
其中引數如下:
· formdata (dict) – fields to override in the formdata. If a field was already present in the response <form> element, its value is overridden by the one passedin this parameter.
表示form中覆蓋的資料。主要用於模擬HTML POST格式,傳送一對健值。
然後呼叫request_zhihu函式。
2.4 函式def request_zhihu(self, response):
程式碼如下:
def request_zhihu(self, response):
yield scrapy.Request(url=self.topic+'/19760570',
headers=self.headers_dict,
callback=self.get_topic_question,
dont_filter=True)
從https://www.zhihu.com/topic/19760570/hot開始
因為需要迴圈呼叫所以設定了dont_filter=True。
呼叫get_topic_question
2.5 函式def get_topic_question(self, response):
程式碼如下:
def get_topic_question(self,response):
# withopen("topic.html", "wb") asfp:
# fp.write(response.body)
question_urls = response.css(".question_link[target=_blank]::attr(href)").extract()
length = len(question_urls)
k = -1
j = 0
temp = []
for j in range(length/3):
temp.append(question_urls[k+3])
j+=1
k+=3
for url in temp:
yield scrapy.Request(url =self.zhihu_url+url,
headers = self.headers_dict,
callback = self.parse_question_data)
找到相對連結,即href屬性,這些就是知乎的TOPIC的相對連結。賦值給question_urls變數。然後間隔提取其中的TOPIC URL片段,最後繼續呼叫Request,不過URL發生了變化,變成了一個拼接的URL。然後呼叫parse_question_data
2.6 函式def parse_question_data(self, response):
是蜘蛛的最後一個函式,如下:
def parse_question_data(self,response):
item = ZhihuItem()
item["qid"] = re.search('\d+',response.url).group()
item["title"] = response.css(".zm-item-title::text").extract()[0].strip()
item["answers_num"] = response.css("h3::attr(data-num)").extract()[0]
question_nums = response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()
item["followers_num"] = question_nums[0]
item["visitsCount"] = question_nums[1]
item["topic_views"] = question_nums[2]
topic_tags = response.css(".zm-item-tag::text").extract()
if len(topic_tags) >= 3:
item["topic_tag0"] = topic_tags[0].strip()
item["topic_tag1"] = topic_tags[1].strip()
item["topic_tag2"] = topic_tags[2].strip()
print item
elif len(topic_tags) == 2:
item["topic_tag0"] = topic_tags[0].strip()
item["topic_tag1"] = topic_tags[1].strip()
item["topic_tag2"] ='-'
elif len(topic_tags) == 1:
item["topic_tag0"] = topic_tags[0].strip()
item["topic_tag1"] ='-'
item["topic_tag2"] ='-'
# printtype(item["title"])
question_links = response.css(".question_link::attr(href)").extract()
yield item
for url in question_links:
yield scrapy.Request(url =self.zhihu_url+url,
headers = self.headers_dict,
callback = self.parse_question_data)
迴圈抓取,直到結束
3 編輯pipelines.py
匯入到資料庫中。
3.1 函式def open_spider(self, spider):
當spider被開啟時,這個方法被調用
3.2 函式def process_item(self, item, spider):
每個item pipeline元件都需要呼叫該方法,這個方法必須返回一個 Item (或任何繼承類)物件,或是丟擲 DropItem 異常,被丟棄的item將不會被之後的pipeline組件所處理。
3.3 編輯SETTING
修改ROBOTSTXT_OBEY = False
4 items內容
import scrapy
class ZhihuItem(scrapy.Item):
# define the fields for youritem here like:
# name = scrapy.Field()
qid = scrapy.Field()
title = scrapy.Field()
followers_num = scrapy.Field()
answers_num = scrapy.Field()
visitsCount = scrapy.Field()
topic_views = scrapy.Field()
topic_tag0 = scrapy.Field()
topic_tag1 = scrapy.Field()
topic_tag2 = scrapy.Field()
5 SPDIER內容
#coding=utf-8
import scrapy
import os
import time
import re
import json
from ..items import ZhihuItem
class zhihutopicSpider(scrapy.Spider):
zhihu_url ="https://www.zhihu.com"
headers_dict = {
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language":"zh-CN,zh;q=0.8",
"Connection": "keep-alive",
"Host":"www.zhihu.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36(KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"
}
def start_requests(self):
yield scrapy.Request(
url=self.zhihu_url,
headers=self.headers_dict,
callback=self.request_captcha
)
def request_captcha(self,response):
_xsrf =response.css('input[name="_xsrf"]::attr(value)').extract()[0]
captcha_url ="http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)
yield scrapy.Request(
url=captcha_url,
headers=self.headers_dict,
callback=self.download_captcha
)
def download_captcha(self,response):
withopen("captcha.gif", "wb") as fp:
fp.write(response.body)
os.system('opencaptcha.gif')
print "請輸入驗證碼:\n"
captcha = raw_input()
yield scrapy.FormRequest(
url=self.login_url,
headers=self.headers_dict,
formdata={
"email":email,
"password": password,
"_xsrf":response.meta[“_xsrf"],
"remember_me": "true",
"captcha":captcha
},
callback=self.request_zhihu
)
def request_zhihu(self,response):
yieldscrapy.Request(url=self.topic + '/19760570',
headers=self.headers_dict,
callback=self.get_topic_question,
dont_filter=True)
def get_topic_question(self,response):
# withopen("topic.html", "wb") as fp:
# fp.write(response.body)
question_urls =response.css(".question_link[target=_blank]::attr(href)").extract()
length = len(question_urls)
k = -1
j = 0
temp = []
for j in range(length/3):
temp.append(question_urls[k+3])
j+=1
k+=3
for url in temp:
yield scrapy.Request(url= self.zhihu_url+url,
headers =self.headers_dict,
callback =self.parse_question_data)
def parse_question_data(self,response):
item = zhihuQuestionItem()
item["qid"] =re.search('\d+',response.url).group()
item["title"] =response.css(".zm-item-title::text").extract()[0].strip()
item["answers_num"] =response.css("h3::attr(data-num)").extract()[0]
question_nums =response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()
item["followers_num"] = question_nums[0]
item["visitsCount"] = question_nums[1]
item["topic_views"] = question_nums[2]
topic_tags =response.css(".zm-item-tag::text").extract()
if len(topic_tags) >= 3:
item["topic_tag0"] = topic_tags[0].strip()
item["topic_tag1"] = topic_tags[1].strip()
item["topic_tag2"] = topic_tags[2].strip()
elif len(topic_tags) == 2:
item["topic_tag0"] = topic_tags[0].strip()
item["topic_tag1"] = topic_tags[1].strip()
item["topic_tag2"] = '-'
elif len(topic_tags) == 1:
item["topic_tag0"] = topic_tags[0].strip()
item["topic_tag1"] = '-'
item["topic_tag2"] = '-'
# printtype(item["title"])
question_links =response.css(".question_link::attr(href)").extract()
yield item
for url in question_links:
yield scrapy.Request(url =self.zhihu_url+url,
headers =self.headers_dict,
callback =self.parse_question_data)
6 PIPELINE內容
import MySQLdb
class ZhihuPipeline(object):
print "\n\n\n\n\n\n\n\n"
sql_questions = (
"INSERTINTO questions("
"qid,title, answers_num, followers_num, visitsCount, topic_views, topic_tag0,topic_tag1, topic_tag2) "
"VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')")
count = 0
def open_spider(self, spider):
host = "localhost"
user = "root"
password = "wangqi"
dbname = "zh"
self.conn = MySQLdb.connect(host, user, password, dbname)
self.cursor = self.conn.cursor()
self.conn.set_character_set('utf8')
self.cursor.execute('SET NAMES utf8;')
self.cursor.execute('SET CHARACTER SET utf8;')
self.cursor.execute('SET character_set_connection=utf8;')
print "\n\nMYSQL DB CURSOR INITSUCCESS!!\n\n"
sql = (
"CREATETABLE IF NOT EXISTS questions ("
"qid VARCHAR (100) NOTNULL,"
"title varchar(100),"
"answers_num INT(11),"
"followers_num INT(11) NOT NULL,"
"visitsCount INT(11),"
"topic_views INT(11),"
"topic_tag0 VARCHAR (600),"
"topic_tag1 VARCHAR (600),"
"topic_tag2 VARCHAR (600),"
"PRIMARY KEY (qid)"
")")
self.cursor.execute(sql)
print "\n\nTABLES ARE READY!\n\n"
def process_item(self, item,spider):
sql = self.sql_questions % (item["qid"], item["title"], item["answers_num"],item["followers_num"],
item["visitsCount"], item["topic_views"], item["topic_tag0"], item["topic_tag1"], item["topic_tag2"])
self.cursor.execute(sql)
if self.count % 10 == 0:
self.conn.commit()
self.count += 1
print item["qid"] +" DATA COLLECTED!"
7 執行
scrapy crawl zhihu
8 關於反爬
robots.txt(統一小寫)是一種存放於網站根目錄下的ASCII編碼的文字檔案,它通常告訴網路蜘蛛,此網站中的哪些內容是不應被搜尋引擎的網路蜘蛛獲取的,哪些是可以被網路蜘蛛獲取的。robots.txt是一個這個紳士協議也不是一個規範,而只是約定俗成的,有些搜尋引擎會遵守這一規範,而其他則不然。這就說明了scrapy自動遵守了robots協議.(這個要在settings.py裡面設定不遵守才可以爬得到把scrapy寫進robots.txt的網站)