1. 程式人生 > >超簡易Scrapy爬取知乎問題,標籤的爬蟲

超簡易Scrapy爬取知乎問題,標籤的爬蟲

上課的作業,備份一下,以免不時之需。

知乎的問題的網頁都是 https://www.zhihu.com/question/ 帶8位神祕數字,我們只需要依次遍歷就解決問題啦,遇到404的情況就直接跳過。用scrapy框架快速開發。

獲取知乎問題標題的程式碼

title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5]

提取問題標籤的程式碼

head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split()

獲取第一個問題點贊數的程式碼

praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first()

scapy的spider程式碼:

# -*- coding: utf-8 -*-
import scrapy
import re

class ZhihuqSpider(scrapy.Spider):

    name = 'zhihuq'
    allowed_domains = ["www.zhihu.com"]
    start_urls = ['https://www.zhihu.com/question/22913650']




    def parse(self, response):
    	
    	#提取標題
    	title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5]
    	if title:
    		#提取標籤
    		head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split()
    		#獲取點贊數
    		praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first()
    	#	if int(praise_num) > 100 :
    		yield{
   				'title':title,
   				'head_list':head_list,
   				'praise_num':praise_num
   			}

    def start_requests(self):

    	url_base = r"https://www.zhihu.com/question/"
    	for i in range(20000000,99999999):
    		url = url_base + str(i)
    		yield scrapy.Request(url,callback=self.parse)

修改setting.py中的設定:

請求頭

DEFAULT_REQUEST_HEADERS = {
    "Host": "www.zhihu.com",
    "Connection": "keep-alive",
    "Cache-Control": "max-age=0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36",
    "Referer": "http://www.zhihu.com/people/raymond-wang",
    "Accept-Encoding": "gzip,deflate,sdch",
    "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-TW;q=0.2",
}

禁止robot協議:

COOKIES_ENABLED = False