超簡易Scrapy爬取知乎問題,標籤的爬蟲
阿新 • • 發佈:2018-12-19
上課的作業,備份一下,以免不時之需。
知乎的問題的網頁都是 https://www.zhihu.com/question/ 帶8位神祕數字,我們只需要依次遍歷就解決問題啦,遇到404的情況就直接跳過。用scrapy框架快速開發。
獲取知乎問題標題的程式碼
title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5]
提取問題標籤的程式碼
head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split()
獲取第一個問題點贊數的程式碼
praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first()
scapy的spider程式碼:
# -*- coding: utf-8 -*- import scrapy import re class ZhihuqSpider(scrapy.Spider): name = 'zhihuq' allowed_domains = ["www.zhihu.com"] start_urls = ['https://www.zhihu.com/question/22913650'] def parse(self, response): #提取標題 title = response.selector.xpath("/html/head/title[1]/text()").extract_first()[0:-5] if title: #提取標籤 head_list = response.css("#root > div > main > div > meta:nth-child(3)").xpath("@content").extract_first().split() #獲取點贊數 praise_num = response.css("#QuestionAnswers-answers > div > div > div:nth-child(2) > div > div:nth-child(1) > div > meta:nth-child(3)").xpath("@content").extract_first() # if int(praise_num) > 100 : yield{ 'title':title, 'head_list':head_list, 'praise_num':praise_num } def start_requests(self): url_base = r"https://www.zhihu.com/question/" for i in range(20000000,99999999): url = url_base + str(i) yield scrapy.Request(url,callback=self.parse)
修改setting.py中的設定:
請求頭
DEFAULT_REQUEST_HEADERS = { "Host": "www.zhihu.com", "Connection": "keep-alive", "Cache-Control": "max-age=0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36", "Referer": "http://www.zhihu.com/people/raymond-wang", "Accept-Encoding": "gzip,deflate,sdch", "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4,zh-TW;q=0.2", }
禁止robot協議:
COOKIES_ENABLED = False