爬取愛問知識人，問題及回答

阿新 • • 發佈：2018-11-25

主要原始碼：
aiwen_spider.py：
import scrapy
from aiwen.items import AiwenItem

class aiwenSpider( scrapy.Spider):
name = “aiwen”
allowed_domains = “/iask.sina.com”
start_urls = [
“https://iask.sina.com.cn/c/80-goodAnswer-1-new.html”
]
def parse(self,response):
link = response.xpath(’//div[@class=“list-body-con current”]/ul/li/div/div[@class=“question-title”]/a/@href’).extract()
user_agent = ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134’
headers = {‘User-Agent’: user_agent}
for a in link:
url = ‘

https://iask.sina.com.cn’+a
# print(url)
yield scrapy.Request(url,callback=self.content,dont_filter=True,headers=headers)
link1= response.xpath(’//div[@class=“page mt30”]/a/@href’).extract()
for b in link1:
url1 = ‘https://iask.sina.com.cn’+b
# print(url1)
yield scrapy.Request(url1,callback=self.parse,dont_filter=True,headers=headers)
def content(self,response):
print(response)
question = response.xpath(’//p[@class=“title-text”]/text()’).extract()
answer = response.xpath(’//div[@class=“new-answer-text new-answer-cut new-pre-answer-text”]/pre/text()’).extract()
item = AiwenItem()
for question1 in question:
item[‘question’] = question1
print(item[‘question’])
for answer1 in answer:
item[‘answer’] = answer1
print(item[‘answer’])

        yield item
        # f = open('aiwen.txt', 'a+')
        # item['question']=str(item['question'])
        # item['answer']=str(item['answer'] )
        #
        # f.write(item['answer'] + '\n\n')
        # f.close()

item.py：

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class AiwenItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
question = scrapy.Field()
answer = scrapy.Field()

pass

main.py：
#coding=utf-8
from scrapy import cmdline
cmdline.execute(“scrapy crawl aiwen”.split())
pipeline.py：

-- coding: utf-8 --

Define your item pipelines here

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
class AiwenPipeline(object):
def init(self):
self.conn = pymysql.connect(host=‘localhost’,
user=‘root’,
password=‘123’,
db=‘test’,
charset=‘utf8’
)
cursor = self.conn.cursor()
cursor.execute(“DROP TABLE IF EXISTS aiwen”)
sql = “”“CREATE TABLE aiwen(aiwen text(1000) )”""
cursor.execute(sql)

def process_item(self, item, spider):
    cursor = self.conn.cursor()
    cursor.execute( "INSERT INTO aiwen(aiwen) VALUES ('%s');" % (  pymysql.escape_string(item['answer'])))
    self.conn.commit()

    return item

學習總結：
1.在這個小任務中，鞏固了scrapy框架的使用，同時也掌握了xpath的使用

爬取愛問知識人，問題及回答

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

https://doc.scrapy.org/en/latest/topics/items.html

-- coding: utf-8 --

Define your item pipelines here

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

爬取愛問知識人，問題及回答

教你分分鐘爬取百度貼吧，新手可操作（附原始碼及解析）

百度貼吧爬取(可以指定貼吧名及頁碼)

使用selenium 多線程爬取愛奇藝電影信息

簡易python爬蟲爬取boss直聘職位，並寫入excel

python小白也可以分分鐘爬取微博數據，並生成有個性的詞雲，你get到了嗎？

爬取陽光問政平臺

Python爬取王者榮耀官網，實現一對一下載軟件！

使用進程池模擬多進程爬取url獲取數據，使用進程綁定的回調函數去處理數據

python3 學習 3：python爬蟲之爬取動態載入的圖片，以百度圖片為例

【Python爬蟲實戰專案一】爬取大眾點評團購詳情及團購評論

利用Python批量爬取XKCD動漫圖片，並批量儲存

用python來爬取中國天氣網北京，上海，成都8-15天的天氣

python爬取網易雲歌曲資訊及下載連結並簡單展示

Python 爬取愛奇藝視訊二十五萬條資料分析為什麼李誕不值得了？

Python3.7爬取騰訊地圖關鍵詞位置及電話資訊

python 3 爬取某小說網站小說，註釋詳細

資料加解密基礎知識介紹，及Java實現Base64加密

用Python爬取網頁上的小說，讓你從此告別書荒！

爬取微信好友資訊，進行視覺化分析(頭像人臉識別部分已更新！)（程式碼已上傳）

爬取愛問知識人，問題及回答

-- coding: utf-8 --

Define here the models for your scraped items

See documentation in:

-- coding: utf-8 --

Define your item pipelines here

Don’t forget to add your pipeline to the ITEM_PIPELINES setting

See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

相關推薦