爬取二重網頁

阿新 • • 發佈：2017-10-06

fin @class 爬取 self. tpi false ons type php

1.用 scrapy 新建一個 sun0769 項目

scrapy startproject sun0769

2.在 items.py 中確定要爬去的內容

 1 import scrapy
 2 
 3 
 4 class Sun0769Item(scrapy.Item):
 5     # define the fields for your item here like:
 6     # name = scrapy.Field()
 7     problem_type = scrapy.Field()
 8     title = scrapy.Field() 
 9     number = scrapy.Field() 
 
10     content = scrapy.Field() 
11     Processing_status = scrapy.Field()
12     url = scrapy.Field()

3.快速創建 CrawlSpider模板

scrapy genspider -t crawl dongguan wz.sun0769.com

註意此時中的名稱不能與項目名相同

4.打開 dongguan.py 編寫代碼

 1 # -*- coding: utf-8 -*-
 2 # 導入scrapy 模塊
 3 import scrapy
 4 # 導入匹配規則類，用來提取符合規則的鏈接 

 5 from scrapy.linkextractors import LinkExtractor
 6 # 導入CrawlSpiderl類和Rule
 7 from scrapy.spiders import CrawlSpider, Rule
 8 # 導入items中的類
 9 from sun0769.items import Sun0769Item
10 
11 class DongguanSpider(CrawlSpider):
12     name = ‘dongguan‘
13     allowed_domains = [‘wz.sun0769.com‘]
14     start_urls = [‘ 
http://d.wz.sun0769.com/index.php/question/huiyin?page=30‘]
15     pagelink = LinkExtractor(allow=r"page=\d+")
16     pagelink2 = LinkExtractor(allow=r"/question/\d+/\d+.shtml")
17 
18     rules = (
19         Rule(pagelink, follow=True ),
20         Rule(pagelink2, callback=‘parse_item‘,follow=True ),
21 
22     )
23 
24     def parse_item(self, response):
25         #print response.url 
26         item = Sun0769Item() 
27         # xpath 返回是一個列表
28         #item[‘problem_type‘] = response.xpath(‘//a[@class="red14"]‘).extract()
29         item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0].split(" ")[-1].split(":")[-1]
30         # item[‘title‘] = response.xpath(‘//div[@class="pagecenter p3"]//strong[@class="tgray14"]/text()‘).extract()[0]
31         item[‘number‘] = response.xpath(‘//div[@class="pagecenter p3"]//    strong[@class="tgray14"]/text()‘).extract()[0].split("：")[1].split("  ")[0]
32         #item[‘content‘] = response.xpath().extract()
33         #item[‘Processing_status‘] = response.xpath(‘//div/span[@class="qgrn"]/text()‘).extract()[0]
34         # 把數據傳出去
35         yield item
36         
37

5.在piplines.py寫代碼

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 import json
 9 
10 class TencentPipeline(object):
11     def open_spider(self, spider):
12         self.filename = open("dongguan.json", "w")
13 
14     def process_item(self, item, spider):
15         text = json.dumps(dict(item), ensure_ascii = False) + "\n"
16         self.filename.write(text.encode("utf-8")
17         return item
18 
19     def close_spider(self, spider):
20         self.filename.close()
復制代碼

6.在setting.py設置相關內容

問題:

1.怎麽把不同頁面的內容整合到一塊

2.內容匹配還有些困難（xpath，re）

爬取二重網頁

fin @class 爬取 self. tpi false ons type php 1.用 scrapy 新建一個 sun0769 項目 scrapy startproject sun0769 2.在 items.py 中確定要爬去的內容 1 import scrap

網路爬蟲之Scrapy實戰二：爬取多個網頁

前面介紹的scrapy爬蟲只能爬取單個網頁。如果我們想爬取多個網頁。比如網上的小說該如何如何操作呢。比如下面的這樣的結構。是小說的第一篇。可以點選返回目錄還是下一頁對應的網頁程式碼：我們再看進入後面章節的網頁，可以看到增加了上一頁對應的網頁程式碼通過

爬取N個網頁，並將其記錄

color 完整 encode down utf 模塊 round 初始函數挖的坑，終於能填上了，先共享出來，大家有個對比參考。也幫忙找找錯誤。我也正在看，看看原來是哪裏出了問題。下面這段代碼已經實現了網頁的爬取：其效果為：下面給出詳細說明：上圖中出現的 _

小白scrapy爬蟲之爬取簡書網頁並下載對應鏈接內容

tps python 分享列表 scrapy 網頁 pytho 分享圖片介紹 *準備工作：爬取的網址：https://www.jianshu.com/p/7353375213ab 爬取的內容：下圖中python庫介紹的內容列表，並將其鏈接的文章內容寫進文本文件中小

詳解教務系統模擬登入與爬取二

版權宣告：本文為博主原創文章，轉載請註明出處：https://blog.csdn.net/sc2079/article/details/82564284 - 寫在前面　　上篇部落格教務系統模擬登入與成績爬取對教務處成績成功爬取並將資料儲存在JSON，這篇部落格就實現查詢成績

Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼

可以實現功能的全部程式碼： import requests import re def getHTMLText(url): try: r = requests.get(url, timeout = 30) r.raise_for_statu

Python爬蟲-爬取鬥魚網頁selenium+bs

爬取鬥魚網頁（selenium+chromedriver得到網頁，用Beasutiful Soup提取資訊） ============================= ================================= =============================

利用java-maven程式爬取西刺網頁的ip代理

主要程式碼: package com.itquwei.spider; import java.io.IOException; import java.nio.charset.Charset; import org.apache.http.HttpEntity; import org.a

python 模擬滑鼠點選+bs4爬取多個網頁新聞（題目、媒體、日期、內容、url）

在搜狗新聞中，輸入關鍵詞（兩岸關係fa發展前景）後，出現6頁有關於這個關鍵詞的新聞。現在目的就是爬取有關這個關鍵詞的網頁文章，如題目、媒體、日期、內容、url。如下圖：載入包 import requests from bs4 import Beautif

python爬蟲爬取非同步載入網頁資訊（python抓取網頁中無法通過網頁標籤屬性抓取的內容）

1.問題描述最近由於學習內容的要求，需要從網頁上抓取一些資料來做分析報告，在看了python爬蟲的一些基礎知識之後就直接上手去網站上爬資料了。作為新手踩坑是無法避免，最近就遇到了一個比較難的問題：一般情況下，要抓去網頁上某個標籤上的內容，在通過urllib下

一、如何爬取鏈家網頁房源資訊

由於個人安裝的Python版本是2.7的，因此此後的相關程式碼也是該版本。爬取網頁所有資訊利用urllib2包來抓取網頁的資訊，先介紹下urllib2包的urlopen函式。 urlopen：將網頁所有資訊存到一個object裡，我們可通過讀取這個o

python爬取JS動態網頁完整指南（selenium+chrome headless）

11.23：更新，每次還要開啟瀏覽器多影響執行效率，看到蟲師講了chrome headless，非常好用引用新增： from selenium.webdriver.chrome.options import Options 程式碼新增： chrome_options =

Python3~爬取某翻譯網頁的單詞與解釋

from urllib import request from bs4 import BeautifulSoup import ssl ssl._create_default_https_context=ssl._create_unverified_context #一、

python爬蟲爬取淘寶網頁資料

O、requests 和 re 庫的介紹 requests庫是一個小型好用的網頁請求模組，可用於網頁請求，常用來編寫小型爬蟲安裝requests可以使用pip命令：在命令列輸入 pip install requests re庫是正則表示式庫，是p

Python 爬蟲5——爬取並下載網頁指定規格的圖片

看完上篇文件之後，我們對於正則表示式已經有了基本的瞭解，其實學習最有效的辦法就是帶著問題和目的，這裡我們假設有一個目標：獲取某個網頁上指定規格的圖片的連結地址，並下載到本地。一、實

HtmlParser應用,使用Filter從爬取到的網頁中獲取需要的內容

/** * 在文字中通過正則進行匹配 * * @param url 請求處理的url * @param encoding 字元編碼 * @param regex 待匹配的正則表示式 */ publi

Python 爬取網頁中JavaScript動態添加的內容（二）

python tab sta exe div int rom ava script 使用 selenium + phantomjs 實現 1、準備環境 selenium（一個用於web應用程測試的工具）安裝：pip install seleniumphantomjs（是

Python 爬取網頁中JavaScript動態新增的內容（二）

使用 selenium + phantomjs 實現 1、準備環境 selenium（一個用於web應用程測試的工具）安裝：pip install selenium phantomjs（是一種無介面的瀏覽器，用於完成網頁的渲染）下載：http://phantomjs.or

python 爬蟲入門(二) 爬取簡單網頁並儲存到本地

import refrom urllib.request import Request, urlopen#爬蟲基本的三個步驟:1.向頁面傳送請求, 獲取原始碼(都是靜態頁面的程式碼);2, 利用正則匹配資料;3 .儲存到資料庫class DataParserTool(obje

3.10爬取網頁數據示例（二）

lec href icu fin done mage con img else import requestsimport osimport bs4url=‘http://xkcd.com‘ml=‘F:\ABD‘os.makedirs(ml,exist_ok=True)wh

爬取二重網頁

相關推薦