基於大資料的房價分析--1.資料爬取

阿新 • • 發佈：2019-02-03

爬取資料用的是python2.6+scrapy爬蟲框架,一開始我寫的是一個全站爬蟲,可以根據一個種子url爬取58同城所有房價資訊,但有個問題就是必須使用代理IP,否則爬蟲很快就會被封禁,於是我想了個辦法就是在linux中每五分鐘執行一次爬蟲程式,每次只爬取一個城市的房價資訊,程式碼如下

1.spiders

#encoding=utf-8
import sys

sys.path.append("..")
from scrapy.spiders import Spider
from lxml import html
import plug
from plug.utils import 
 StringUtil,NumberUtil
from ershoufang.items import HouseItem
import re
import scrapy
import time
import pymongo
from scrapy.utils.project import get_project_settings
class erShouSpider(Spider):
    name = "ershoufang"
    allowed_domains = ["58.com"]

    def __init__(self):
        super(erShouSpider,self).__init__()
        self.settings = get_project_settings()
        self.client = pymongo.MongoClient(
                                                                self.settings['MONGO_IP' 
],
                                                                self.settings['MONGO_PORT'])
        self.cities_db = self.client[self.settings['CITY_DB']]
        self.cities_Col = self.cities_db[self.settings['CITY_COL']]
        self.fillurl=""
        self.cityhost=""
        self.city=""
    def 
 get_specify_request(self):
        #返回指定的請求
        condition = {"city":self.settings['CITY']}
        if self.settings['PROVIENCE'] and self.settings['PROVIENCE']!="":
            condition = {"city":self.settings['CITY'],"provience":self.settings['PROVIENCE']}
            content = self.cities_Col.find_one(condition)
            self.cityhost = content['cityhost']
            self.fillUrl = "http://%s.58.com/ershoufang/"%self.cityhost
            self.city = content["_id"]
            return [scrapy.Request(self.fillUrl)]
    def get_sequence_request(self):
        #按順序進行爬取
        requests = []
        if self.cities_Col.count({"status":False}) <= 0:
            print("全部設為false")
            self.cities_Col.update({},{"$set":{"status":False}},True,True,True)
        content = self.cities_Col.find_one({"status":False})
        self.cities_Col.update({"_id":content["_id"]},{"$set":{"status":True}})
        self.client.close()
        self.cityhost = content['cityhost']
        self.fillUrl = "http://%s.58.com/ershoufang/"%self.cityhost
        self.city = content["_id"]
        requests.append(scrapy.Request(self.fillUrl))
        return requests
    def start_requests(self):
        if self.settings['CITY'] and self.settings['CITY'] != '':
            return self.get_specify_request()
        else:
            return self.get_sequence_request()
    def parseUrls(self,html):
        links = html.xpath(".//a/@href")
        urls = []
        for link in links:
            if StringUtil.filtString(self.fillUrl+"pn\d+?/",link):

                urls.append(link)
        return urls
    def parseItems(self,html,url):
        houselist = html.xpath(".//ul[@class='house-list-wrap']//div[@class='list-info']")
        items = []
        for houseinfo in houselist:
            detailurl = houseinfo.xpath(".//h2[1]/a/@href")
            title = "".join(houseinfo.xpath(".//h2[1]/a/text()"))
            roomNum = "".join(houseinfo.xpath(".//p[1]/span[1]/text()")[0].split())
            size = "".join(houseinfo.xpath(".//p[1]/span[2]/text()"))
            orient =  "".join(houseinfo.xpath(".//p[1]/span[3]/text()"))
            floor = "".join(houseinfo.xpath(".//p[1]/span[4]/text()"))
            address = "".join(("".join(houseinfo.xpath(".//p[2]/span[1]//a/text()"))).split())
            sumprice = "".join(houseinfo.xpath("./following-sibling::div[1]//p[@class='sum']/b/text()"))
            unitprice = "".join(houseinfo.xpath("./following-sibling::div[@class='price']//p[@class='unit']/text()"))
            items.append(HouseItem(
                                        _id = "".join(detailurl),
                                        title = title,
                                        roomNum = roomNum,
                                        size = NumberUtil.fromString(size),
                                        orient = orient,
                                        floor = floor,
                                        address = address,
                                        sumPrice = NumberUtil.fromString(sumprice),
                                        unitPrice = NumberUtil.fromString(unitprice),
                                        city=self.city,
                                        fromUrl = url,
                                        nowTime = time.time(),
                                        status = "SUBSPENDING")
                                    )
        return items
    def parse(self,response):
        if(response.body =='None'):
            return
        doc = html.fromstring(response.body.decode("utf-8"))
        urls = self.parseUrls(doc)
        items = self.parseItems(doc,response.url)
        for url in urls:
            yield scrapy.Request(url,callback=self.parse)
        for item in items:
            yield item

2.items

class HouseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    roomNum = scrapy.Field()
    size = scrapy.Field()
    orient = scrapy.Field()
    floor = scrapy.Field()
    address = scrapy.Field()
    sumPrice = scrapy.Field()
    unitPrice = scrapy.Field()
    _id = scrapy.Field()
    imageurl = scrapy.Field()
    fromUrl = scrapy.Field()
    city = scrapy.Field()
    nowTime = scrapy.Field()
    status = scrapy.Field()

3.pipelines

#coding: utf-8
import codecs
import json
import pymongo
from scrapy.utils.project import get_project_settings       

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from ershoufang.items import ProxyItem

class ErshoufangPipeline(object):
        def __init__(self):
            self.settings = get_project_settings()
            self.client = pymongo.MongoClient(
                host=self.settings['MONGO_IP'],
                port=self.settings['MONGO_PORT'])
            self.db = self.client[self.settings['MONGO_DB']]
            self.proxyclient = self.proxy = self.client[self.settings['PROXY_DB']][self.settings['POOL_NAME']]
            self.itemNumber = 0
        def process_proxy(self,item):
            self.proxyclient.insert(dict(item))
        def process_item(self, item, spider):
            if isinstance (item,ProxyItem):
                self.process_proxy(item)
                return item
            try:
                if not item['address']: 
                    print(item["fromUrl"+"網頁異常"])
                    return item
                '''
                if self.db.ershoufang.count({"_id":item["_id"],"city":item['city']})<= 0:
                    print("刪除")
                    self.db.ershoufang.remove({"_id":item["_id"]})
                '''
                coll = self.db[self.settings['ALL']]
                coll.insert(dict(item))
                self.itemNumber += 1
                print("爬取到第%s個房屋,地址為%s"%(self.itemNumber,item['address']))
            except Exception,e:
                print("房屋已存在"+item['address'])
            return item
        def closed_spider(self,spider):
            self.client.close()
            self.db.close() 
            print("本次爬取共爬取到%s條房屋資料"%self.itemNumber)

爬取了三天,爬取了兩百多萬的資料,結果如下
這裡寫圖片描述

基於大資料的房價分析--1.資料爬取

爬取資料用的是python2.6+scrapy爬蟲框架,一開始我寫的是一個全站爬蟲,可以根據一個種子url爬取58同城所有房價資訊,但有個問題就是必須使用代理IP,否則爬蟲很快就會被封禁,於是我想了個辦法就是在linux中每五分鐘執行一次爬蟲程式,每次只爬取一個

基於大資料的房價分析--2.資料解析

單單隻有地址資訊是沒法在echarts上畫出散點圖的,必須有經度緯度資訊,利用百度地圖API可以將地址資訊解析為經緯度資訊,該API每日可以調取6000次,在申請認證開發者資訊後每日可以調取三十萬次,足夠解析資料了,認證之後按如下操作建立應用其中應用

BOSS直聘網站資料分析崗位資訊爬取

感謝BOSS直聘上比較可靠的招聘資訊，讓我們有機會對資料分析崗位進行簡單的爬取與分析。語言：Python3 目錄一、資訊爬取二、資料分析 2.1 資料解析 2.2 資料分析 2.2.1 資料清洗

利用requests+分析ajax+mogodb爬取並存儲攜程酒店資料

以前就利用selenium爬取協程酒店資訊，但是我們知道利用selenium抓取資訊有個缺點就是效率不高，於是這幾天重新開啟網頁，從基本的網頁和原始碼中尋找一些值得利用的資料。話不多說，我們直接說抓取攜程酒店資料的思路，宣告：本節只做爬蟲交流技術所用，不得用於商業用途，如有侵犯他人權利，聯

【汽車口碑分析】3.爬取汽車評論資料

環境配置 Ubuntu 16.04 Python 3.5 技術框架 Scrapy 需求目標本專案為汽車口碑分析，第一步需要爬取對於不同車型的評論資料。選擇58車的車型分類爬取評論資料。爬取流程先獲取每個車型的連結

資料結構作業1-資料結構基本概念

1-1 抽象資料型別中基本操作的定義與具體實現有關。 (1分) [ ] T [x] F 1-2 若用連結串列來表示一個線性表，則表中元素的地址一定是連續的。 (1分) [ ] T [x] F 2-1 在決定選取何種儲存結構時，一般不考慮（）。 (2分) [ ] A.

【資料倉庫】1.資料模型

0x00 前言翻出來之前零零散散寫的資料倉庫的內容，重新修正整理成一個系列，此為第一篇《資料模型》。資料倉庫包含的內容很多，比如系統架構、建模和方法論。對應到具體工作中的話，它可以包含下面的這些內容：以Hadoop、Spark、Hive等元件為中心的資料架構體系

易學筆記-第6章資料管理/6.1 資料卷掛載

資料卷在容器內建立資料卷，掛載目錄為 wys： [[email protected] ~]# docker run -d -P --name mydatadocker -v /wys docker.io/ubuntu:latest 容器名字為mydatadoc

1.scrapy爬取的數據保存到es中

create date() city sql none tin alc set reat 先建立es的mapping，也就是建立在es中建立一個空的Index，代碼如下：執行後就會在es建lagou 這個index。 from datetime import

Python 爬蟲技巧1 | 將爬取網頁中的相對路徑轉換為絕對路徑

1.背景：在爬取網頁中的過程中，我對目前爬蟲專案後端指令碼中拼接得到絕對路徑的方法很不滿意，今天很無意瞭解到在python3 的 urllib.parse模組對這個問題有著非常完善的解決策略，真的是上天有眼，感動！ 2.urllib.parse模組 This module define

快過年了,Python大神給免費分享一個爬取12306火車票例子（附原始碼）

。。。上面是以前寫的文章的資源，在以前的文章中有對應的資源，有興趣的可以去檢視。作為一種便捷地收集網上資訊並從中抽取出可用資訊的方式，網路爬蟲技術變得越來越有用。使用Python這樣的簡單程式語言，你可以使用少量程式設計技能就可以爬取複雜的網站。如果手機上顯示程式碼錯亂，請分

對github的爬取，涉及請求的分析，登入爬取，request中session物件的使用

其實github的爬取相對來說是比較簡單的，可以不用框架直接使用requests和BF就可以完成一個縱向的爬取。代理工具：fiddler 首先說一下這次爬取的資料，是github上遞迴的爬取使用者主頁的資訊，包括博主和此博主主頁上最受歡迎的六個專案（Popular Repositori

Python爬蟲例項（1）-爬取一張貓的照片

程式設計這種實用性的工程性語言，光看書、背概念是完全刻舟求劍；learning with doing才是最快速的學習方法。我打算將自己學習爬蟲的週期，總結之後寫在CSDN上；通過自己感覺非常經典的例項，來表達爬蟲的知識點。話不多說，上一篇描述了爬蟲的原理之後，這一篇搞第一個例

1] python 爬取微信好友個性簽名，生成詞雲

在Anaconda下完成，參考https://blog.csdn.net/zhonglixianyun/article/details/78229782結果圖：1. 需要的庫 numpy， os， itchat, wordcloud, jieba, 1) 安裝itchat

python 爬蟲(1)：爬取 DDCTF-2018 參賽選手

簡介分析程式碼爬取結果簡介這幾天閒來無事，突然想著學習一下 python 爬蟲，也可以用來練習一下 python。剛好這兩天報名參加了 DDCTF-2018 比賽，在比賽官網的挑戰者頁面可以看到參賽者，不如就爬取一下所有的參賽

基於大資料的房價分析

大二自學的是python和java,大三卻找了個和前端相關的實習工作,好在不是很忙,工作之餘做一些自己以前想做懶得做的東西,複習一下python和java,也能增強自己的工作技能,第一個專案就是基於房價的大資料分析 1.前置準備 1.工具使用的是

資料分析、資料探勘、演算法工程師、大資料分析師的區別是什麼？爬招聘網站用資料來全方位分析

大資料行業經過幾年的發展和沉澱，大資料專案崗位細分領域已經趨於完善，本文主要探討在大資料分析這個領域，通過爬蟲爬取各個招聘網站的相關資料，對細分崗位進行深入分析，本文的呈現，感謝科多大資料資料分析培訓班第10期學員“NO.1”團隊的技術支援。此次分析結果呈現經歷三個階段：細分查詢目

[大資料專案]-0016-基於Spark2.x新聞網大資料實時分析視覺化系統

2018最新最全大資料技術、專案視訊。整套視訊，非那種淘寶雜七雜八網上能免費找到拼湊的亂八七糟的幾年前的不成體系浪費咱們寶貴時間的垃圾，詳細內容如下，視訊高清不加密，需要的聯絡QQ：3164282908（加Q註明51CTO）。課程介紹本專案基於某新聞網使用者日誌分析系統進行講解

爬取QQ空間3000萬用戶，玩玩大資料分析

這是我近期使用C#寫的一個QQ空間蜘蛛網爬蟲程式。程式斷斷續續的運行了兩週，目前總共爬了3000萬QQ資料，其中有300萬包含使用者（QQ號，暱稱，空間名稱，頭像，最新一條說說內容，最新說說的發表時間，空間簡介，性別，生日，所在省份，城市）的詳細資料。目前已經爬到

基於大資料分析的精準營銷系統研究及實現

**基於大資料分析的精準營銷系統研究及實現** 基於大資料分析的精準營銷系統研究及實現mysql資料庫建立語句基於大資料分析的精準營銷系統研究及實現oracle資料庫建立語句基於大資料分析的精準營銷系統研究及實現sqlserver資料庫建立語句基於大資料分析的

基於大資料的房價分析--1.資料爬取

1.spiders

2.items

3.pipelines

相關推薦