運用scrapy爬取鏈家網房價並儲存到本地

阿新 • • 發佈：2018-12-13

因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並儲存到本地。

先看鏈家網的原始碼。。房價資訊都儲存在 ul 下的li 裡面

爬蟲結構：

其中封裝了一個數據庫處理模組，還有一個user-agent池。。

先看mylianjia.py

# -*- coding: utf-8 -*-
import scrapy
from ..items import LianjiaItem
from scrapy.http import Request
from parsel import Selector
import requests
import os


class MylianjiaSpider(scrapy.Spider):
    name = 'mylianjia'
    #allowed_domains = ['lianjia.com']
    start_urls = ['https://bj.lianjia.com/ershoufang/chaoyang/pg']

    def start_requests(self):
        for i in range(1, 101):  #100頁的所有資訊
            url1 = self.start_urls + list(str(i))
            #print(url1)
            url = ''
            for j in url1:
                url += j + ''
            yield Request(url, self.parse)

    def parse(self, response):
        print(response.url)

        '''
        response1 = requests.get(response.url, params={'search_text': '粉墨', 'cat': 1001})
        if response1.status_code == 200:
            print(response1.text)
        dirPath = os.path.join(os.getcwd(), 'data')
        if not os.path.exists(dirPath):
            os.makedirs(dirPath)
        with open(os.path.join(dirPath, 'lianjia.html'), 'w', encoding='utf-8')as fp:
            fp.write(response1.text)
            print('網頁原始碼寫入完畢')
        '''

        infoall=response.xpath("//div[4]/div[1]/ul/li")
        #infos = response.xpath('//div[@class="info clear"]')
        #print(infos)
        #info1 = infoall.xpath('div/div[1]/a/text()').extract_first()
        #print(infoall)
        for info in infoall:
            item =LianjiaItem()
            #print(info)
            info1 = info.xpath('div/div[1]/a/text()').extract_first()
            info1_url = info.xpath('div/div[1]/a/@href').extract_first()
            #info2 = info.xpath('div/div[2]/div/text()').extract_first()
            info2_dizhi = info.xpath('div/div[2]/div/a/text()').extract_first()
            info2_xiangxi= info.xpath('div/div[2]/div/text()').extract()
            #info3 = info.xpath('div/div[3]/div/a/text()').extract_first()
            #info4 = info.xpath('div/div[4]/text()').extract_first()
            price = info.xpath('div/div[4]/div[2]/div/span/text()').extract_first()
            perprice = info.xpath('div/div[4]/div[2]/div[2]/span/text()').extract_first()
            #print(info1,'--',info1_url,'--',info2_dizhi,'--',info2_xiangxi,'--',info4,'--',price,perprice)
            info2_xiangxi1 = ''
            for j1 in info2_xiangxi:
                info2_xiangxi1 += j1 + ''
            #print(info2_xiangxi1)  #化為字串

            item['houseinfo']=info1
            item['houseurl']=info1_url
            item['housedizhi']=info2_dizhi
            item['housexiangxi']=info2_xiangxi1
            item['houseprice']=price
            item['houseperprice']=perprice

            yield item

再看items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class LianjiaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    houseinfo=scrapy.Field()
    houseurl=scrapy.Field()
    housedizhi=scrapy.Field()
    housexiangxi=scrapy.Field()
    houseprice=scrapy.Field()
    houseperprice=scrapy.Field()
    pass

接下來看pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class LianjiaPipeline(object):
    def process_item(self, item, spider):
        print('房屋資訊:',item['houseinfo'])
        print('房屋連結:', item['houseurl'])
        print('房屋位置:', item['housedizhi'])
        print('房屋詳細資訊:', item['housexiangxi'])
        print('房屋總價:', item['houseprice'],'萬')
        print('平方米價格:', item['houseperprice'])
        print('===='*10)
        return item

接下來看csvpipelines.py

import os
print(os.getcwd())
class LianjiaPipeline(object):
    def process_item(self, item, spider):
        with open('G:\pythonAI\爬蟲大全\lianjia\data\house.txt', 'a+', encoding='utf-8') as fp:
            name=str(item['houseinfo'])
            dizhi=str(item['housedizhi'])
            info=str(item['housexiangxi'])
            price=str(item['houseprice'])
            perprice=str(item['houseperprice'])
            fp.write(name + dizhi + info+ price +perprice+ '\n')
            fp.flush()
            fp.close()
        return item

    print('寫入檔案成功')

接下來看 settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for lianjia project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'lianjia'

SPIDER_MODULES = ['lianjia.spiders']
NEWSPIDER_MODULE = 'lianjia.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'lianjia (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    'lianjia.middlewares.LianjiaSpiderMiddleware': 543,
    #'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
	#'lianjia.rotate_useragent.RotateUserAgentMiddleware' :400
}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'lianjia.middlewares.LianjiaDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'lianjia.pipelines.LianjiaPipeline': 300,
    #'lianjia.iopipelines.LianjiaPipeline': 301,
    'lianjia.csvpipelines.LianjiaPipeline':302,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


LOG_LEVEL='INFO'
LOG_FILE='lianjia.log'

最後看starthouse.py

from scrapy.cmdline import execute

execute(['scrapy', 'crawl', 'mylianjia'])

程式碼執行結果

儲存到本地效果：

完成，事後可以分析一下房價和每平方米的方劑，，因為是海淀區的，，可以看到都是好幾萬一平米，總價也得幾百萬了而且是二手房，，，可以看出來，在北京買房太難。。。

原始碼 tyutltf/lianjia: 爬取鏈家北京房價並儲存txt文件 https://github.com/tyutltf/lianjia

運用scrapy爬取鏈家網房價並儲存到本地

因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並儲存到本地。先看鏈家網的原始碼。。房價資訊都儲存在 ul 下的li 裡面爬蟲結構：其中封裝了一個數據庫處理模組，還有一個user-agent池。。

Python的scrapy之爬取鏈家網房價資訊並儲存到本地

因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並儲存到本地。先看鏈家網的原始碼。。房價資訊都儲存在 ul 下的li 裡面爬蟲結構：其中封裝了一個數據庫處理模組，還有一個user-agent池。。先看mylian

Python的scrapy之爬取鏈家網房價信息並保存到本地

width gif pat lse idt ext tst maximum spa 因為有在北京租房的打算，於是上網瀏覽了一下鏈家網站的房價，想將他們爬取下來，並保存到本地。先看鏈家網的源碼。。房價信息都保存在 ul 下的li 裏面 ? 爬蟲結構： ? 其中封裝了一

資料採集（四）：用XPath爬取鏈家網房價資料

準備工作編寫爬蟲前的準備工作，我們需要匯入用到的庫，這裡主要使用的是requests和lxml兩個。還有一個Time庫，負責設定每次抓取的休息時間。 import requests import requests import time from lxml

爬取鏈家網北京房源及房價分析

爬取鏈家網北京房源及房價分析文章開始把我喜歡的這句話送個大家：這個世界上還有什麼比自己寫的程式碼執行在一億人的電腦上更酷的事情嗎，如果有那就是

scrapy實戰(一)-------------爬取鏈家網的二手房資訊

主要是通過scrapy爬取二手房相關資訊，只關心ershoufang相關連結，原始碼地址: 程式碼更新： 1.增加了爬取已成交房產的資訊，用於做為目標樣本來預測未成交房屋的價格。 2.資料通過pip

初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊

Scrapy簡介 Scrapy，Python開發的一個快速、高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛，可以用於資料探勘、監測和自動化測試。Scrapy吸引人的地方在於它是一個框架，任何人都可以根

Scrapy實戰篇（二）之爬取鏈家網成交房源數據（下）

html win64 4.0 https set 爬蟲使用創建鼓樓區在上一小節中，我們已經提取到了房源的具體信息，這一節中，我們主要是對提取到的數據進行後續的處理，以及進行相關的設置。數據處理我們這裏以把數據存儲到mongo數據庫為例。編寫pipelines.p

Scrapy實戰篇（一）之爬取鏈家網成交房源數據（上）

meta pat 分割自定義是不是 rom 創建開始 mat 今天，我們就以鏈家網南京地區為例，來學習爬取鏈家網的成交房源數據。這裏推薦使用火狐瀏覽器，並且安裝firebug和firepath兩款插件，你會發現，這兩款插件會給我們後續的數據提取帶來很大的方便。首先

Scrapy實戰篇（九）之爬取鏈家網天津租房數據

房子爬取思路頁面 scrapy more 關心分析網上　　以後有可能會在天津租房子，所以想將鏈家網上面天津的租房數據抓下來，以供分析使用。　　思路：　　1、以初始鏈接https://tj.lianjia.com/zufang/rt200600000001

利用高德API + Python爬取鏈家網租房資訊 01

看了實驗樓的專案發現五八同城爬取還是有點難度所以轉戰鏈家實驗程式碼如下 from bs4 import BeautifulSoup from urllib.request import urlopen import csv url = 'https://gz.lia

python 學習 - 爬蟲入門練習爬取鏈家網二手房資訊

import requests from bs4 import BeautifulSoup import sqlite3 conn = sqlite3.connect("test.db") c = conn.cursor() for num in range(1,101): url = "h

43.scrapy爬取鏈家網站二手房信息-1

response ons tro 問題 import xtra dom nts class 首先分析：目的：采集鏈家網站二手房數據1.先分析一下二手房主界面信息，顯示情況如下：url = https://gz.lianjia.com/ershoufang/pg1/顯示

43.scrapy爬取鏈家網站二手房資訊-1

首先分析：目的：採集鏈家網站二手房資料1.先分析一下二手房主介面資訊，顯示情況如下：url = https://gz.lianjia.com/ershoufang/pg1/顯示總資料量為27589套，但是頁面只給返回100頁的資料，每頁30條資料，也就是隻給返回3000條資料。

44.scrapy爬取鏈家網站二手房資訊-2

全面採集二手房資料：網站二手房總資料量為27650條，但有的引數欄位會出現一些問題，因為只給返回100頁資料，具體檢視就需要去細分請求url引數去請求網站資料。我這裡大概的獲取了一下篩選條件引數，一些存在問題也沒做細化處理，大致的採集資料量為21096，實際19794條。看一下執行完成結果： {'d

Python爬蟲實戰之爬取鏈家廣州房價_04鏈家的模擬登入(記錄)

問題引入開始鏈家爬蟲的時候，瞭解到需要實現模擬登入，不登入不能爬取三個月之內的資料，目前暫未驗證這個說法是否正確，這一小節記錄一下利用瀏覽器(IE11)的開發者工具去分析模擬登入網站(鏈家)的內部邏輯過程，花了一個週末的時間，部分問題暫未解決。思路介

爬蟲，爬取鏈家網北京二手房資訊

# 鏈家網二手房資訊爬取 import re import time import requests import pandas as pd from bs4 import BeautifulSoup url = 'http://bj.lianjia.com/ershouf

爬取鏈家網租房資訊（萬級資料的簡單實現）

這不是一個很難的專案，沒有ajax請求，也沒有用框架，只是一個requests請求和BeautifulSoup的解析不過，看這段程式碼你會發現，BeautifulSoup不止只有find和fing_all用於元素定位，還有fing_next等其他的更簡單的，

Scrapy爬取知名技術網站文章並儲存到MySQL資料庫

之前的幾篇文章都是在講如何把資料爬下來，今天記錄一下把資料爬下來並儲存到MySQL資料庫。文章中有講同步和非同步兩種方法。所有文章文章的地址：http://blog.jobbole.com/all-posts/ 對所有文章

分享爬取鏈家地圖找房房價資料的小爬蟲

一、說在前面受人所託，爬取鏈家上地圖找房的資料：https://bj.lianjia.com/ditu/。上面有按區域劃分的二手房均價和在售套數，我們的任務就是抓下這些資料。二、開幹 2.1失敗一次老樣子，Chrome 按下F12開啟Chrome DevTo

運用scrapy爬取鏈家網房價並儲存到本地

相關推薦