python 爬蟲例項（四）

阿新 • • 發佈：2018-12-19

環境：

OS：Window10

python：3.7

爬取鏈家地產上面的資料，兩個畫面上的資料的爬取

效果，下面的兩個網頁中的資料取出來

程式碼

import datetime
import threading

import requests
from bs4 import BeautifulSoup

class LianjiaHouseInfo:

    '''
        初期化變數的值
    '''
    def __init__(self):
        # 定義自己要爬取的URL
        self.url = "https://dl.lianjia.com/ershoufang/pg{0}"
        self.path = r"C:\pythonProject\Lianjia_House"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}

    '''
        訪問URL
    '''
    def request(self, param):

        # 如果不加的話可能會出現403的錯誤，所以儘量的都加上header，模仿網頁來訪問
        req = requests.get(param, headers=self.headers)
        # req.raise_for_status()
        # req.encoding = req.apparent_encoding
        return req.text
    '''
        page設定
    '''
    def all_pages(self, pageCn):
        dataListA = []
        for i in range(1, pageCn+1):
            if pageCn == 1:
                dataListA = dataListA + self.getData(self.url[0:self.url.find("pg")])
            else:
                url = self.url.format(i)
                dataListA = dataListA + self.getData(url)
        # self.dataOrganize(dataListA)
    '''
       資料取得
    '''
    def getData(self, url):
        dataList = []
        thread_lock.acquire()
        req = self.request(url)
        # driver = webdriver.Chrome()
        # driver.get(self.url)
        # iframe_html = driver.page_source
        # driver.close()
        # print(iframe_html)
        soup = BeautifulSoup(req, 'lxml')
        countHouse = soup.find(class_="total fl").find("span")
        print("共找到 ", countHouse.string, " 套大連二手房")

        sell_all = soup.find(class_="sellListContent").find_all("li")
        for sell in sell_all:

            title = sell.find(class_="title")
            if title is not None:
                print("------------------------概要--------------------------------------------")
                title = title.find("a")
                print("title:", title.string)
                housInfo = sell.find(class_="houseInfo").get_text()
                print("houseInfo:", housInfo)
                positionInfo = sell.find(class_="positionInfo").get_text()
                print("positionInfo:", positionInfo)

                followInfo = sell.find(class_="followInfo").get_text()
                print("followInfo:", followInfo)

                print("------------------------詳細資訊--------------------------------------------")
                url_detail = title["href"]
                req_detail = self.request(url_detail)
                soup_detail = BeautifulSoup(req_detail, "lxml")
                total = soup_detail.find(class_="total")
                unit = soup_detail.find(class_="unit").get_text()
                dataList.append(total.string+unit)
                print("總價:", total.string, unit)
                unitPriceValue = soup_detail.find(class_="unitPriceValue").get_text()
                dataList.append(unitPriceValue)
                print("單價:", unitPriceValue)
                room_mainInfo = soup_detail.find(class_="room").find(class_="mainInfo").get_text()
                dataList.append(room_mainInfo)
                print("戶型:", room_mainInfo)
                type_mainInfo = soup_detail.find(class_="type").find(class_="mainInfo").get_text()
                dataList.append(type_mainInfo)
                print("朝向:", type_mainInfo)
                area_mainInfo = soup_detail.find(class_="area").find(class_="mainInfo").get_text()
                dataList.append(area_mainInfo)
                print("面積:", area_mainInfo)

            else:
                print("広告です")
        thread_lock.release()
        return dataList
    #
    # def dataOrganize(self, dataList):
    #
    #     data2 = pd.DataFrame(dataList)
    #     data2.to_csv(r'C:\Users\peiqiang\Desktop\lagoujob.csv', header=False, index=False, mode='a+')
    #     data3 = pd.read_csv(r'C:\Users\peiqiang\Desktop\lagoujob.csv', encoding='gbk')



thread_lock = threading.BoundedSemaphore(value=100)
house_Info = LianjiaHouseInfo()
startTime = datetime.datetime.now()
house_Info.all_pages(1)
endTime = datetime.datetime.now()
print("実行時間：", (endTime - startTime).seconds)

　　執行之後的效果

python 爬蟲例項（四）

環境： OS：Window10 python：3.7 爬取鏈家地產上面的資料，兩個畫面上的資料的爬取效果，下面的兩個網頁中的資料取出來程式碼 import datetime import threading import requ

Python爬蟲系列（四）：Beautiful Soup解析HTML之把HTML轉成Python對象

調用 nor 結束版本現在 name屬性 data 官方文檔 get 在前幾篇文章，我們學會了如何獲取html文檔內容，就是從url下載網頁。今天開始，我們將討論如何將html轉成python對象，用python代碼對文檔進行分析。 (牛小妹在學校折騰了好幾天，也沒把h

Python爬蟲開發（四）：動態載入頁面的解決方案與爬蟲代理

0×00 前言如果讀者讀過我前面的關於爬蟲的文章，應該大概都清楚我們現在可以對一個靜態的web頁面”為所欲為“了，但是技術的發展總是沒有止境的，僅僅是這樣對靜態頁面處理遠遠不夠，要知道現在很多的web頁面有意無意為了防止靜態爬蟲使用ajax技術動態載入頁面，這就導致了在面對這些網站的時候，我們

python 爬蟲例項（三）

問題描述爬取部落格園的首頁資料URL【https://home.cnblogs.com/blog/page/1/】，之後寫到自己的Excel裡面環境： OS：Window10 python：3.7 程式碼 import requests import os

Python爬蟲系列（四）（簡單）Dota排行榜爬取，並存入Excel表格

在編寫Python程式的時候，有很多庫供我們選擇，如urllib、requests，BeautifulSoup，lxml，正則表示式等等，使得我們在獲取網頁原始碼或者選擇元素的時候

python爬蟲入門（四）利用多執行緒爬蟲

#!/usr/bin/env python # -*- coding:utf-8 -*- # 使用了執行緒庫 import threading # 佇列 from Queue import Queue # 解析庫 from lxml import etree # 請求處理 impor

python爬蟲實戰（四）：selenium爬蟲抓取阿里巴巴採購批發商品

一、前言二、學習資料（感謝分享）三、開始爬取 1、先分析目標網址，為什麼選擇selenium 在搜尋中輸入女裝，用F12檢視原始碼，看看網頁顯示的內容是不是Ajax。點選Network，選擇下面的XHR，按F5重新整理頁面，下

Python爬蟲例項（1）-爬取一張貓的照片

程式設計這種實用性的工程性語言，光看書、背概念是完全刻舟求劍；learning with doing才是最快速的學習方法。我打算將自己學習爬蟲的週期，總結之後寫在CSDN上；通過自己感覺非常經典的例項，來表達爬蟲的知識點。話不多說，上一篇描述了爬蟲的原理之後，這一篇搞第一個例

Python爬蟲學習（四）正則表示式

正則表示式是用來匹配字串非常強大的工具，在其他程式語言中同樣有正則表示式的概念，Python同樣不例外，利用了正則表示式，我們想要從返回的頁面內容提取出我們想要的內容就易如反掌了。貪婪模式與非貪婪模式正則表示式通常用於在文字中查詢匹配的字串。Python裡數量詞

python網絡爬蟲筆記（四）

inf 比較小寫字母網絡爬蟲作用自定義 gpo 外部而且一、python中的高階函數算法 1、sorted()函數的排序 sorted()函數是一個高階函數，還可以接受一個key函數來實現自定義的函數排序，key指定的函數作用於每個序列元素上，並根據key函

一個簡單Python爬蟲例項（爬取的是前程無憂網的部分招聘資訊）

從今天開始學習爬蟲，展示我的第一個例項（用的是Python3寫的，Python2需要加個編碼方式吧，或許還有其他的不相容的地方吧，我就不知道了），把這分享給大家，希望對大家有一些幫助 import urllib,re import urllib.request import xlwt #開啟網頁

$python爬蟲系列（1）——一個簡單的爬蟲例項

本文主要實現一個簡單的爬蟲，目的是從一個百度貼吧頁面下載圖片。 1. 概述本文主要實現一個簡單的爬蟲，目的是從一個百度貼吧頁面下載圖片。下載圖片的步驟如下：獲取網頁html文字內容；分析html中圖片的html標籤特徵，用正則解析出所有的

Python基礎學習（四）

python 函數集合 Python 集合： set 顧明思義，就是個集合，集合的元素是唯一的，無序的。一個{ }裏面放一些元素就構成了一個集合，set裏面可以是多種數據類型（但不能是列表，集合，字典，可以是元組）它可以對列表裏面的重復元素進行去重list1 = [1,2,3,23

Python編程（四）函數

要求 spa max 關系接受 b- bdb ont log 函數的返回值：可以返回任意類型沒有return的話返回：None return返回一次就終止函數的運行函數的使用：必須先定義，然後再調用函數的定義：跟變量定義相似，如果沒有事先定義就調用的話相當於在引

Python學習筆記（四）列表生成式_生成器

rec triangle 小寫 ont 無限 end clas 普通執行過程筆記摘抄來自：https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000/001431

Python 爬蟲 2 （轉）

規範 return python 爬蟲直接 htm str 保存 urn find 一，獲取整個頁面數據首先我們可以先獲取要下載圖片的整個頁面信息。 getjpg.py #coding=utf-8 import urllib def getHtml(url):

python基礎教程（四）

一次不同 ble 排序。 itl tuple uda 基本 append 列表　　本節繼續討論列表不同元組和字符串的地方：列表是可變的（mutable）----可以改變列表的內容，並且列表有很多有用的、專門的方法。 List函數可以將一個字符串拆分成列表。 >

Python爬蟲系列（一）：從零開始，安裝環境

tar 公司 pip nal 網頁解析目標 http caption 在上一個系列，我們學會使用rabbitmq。本來接著是把公司的celery分享出來，但是定睛一看，celery4.0已經不再支持Windows。公司也逐步放棄了服役多年的celery項目。恰好，公司找

Python爬蟲學習（一）

code time response utf path urllib quest ext .com Python訪問網頁主要使用包urllib 打開網頁使用 urllib.request.urlopen(url, data=None, [timeout, ]*, cafi

python學習筆記（四）-數據類型

rand 兩個 urn 浪費 line 平年 randint .com .cn 0. 在 Python 中的數據類型詳解 http://www.cnblogs.com/scios/p/8026576.html 1. 為什麽布爾類型(bool)的 True 和 False 分

python 爬蟲例項（四）

相關推薦