利用requests、pyquery、BeautifulSoup爬取深圳市某租房網站的租房資訊

阿新 • • 發佈：2020-12-02

  1 import requests
  2 from requests.exceptions import RequestException
  3 from pyquery import PyQuery as pq
  4 from bs4 import BeautifulSoup
  5 import pymongo
  6 from config import *
  7 from multiprocessing import Pool
  8 import time
  9 
 10 client = pymongo.MongoClient(MONGO_URL)    # 申明連線物件 

 11 db = client[MONGO_DB]    # 申明資料庫
 12 
 13 def get_one_page_html(url):    # 獲取網站每一頁的html
 14     headers = {
 15         "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
 16                       "Chrome/85.0.4183.121 Safari/537.36"
 17     }
 18     try:
 
 19         response = requests.get(url, headers=headers)
 20         if response.status_code == 200:
 21             return response.text
 22         else:
 23             return None
 24     except RequestException:
 25         return None
 26 
 27 
 28 def get_room_url(html):    # 獲取當前頁面上所有room_info的url 

 29     soup = BeautifulSoup(html, 'lxml')
 30     addresses = soup.find_all('p', {'class': 'content__list--item--des'})
 31     doc = pq(html)
 32     room_urls = doc('.content__list--item--main .content__list--item--title a').items()
 33     return room_urls, addresses
 34 
 35 
 36 def parser_room_page(room_html, address_queue1, address_queue2, address_queue3):    # 對租房詳情頁面進行解析，獲取特定資訊
 37     soup = BeautifulSoup(room_html, 'lxml')
 38     pinpai = soup.find('p', {'class': 'content__aside__list--subtitle oneline'}).text.strip().split(' ')[0]
 39     price = soup.find_all('li', {'class': 'table_col'})
 40     zujin = price[6].text    # 租金
 41     yajin = price[7].text    # 押金
 42     fuwufei = price[8].text    # 服務費
 43     zhongjiefei = price[9].text    # 中介費
 44     house_type = soup.find('ul', {'class': 'content__aside__list'}).find_all('li')[1].text[5:11]    # 戶型
 45     x = soup.find_all('li', {'class': 'fl oneline'})
 46     area = x[1].text[3:]  # 面積
 47     floor = x[7].text[3:]    # 樓層
 48     direction = x[2].text[3:]    # 朝向
 49     elevator = x[8].text[3:]    # 有無電梯
 50     carport = x[10].text[3:]    # 有無車位
 51     tenancy = x[18].text[3:]    # 租期
 52     maintenance = x[4].text[3:]    # 維護日期
 53     kanfang = x[21].text[3:]   # 看房是否要預約
 54     tags = soup.find('p', {'class': 'content__aside--tags'}).get_text().replace('\n', '')    # 標籤
 55 
 56     yield {
 57         'pinpai': pinpai,
 58         'zujin': zujin,
 59         'yajin': yajin,
 60         'fuwufei': fuwufei,
 61         'zhongjiefei': zhongjiefei,
 62         'house_type': house_type,
 63         'area': area,
 64         'floor': floor,
 65         'direction': direction,
 66         'elevator': elevator,
 67         'carport': carport,
 68         'tenancy': tenancy,
 69         'maintenance': maintenance,
 70         'kanfang': kanfang,
 71         'location1': address_queue1.pop(),
 72         'location2': address_queue2.pop(),
 73         'location3': address_queue3.pop(),
 74         'tags': tags,
 75     }
 76 
 77 
 78 def save_to_mongo(result):
 79     if db[MONGO_TABLE].insert_one(result):
 80         print('儲存到mongodb成功', result)
 81         return True
 82     return False
 83 
 84 
 85 def main(page):
 86     url = 'http://sz.xxxxx.com/zufang/pg' + str(page) + 'rt200600000002/#contentList'
 87     html = get_one_page_html(url)
 88     room_urls, addresses = get_room_url(html)
 89     address_queue1 = []    # 採用佇列資料結構，先進先出，用來存放租房區域（南山區、福田區等）
 90     address_queue2 = []
 91     address_queue3 = []    # 採用佇列資料結構，先進先出，用來存放租房具體小區
 92     for address in addresses:
 93         temp = address.find_all('a')
 94         address_queue1.insert(0, temp[0].text)
 95         address_queue2.insert(0, temp[1].text)
 96         address_queue3.insert(0, temp[2].text)
 97     for room_url in room_urls:
 98         room_url_href = room_url.attr('href')
 99         room_url_href = 'http://sz.xxxxx.com/' + room_url_href
100         room_html = get_one_page_html(room_url_href)
101         if room_html is None:    # 非常重要，否則room_html為None時會報錯
102             pass
103         else:
104             # parser_room_page(room_html, address_queue1, address_queue2, address_queue3)
105             results = parser_room_page(room_html, address_queue1, address_queue2, address_queue3)
106             for result in results:
107                 save_to_mongo(result)
108 
109 if __name__ == '__main__':
110     time1 = time.time()
111     pool = Pool()  # 使用多程序提高爬取效率
112     pool.map(main, [i for i in range(1, 101)])
113     time2 = time.time()
114     print(time2 - time1)    # 耗時

利用requests、pyquery、BeautifulSoup爬取深圳市某租房網站的租房資訊

1 import requests 2 from requests.exceptions import RequestException 3 from pyquery import PyQuery as pq

利用requestes\pyquery\BeautifulSoup爬取某租房公寓(深圳市)4755條租房資訊及總結

為了分析深圳市所有長租、短租公寓的資訊，爬取了某租房公寓深圳區域所有在租公寓資訊，以下記錄了爬取過程以及爬取過程中遇到的問題：

python利用BeautifulSoup爬取瓜子網內容並儲存在csv檔案

這裡需要匯入requests庫和BeautifulSoup庫和lxml庫 lxml是python的一個解析庫，支援HTML和XML的解析，支援XPath解析方式，而且解析效率非常高

python3爬蟲-6.使用requests和BeautifulSoup爬取豆瓣Top250電影

初次探查這次我們就要來使用上次說的BeautifulSoup + Reuqests進行爬取豆瓣TOP250電影

Resquest、Bs4、多執行緒爬取全站圖片

　　#!/usr/bin/env python 　　# coding=utf-8 　　# author:Charles 　　# datetime:2021/03/23/0004 11:26

Python如何使用BeautifulSoup爬取網頁資訊

這篇文章主要介紹了Python如何使用BeautifulSoup爬取網頁資訊,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

Python基於BeautifulSoup爬取京東商品資訊

今天小編利用美麗的湯來為大家演示一下如何實現京東商品資訊的精準匹配~~

另類Python爬蟲，利用pandas庫的read_html()方法爬取網頁表格型資料

文章目錄一、簡介二、原理三、爬取實戰例項1 例項2 一、簡介很多人學習python，不知道從何學起。很多人學習python，掌握了基本語法過後，不知道在哪裡尋找案例上手。很多已經做案例的人，卻不知道

基於xpath選擇器、PyQuery、正則表示式的格式清理工具詳解

1，使用xpath清理不必要的標籤元素，以及無內容標籤 from lxml import etree def xpath_clean(self,text: str,xpath_dict: dict) -> str:

python+mongo+BeautifulSoup爬取網頁小說

技術標籤：python 申明，本人非專業python職業者。以下程式碼作為自己學習使用。

python爬蟲利用selenium實現自動翻頁爬取某魚資料的思路詳解

基本思路: 首先用開發者工具找到需要提取資料的標籤列利用xpath定位需要提取資料的列表

steam夏日促銷悄然開始，用Python爬取排行榜上的遊戲打折資訊

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理。

爬取安居客二手房房屋資訊

目標 1 開啟安居客二手房頁面，如https://nanning.anjuke.com/sale/?from=navigation。得到如下頁面。

爬蟲爬取豆瓣影院的相關電影資訊

主要的是將相關的電影資訊爬取到之後儲存到excel中：主要的程式碼入下： import requests

python 爬取免費簡歷模板網站的示例

程式碼 # 免費的簡歷模板進行爬取本地儲存 # http://sc.chinaz.com/jianli/free.html # http://sc.chinaz.com/jianli/free_2.html

分散式爬取鏈家網二手房資訊

任務目標：以分散式的方式爬取鏈家網上二手房資訊，包括標題、城市、行政區、總價、戶型、面積、朝向等資訊

桌布不嫌棄多，今天帶你爬取動漫桌布網站（福利哦）

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

Java 爬蟲如何爬取需要登入的網站

這是 Java 網路爬蟲系列博文的第二篇，在上一篇 Java 網路爬蟲新手入門詳解中，我們簡單的學習了一下如何利用 Java 進行網路爬蟲。在這一篇中我們將簡單的聊一聊在網路爬蟲時，遇到需要登入的網站，我們該怎麼辦？

案例六基於Scrapy框架爬取亞馬遜熱銷商品資訊

閱讀目錄一介紹二安裝三命令列工具四專案結構以及爬蟲應用簡介五 Spiders六 Selectors七 Items八 Item Pipeline九 Dowloader Middeware十 Spider Middleware十一自定義擴充套件十二 settings.py十三爬取亞馬遜

Python爬蟲爬取京東某商品評論資訊存入mysql資料庫

1 \"\"\" 2 https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100006852812&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&f

利用requests、pyquery、BeautifulSoup爬取深圳市某租房網站的租房資訊

相關推薦