爬蟲練習 -- 鏈家
注意:請不要爬取過多資訊,僅供學習。
分析:
- 業務需求分析......(此例為住房資訊...)
- 查詢相關網頁資訊(以鏈家為例)
- 分析URL,查詢我們需要的內容,建立連線
- 定位資料
- 儲存資料
首先進入鏈家網首頁,點選租房,F12檢查網頁,查詢我們需要的資訊。如圖:
然後再定位我們需要的資訊:如下圖
下面就開始程式碼實現,我們的分析過程,獲取資料,對資料進行定位。
主要程式碼:
-
# url 頁碼拼接
-
url = 'https://bj.lianjia.com/zufang/pg{}'.format(page)
-
# 利用Xpath 對資料進行定位
-
...
-
html_pipei = html_ele.xpath('//ul[@id="house-lst"]/li')
-
for pipei_one in html_pipei:
-
title = pipei_one.xpath('./div[2]/h2/a')[0].text
-
region = pipei_one.xpath('./div[2]/div[1]/div[1]/a/span')[0].text
-
...
完整程式碼如下:
-
import requests
-
from lxml import etree
-
import pymysql
-
class Mysql(object):
-
'''執行資料操作封裝類'''
-
def __init__(self):
-
'''連線資料庫、建立遊標'''
-
self.db = pymysql.connect(host="localhost", user="root", password="8888", database="test")
-
self.cursor = self.db.cursor()
-
def mysql_op(self, sql, data):
-
'''MySQL語句'''
-
self.cursor.execute(sql, data)
-
self.db.commit()
-
def __del__(self):
-
'''關閉遊標、關閉資料庫'''
-
self.cursor.close()
-
self.db.close()
-
# MySQL語句
-
Insert = Mysql()
-
# 要執行的sql 語句
-
sql = '''INSERT INTO lianjia (title, region, zone, meters, location, price) VALUES(%s, %s, %s, %s, %s, %s)'''
-
# 頭部報文
-
headers = {
-
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
-
}
-
def download_msg():
-
for page in range(1, 2):
-
url = 'https://bj.lianjia.com/zufang/pg{}'.format(page)
-
responses = requests.get(url, headers=headers)
-
html = responses.text
-
# 利用Xpath
-
html_ele = etree.HTML(html)
-
html_pipei = html_ele.xpath('//ul[@id="house-lst"]/li')
-
# print(html_pipei)
-
for pipei_one in html_pipei:
-
# ./li/div[2]/a
-
title = pipei_one.xpath('./div[2]/h2/a')[0].text
-
# print(title)
-
region = pipei_one.xpath('./div[2]/div[1]/div[1]/a/span')[0].text
-
# print(region)
-
zone = pipei_one.xpath('./div[2]/div[1]/div[1]/span[1]/span')[0].text
-
# print(zone)
-
meters = pipei_one.xpath('./div[2]/div[1]/div[1]/span[2]')[0].text
-
# print(meters)
-
location = pipei_one.xpath('./div[2]/div[1]/div[1]/span[3]')[0].text
-
# print(location)
-
price = pipei_one.xpath('.//div[@class="price"]/span')[0].text
-
# print(price)
-
data = (title, region, zone, meters, location, price)
-
Insert.mysql_op(sql, data)
-
if __name__ == '__main__':
-
download_msg()