python 爬蟲啟航2.0

阿新 • • 發佈：2018-11-30

項目正則 hba color element 樹形二手房 == spa

文章解析：

1.正則表達式解析

2.beautifulsoup，BeautifulSoup是一個復雜的樹形結構，她的每一個節點都是一個python對象，獲取網頁的內容就是一個提取對象內容的過程，它的提取方法可以歸類為三種，1）遍歷文檔樹 2）搜索文檔樹 3）css選擇器

# -*- coding: utf-8 -*-
# @Time    : 2018/11/28 17:23
# @Author  : Bo
# @Email   : [email protected]
# @File    : re_spider.py
# @Software: PyCharm
import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
def get_title_re_spider():
    url = "http://www.santostang.com/"
    headers = {
        "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0",
    }

    r = requests.get(url,headers= headers,timeout =10)
    html = r.text
    title_list = re.findall(‘<h1 class="post-title"><a href=.*?>(.*?)</a></h1>‘,html)
    print(title_list)

def beautifulsoup_spider():
    url = "http://www.santostang.com/"
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0",
    }

    r = requests.get(url, headers=headers, timeout=10)
    soup = BeautifulSoup(r.text,"html.parser")
    title_list = soup.find_all("h1",class_="post-title")
    for i in range(len(title_list)):
        title = title_list[i].a.text.strip()
        print("第 %s篇文章的標題是：%s" %(i+1,title))

def beautiful_methods():
    url = "http://www.santostang.com/"
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0",
    }

    r = requests.get(url, headers=headers, timeout=10)
    soup = BeautifulSoup(r.text, "html.parser")
    # print(soup.prettify()) #soup對代碼進行優化

    #遍歷文檔樹
    one_element = soup.header.h1 #獲取具體的標簽節點
    elements = soup.header.contents #獲取header標簽的子節點
    first_element = soup.header.contents[1] #標簽都在奇數項
    # print(one_element)
    # print(elements)
    # # print(first_element)
    # #獲取子節點(只有下一級)
    # for child in soup.body.children:
    #     print(child)
    # #獲取所有子子孫孫的節點
    # for child in soup.body.descendants:
    #     print(child)
    # a_tag = soup.header.div.a
    # a_parent = a_tag.parent
    # print(a_parent)
    # #搜索節點 find() 和find_all()

    #css選擇器
    print(soup.select("header h1"))
    print(soup.select("header > h1"))

    #css也可以實現文檔搜索功能

    #使用lxml解析網頁

def lxml_spider():
    url = "http://www.santostang.com/"
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0",
    }

    r = requests.get(url, headers=headers, timeout=10)

    html = etree.HTML(r.text)
    title_list = html.xpath("/html/body/div[1]/div/div[1]/article/header/h1/a/text()")
    print(title_list)

# 項目實踐-爬取安居客二手房信息
def second_house_spider():
    url = "https://weihai.anjuke.com/sale/gaoqu/?from=SearchBar"

    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0",
    }
    r = requests.get(url,headers = headers,timeout = 10)

    #使用BeautifulSoup解析網頁
    soup = BeautifulSoup(r.text,‘lxml‘)

    house_list = soup.find_all("li",class_="list-item")

    for house in house_list:
        name = house.find("div",class_="house-title").a.text.strip()
        price = house.find("span",class_="price-det").text.strip()
        price_area = house.find("span",class_="unit-price").text.strip()

        no_room = house.find("div",class_=‘details-item‘).span.text.strip()
        area = house.find("div",class_="details-item").contents[3].text
        floor = house.find("div",class_="details-item").contents[5].text

        address = house.find("span",class_="comm-address").text.strip()

        address = address.replace(‘\xa0\xa0\n           ‘,‘ ‘)
        tag_list = house.find_all("span",class_="item-tags")
        tag = [i.text for i in tag_list]

        with open(‘b.txt‘, "a+",encoding="utf-8") as f:
            f.write(address)
        print(name)
        print(price)
        print(price_area)
        print(no_room)
        print(area)
        print(floor)
        print(address)
        print(tag)


if __name__ == "__main__":
    # get_title_re_spider()
    # beautifulsoup_spider()
    # beautiful_methods()
    lxml_spider()


學習網址：

https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id27

python 爬蟲啟航2.0

項目正則 hba color element 樹形二手房 == spa 文章解析： 1.正則表達式解析 2.beautifulsoup，BeautifulSoup是一個復雜的樹形結構，她的每一個節點都是一個python對象，獲取網頁的內容就是一個提取對象內容的過程，它的

python 爬蟲啟航2.0

python 爬蟲啟航2.0

53. Python 爬蟲（2）

Python 購物車練習 2.0

python爬蟲系列(2.3-requests庫模擬使用者登入)

python爬蟲系列(2.2-requests庫的高階使用)

python爬蟲系列(2.1-requests庫的基本的使用)

【每天堅持學一點之python爬蟲】2、什麼是爬蟲

Python爬蟲入門——2. 1 我的第一個爬蟲程式

Python爬蟲入門 | 2 爬取豆瓣電影資訊

Python爬蟲入門——2. 5 爬取鬥破蒼穹並儲存到本地TXT

小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝

【Python網路爬蟲開發教程】Beautiful Soup 4.2.0 文件

Python 爬蟲 2 （轉）

python之購物車version 2.0

QQ空間Python爬蟲v2.0--點贊數據分析

python 爬蟲2-正則表達式抓取拉勾網職位信息

Python API 2.0

python函數篇0-2

Python爬蟲2------爬蟲屏蔽手段之代理服務器實戰

python爬蟲【第2篇】

python 爬蟲啟航2.0

相關推薦