爬蟲入門 02 xpath庫初步使用

阿新 • • 發佈：2022-04-05

xpath概述：

1.xpath：
最常用、最高效、的一種解析方式【資料解析首選】
2.如何使用？
1.例項化 etree => 載入資料
2.呼叫 api => 1.標籤定位 2.資料解析
xxx.xpath('')
=> xpath表示式：
1.標籤定位
2.資料解析
1.標籤的文字
2.標籤的屬性
3.api使用
1.標籤定位
2.資料解析
pip install lxml

xpath爬蟲案例

1.離線方式使用xpath


import requests
from lxml import etree

if __name__ == '__main__':
    # 1.例項化
    root = etree.parse("D:\ \python-sk\data\lol.html")

    # 2.呼叫api
    '''
        編寫xpath表示式
        1.標籤定位
        2.資料解析
    '''

    '''
    1.標籤定位：
        1.絕對路徑
        2.相對路徑
        3.屬性定位
        4.索引定位
    爬蟲考試卷
        以為特別深入
        結果考試卷就一個xpath
        和script的基本建立
    '''


    #api：1 => 絕對路徑 => 返回 Element-list（返回陣列）
    t_info = root.xpath('/html/head/title')
    print(t_info)

    #api：2 => 相對路徑 => 可以從任何路徑定位
    t_info1 = root.xpath('//title')
    print(t_info1)
    t_info1 = root.xpath('/html//title')
    print(t_info1)
   
    # api：3 => 屬性定位 => 標籤[@屬性="xxx"]
    div = root.xpath('//div')
    print(div)

    div_adc = root.xpath('//div[@class="adc"]')
    print(div_adc)

    li_list = root.xpath('//div[@class="adc"]//ul//li')
    print(li_list)

    li_list = root.xpath('//div[@class="adc"]//ul//li')
    print(li_list)

    # api：4 => 索引定位 => 索引下標從1開始 標籤[索引]
    # xpath中不管怎麼寫，返回的都是連結串列，但是用下標就不是了
    li_list1 = root.xpath('//div[@class="adc"]//ul//li[1]')
    print(li_list1)

    li_list1 = root.xpath('//div[@class="adc"]/ul/li[1]/a')
    print(li_list1)

    # li_list2 = root.xpath('//div[@class="adc"]//ul//li')[0]
    # print(li_list2)

    # 2.資料解析
        # 1.標籤文字
        # 2.標籤屬性
    # a下面的文字
    a_text = root.xpath('//div[@class="adc"]//li[1]/a/text()')
    print(a_text)
    # li下面的文字
    li1_text = root.xpath('//div[@class="adc"]//li[1]//text()')
    print(li1_text)

    # 2.屬性解析 => 標籤/@屬性名
    img_info = root.xpath('//div[@class="top"]//img')
    print(img_info)

    img = root.xpath('//div[@class="top"]//img/@src')
    print(img)

2.xpath爬取二手房資訊

import requests
from lxml import etree

if __name__ == '__main__':
    # ua偽裝 => 模擬瀏覽器上網
    headers = {
        "User-Agent":"你的ua"
    }
    url = "https://dl.58.com/ershoufang"

    # 1.通用爬蟲
    page_info = requests.get(url,headers=headers)

    # 2.資料解析
    root = etree.HTML(page_info.text)

    # 3.標籤定位
    div_list = root.xpath('//section[@class="list"]/div')
    print(div_list)

    fp = open("D:\ \python-sk\data\二手房.txt","w",encoding="utf-8")
    for div in div_list:
        # 標籤定位,要加[0]得到裡面的資料，不然都是一個一個的list
        title = div.xpath('./a/div[@class="property-content"]/div[@class="property-content-detail"]/div[@class="property-content-title"]/h3/text()')[0]
        fp.write(title+"\n")
        print(title, "=>爬蟲ok")

3.爬取58同城租房資訊

'''
練習
    1.爬取58租房
'''
import requests
from lxml import etree

if __name__ == '__main__':
    # ua偽裝 => 模擬瀏覽器上網
    headers = {
        "User-Agent":"你的ua"
    }
    url = "https://dl.58.com/chuzu"

    # 1.通用爬蟲
    page_info = requests.get(url=url,headers=headers)

    # 2.資料解析
    root = etree.HTML(page_info.text)

    # 3.標籤定位
    house_list = root.xpath('//div[@class="list-wrap"]/div[@class="list-box"]/ul/li')
    print(house_list)

    fp = open("D:\ \python-sk\data\租房.txt","w",encoding="utf-8")
    # 4.資料解析，try，except丟擲異常
    for el in house_list:
        try:
            message = el.xpath('./div[@class="des"]/h2/a/text()')[0]
            # print(message + "\n")
            fp.write(message)
            print(message,"爬取ok")
        except  BaseException as e:
            print(e)

4.爬取58同城二手車【坑，慎入，別爬】

'''
練習
    2.爬取58二手車
    裡面有坑，不太行，售價是UI元件做的
    而且被經常被58牆，請求一次，就不行了
    只能看程式碼了，已經被牆了，請求發出去返回的是空資料
'''

import requests
from lxml import etree

if __name__ == '__main__':
    # ua偽裝 => 模擬瀏覽器上網
    headers = {
        "User-Agent":"你的ua"
    }
    url = "https://dl.58.com/ershouche"

    # 1.通用爬蟲
    page_info = requests.get(url=url,headers=headers)

    # 2.解析資料
    root = etree.HTML(page_info.text)

    # 3.標籤定位
    # car_list = root.xpath('//div[@class="list-wrap"]/ul/li')
    # print(car_list)

    car_list = root.xpath('//div[@class="list-wrap"]/ul/li')
    print(car_list)
    # 4.解析資料
    # for el in car_list:
    #     message = el.xpath('./div[@class="info--wrap"]/a/div')
    #     print(message)

5.爬取圖片

import requests
from lxml import etree

if __name__ == '__main__':
    # ua偽裝 => 模擬瀏覽器上網
    headers = {
        "User-Agent":"你的ua"
    }

    url = "https://pic.netbian.com/4kmeinv/"

    # url = "https://pic.netbian.com/4kdongman"

    # 1.通用爬蟲
    page_info = requests.get(url=url, headers=headers)
    page_info.encoding="gbk"
    print(page_info)
    # 2.解析資料
    root = etree.HTML(page_info.text)

    # 3.標籤定位
    li_list = root.xpath('//div[@class="slist"]/ul/li')

    for el in li_list:
        img_url = "https://pic.netbian.com"+el.xpath('./a/img/@src')[0]
        img_title = el.xpath('./a/img/@alt')[0]
        print(img_url)

        # 1.通用爬蟲
        img_reponse = requests.get(url=img_url,headers=headers)
        img_data = img_reponse.content
        with open(f"D:\ \python-sk\data\img\{img_title}.jpg","wb") as fp:
            fp.write(img_data)
            print(img_title,"爬取ok")

6.爬取城市資訊並去重

import requests
from lxml import etree

if __name__ == '__main__':
    # ua偽裝 => 模擬瀏覽器上網
    headers = {
        "User-Agent":"你的ua"
    }
    url = "http://www.air-level.com/"

    # 1.通用爬蟲
    page_info = requests.get(url=url, headers=headers)

    # 2.解析資料
    root = etree.HTML(page_info.text)

    # 3.標籤定位
    a_list = root.xpath('//div[@id="citylist"]/div[@class="citynames"]/a')
    # print(a_list)

    fp = open("D:\ \python-sk\data\cityname.txt","w",encoding="utf-8")


    # 去重方法1，使用set
    s1 = {""}
    s1.pop()
    for a in a_list:
        a_text = a.xpath('./text()')[0]
        s1.add(a_text)
    fp.write(str(s1))

    # # 去重方法2，大字串，in和not in
    st = ""
    for a in a_list:
        a_text = a.xpath('./text()')[0]
        print(a_text)
        if((a_text not in st) == True):
            st = st + a_text
            fp.write(a_text)
            print("寫入一個字串")
        else:
            print("該字串已存在")



    print("successful")

    '''
    作業，cityname去重
        1.把重點城市去掉
        2.取完再去重【推薦】
    '''

爬蟲入門 02 xpath庫初步使用

xpath概述： 1.xpath：最常用、最高效、的一種解析方式【資料解析首選】 2.如何使用？

爬蟲入門 01 request庫與bs4庫初步使用

爬蟲概述爬蟲： 1.百度百科 2.自己理解：通過程式碼、模擬瀏覽器上網然後抓取資料的過程資料解析

爬蟲實踐02 | xpath爬取某大學新聞網站

完整原始碼： #2022-03-01 xpath爬取某大學新聞網站 import requests from lxml import etree import time

爬蟲入門 03 Scrapy框架初步使用

1.首先了解一下Scrapy框架： 1.1官網的描述 An open source and collaborative framework for extracting the data you need from websites.

Python網路爬蟲規則之Request庫入門

　　Requests庫是Python的第三方庫，它是目前公認的爬取網頁最好的第三方庫。Requests庫有兩個特點，它很簡單簡單，也很簡潔，甚至用一行程式碼從網頁上獲得相關的資源。Requests庫的更多資訊可以在https://requests

Python爬蟲入門（二）之Requests庫

Python爬蟲入門（二）之Requests庫我是照著小白教程做的，所以該篇是更小白教程hhhhhhhh

python 爬蟲學習入門5 requests庫

技術標籤：爬蟲python Requests庫 requests庫為我們提供了一個封裝好的庫函式，使用更為簡便

爬蟲02--requests庫其他操作、免費代理池和案例

昨日回顧 1 爬蟲介紹 -模擬傳送http請求---》獲取到資料（html，json，圖片，視訊）---》資料清洗----》入庫

WebMagic 實現爬蟲入門教程

本示例實現某電影網站最新片源名稱列表及詳情頁下載地址的抓取。 webmagic是一個開源的Java垂直爬蟲框架，目標是簡化爬蟲的開發流程，讓開發者專注於邏輯功能的開發。

Python 爬蟲入門實戰

1. 前言首先自我介紹一下，我是一個做 Java 的開發人員，從今年下半年開始，一直在各大技術部落格網站發表自己的一些技術文章，差不多有幾個月了，之前在 cnblog 部落格園加了網站統計程式碼，看到每天的訪問量逐

Python爬蟲入門有哪些基礎知識點

1、什麼是爬蟲爬蟲，即網路爬蟲，大家可以理解為在網路上爬行的一直蜘蛛，網際網路就比作一張大網，而爬蟲便是在這張網上爬來爬去的蜘蛛咯，如果它遇到資源，那麼它就會抓取下來。想抓取什麼？這個由你來控制它咯。

Python 爬蟲乾貨之urllib庫

1. 小試牛刀怎樣扒網頁呢？其實就是根據URL來獲取它的網頁資訊，雖然我們在瀏覽器中看到的是一幅幅優美的畫面，但是其實是由瀏覽器解釋才呈現出來的，實質它是一段HTML程式碼，加 JS、CSS，如果把網頁比作一個人，

Python爬蟲學習02--pyinstaller

Python爬蟲學習02--打包exe可執行程式 1.上一次做了一個爬蟲爬取電子書的Python程式，然後發現可以通過pyinstaller進行打包成exe可執行程式。發現非常簡單好用

爬蟲學習之Urllib庫

爬蟲學習之Urllib資料抓取 Urlib是一個收集幾個模組來使用URL的軟體包，大致具備以下功能:

Struts2入門02

二、Struts2初步使用說明 1、struts2.xml中result result的type屬性常用的4種方式轉發：dispatcher

Python爬蟲入門練手案例，爬取某乎問答數（附原始碼）

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理。

Python小白爬蟲入門的第一個案例：爬取全站小說

前言很多免費的資源只能看但是不提供下載，今天我們以小說為例，教你如何把網際網路上只能看不能下載的東西給下載下來

爬蟲入門一基礎知識以及request

title: 爬蟲入門一基礎知識以及request date: 2020-03-05 14:43:00 categories: python tags: crawler

爬蟲入門四 re

title: 爬蟲入門四 re date: 2020-03-14 16:49:00 categories: python tags: crawler 正則表示式與re庫

爬蟲入門三 scrapy

title: 爬蟲入門三 scrapy date: 2020-03-14 14:49:00 categories: python tags: crawler scrapy框架入門

爬蟲入門 02 xpath庫初步使用

xpath概述：

xpath爬蟲案例

1.離線方式使用xpath

2.xpath爬取二手房資訊

3.爬取58同城租房資訊

4.爬取58同城二手車【坑，慎入，別爬】

5.爬取圖片

6.爬取城市資訊並去重

相關推薦