X-path爬取案例與selenuim模組

阿新 • • 發佈：2021-09-28

import requests
from lxml import etree

res = requests.get("https://www.aqistudy.cn/historydata/",
                   headers={
                       'User-Agent 
': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
                   }
                   )
tree = etree.HTML(res.text)

# 先獲取熱門城市
hot_city_names = tree.xpath('//div[@class="hot"]/div[2]/ul/li/a/text()')

# 再獲取其他城市
other_city_names = tree.xpath(' 
//div[@class="all"]/div[2]/ul/div[2]/li/a/text()')

# 一次性查詢所有的城市
all_city_names = tree.xpath(
    '//div[@class="hot"]/div[2]/ul/li/a/text() | //div[@class="all"]/div[2]/ul/div[2]/li/a/text()')
print(all_city_names)

View Code

總結

1.使用x-path選擇器，開始時可以一層一層往下查詢 無需跳躍
2.解析出來的結果肯定是一個列表
3.使用x-path的規律就在於先明確你需要查詢的標籤，之後往上多看幾層具有一定特徵的父標籤，之後依次逐層查詢即可

爬取豬八戒資料並寫入表格

思路

1.朝網頁傳送get請求
2.生成xpath物件
3.研究標籤規律，利用xpath模組篩選
4.先查詢所有含有資料的div之後依次迴圈
5.利用解析器xpath篩選所需資料
6.匯入openpyxl模組建立表格
7.定義表頭以及寫入資料最後儲存檔案

篩選資料詳細步驟

# 公司名稱
company_name = div.xpath('./div/div/a/div[1]/p/text()')

# 解決措施
if not company_name:
        continue

# 公司地址
    address_info = div.xpath('./div/div/a/div[1]/div/span/text()')

# 訂單價格
    order_price = div.xpath('./div/div/a[2]/div[2]/div[1]/span[1]/text()')

 # 歷史成交
    order_num = div.xpath('./div/div/a[2]/div[2]/div[1]/span[2]/text()')

# 訂單描述
    order_desc = div.xpath('./div/div/a[2]/div[2]/div[2]/p/text()')

完整程式碼

import requests
from lxml import etree
from openpyxl import Workbook

wb = Workbook()
wb1 = wb.create_sheet('訂單資料', 0)
wb1.append(['公司名稱', '公司地址', '訂單價格', '歷史成交', '訂單描述'])

# 1.傳送請求獲取頁面資料
res = requests.get('https://shanghai.zbj.com/search/f/',
                   params={'kw': 'app'}
                   )
# 2.生成xpath物件
tree = etree.HTML(res.text)
# 3.研究標籤規律 書寫xpath
div_list = tree.xpath('//div[@class="new-service-wrap"]/div')
for div in div_list:
    # 公司名稱
    company_name = div.xpath('./div/div/a/div[1]/p/text()')
    if not company_name:
        continue
    # print(company_name[-1].strip('\n'))
    # 公司地址
    address_info = div.xpath('./div/div/a/div[1]/div/span/text()')
    # print(address_info[0])
    # 訂單價格
    order_price = div.xpath('./div/div/a[2]/div[2]/div[1]/span[1]/text()')
    # print(order_price[0])
    # 歷史成交
    order_num = div.xpath('./div/div/a[2]/div[2]/div[1]/span[2]/text()')
    # print(order_num[0])
    # 訂單描述
    order_desc = div.xpath('./div/div/a[2]/div[2]/div[2]/p/text()')
    # print('app'.join(order_desc))

    wb1.append([company_name[-1].strip('\n'), address_info[0], order_price[0], order_num[0], 'app'.join(order_desc)])

wb.save(r'訂單資料.xlsx')

View Code

爬取貼吧圖片資料

需求

貼吧名稱是使用者自己指定 不是固定的一個
在編寫程式的時候可以先以固定的為例之後換成使用者輸入即可
所有的圖片都需要自動儲存到以貼吧名稱命名的資料夾內
貼吧圖片支援多頁爬取

思路

1.向網頁傳送get請求
2.生成一個xpath物件
3.查詢所有帖子的連結地址
4.迴圈獲取每一個帖子連結 拼接成完整的地址 再發送請求
5.傳送詳情頁請求獲取頁面資料
6.篩選圖片連結地址
7.迴圈請求每個圖片地址並儲存圖片

如何實現使用者自己想搜尋的關鍵字

研究下來，發現核心在於kw引數
# https://tieba.baidu.com/f?ie=utf-8&kw=%E4%B8%83%E9%BE%99%E7%8F%A0&fr=search
# https://tieba.baidu.com/f?ie=utf-8&kw=%E7%BE%8E%E5%A5%B3&fr=search
# https://tieba.baidu.com/f?ie=utf-8&kw=%E8%BE%A3%E5%A6%B9&fr=search

詳細步驟

1.查詢所有帖子的連結

a_link_list = tree.xpath('//a[@class="j_th_tit "]/@href')

2. 迴圈獲取每一個帖子連結，拼接成完整的地址，再發送請求

base_url = 'https://tieba.baidu.com'
for link in a_link_list:
    full_link = base_url + link
res1 = requests.get(full_link)
    tree1 = etree.HTML(res1.text)

3.篩選圖片連結地址

img_src_list = tree1.xpath('//img[@class="BDE_Image"]/@src')

4.迴圈請求每個圖片地址，並儲存圖片，完整程式碼如下：

import requests
from lxml import etree
import os
import time

tieba_name = input('請輸入你想要爬取的貼吧名稱>>>:').strip()

if not os.path.exists(tieba_name):
    os.mkdir(tieba_name)

res = requests.get('https://tieba.baidu.com/f',
                   params={'kw': tieba_name}
                   )
tree = etree.HTML(res.text)

a_link_list = tree.xpath('//a[@class="j_th_tit "]/@href')
base_url = 'https://tieba.baidu.com'
for link in a_link_list:
    full_link = base_url + link
    # 5.傳送詳情頁請求獲取頁面資料
    res1 = requests.get(full_link)
    tree1 = etree.HTML(res1.text)
 
    img_src_list = tree1.xpath('//img[@class="BDE_Image"]/@src')
    for img_src in img_src_list:
        res2 = requests.get(img_src)
        file_path = os.path.join(tieba_name,img_src[-15:])
        with open(file_path,'wb') as f:
            f.write(res2.content)
    time.sleep(1)

View Code

5.多頁操作

研究發現在pn引數上
# https://tieba.baidu.com/f?kw=%E4%B8%83%E9%BE%99%E7%8F%A0&ie=utf-8&pn=50
# https://tieba.baidu.com/f?kw=%E4%B8%83%E9%BE%99%E7%8F%A0&ie=utf-8&pn=100
# https://tieba.baidu.com/f?kw=%E4%B8%83%E9%BE%99%E7%8F%A0&ie=utf-8&pn=150

完整程式碼

import requests
from lxml import etree
import os
import time

name = input('請輸入想要爬取的貼吧名>>>:').strip()
if not os.path.exists(name):
    os.mkdir(name)
page_str = input('請輸入你要爬取的內碼表數>>>:').strip()
page = (int(page_str) - 1) * 50
res = requests.get('https://tieba.baidu.com/f',
                   params={'kw': name,
                           'pn': page}
                   )  # 通過額外引數控制貼吧和頁數
# print(res.text)
tree = etree.HTML(res.text)
second_link_list = tree.xpath('//a[@class="j_th_tit "]/@href')  # 篩選出所有帖子的連結部分
base_url = 'https://tieba.baidu.com'
for link in second_link_list:
    url = base_url + link  # 拿到一個個帖子的連結之後拼接獲取完整網址
    res1 = requests.get(url)  # 訪問一個個帖子的網址
    tree1 = etree.HTML(res1.text)
    img_link_list = tree1.xpath('//img[@class="BDE_Image"]/@src')  # 篩選出帖子裡圖片的連結部分
    for img_link in img_link_list:
        res2 = requests.get(img_link)  # 訪問一個個圖片的網址
        file_path = os.path.join(name, img_link[-10:])  # 起名從後往前拿保證拿到.jpg
        with open(file_path, 'wb') as f:
            f.write(res2.content)  # 寫入儲存
            time.sleep(1)

View Code

selenium模組

原本僅僅是測試領域裡面的一款測試工具 
但是由於其可以操作瀏覽器所以逐步也被應用到了爬蟲領域
可以避免很多防爬措施但是由於需要操控瀏覽器所以效率上偏慢

模組下載

驅動下載

該模組是用來操作瀏覽器的 需要相應的驅動軟體

注意事項

# 1.有時候下載了驅動可能也無法操作瀏覽器
        原因:可能是因為驅動版本不對
        措施:重新下一個版本的驅動
# 2.不同的瀏覽器需要下載不同的驅動檔案

驅動檔案的存放位置

存放到python直譯器scripts資料夾內即可

驗證

完整程式碼

from selenium import webdriver
import time

# 指定操作的瀏覽器驅動
bro = webdriver.Chrome()
# 控制瀏覽器訪問B站網站資料
bro.get("https://www.bilibili.com/")

# 關閉瀏覽器視窗
time.sleep(3)   # 設定延遲
bro.close()

View Code

基本操作

1、find_element_by_id   根據id找
# div_tag = bro.find_element_by_id('s-top-left')
2、find_element_by_link_text     根據連結名字找到控制元件（a標籤的文字）
# a_tag = bro.find_element_by_link_text("新聞")
3、find_element_by_partial_link_text   根據連結名字找到控制元件（a標籤的文字）模糊查詢
4、find_element_by_tag_name       根據標籤名
5、find_element_by_class_name     根據類名
6、find_element_by_name           根據屬性名
7、find_element_by_css_selector   根據css選擇器
8、find_element_by_xpath          根據xpath選擇

結論

find_element與find_elements的區別就在於前者只會找到符合條件的第一個，後者是所有
兩者的關係相當於bs4模組裡面的find與find_all

小案例

from selenium import webdriver
from selenium.webdriver.common.keys import Keys  # 鍵盤按鍵操作
import time

bro = webdriver.Chrome()
# 開啟淘寶
bro.get('https://www.taobao.com/')
# 查詢搜尋框標籤
input_tag = bro.find_element_by_id('q')
# 輸入文字內容
input_tag.send_keys('iphone手機')
time.sleep(1)
# 點選搜尋
input_tag.send_keys(Keys.ENTER)
time.sleep(5)
# 關閉瀏覽器
bro.close()

View Code

X-path爬取案例與selenuim模組

目錄爬取城市名稱爬取豬八戒資料並寫入表格爬取貼吧圖片資料自動化測試工具selenuim模組

Python進行拉勾網資料爬取框架與思路

爬取內容用互動式的方式輸入想查詢的職位並迴圈輸出職位簡要資訊，為了方便閱讀，本文使用到的變數並不是以目標變數的英文來命名，而是為了方便而取的變數名，各位大牛請諒解。（因貴網站爬取一定量資料後需要登陸，

爬取例項和selenium模組

爬取城市名稱熱門城市名稱 hot_city_names = tree.xpath(\'//div[@class=\"hot\"]/div[2]/ul/li/a/text()\')

爬取城市名、豬八戒、貼吧圖片和selenuim模組

xpath相關練習和selenuim模組爬取城市名稱爬取豬八戒資料並寫入表格爬取貼吧圖片資料

MongoDB聚合分組取第一條記錄的案例與實現方法

前言今天開發同學向我們提了一個緊急的需求，從集合mt_resources_access_log中，根據欄位refererDomain分組，取分組中最近一筆插入的資料，然後將這些符合條件的資料匯入到集合mt_resources_access_log_new中。

Python爬蟲實現使用beautifulSoup4爬取名言網功能案例

本文例項講述了Python爬蟲實現使用beautifulSoup4爬取名言網功能。分享給大家供大家參考，具體如下：

Python實現的爬取豆瓣電影資訊功能案例

本文例項講述了Python實現的爬取豆瓣電影資訊功能。分享給大家供大家參考，具體如下：

python協程gevent案例爬取鬥魚圖片過程解析

分析分析網站尋找需要的網址用谷歌瀏覽器摁F12開啟開發者工具，然後開啟鬥魚顏值分類的頁面，如圖：

python爬蟲開發之使用python爬蟲庫requests，urllib與今日頭條搜尋功能爬取搜尋內容例項

使用python爬蟲庫requests，urllib爬取今日頭條街拍美圖程式碼均有註釋 import re,json,requests,os

爬取西刺代理的IP與埠（一）

0x01 簡陋程式碼是，獲取(.*?)的字串 #coding:utf-8 from requests import * import re headers = { "accept":"text/html,application/xhtml+xml,application/xml;",

Python scrapy爬取小說程式碼案例詳解

scrapy是目前python使用的最廣泛的爬蟲框架架構圖如下解釋： Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通訊，訊號、資料傳遞等。

Python爬蟲爬取新聞資訊案例詳解

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理。

爬取csdn的資料與解析儲存（9）

安裝軟體： pip instal pymysq pip install peewee 建立資料模型orm from peewee import * db = MySQLDatabase(\"spider\", host=\"127.0.0.1\", port=3306, user=\"root\", password=\"root\")

xpath案例-全國城市名爬取

#!/usr/bin/python import requests from lxml import etree #專案需求：解析出所有的城市名稱https://www.aqistudy.cn/historydata/

xpath案例-4K圖片爬取

#!/usr/bin/python #需求：解析下載圖片資料 http://pic.netbian.com/4kmeinv/ import requests from lxml import etree

Python爬蟲入門練手案例，爬取某乎問答數（附原始碼）

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理。

scrapy爬取網站案例

scrapy爬取網站案例爬取抽屜網資料，存到Redis和MySQL中，實現持久化 Mysql實現 # settings.py

Python小白爬蟲入門的第一個案例：爬取全站小說

前言很多免費的資源只能看但是不提供下載，今天我們以小說為例，教你如何把網際網路上只能看不能下載的東西給下載下來

python網路爬蟲案例：批量爬取百度貼吧頁面資料

⾸先我們建立⼀個 python⽂件, tiebaSpider.py，我們要完成的是，輸⼊⼀個百度貼吧的地址，⽐如：

Python爬蟲案例：爬取攜程評論

前言之前爬取美團，馬蜂窩等網站的資料都挺順利，大眾點評（這個反爬機制有點麻煩）在磕磕絆絆中也算成功（重點是網頁頁數的變化和關鍵字的隱藏替換）但攜程居然遇到了瓶頸。

X-path爬取案例與selenuim模組

目錄

爬取城市名稱

思路

完整程式碼

總結

爬取豬八戒資料並寫入表格

思路

篩選資料詳細步驟

完整程式碼

爬取貼吧圖片資料

需求

思路

如何實現使用者自己想搜尋的關鍵字

詳細步驟

1.查詢所有帖子的連結

2. 迴圈獲取每一個帖子連結，拼接成完整的地址，再發送請求

3.篩選圖片連結地址

4.迴圈請求每個圖片地址，並儲存圖片，完整程式碼如下：

5.多頁操作

完整程式碼

selenium模組

模組下載

驅動下載

注意事項

驅動檔案的存放位置

驗證

完整程式碼

基本操作

結論

小案例

相關推薦