python爬蟲-通過bs4和xpath分析html程式碼

阿新 • • 發佈：2018-12-20

我感覺作者用xpath分析程式碼的時候不是很好，下面是我重新改善的

一、用lxml模組分析程式碼

#!/usr/bin/env python
#-*- coding:utf-8 -*-

import requests
import time,os
from lxml import etree

def get_Page(url,headers):
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    return None

def parse_Page(html,headers):
    html_lxml = etree.HTML(html)
    
    #在xpath中可以用 "|" 表示選取若干路徑
    datas = html_lxml.xpath('.//div[@class="captcha_images_left"]|.//div[@class="captcha_images_right"]')
    item= {}
    # 建立儲存驗證碼資料夾
    file = 'qcode'
    if os.path.exists(file):
        os.chdir(file)
    else:
        os.mkdir(file)
        os.chdir(file)
        
    for data in datas:
        name = data.xpath('.//h3')   #驗證碼名稱,返回列表
        src = data.xpath('.//div/img/@src') #驗證碼連結,返回列表

        for i in range(len(name)):
            filename = name[i].text + '.jpg'    # 驗證碼圖片檔名
            img_url = 'https://captcha.com/' + src[i]
            item[filename] = img_url

        count = 0
        for imgname, imgurl in item.items():
            response = requests.get(imgurl, headers=headers)
            if response.status_code == 200:
                image = response.content     #獲取圖片內容
                with open(imgname,'wb') as f:
                    f.write(image)
                    count += 1
                    print('儲存第{}張驗證碼成功'.format(count))
                    time.sleep(1)


def main():
    url = 'https://captcha.com/captcha-examples.html?cst=corg'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
    html = get_Page(url,headers)
    parse_Page(html,headers)


if __name__ == '__main__':
    main()

二、用bs4模組分析程式碼

#!/usr/bin/env python
#-*- coding:utf-8 -*-

import requests
import time,os
from urllib.request import urlretrieve
from bs4 import BeautifulSoup

def get_Page(url,headers):
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    return None

def parse_Page(html):
    soup = BeautifulSoup(html.encode(), 'html.parser', from_encoding='utf-8')
    data_left = soup.select('#main .captcha_images_left')
    data_right = soup.select('#main .captcha_images_right')
    data = {}

    # 建立儲存驗證碼資料夾
    file = 'qcode'
    if os.path.exists(file):
        os.chdir(file)
    else:
        os.mkdir(file)
        os.chdir(file)

    for i in range(2):
        for row in zip(data_left,data_right):
            names = row[i].select('h3')    #獲取所有的 h3 標籤，返回列表
            images = row[i].select('img')  #獲取所有的 img 標籤，返回列表
            for tag_h,tag_img in zip(names,images):
                #tag_h 為所有的 h3 標籤 ；tag_img 為所有的 img 標籤
                data[str(tag_h.text).strip() + '.jpg'] = 'https://captcha.com/' + tag_img['src']

    for imgname,imgurl in data.items():
        response = requests.get(imgurl, headers=headers)
        if response.status_code == 200:
            urlretrieve(imgurl,imgname)
            time.sleep(1)

if __name__ == '__main__':
    url = 'https://captcha.com/captcha-examples.html?cst=corg'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}
    html = get_Page(url, headers)
    parse_Page(html)

改善多執行緒爬蟲，以下是pyton3的程式碼

#!/usr/bin/env python
#-*- coding:utf-8 -*-
# author:Mr Yang

import requests
import time,os
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import threading, queue

def get_Page(url,headers):
    response = requests.get(url,headers=headers)
    if response.status_code == 200:
        return response.text
    return None

def parse_Page(html,urlQueue):
    soup = BeautifulSoup(html.encode(), 'html.parser', from_encoding='utf-8')
    data_left = soup.select('#main .captcha_images_left')
    data_right = soup.select('#main .captcha_images_right')

    # 建立儲存驗證碼資料夾
    file = 'qcode'
    if os.path.exists(file):
        os.chdir(file)
    else:
        os.mkdir(file)
        os.chdir(file)

    for i in range(2):
        for row in zip(data_left,data_right):
            names = row[i].select('h3')    #獲取所有的 h3 標籤，返回列表
            images = row[i].select('img')  #獲取所有的 img 標籤，返回列表
            for tag_h,tag_img in zip(names,images):
                #tag_h 為所有的 h3 標籤 ；tag_img 為所有的 img 標籤
                urlQueue.put({str(tag_h.text).strip() + '.jpg':'https://captcha.com/' + tag_img['src']})

def dowloadimg(urlQueue,headers):
    while True:
        try:
            data = urlQueue.get_nowait()  # 不阻塞的讀取佇列資料
            i = urlQueue.qsize()  # 佇列長度，取出一個長度就減少一個
        except Exception as e:
            break


        for imgname,imgurl in data.items():
            response = requests.get(imgurl, headers=headers)
            if response.status_code == 200:
                urlretrieve(imgurl,imgname)
                time.sleep(1)

if __name__ == '__main__':
    url = 'https://captcha.com/captcha-examples.html?cst=corg'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}

    urlQueue = queue.Queue()
    html = get_Page(url,headers)
    parse_Page(html,urlQueue)

    threadNum = 7
    for i in range(threadNum):
        t = threading.Thread(target=dowloadimg,args=(urlQueue,headers,))
        t.start()

python爬蟲-通過bs4和xpath分析html程式碼

我感覺作者用xpath分析程式碼的時候不是很好，下面是我重新改善的一、用lxml模組分析程式碼 #!/usr/bin/env python #-*- coding:utf-8 -*- import requests import time,os from

[Python爬蟲]通過分析胸罩銷售記錄發現了驚人的祕密

通過爬蟲去爬取京東的使用者評價，通過分析爬取的資料能得到很多結果，比如，哪一種顏色的胸罩最受女性歡迎，以及中國女性的平均size(僅供參考哦~) 開啟開發者工具-network，在使用者評價頁面我們發現瀏覽器有這樣一個請求通過分析我們發現主要用的引數有三

python爬蟲：scrapy框架xpath和css選擇器語法

Xpath基本語法一、常用的路徑表示式：表示式描述例項 nodename 選取nodename節點的所有子節點 //div / 從根節點選取

Python爬蟲——4.4爬蟲案例——requests和xpath爬取招聘網站資訊

# -*-coding:utf-8 -*- ''' 使用requests模組進行資料採集，XPath進行資料篩選''' import requests from lxml import etree #

python爬蟲rp+bs4

每一個使用一個進行處理爬蟲字符 itl fin soup = BeautifulSoup(html_doc) soup 就是BeautifulSoup處理格式化後的字符串，soup.title 得到的是title標簽，soup.p 得到的是文檔中的第一個p標簽，

Python爬蟲利器三之Xpath語法與lxml庫的用法

blank color idt tab 一段並且 .text rst 基本用法前面我們介紹了 BeautifulSoup 的用法，這個已經是非常強大的庫了，不過還有一些比較流行的解析庫，例如 lxml，使用的是 Xpath 語法，同樣是效率比較高的解析方法。如果大家

Python爬蟲通過替換http request header來欺騙瀏覽器實現登錄

kit 5.0 目的 user ima blog 界面 chrom num 以豆瓣為例，訪問https://www.douban.com/contacts/list 來查看自己關註的人，要登錄才能查看。如果用requests.get()方法獲取這個http，沒登錄只能

Python爬蟲的步驟和工具

數據 raw 匹配 () 可能表達寫入封裝 ext #四個步驟 1.查看crawl內容的源碼格式 crawl的內容可以是 url(鏈接），文字，圖片，視頻 2.請求網頁源碼　　　　　　　　（可能要設置）代理，限速，cookie 3.匹配　　　　　　　

Python - 爬蟲爬取和登陸github

用API搜尋GitHub中star數最多的前十個庫，並用post方法登陸並點選收藏一用API搜尋GitHub中star數最多的前十個庫利用GitHub提供的API爬取前十個star數量最多的Python庫 GitHub提供了很多專門為爬蟲準

Windows環境下python爬蟲常用庫和工具的安裝（UrlLib、Re、Requests、Selenium、lxml、Beautiful Soup、PyQuery 、PyMySQL等等）

本文列出了使用python進行爬蟲時所需的常用庫和工具的安裝過程，基本上只有幾行命令列的功夫就可以搞定，還是十分簡單的。一、UrlLib 與 Re 這兩個庫是python的內建庫，若系統中已經成功安裝了python的話，這兩個庫一般是沒有什麼問題的。驗證開啟命令列，進入

python爬蟲系列(3.1-xpath語法的介紹)

一、關於xpath的認識 xpath（XML Path Language）是一門在XML和HTML文件中查詢資訊的語言，可用來在XML和HTML文件中對元素和屬性進行遍歷。二、xpath的基本語法 1、選擇節點 2、謂語謂語是用來找出某個特定的

python爬蟲電影頁面資訊 xpath csv寫入圖片儲存到本地

import re import requests from lxml import etree import time import urllib.request import csv import os # 獲取電影詳情 def getMoviesDetail(id,score):

2.1-Python爬蟲-正則/XML/XPath/CSS選擇器-案例演示

Python爬蟲-正則/XML/XPath/CSS選擇器正則表示式案例v23,re的基本使用流程 ''' python中正則模組是re 使用大致步驟： 1. compile函式講正則表示式的字串便以為一個Pattern物件 2. 通過Pattern物件的一些列方法對文字進行匹配，匹配結果是一個Mat

Python爬蟲中HTTP和HTTPS的雜七雜八

HTTP協議（HyperText Transfer Protocol，埠號80）即超文字傳輸協議，是一種釋出和接收HTML頁面的方法。HTTPS（HyperText Transfer Protocol over Secure Socket Layer，埠號443）是H

Python爬蟲框架 scrapy之xpath選擇器 css選擇器

文章目錄一、xpath 1、節點選擇二、 css css選擇三、xpath函式操作 1

用python爬蟲爬取和登陸github

一利用API簡單爬取利用GitHub提供的API爬取前十個star數量最多的Python庫 GitHub提供了很多專門為爬蟲準備的API介面，通過介面可以爬取到便捷，易處理的資訊。（這是GitHub官網的各種api介紹）使用到的庫 import re

python爬蟲學習筆記-使用BeautifulSoup解析html

之前抓取豆瓣圖書Top250的時候，獲取內容使用的方法是正則表示式匹配，看上去是一種比較簡潔的方法，但問題在於，正則表示式的編寫必須非常細心，一旦出了任何小問題，就會導致得不到想要的結果。熟悉html的話，不難想到可以利用節點之間的結構和層級關係來作區分並進一步獲取節點內想要的文字。於是B

使用流載入和儲存HTML內容&遍歷html的資訊存入一個ini檔案&使用MSHTML分析HTML程式碼

使用流載入和儲存HTML內容&遍歷html的資訊存入一個ini檔案&使用MSHTML分析HTML程式碼使用流載入和儲存HTML內容本文的部分內容翻譯自MSDN文章Loading HTML content from a StreamIPersist* 介面，

Python-爬蟲-（Json和Csv）檔案儲存

用解析器解析出的資料後，可以通過TXT、JSON、CSV等檔案形式進行儲存； 1、TXT形式此處略； 2、JSON檔案儲存 json即js 物件標記，是一種資料格式； json格式： json物件：{"username":"ADMIN","pwd":"xxxx","address":"北京"} j

教你用Python爬蟲股票評論，簡單分析股民使用者情緒

一、背景股民是網路使用者的一大群體，他們的網路情緒在一定程度上反映了該股票的情況，也反映了股市市場的波動情況。作為一隻時間充裕的研究僧，我課餘時間準備寫個小程式碼get一下股民的評論資料，分析使用者情緒的走勢。程式碼還會修改，因為結果不準確，哈哈！二、資料來源

python爬蟲-通過bs4和xpath分析html程式碼

一、用lxml模組分析程式碼

二、用bs4模組分析程式碼

相關推薦