Python解析html的幾種操作方式

阿新 • • 發佈：2018-12-28

解析html是爬蟲後的重要的一個處理資料的環節。一下記錄解析html的幾種方式。
先介紹基礎的輔助函式，主要用於獲取html並輸入解析後的結束

#把傳遞解析函式，便於下面的修改
def get_html(url, paraser=bs4_paraser):
    headers = {
        'Accept': '*/*',
        'Accept-Encoding': 'gzip, deflate, sdch',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Host': 'www.360kan.com' 
,
        'Proxy-Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
    }
    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)
    response.encoding = 'utf-8' 

    if response.code == 200:
        data = StringIO.StringIO(response.read())
        gzipper = gzip.GzipFile(fileobj=data)
        data = gzipper.read()
        value = paraser(data)  # open('E:/h5/haPkY0osd0r5UB.html').read()
        return value
    else:
        pass


value = get_html('http://www.360kan.com/m/haPkY0osd0r5UB.html' 
, paraser=lxml_parser)
for row in value:
    print row

1，lxml.html的方式進行解析，
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ.
[官網](http://lxml.de/)

def lxml_parser(page):
    data = []
    doc = etree.HTML(page)
    all_div = doc.xpath('//div[@class="yingping-list-wrap"]')
    for row in all_div:
        # 獲取每一個影評，即影評的item
        all_div_item = row.xpath('.//div[@class="item"]')  # find_all('div', attrs={'class': 'item'})
        for r in all_div_item:
            value = {}
            # 獲取影評的標題部分
            title = r.xpath('.//div[@class="g-clear title-wrap"][1]')
            value['title'] = title[0].xpath('./a/text()')[0]
            value['title_href'] = title[0].xpath('./a/@href')[0]
            score_text = title[0].xpath('./div/span/span/@style')[0]
            score_text = re.search(r'\d+', score_text).group()
            value['score'] = int(score_text) / 20
            # 時間
            value['time'] = title[0].xpath('./div/span[@class="time"]/text()')[0]
            # 多少人喜歡
            value['people'] = int(
                    re.search(r'\d+', title[0].xpath('./div[@class="num"]/span/text()')[0]).group())
            data.append(value)
    return data

2，使用BeautifulSoup，不多說了，推薦一篇講解非常好的文章
[應用講解](http://www.bkjia.com/Pythonjc/992499.html%20%E5%BA%94%E7%94%A8%E8%AE%B2%E8%A7%A3)

def bs4_paraser(html):
    all_value = []
    value = {}
    soup = BeautifulSoup(html, 'html.parser')
    # 獲取影評的部分
    all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1)
    for row in all_div:
        # 獲取每一個影評，即影評的item
        all_div_item = row.find_all('div', attrs={'class': 'item'})
        for r in all_div_item:
            # 獲取影評的標題部分
            title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)
            if title is not None and len(title) > 0:
                value['title'] = title[0].a.string
                value['title_href'] = title[0].a['href']
                score_text = title[0].div.span.span['style']
                score_text = re.search(r'\d+', score_text).group()
                value['score'] = int(score_text) / 20
                # 時間
                value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string
                # 多少人喜歡
                value['people'] = int(
                        re.search(r'\d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())
            # print r
            all_value.append(value)
            value = {}
    return all_value

3，使用SGMLParser，主要是通過start、end tag的方式進行了，解析工程比較明朗，但是有點麻煩，而且該案例的場景不太適合該方法，（哈哈）

class CommentParaser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.__start_div_yingping = False
        self.__start_div_item = False
        self.__start_div_gclear = False
        self.__start_div_ratingwrap = False
        self.__start_div_num = False
        # a
        self.__start_a = False
        # span 3中狀態
        self.__span_state = 0
        # 資料
        self.__value = {}
        self.data = []

    def start_div(self, attrs):
        for k, v in attrs:
            if k == 'class' and v == 'yingping-list-wrap':
                self.__start_div_yingping = True
            elif k == 'class' and v == 'item':
                self.__start_div_item = True
            elif k == 'class' and v == 'g-clear title-wrap':
                self.__start_div_gclear = True
            elif k == 'class' and v == 'rating-wrap g-clear':
                self.__start_div_ratingwrap = True
            elif k == 'class' and v == 'num':
                self.__start_div_num = True

    def end_div(self):
        if self.__start_div_yingping:
            if self.__start_div_item:
                if self.__start_div_gclear:
                    if self.__start_div_num or self.__start_div_ratingwrap:
                        if self.__start_div_num:
                            self.__start_div_num = False
                        if self.__start_div_ratingwrap:
                            self.__start_div_ratingwrap = False
                    else:
                        self.__start_div_gclear = False
                else:
                    self.data.append(self.__value)
                    self.__value = {}
                    self.__start_div_item = False
            else:
                self.__start_div_yingping = False

    def start_a(self, attrs):
        if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
            self.__start_a = True
            for k, v in attrs:
                if k == 'href':
                    self.__value['href'] = v

    def end_a(self):
        if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
            self.__start_a = False

    def start_span(self, attrs):
        if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
            if self.__start_div_ratingwrap:
                if self.__span_state != 1:
                    for k, v in attrs:
                        if k == 'class' and v == 'rating':
                            self.__span_state = 1
                        elif k == 'class' and v == 'time':
                            self.__span_state = 2
                else:
                    for k, v in attrs:
                        if k == 'style':
                            score_text = re.search(r'\d+', v).group()
                    self.__value['score'] = int(score_text) / 20
                    self.__span_state = 3
            elif self.__start_div_num:
                self.__span_state = 4

    def end_span(self):
        self.__span_state = 0

    def handle_data(self, data):
        if self.__start_a:
            self.__value['title'] = data
        elif self.__span_state == 2:
            self.__value['time'] = data
        elif self.__span_state == 4:
            score_text = re.search(r'\d+', data).group()
            self.__value['people'] = int(score_text)
        pass
def sgl_parser(html):
    parser = CommentParaser()
    parser.feed(html)
    return parser.data

4,HTMLParaer，與3原理相識，就是呼叫的方法不太一樣，基本上可以公用，

class CommentHTMLParser(HTMLParser.HTMLParser):
    def __init__(self):
        HTMLParser.HTMLParser.__init__(self)
        self.__start_div_yingping = False
        self.__start_div_item = False
        self.__start_div_gclear = False
        self.__start_div_ratingwrap = False
        self.__start_div_num = False
        # a
        self.__start_a = False
        # span 3中狀態
        self.__span_state = 0
        # 資料
        self.__value = {}
        self.data = []

    def handle_starttag(self, tag, attrs):
        if tag == 'div':
            for k, v in attrs:
                if k == 'class' and v == 'yingping-list-wrap':
                    self.__start_div_yingping = True
                elif k == 'class' and v == 'item':
                    self.__start_div_item = True
                elif k == 'class' and v == 'g-clear title-wrap':
                    self.__start_div_gclear = True
                elif k == 'class' and v == 'rating-wrap g-clear':
                    self.__start_div_ratingwrap = True
                elif k == 'class' and v == 'num':
                    self.__start_div_num = True
        elif tag == 'a':
            if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
                self.__start_a = True
                for k, v in attrs:
                    if k == 'href':
                        self.__value['href'] = v
        elif tag == 'span':
            if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
                if self.__start_div_ratingwrap:
                    if self.__span_state != 1:
                        for k, v in attrs:
                            if k == 'class' and v == 'rating':
                                self.__span_state = 1
                            elif k == 'class' and v == 'time':
                                self.__span_state = 2
                    else:
                        for k, v in attrs:
                            if k == 'style':
                                score_text = re.search(r'\d+', v).group()
                        self.__value['score'] = int(score_text) / 20
                        self.__span_state = 3
                elif self.__start_div_num:
                    self.__span_state = 4

    def handle_endtag(self, tag):
        if tag == 'div':
            if self.__start_div_yingping:
                if self.__start_div_item:
                    if self.__start_div_gclear:
                        if self.__start_div_num or self.__start_div_ratingwrap:
                            if self.__start_div_num:
                                self.__start_div_num = False
                            if self.__start_div_ratingwrap:
                                self.__start_div_ratingwrap = False
                        else:
                            self.__start_div_gclear = False
                    else:
                        self.data.append(self.__value)
                        self.__value = {}
                        self.__start_div_item = False
                else:
                    self.__start_div_yingping = False
        elif tag == 'a':
            if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
                self.__start_a = False
        elif tag == 'span':
            self.__span_state = 0

    def handle_data(self, data):
        if self.__start_a:
            self.__value['title'] = data
        elif self.__span_state == 2:
            self.__value['time'] = data
        elif self.__span_state == 4:
            score_text = re.search(r'\d+', data).group()
            self.__value['people'] = int(score_text)
        pass
def html_parser(html):
    parser = CommentHTMLParser()
    parser.feed(html)
    return parser.data

3,4對於該案例來說確實是不太適合，趁現在有空記錄下來，功學習使用！

Python解析html的幾種操作方式

解析html是爬蟲後的重要的一個處理資料的環節。一下記錄解析html的幾種方式。先介紹基礎的輔助函式，主要用於獲取html並輸入解析後的結束 #把傳遞解析函式，便於下面的修改 def get_html(url, paraser=bs4_paraser): headers =

Lucene筆記08-Directory的幾種操作方式

一、IndexReader的刪除操作 IndexReader也可以執行刪除操作，使用IndexReader執行的刪除操作不需要重新獲取IndexReader物件了。使用IndexReader刪除的前提是，在建立IndexReader物件的時候，要設定它的readonly為false。 i

Python程式的幾種退出方式

對於如何結束一個Python程式或者用Python操作去結束一個程序等，Python本身給出了好幾種方法，而這些方式也存在著一些區別，對相關的幾種方法看了並實踐了下，同時也記錄下。參考：Python 核心程式設計（第二版）1. sys.exit()執行該語句會直接退出程式，這

Python模塊常用的幾種安裝方式

author 位置 def 位操作依賴關系依賴模塊 sys.path 此外 Python模塊安裝一、單文件模塊直接把文件拷貝到 $python_dir/Lib 二、多文件模塊，帶setup.pypython setup.py install 三、 egg文件1)

python爬蟲--解析網頁幾種方法之正則表達式

ima 3.5 ref string tex href quest user lin 1、正則表達式正則表達式是一個特殊的字符序列，它能幫助你方便的檢查一個字符串是否與某種模式匹配。 re 模塊使 Python 語言擁有全部的正則表達式功能。 re.match函數 re.

python爬蟲--解析網頁幾種方法之BeautifulSoup

first div xml html find 抓取 XML 格式速度慢析取一.解析器概述 soup=BeautifulSoup(response.body) 對網頁進行析取時，並未規定解析器，此時使用的是python內部默認的解析器“html.parser”

python常用的幾種GUI解析

python在Python中，開始使用GUI編程的步驟並不復雜，但是它們要求用戶開始做出一些選擇。作為通用編程語言，每個常見操作系統都有可用的解釋器，所以創建圖形用戶界面對於Python來說並不是難事。程序員可以有很多的選擇，真正困難的是如何為用戶創建一個簡單的方式來與程序互動。在python中有幾種常用的G

Python裏面幾種排序算法的比較，sorted的底層實現，雖然我們知道sorted的實現方式，但是

增長歸並排序 sha __main__ 代碼復雜位置好的 strong 算法與數據結構基礎原文鏈接：http://note.youdao.com/noteshare?id=7b9757930ce3cc9e0a5e61e4d0aa9ea2&sub=2726FFA02

Python的幾種實現方式

Python自身作為一門程式語言，它有多種實現。這裡的實現指的是符合Python語言規範的Python解釋程式以及標準庫等。這些實現雖然實現的是同一種語言，但是彼此之間，特別是與CPython之間還是有些差別的。下面分別列出幾個主要的實現。 1.CPython：這是Python

python中字串的幾種表達方式（用什麼方式表示字串）

說明: 　　今天在學習python的基礎的內容,學習在python中如何操作字串,在此記錄下. 　　主要是python中字串的幾種表達，表示方式。 python的幾種表達方式　　1 使用單引號擴起來字串 >>> 'my python lession' #以單引號

Python讀寫檔案的幾種不同方式，特別對出現‘b’的解讀

一、Python檔案讀寫的幾種模式： 'r':預設值，表示從檔案讀取資料。 'w':表示要向檔案寫入資料，並截斷以前的內容 'a':表示要向檔案寫入資料，新增到當前內容尾部 'r+':表示對檔案進行可讀寫操作（刪除以前的所有資料） 'r+a'：表示對檔案可進行讀寫操作（新增到當前檔案尾部）

python中字典的幾種定義方式

def trace syn syntax invalid last ict print class # 方式1 >>> d = dict(name=‘Bob‘, age=20, score=88) >>> print(d) {‘name

用python 訪問redis的幾種常用方式

1，配置sentinel from redis.sentinel import Sentinel sentinel = Sentinel([('192.168.0.210', 26379),('192.168.1.220', 26380),('192.168.1.220', 26379)], socket

python中print的幾種輸出方式

# 第一種輸出方式 print("我是第一種輸出方式。") # 第二種輸出方式 # 預設sep=" " print("我是第二種", "輸出方式") print("我是第二種", "輸出方式", sep

HTML中的幾種定位方式

1，static(預設) 當你沒有為一個元素(例如div)指定定位方式時，預設為static，也就是按照文件的流式(flow)定位，將元素放到一個合適的地方。所以在不同的解析度下，採用流式定位能很好的自適合，取得相對較好的佈局效果。一般來說，我們不需要指明當前元素的定位

html中圖片路徑的幾種使用方式

1、如果html檔案和圖片在同一檔案目錄下 <img src="holly.jpg" width="140" height="140"/> 2、如果圖片在images資料夾裡而html檔案與images在同一檔案目錄下 &nb

Java執行Python指令碼的幾種方式

由於在專案需要執行Python，找尋相關資料，總結出以下幾種方式：直接執行Python指令碼程式碼　　　　引用 org.python包 1 PythonInterpreter interpreter = new PythonInterpreter(); 2 interpreter.

python建立子程序的幾種常用方式（fork, Process,程序池）

linux下使用fork()建立子程序 Linux 作業系統提供了一個 fork() 函式用來建立子程序，這個函式很特殊，呼叫一次，返回兩次，因為作業系統是將當前的程序（父程序）複製了一份（子程序），然後分別在父程序和子程序內返回。子程序永遠返回0，而父程

python 呼叫elasticsearch的幾種不同方式

from elasticsearch import Elasticsearch from elasticsearch_dsl import Search,Q,scan es = Elasticsearc

uboot環境變數中幾種啟動方式解析

說明：1）板子裡原本是沒有環境變數的，u-boot的預設情況下會有一些基本的環境變數，定義在標頭檔案中； 2）執行了saveenv/save之後，環境變數會第一次儲存到nor flash/

Python解析html的幾種操作方式

相關推薦