用python解析html

阿新 • • 發佈：2019-02-09

python中，有三個庫可以解析html文字，HTMLParser,sgmllib,htmllib。他們的實現方法不通，但功能差不多。這三個庫中提供解析html的類都是基類，本身並不做具體的工作。他們在發現的元件後（如標籤、註釋、聲名等），會呼叫相應的函式，這些函式必須過載，因為基類中不作處理。

比如：

"""<html><head><title>Advice</title></head><body>
<p>The <a href="http://ietf.org">IETF admonishes:
<i>Be strict in what you <b>send</b>.</i></a></p>
<form>
<input type=submit > <input type=text name=start size=4></form>
</body></html>
"""

如果對這個資料做處理，當檢測到<html>標籤時，對於HTMLParser，會呼叫handle_starttag函式。

下面具體介紹下幾個庫

1、HTMLParser

#------------------ HTMLParser_stack.py ------------------# #-- coding: GBK -- import HTMLParser,sys,os,string html = """<html><head><title>Advice</title></head><body> <p>The <a href="http://ietf.org" mce_href="http://ietf.org">IETF admonishes: <i>Be strict in what you <b>send</b>.</i></a></p> <form> <input type=submit > <input type=text name=start size=4></form> </body></html> """ tagstack = [] class ShowStructure(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): tagstack.append(tag) def handle_endtag(self, tag): tagstack.pop() def handle_data(self, data): if data.strip(): for tag in tagstack: sys.stdout.write('/'+tag) sys.stdout.write(' >> %s/n' % data[:40].strip()) ShowStructure().feed(html)

此函式的輸出：

/html/body/p >> The
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> Be strict in what you
/html/body/p/a/i/b >> send
/html/body/p/a/i >> .

對於一些網頁，可能並沒有嚴格的開始結束標籤對，這時，我們可以去忽略一些標籤。可以自己寫個堆疊來處理這些標籤。

#*--------------- TagStack class example -----------------# class TagStack: def __init__(self, lst=[]): self.lst = lst def __getitem__(self, pos): return self.lst[pos] def append(self, tag): # Remove every paragraph-level tag if this is one if tag.lower() in ('p','blockquote'): self.lst = [t for t in self.lst if t not in ('p','blockquote')] self.lst.append(tag) def pop(self, tag): # "Pop" by tag from nearest pos, not only last item self.lst.reverse() try: pos = self.lst.index(tag) except ValueError: raise HTMLParser.HTMLParseError, "Tag not on stack" del self.lst[pos] self.lst.reverse() tagstack = TagStack()

HTMLParser有個bug，就是不能處理中文屬性，比如說，如果網頁裡有這麼一段：

<input type=submit value=跳轉到>

那麼解析到這一行時就會出錯。

錯誤原因還是正則表示式惹的禍。

attrfind = re.compile(
r'/s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(/s*=/s*'
r'(/'[^/']*/'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$/(/)_#[email protected]]*))?')
attrfind 沒有匹配中文字元。

可以更改這個匹配已修正這個錯誤。sgmllib則不存在這種錯誤。

2、sgmllib

html格式為sgml格式的一個子集，所以sgml可以處理跟多的東西，下面通過一段程式碼來示例sgmllib的用法。

#------------------ HTMLParser_stack.py ------------------# #-- coding: GBK -- import sgmllib,sys,os,string html = """<lala><head><title>Advice</title></head><body> <p>The <a href="http://ietf.org" mce_href="http://ietf.org">IETF admonishes: <i>Be strict in what you <b>send</b>.</i></a></p> <form> <input type=submit name='我'> 我 <input type=text name=start size=4></form> </body></lala> """ os.chdir('d://python') f=file('testboard.txt','r') contest=f.read() tagstack = [] class ShowStructure(sgmllib.SGMLParser): def handle_starttag(self, tag, method,attrs): tagstack.append(tag) def handle_endtag(self, tag): tagstack.pop() def handle_data(self, data): if data.strip(): for tag in tagstack: sys.stdout.write('/'+tag) sys.stdout.write(' >> %s/n' % data[:40].strip()) def unknown_starttag(self,tag,attrs): print 'start tag:<'+tag+'>' def unknown_endtag(self,tag): print 'end tag:</'+tag+'>' def start_lala(self,attr): print 'lala tag found' ShowStructure().feed(html)

輸出:

start tag:<head>
start tag:<title>
/lala >> Advice
end tag:</title>
end tag:</head>
start tag:<body>
start tag:<p>
/lala >> The
start tag:<a>
/lala >> IETF admonishes:
start tag:<i>
/lala >> Be strict in what you
start tag:<b>
/lala >> send
end tag:</b>
/lala >> .
end tag:</i>
end tag:</a>
end tag:</p>
start tag:<form>
start tag:<input>
/lala >> ϒ
start tag:<input>
end tag:</form>
end tag:</body>
end tag:</lala>

和HTMLParser一樣，如果要用sgmllib解析html，則要繼承sgmllib.SGMLParser類，此類裡的函式都是空的，使用者需要過載它。這個類提供的功能是在特定情況下呼叫相應的函式。

比如當發現<html>標籤時，如果並沒有定義 start_html(self,attr)函式，則會呼叫unknown_starttag函式，具體怎麼處理則更具使用者。

sgml的標籤是可以自定義的，比如自己定義了一個start_lala函式，則就會處理<lala>標籤。

有個地方要說明下，如果定義了start_tagname函式，有定義了handle_starttag函式，則函式只會執行handle_starttag函式，start_tagname為空函式都沒有問題，如果沒有定義handle_starttag函式，則遇到<tagname>標籤時，會執行start_tagname函式。如果沒有定義tagname的start函式，則此標籤為未知標籤，呼叫unknown_starttag函式

用python解析html

1、HTMLParser

2、sgmllib

用python解析html

python 極好用的解析 html 標簽的模塊 - BeautifulSoup

Python 用HTMLParser解析HTML檔案

python 解析html網頁

從零寬斷言說起到用python匹配html標簽內容

用python給html裏的css及js文件鏈接自動添加版本號

python解析HTML之:PyQuery庫的介紹與使用

用python解析pdf中的文字與表格【pdfplumber的安裝與使用】

用python解析word檔案（段落篇（paragraph）表格篇（table）樣式篇（style））

用Python組裝html表格

用 Python 處理 HTML 轉義字元的5種方式

用python 解析XML 的幾種常見方法的介紹

Python解析html的幾種操作方式

Python解析HTML定位元素（內容）HTMLParser demo

手把手教學 Android用jsoup解析html

用Python處理HTML轉義字元的5種方式

python解析html提取資料，並生成word文件

linux用java解析html出現中文亂碼問題

python 解析html中的link

python模組之ConfigParser: 用python解析配置檔案

用python解析html

1、HTMLParser

2、sgmllib

相關推薦