使用BeautifulSoup解析html入門

阿新 • • 發佈：2019-01-20

最近一直需要接觸爬蟲，爬蟲得到的頁面資料需要進一步的處理才能夠轉化為我們可以使用的資料，在這裡今天學習的是BeautifulSoup這個python的第三方的庫，這是一款很優秀的產品，可以很好地處理html檔案，網上一搜一大堆的資源，今天的學習內容主要是依據網上一篇超級詳細的入門教程來進行實驗的，不得不說裡面的內容真的好多，一時半會也不能全部都理解，在這裡先記錄一下今晚的學習成果，接下來有時間繼續學習。

接下來給出自己的程式碼，小指令碼，對百度頁面原始碼進行了操作：

#!/usr/bin/python
#-*-coding:utf-8-*-

import urllib
from bs4 import BeautifulSoup

def bs_test():
    url = 'http://www.baidu.com'
    html = urllib.urlopen(url).read()
    print '百度頁面html長度為：', len(html)
    #建立beautifulsoup物件
    soup = BeautifulSoup(html)
    #這個小函式可以格式化輸出soup物件
    print soup.prettify()
    '''
    Beautiful Soup將複雜HTML文件轉換成一個複雜的樹形結構,每個節點都是Python物件,所有物件可以歸納為4種:
    Tag、NavigableString、BeautifulSoup、Comment
    '''
    #Tag,tag就是爬取得到的html原始碼中的一個一個的標籤,Tag，它有兩個重要的屬性，是 name 和 attrs
    print soup.title   #輸出<title>標籤
    print len(soup.title)
    print soup.head    #輸出<head>標籤
    print len(soup.head)
    print soup.a       #輸出<a>標籤
    print len(soup.a)
    print type(soup.a) #輸出<a>標籤的型別
    print soup.p       #輸出<p>標籤
    print soup.p.attrs #把 p 標籤的所有屬性列印輸出
    print soup.name
    print soup.head.name
    #NavigableString
    print soup.p.string  #用.string方法獲取標籤裡面的內容
    print type(soup.p.string)   #判斷型別輸出
    #BeautifulSoup物件表示的是一個文件的全部內容
    print type(soup.name)    #獲取soup名字的型別
    print soup.name
    print soup.attrs
    #Comment 物件是一個特殊型別的 NavigableString 物件，其實輸出的內容仍然不包括註釋符號
    print soup.a
    print soup.a.string
    print type(soup.a.string)
    #熟悉.contents.children屬性，tag 的 .content 屬性可以將tag的子節點以列表的方式輸出
    print soup.head.contents 
    print soup.head.children
    #.contents 和 .children 屬性僅包含tag的直接子節點，.descendants 屬性可以對所有tag的子孫節點進行遞迴迴圈
    for child in soup.descendants:
        print child
    #.strings獲取多個內容，不過需要遍歷獲取
    for string in soup.strings:
    	print(repr(string))
    #.stripped_strings輸出的字串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多餘空白內容
    for string in soup.stripped_strings:
    	print(repr(string))
    # .parent 屬性
    print soup.p.parent.name
    #.next_sibling .previous_sibling 屬性
    for sibling in soup.a.next_siblings:
    	print(repr(sibling))
    #.next_element .previous_element 屬性,輸出當前節點前一個節點或者下一個節點
    print soup.head.next_element
    #搜尋文件樹find_all( name , attrs , recursive , text , **kwargs )find_all() 方法搜尋當前tag的所有tag子節點,並判斷是否符合過濾器的條件
    print soup.find_all(['a','b'])  #查詢所有的a標籤
    soup.find_all("a", limit=2)    #可以限制返回的數量


bs_test()

結果很大就不粘貼出來了，實際用的時候可以自己設定一個小點的html來進行實驗，今天對bs4的初步學習就這些，還是剛剛入門的階段，接下來還有很多很多需要學習的地方，歡迎有興趣的同學一起來交流學習哈。

使用BeautifulSoup解析html入門

使用BeautifulSoup解析html入門

Python3.x的BeautifulSoup解析html常用函數

使用BeautifulSoup解析HTML

beautifulsoup 解析html方法

python爬蟲學習筆記-使用BeautifulSoup解析html

python 極好用的解析 html 標簽的模塊 - BeautifulSoup

Java爬蟲入門簡介（三） —— Jsoup解析HTML頁面

通過使用jsoup解析html,繪畫表格生成execl文件

python 解析html網頁

Android解析HTML網頁數據第一個方法Jsoup（一）

html入門

Java學習：HTML入門

jericho解析html

使用C#和HtmlAgilityPack解析HTML

Beautiful Soup 解析html表格示例

php解析html類庫simple_html_dom

關於瀏覽器解析html全過程詳解

Java解析html頁面,獲取想要的元素

HTML入門（三）後臺系統顯示頁面_框架標簽

Python爬蟲系列（四）：Beautiful Soup解析HTML之把HTML轉成Python對象

使用BeautifulSoup解析html入門

相關推薦