1. 程式人生 > 其它 >Beautiful_Soup 自學筆記 001 -- 建立Beautiful Soup物件, Features Argument, TreeBuilder & Parsers

Beautiful_Soup 自學筆記 001 -- 建立Beautiful Soup物件, Features Argument, TreeBuilder & Parsers

技術標籤:python

Beautiful_Soup 自學筆記 001

1. 建立Beautiful Soup物件

方法一:通過string來建立

# Method 1 -- Create from String
hello = "<p>Hello</p>"
soup_str = BeautifulSoup(hello)

方法二:通過URL來建立

# Method 2 -- Create from URL
url = 'https://mcc.osu.edu/events.aspx'
page = requests.get(url) # Get the webpage with GET request
soup_url = BeautifulSoup(page.text, "html.parser")

注意: 此處的 html.parser 叫做 features argument,在後面的TreeBuilder Class部分有詳細說明

方法三:通過file來建立

with open("foo.html","r") as foo_file:
    soup_file = BeautifulSoup(foo_file)

2. 有關TreeBuilder Class

The TreeBuilder class is used for creating the HTML/XML tree from the

input document

  • 在建立object時註明 “features argument” (e.g. html, xml, etc.) (Default: HTML parser)
  • BeautifulSoup會根據提供的argument選擇最合適的TreeBuilder (根據parser的優先順序)
  • 例如:
    features argument: html
    BeautifulSoup選擇parser優先順序為: lxml > html5lib > html.parser
    於是根據parser優先順序,BeautifulSoup選擇TreeBuilder優先順序為
    • lXmlTreeBuilder > HTML5TreeBuilder > HTMLPraserTreeBuilder
# Example Code -- Features specified as xml
soup_xml = BeautifulSoup(hello,features= "xml")
soup_xml = BeautifulSoup(hello,"xml")

bs4

  • A Better Practice – specify parser

不同的parser parse的結果不同,所以註明parser結果會更準確

“It is good to specify the parser by giving the features argument because this helps to ensure that the input is processed in the same manner across different machines”