Beautiful_Soup 自學筆記 001 -- 建立Beautiful Soup物件, Features Argument, TreeBuilder & Parsers
阿新 • • 發佈:2020-12-20
技術標籤:python
Beautiful_Soup 自學筆記 001
1. 建立Beautiful Soup物件
方法一:通過string來建立
# Method 1 -- Create from String
hello = "<p>Hello</p>"
soup_str = BeautifulSoup(hello)
方法二:通過URL來建立
# Method 2 -- Create from URL
url = 'https://mcc.osu.edu/events.aspx'
page = requests.get(url) # Get the webpage with GET request
soup_url = BeautifulSoup(page.text, "html.parser")
注意: 此處的 html.parser 叫做 features argument,在後面的TreeBuilder Class部分有詳細說明
方法三:通過file來建立
with open("foo.html","r") as foo_file:
soup_file = BeautifulSoup(foo_file)
2. 有關TreeBuilder Class
The TreeBuilder class is used for creating the HTML/XML tree from the
input document
- 在建立object時註明 “features argument” (e.g. html, xml, etc.) (Default: HTML parser)
- BeautifulSoup會根據提供的argument選擇最合適的TreeBuilder (根據parser的優先順序)
- 例如:
features argument: html
BeautifulSoup選擇parser優先順序為: lxml > html5lib > html.parser
於是根據parser優先順序,BeautifulSoup選擇TreeBuilder優先順序為- lXmlTreeBuilder > HTML5TreeBuilder > HTMLPraserTreeBuilder
# Example Code -- Features specified as xml
soup_xml = BeautifulSoup(hello,features= "xml")
soup_xml = BeautifulSoup(hello,"xml")
- A Better Practice – specify parser
不同的parser parse的結果不同,所以註明parser結果會更準確
“It is good to specify the parser by giving the features argument because this helps to ensure that the input is processed in the same manner across different machines”