python從web抓取資訊
阿新 • • 發佈:2020-12-15
技術標籤:pythonpythonweb抓取資訊beautifulsouprequest
webbrowser模組:open()函式可以開啟一個瀏覽器的指定url,這大概就是webbrowser模組唯一能做的事情了
import webbrowser
webbrowser.open("www.baidu.com")
request模組:
1> 不是python自帶的模組,需要安裝 pip install request
2> 編寫request模組是因為python的urllib2模組用起來太複雜,當你需要從Web下載東西的時候使用request就好
import requests res = requests.get("https://jingyan.baidu.com/article/2a138328efdb44074a134fc5.html") print(type(res)) print(res.status_code == requests.codes.ok) print(len(res.text)) print(res.text[:250]) ======================================================================================= result: <class 'requests.models.Response'> True 160814 <!DOCTYPE html><html><!--STATUS OK--><head><meta http-equiv="X-UA-Compatible" content="IE=Edge" /><meta charset="utf-8" /><meta name="referrer"
檢查錯誤:在response物件上呼叫raise_for_status方法,如果下載檔案出錯,將會丟擲異常,如果下載成功就什麼都不做
import requests
res = requests.get("http://inventwithpyon.com/page_that_does_not_exist")
res.raise_for_status()
BeautifulSoup模組:
1> 用於從HTML頁面提取資訊
2> 模組名稱bs4,需要引用的時候import bs4
html.example:
<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"> <title>The Website Title</title> </head> <p>Download my<strong>Python</strong></p> <p><span id="author">Super Yang</span></p> </body> </html>
main.py:
import bs4 exampleFile = open("example.html") exampleSoup = bs4.BeautifulSoup(exampleFile.read()) elems = exampleSoup.select("#author") print(len(elems)) print(elems[0].getText()) print(str(elems[0])) print(elems[0].attrs) print(elems[0].get('id')) =================================================== result: 1 Super Yang <span id="author">Super Yang</span> {'id': 'author'} author