1. 程式人生 > 其它 >python從web抓取資訊

python從web抓取資訊

技術標籤:pythonpythonweb抓取資訊beautifulsouprequest

webbrowser模組:open()函式可以開啟一個瀏覽器的指定url,這大概就是webbrowser模組唯一能做的事情了

import webbrowser
webbrowser.open("www.baidu.com")

request模組:

1> 不是python自帶的模組,需要安裝 pip install request

2> 編寫request模組是因為python的urllib2模組用起來太複雜,當你需要從Web下載東西的時候使用request就好

import requests
res = requests.get("https://jingyan.baidu.com/article/2a138328efdb44074a134fc5.html")
print(type(res))
print(res.status_code == requests.codes.ok)
print(len(res.text))
print(res.text[:250])
=======================================================================================
result:
<class 'requests.models.Response'>
True
160814
<!DOCTYPE html><html><!--STATUS OK--><head><meta http-equiv="X-UA-Compatible" content="IE=Edge" /><meta charset="utf-8" /><meta name="referrer"

檢查錯誤:在response物件上呼叫raise_for_status方法,如果下載檔案出錯,將會丟擲異常,如果下載成功就什麼都不做

import requests
res = requests.get("http://inventwithpyon.com/page_that_does_not_exist")
res.raise_for_status()

BeautifulSoup模組:

1> 用於從HTML頁面提取資訊

2> 模組名稱bs4,需要引用的時候import bs4

html.example:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <title>The Website Title</title>
</head>
<p>Download my<strong>Python</strong></p>
<p><span id="author">Super Yang</span></p>
</body>
</html>

main.py:

import bs4
exampleFile = open("example.html")
exampleSoup = bs4.BeautifulSoup(exampleFile.read())
elems = exampleSoup.select("#author")
print(len(elems))
print(elems[0].getText())
print(str(elems[0]))
print(elems[0].attrs)
print(elems[0].get('id'))
===================================================
result:
1
Super Yang
<span id="author">Super Yang</span>
{'id': 'author'}
author