1. 程式人生 > >requests and BeautifulSoup

requests and BeautifulSoup

requests

Python標準庫中提供了:urllib、urllib2、httplib等模組以供Http請求,但是,它的 API 太渣了。它是為另一個時代、另一個網際網路所建立的。它需要巨量的工作,甚至包括各種方法覆蓋,來完成最簡單的任務。

Requests 是使用 Apache2 Licensed 許可證的 基於Python開發的HTTP 庫,其在Python內建模組的基礎上進行了高度的封裝,從而使得Pythoner進行網路請求時,變得美好了許多,使用Requests可以輕而易舉的完成瀏覽器可有的任何操作。

1、GET請求

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # 1、無引數例項    import  requests    ret  =  requests.get( 'https://github.com/timeline.json'
)    print  ret.url print  ret.text          # 2、有引數例項
   import  requests    payload  =  { 'key1' 'value1' 'key2' 'value2' } ret  =  requests.get( "http://httpbin.org/get" , params = payload)    print  ret.url print  ret.text

2、POST請求

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 # 1、基本POST例項    import  requests    payload  =  { 'key1' 'value1' 'key2' 'value2' } ret  =  requests.post( "http://httpbin.org/post" , data = payload)    print  ret.text       # 2、傳送請求頭和資料例項    import  requests import  json    url  =  'https://api.github.com/some/endpoint' payload  =  { 'some' 'data' } headers  =  { 'content-type' 'application/json' }    ret  =  requests.post(url, data = json.dumps(payload), headers = headers)    print  ret.text print  ret.cookies

3、其他請求

1 2 3 4 5 6 7 8 9 10 requests.get(url, params = None * * kwargs) requests.post(url, data = None , json = None * * kwargs) requests.put(url, data = None * * kwargs) requests.head(url,  * * kwargs) requests.delete(url,  * * kwargs) requests.patch(url, data = None * * kwargs) requests.options(url,  * * kwargs)    # 以上方法均是在此方法的基礎上構建 requests.request(method, url,  * * kwargs)

4、更多引數

 1 def request(method, url, **kwargs):
 2     """Constructs and sends a :class:`Request <Request>`.
 3 
 4     :param method: method for the new :class:`Request` object.
 5     :param url: URL for the new :class:`Request` object.
 6     :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
 7     :param data: (optional) Dictionary, bytes, or file-like object to send in the body of the :class:`Request`.
 8     :param json: (optional) json data to send in the body of the :class:`Request`.
 9     :param headers: (optional) Dictionary of HTTP Headers to send with the :class:`Request`.
10     :param cookies: (optional) Dict or CookieJar object to send with the :class:`Request`.
11     :param files: (optional) Dictionary of ``'name': file-like-objects`` (or ``{'name': file-tuple}``) for multipart encoding upload.
12         ``file-tuple`` can be a 2-tuple ``('filename', fileobj)``, 3-tuple ``('filename', fileobj, 'content_type')``
13         or a 4-tuple ``('filename', fileobj, 'content_type', custom_headers)``, where ``'content-type'`` is a string
14         defining the content type of the given file and ``custom_headers`` a dict-like object containing additional headers
15         to add for the file.
16     :param auth: (optional) Auth tuple to enable Basic/Digest/Custom HTTP Auth.
17     :param timeout: (optional) How long to wait for the server to send data
18         before giving up, as a float, or a :ref:`(connect timeout, read
19         timeout) <timeouts>` tuple.
20     :type timeout: float or tuple
21     :param allow_redirects: (optional) Boolean. Set to True if POST/PUT/DELETE redirect following is allowed.
22     :type allow_redirects: bool
23     :param proxies: (optional) Dictionary mapping protocol to the URL of the proxy.
24     :param verify: (optional) whether the SSL cert will be verified. A CA_BUNDLE path can also be provided. Defaults to ``True``.
25     :param stream: (optional) if ``False``, the response content will be immediately downloaded.
26     :param cert: (optional) if String, path to ssl client cert file (.pem). If Tuple, ('cert', 'key') pair.
27     :return: :class:`Response <Response>` object
28     :rtype: requests.Response
29 
30     Usage::
31 
32       >>> import requests
33       >>> req = requests.request('GET', 'http://httpbin.org/get')
34       <Response [200]>
35     """
36 
37 引數列表
View Code
  1 def param_method_url():
  2     # requests.request(method='get', url='http://127.0.0.1:8000/test/')
  3     # requests.request(method='post', url='http://127.0.0.1:8000/test/')
  4     pass
  5 
  6 
  7 def param_param():
  8     # - 可以是字典
  9     # - 可以是字串
 10     # - 可以是位元組(ascii編碼以內)
 11 
 12     # requests.request(method='get',
 13     # url='http://127.0.0.1:8000/test/',
 14     # params={'k1': 'v1', 'k2': '水電費'})
 15 
 16     # requests.request(method='get',
 17     # url='http://127.0.0.1:8000/test/',
 18     # params="k1=v1&k2=水電費&k3=v3&k3=vv3")
 19 
 20     # requests.request(method='get',
 21     # url='http://127.0.0.1:8000/test/',
 22     # params=bytes("k1=v1&k2=k2&k3=v3&k3=vv3", encoding='utf8'))
 23 
 24     # 錯誤
 25     # requests.request(method='get',
 26     # url='http://127.0.0.1:8000/test/',
 27     # params=bytes("k1=v1&k2=水電費&k3=v3&k3=vv3", encoding='utf8'))
 28     pass
 29 
 30 
 31 def param_data():
 32     # 可以是字典
 33     # 可以是字串
 34     # 可以是位元組
 35     # 可以是檔案物件
 36 
 37     # requests.request(method='POST',
 38     # url='http://127.0.0.1:8000/test/',
 39     # data={'k1': 'v1', 'k2': '水電費'})
 40 
 41     # requests.request(method='POST',
 42     # url='http://127.0.0.1:8000/test/',
 43     # data="k1=v1; k2=v2; k3=v3; k3=v4"
 44     # )
 45 
 46     # requests.request(method='POST',
 47     # url='http://127.0.0.1:8000/test/',
 48     # data="k1=v1;k2=v2;k3=v3;k3=v4",
 49     # headers={'Content-Type': 'application/x-www-form-urlencoded'}
 50     # )
 51 
 52     # requests.request(method='POST',
 53     # url='http://127.0.0.1:8000/test/',
 54     # data=open('data_file.py', mode='r', encoding='utf-8'), # 檔案內容是:k1=v1;k2=v2;k3=v3;k3=v4
 55     # headers={'Content-Type': 'application/x-www-form-urlencoded'}
 56     # )
 57     pass
 58 
 59 
 60 def param_json():
 61     # 將json中對應的資料進行序列化成一個字串,json.dumps(...)
 62     # 然後傳送到伺服器端的body中,並且Content-Type是 {'Content-Type': 'application/json'}
 63     requests.request(method='POST',
 64                      url='http://127.0.0.1:8000/test/',
 65                      json={'k1': 'v1', 'k2': '水電費'})
 66 
 67 
 68 def param_headers():
 69     # 傳送請求頭到伺服器端
 70     requests.request(method='POST',
 71                      url='http://127.0.0.1:8000/test/',
 72                      json={'k1': 'v1', 'k2': '水電費'},
 73                      headers={'Content-Type': 'application/x-www-form-urlencoded'}
 74                      )
 75 
 76 
 77 def param_cookies():
 78     # 傳送Cookie到伺服器端
 79     requests.request(method='POST',
 80                      url='http://127.0.0.1:8000/test/',
 81                      data={'k1': 'v1', 'k2': 'v2'},
 82                      cookies={'cook1': 'value1'},
 83                      )
 84     # 也可以使用CookieJar(字典形式就是在此基礎上封裝)
 85     from http.cookiejar import CookieJar
 86     from http.cookiejar import Cookie
 87 
 88     obj = CookieJar()
 89     obj.set_cookie(Cookie(version=0, name='c1', value='v1', port=None, domain='', path='/', secure=False, expires=None,
 90                           discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False,
 91                           port_specified=False, domain_specified=False, domain_initial_dot=False, path_specified=False)
 92                    )
 93     requests.request(method='POST',
 94                      url='http://127.0.0.1:8000/test/',
 95                      data={'k1': 'v1', 'k2': 'v2'},
 96                      cookies=obj)
 97 
 98 
 99 def param_files():
100     # 傳送檔案
101     # file_dict = {
102     # 'f1': open('readme', 'rb')
103     # }
104     # requests.request(method='POST',
105     # url='http://127.0.0.1:8000/test/',
106     # files=file_dict)
107 
108     # 傳送檔案,定製檔名
109     # file_dict = {
110     # 'f1': ('test.txt', open('readme', 'rb'))
111     # }
112     # requests.request(method='POST',
113     # url='http://127.0.0.1:8000/test/',
114     # files=file_dict)
115 
116     # 傳送檔案,定製檔名
117     # file_dict = {
118     # 'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf")
119     # }
120     # requests.request(method='POST',
121     # url='http://127.0.0.1:8000/test/',
122     # files=file_dict)
123 
124     # 傳送檔案,定製檔名
125     # file_dict = {
126     #     'f1': ('test.txt', "hahsfaksfa9kasdjflaksdjf", 'application/text', {'k1': '0'})
127     # }
128     # requests.request(method='POST',
129     #                  url='http://127.0.0.1:8000/test/',
130     #                  files=file_dict)
131 
132     pass
133 
134 
135 def param_auth():
136     from requests.auth import HTTPBasicAuth, HTTPDigestAuth
137 
138     ret = requests.get('https://api.github.com/user', auth=HTTPBasicAuth('wupeiqi', 'sdfasdfasdf'))
139     print(ret.text)
140 
141     # ret = requests.get('http://192.168.1.1',
142     # auth=HTTPBasicAuth('admin', 'admin'))
143     # ret.encoding = 'gbk'
144     # print(ret.text)
145 
146     # ret = requests.get('http://httpbin.org/digest-auth/auth/user/pass', auth=HTTPDigestAuth('user', 'pass'))
147     # print(ret)
148     #
149 
150 
151 def param_timeout():
152     # ret = requests.get('http://google.com/', timeout=1)
153     # print(ret)
154 
155     # ret = requests.get('http://google.com/', timeout=(5, 1))
156     # print(ret)
157     pass
158 
159 
160 def param_allow_redirects():
161     ret = requests.get('http://127.0.0.1:8000/test/', allow_redirects=False)
162     print(ret.text)
163 
164 
165 def param_proxies():
166     # proxies = {
167     # "http": "61.172.249.96:80",
168     # "https": "http://61.185.219.126:3128",
169     # }
170 
171     # proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}
172 
173     # ret = requests.get("http://www.proxy360.cn/Proxy", proxies=proxies)
174     # print(ret.headers)
175 
176 
177     # from requests.auth import HTTPProxyAuth
178     #
179     # proxyDict = {
180     # 'http': '77.75.105.165',
181     # 'https': '77.75.105.165'
182     # }
183     # auth = HTTPProxyAuth('username', 'mypassword')
184     #
185     # r = requests.get("http://www.google.com", proxies=proxyDict, auth=auth)
186     # print(r.text)
187 
188     pass
189 
190 
191 def param_stream():
192     ret = requests.get('http://127.0.0.1:8000/test/', stream=True)
193     print(ret.content)
194     ret.close()
195 
196     # from contextlib import closing
197     # with closing(requests.get('http://httpbin.org/get', stream=True)) as r:
198     # # 在此處理響應。
199     # for i in r.iter_content():
200     # print(i)
201 
202 
203 def requests_session():
204     import requests
205 
206     session = requests.Session()
207 
208     ### 1、首先登陸任何頁面,獲取cookie
209 
210     i1 = session.get(url="http://dig.chouti.com/help/service")
211 
212     ### 2、使用者登陸,攜帶上一次的cookie,後臺對cookie中的 gpsd 進行授權
213     i2 = session.post(
214         url="http://dig.chouti.com/login",
215         data={
216             'phone': "8615131255089",
217             'password': "xxxxxx",
218             'oneMonth': ""
219         }
220     )
221 
222     i3 = session.post(
223         url="http://dig.chouti.com/link/vote?linksId=8589623",
224     )
225     print(i3.text)
226 
227 引數示例
View Code

 

官方文件:http://cn.python-requests.org/zh_CN/latest/user/quickstart.html#id4

BeautifulSoup

BeautifulSoup是一個模組,該模組用於接收一個HTML或XML字串,然後將其進行格式化,之後遍可以使用他提供的方法進行快速查詢指定元素,從而使得在HTML或XML中查詢指定元素變得簡單。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 from  bs4  import  BeautifulSoup   html_doc  =  """ <html><head><title>The Dormouse's story</title></head> <body> asdf      <div class="title">          <b>The Dormouse's story總共</b>          <h1>f</h1>      </div> <div class="story">Once upon a time there were three little sisters; and their names were      <a  class="sister0" id="link1">Els<span>f</span>ie</a>,      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</div> ad<br/>sf <p class="story">...</p> </body> </html> """   soup  =  BeautifulSoup(html_doc, features = "lxml" ) # 找到第一個a標籤 tag1  =  soup.find(name = 'a' ) # 找到所有的a標籤 tag2  =  soup.find_all(name = 'a' ) # 找到id=link2的標籤 tag3  =  soup.select( '#link2' )

安裝:

1 pip3 install beautifulsoup4

使用示例:

1 2 3 4 5 6 7 8 9 10 11 from  bs4  import  BeautifulSoup   html_doc  =  """ <html><head><title>The Dormouse's story</title></head> <body>      ... </body> </html> """   soup  =  BeautifulSoup(html_doc, features = "lxml" )

1. name,標籤名稱

1 2 3 4 5 # tag = soup.find('a') # name = tag.name # 獲取 # print(name) # tag.name = 'span' # 設定 # print(soup)

2. attr,標籤屬性

1 2 3 4 5 6 # tag = soup.find('a') # attrs = tag.attrs    # 獲取 # print(attrs) # tag.attrs = {'ik':123} # 設定 # tag.attrs['id'] = 'iiiii' # 設定 # print(soup)

3. children,所有子標籤

1 2 # body = soup.find('body') # v = body.children

4. children,所有子子孫孫標籤

1 2 # body = soup.find('body') # v = body.descendants

5. clear,將標籤的所有子標籤全部清空(保留標籤名)

1 2 3 # tag = soup.find('body') # tag.clear() # print(soup)

6. decompose,遞迴的刪除所有的標籤

1 2 3 # body = soup.find('body') # body.decompose() # print(soup)

7. extract,遞迴的刪除所有的標籤,並獲取刪除的標籤

1 2 3 # body = soup.find('body') # v = body.extract() # print(soup)

8. decode,轉換為字串(含當前標籤);decode_contents(不含當前標籤)

1 2 3 4 # body = soup.find('body') # v = body.decode() # v = body.decode_contents() # print(v)

9. encode,轉換為位元組(含當前標籤);encode_contents(不含當前標籤)

1 2 3 4 # body = soup.find('body') # v = body.encode() # v = body.encode_contents() # print(v)

10. find,獲取匹配的第一個標籤

1 2 3 4 5 # tag = soup.find('a') # print(tag) # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie') # print(tag)

11. find_all,獲取匹配的所有標籤

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33