轉發：python中的網頁爬取函式requests.get（）和urlopen函式的區別

爬蟲裡面，我們不可避免的要用urllib中的urlopen()和requests.get()方法去請求或獲取一個網頁的內容,這裡面的區別在於urlopen開啟URL網址，url引數可以是一個字串url或者是一個Request物件，返回的是http.client.HTTPResponse物件.http.client.HTTPResponse物件大概包括read()、readinto()、getheader()、getheaders()、fileno()、msg、version、status、reason、debuglevel和closed函式，其實一般而言使用read()函式後還需要decode()函式，這裡一個巨大的優勢就是：返回的網頁內容實際上是沒有被解碼或的，在read()得到內容後通過指定decode()函式引數，可以使用對應的解碼方式。
而requests.get()方法請求了站點的網址，然後打印出了返回結果的型別，狀態碼，編碼方式，Cookies等內容

from 
 urllib.request import urlopen
import requests

data_get=requests.get("https://www.baidu.com").content.decode("utf-8")
html_url=urlopen("https://www.baidu.com")
data_url=html_url.read()
with open("data_get.html","w") as f:
    f.write(data_get)


print(data_get)
print("------------------------\n")
print(data_url)

with 
 open("data_url.html","wb") as f:
    f.write(data_url)
   
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16

data_get.html如下：

<!DOCTYPE html>
<!--STATUS OK--><html>
<head>
    <meta http-equiv=content-type content 
=text/html;charset=utf-8>
    <meta http-equiv=X-UA-Compatible content=IE=Edge>
    <meta content=always name=referrer>
      <link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css>
        <title>百度一下，你就知道</title>
      </head>
      <body link=#0000cc> 
          <div id=wrapper>
            <div id=head>
              <div class=head_wrapper>
                <div class=s_form>
                  <div class=s_form_wrapper>
                    <div id=lg>
                      <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129>
                    </div>
                    <form id=form name=f action=//www.baidu.com/s class=fm>
                      <input type=hidden name=bdorz_come value=1>
                      <input type=hidden name=ie value=utf-8>
                      <input type=hidden name=f value=8>
                      <input type=hidden name=rsv_bp value=1>
                      <input type=hidden name=rsv_idx value=1>
                      <input type=hidden name=tn value=baidu>
                      <span class="bg s_ipt_wr">
                        <input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus>
                      </span><span class="bg s_btn_wr">
                        <input type=submit id=su value=百度一下 class="bg s_btn" autofocus>
                      </span>
                    </form>
                  </div>
                </div>
                <div id=u1>
                  <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a>
                   <a href=https://www.hao123.com name=tj_trhao123 class=mnav>hao123</a>
                   <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a>
                   <a href=http://v.baidu.com name=tj_trvideo class=mnav>視訊</a>
                   <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a>
                   <noscript>
                     <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登入</a>
                   </noscript>
                   <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登入</a>');
                </script>
                <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產品</a>
              </div>

            </div>
          </div>
          <div id=ftCon>
            <div id=ftConw>
              <p id=lh>
                <a href=http://home.baidu.com>關於百度</a>
                <a href=http://ir.baidu.com>About Baidu</a> </p>
                <p id=cp>&copy;2017&nbsp;Baidu&nbsp;
                  <a href=http://www.baidu.com/duty/>使用百度前必讀</a>&nbsp;
                  <a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a>&nbsp;京ICP證030173號&nbsp;
                  <img src=//www.baidu.com/img/gs.gif>
                </p>
              </div>
            </div>
          </div>
        </body>
        </html>
   
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64

data_url如下：

<html>
<head>
    <script>
        location.replace(location.href.replace("https://","http://"));
    </script>
</head>
<body>
    <noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
   
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10

from urllib.request import urlopen
import requests

data_get=requests.get("https://www.baidu.com").content.decode("utf-8")
html_url=urlopen("https://www.baidu.com")
data_url=html_url.read()
with open("data_get.html","w") as f:
    f.write(data_get)


print(data_get)
print("------------------------\n")
print(data_url)

with open("data_url.html","wb") as f:
    f.write(data_url)
 
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  13
  14
  15
  16

data_get.html如下：

<!DOCTYPE html>
<!--STATUS OK--><html>
<head>
    <meta http-equiv=content-type content=text/html;charset=utf-8>
    <meta http-equiv=X-UA-Compatible content=IE=Edge>
    <meta content=always name=referrer>
      <link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css>
        <title>百度一下，你就知道</title>
      </head>
      <body link=#0000cc> 
          <div id=wrapper>
            <div id=head>
              <div class=head_wrapper>
                <div class=s_form>
                  <div class=
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    轉發：python中的網頁爬取函式requests.get（）和urlopen函式的區別
       
  
  
  
  爬蟲裡面，我們不可避免的要用urllib中的urlopen()和requests.get()方法去請求或獲取一個網頁的內容,這裡面的區別在於urlopen開啟URL網址，url引數可以是一個字串url或者是一個Request物件，返回的是http.client.HTTPRespons 

  
 

    

    
    python3 學習 3：python爬蟲之爬取動態載入的圖片，以百度圖片為例
       
 
 轉： https://blog.csdn.net/qq_32166627/article/details/60882964 
 前言： 
 前面我們爬取圖片的網站都是靜態的，在頁面中右鍵檢視原始碼就能看到網頁中圖片的位置。這樣我們用requests庫得到頁面原始碼後，再用bs4庫解析標籤即可儲存圖片 

  
 

    

    
    python學習（7）：python爬蟲之爬取動態載入的圖片，以百度圖片為例
      
							
							
							前言：

前面我們爬取圖片的網站都是靜態的，在頁面中右鍵檢視原始碼就能看到網頁中圖片的位置。這樣我們用requests庫得到頁面原始碼後，再用bs4庫解析標籤即可儲存圖片到本地。

當我們在看百度圖片時，右鍵–檢查–Elements，點選箭頭，再用箭頭點選圖片時 

  
 

    

    
    Python網路資料爬取----網路爬蟲基礎（一）
       
The website is the API......(未來的資料都是通過網路來提供的，website本身對爬蟲來講就是自動獲取資料的API)。掌握定向網路資料爬取和網頁解析的基本能力。 
##Requests 庫的使用，此庫是Python公認的優秀的第三方網路爬蟲庫。能夠自動的爬取HTML頁面；自動的 

  
 

    

    
    使用python爬取豆瓣電影圖片（-）
      
                
學python沒多久，主要想用它來做爬蟲，寫api建議用node.js,做全站頁面渲染用php搞定，做爬蟲還得看python:

這裡沒有用python的一些爬蟲框架，先採用python內建模組urllib直接處理頁面抓取，然後解析內容然後直接下載圖片：

直接抓取豆瓣圖片 

  
 

    

    
    Python爬蟲之爬取內涵吧段子（urllib.request）
      
								
								            
							
							
							引言

在寒假開始的時候就打算學習爬蟲了，但是沒有想到一入坑到現在還沒有出坑，說多了都是淚 T_T  
我準備介紹的這個庫是我初學爬蟲時候用到的，比較古老，所以我只用了一兩次就轉向了requests了
 

  
 

    

    
    python爬蟲進階（一）：靜態網頁爬取
      
                
一、文章說明
本文是在學習過程中的筆記分享，開發環境是win7，Python3，編輯器pycharm，文章中若有錯誤歡迎指出、積極討論。
另外，推薦一個比較好的爬蟲教程
二、課程基礎
1、HTML和CSS
爬蟲和網頁內容處處打交道，首先要掌握一部分前端內容。參考教程：
2、 

  
 

    

    
    python requests庫網頁爬取小實例：百度/360搜索關鍵詞提交
      ext   aid   col   text   ()   status   exc   print   爬取   百度/360搜索關鍵詞提交全代碼：


#百度/360搜索關鍵詞提交import requestskeyword=‘Python‘try:    　　#百度關鍵字　　#    kv={‘w 

  
 

    

    
    python+selenium+PhantomJS爬取網頁動態加載內容
      use   for   ive   comm   自動化測試   mac os x   page   影響   blank   一般我們使用python的第三方庫requests及框架scrapy來爬取網上的資源，但是設計javascript渲染的頁面卻不能抓取，此時，我們使用web自動化測試化工具Selen 

  
 

    

    
    [python學習] 簡單爬取圖片站點圖庫中圖片
      ctu   while   要去   文章   ava   ges   file   cor   nal   

        近期老師讓學習Python與維基百科相關的知識，無聊之中用Python簡單做了個爬取“遊訊網圖庫”中的圖片，由於每次點擊下一張感覺很浪費時間又繁瑣。主要分享的是怎樣爬取HTML 

  
 

    

    
    python接口自動化測試十八：使用bs4框架爬取圖片
      image   import   解析器   批量   文件夾   自動化測試   接口   data-   IT   # 爬圖片# 目標網站：http://699pic.com/sousuo-218808-13-1.htmlimport requestsfrom bs4 import BeautifulSo 

  
 

    

    
    Python爬蟲基礎：驗證碼的爬取和識別詳解
       
 
 今天要給大家介紹的是驗證碼的爬取和識別，不過只涉及到最簡單的圖形驗證碼，也是現在比較常見的一種型別。 
 執行平臺：Windows 
 Python版本：Python3.6 
 IDE: Sublime Text 
 其他：Chrome瀏覽器 
 簡述流程： 
 步驟1：簡單介紹驗證碼 
 步驟2： 

  
 

    

    
    【python學習筆記】37：認識Scrapy爬蟲,爬取滬深A股資訊
       
 
  
  
 學習《Python3爬蟲、資料清洗與視覺化實戰》時自己的一些實踐。 
  
 認識Scrapy爬蟲 
 安裝 
 書上說在pip安裝會有問題，直接在Anaconda裡安裝。 
 建立Scrapy專案 
 PyCharm裡沒有直接的建立入口，在命令列建立（從Anaconda安裝後似乎自動就 

  
 

    

    
    你以為Python爬蟲只能爬取網頁資料嗎？APP也是可以的呢！
       
 
 摘要 
 大多數APP裡面返回的是json格式資料，或者一堆加密過的資料 。這裡以超級課程表APP為例，抓取超級課程表裡使用者發的話題。 
 1 
 抓取APP資料包 
 方法詳細可以參考這篇博文：http://my.oschina.net/jhao104/blog/605963 
 得到超級課程表 

  
 

    

    
    python scrapy框架爬取豆瓣top250電影篇一儲存資料到mongogdb | mysql中
       
 
  
  
 存到mongodb中 
 環境 
 windows7
mongodb4.0 
 mongodb安裝教程  設定具體引數    在管道里面寫具體引數   
   開啟settings 設定引數    測試開始–結果    程式碼 
 import  pymongo
from douban. 

  
 

    

    
    python使用selenium爬取js加密的網頁
       
 
 
 python使用selenium爬取js加密的網頁 
 我們經常使用Python從網站上爬取我們喜歡的圖片，比如從煎蛋網爬取妹子圖。現在雖然煎蛋網取消了“OOXX”欄目，但是至少把名字換成了隨手拍，我今天想從該網站爬取妹子圖，去發現沒有辦法 從爬取的程式碼中找到 ‘.jpg’ 關鍵詞，這就尷尬了 

  
 

    

    
    Python使用selenium爬取動態網頁時遇到的問題
      
							
							
							我們在做京東手機資訊的爬取時，遇到的一些問題，現在就來跟大家分享一下。
1.首先，京東的網頁是動態的，當我們搜尋一個商品時，需要把頁面往下翻，下面的內容才會出來，這也是我們選selenium方法的原因
解決方法：讓瀏覽器模擬下拉幾次頁面即可
from selen 

  
 

    

    
    Python之簡單爬取網頁內容
       
  
  
 爬去網頁通用流程 
 這樣看著雖然很麻煩，但是爬取網頁都離不開這四個步驟，以後如果爬取更復雜的網頁內容，只需要在這個基礎上新增內容就ok了。 
 import requests
class Qiushi:
    #  初始化函式
    def __init__(self,name):
  

  
 

    

    
    Python爬蟲教程：多執行緒爬取電子書
       
 
  程式碼非常簡單，有咱們前面的教程做鋪墊，很少的程式碼就可以實現完整的功能了，最後把採集到的內容寫到  csv 檔案裡面，(  csv  是啥，你百度一下就知道了) 這段程式碼是  IO密集操作  我們採用  aiohttp  模 

  
 

    

    
    python 3.3 爬取網頁資訊 小例
      
                
# -*- coding:gb2312 -*-    
import urllib.request
source_stram = urllib.request.urlopen("http://www.12306.cn/mormhweb/kyfw/")
#save_path=