Python篇----Requests獲取網頁原始碼（爬蟲基礎）

阿新 • • 發佈：2019-01-03

1 下載與安裝

見其他教程。

2 Requsts簡介

Requests is an Apache2 Licensed HTTP library, written inPython, for human beings.

Python’s standard urllib2 module provides most ofthe HTTP capabilities you need, but the API is thoroughlybroken.It was built for a different time — and a different web. It requires anenormous

amount of work (even method overrides) to perform the simplest oftasks.

Requests takes all of the work out of Python HTTP/1.1 — making your integrationwith web services seamless. There’s no need to manually add query strings toyour URLs, or to form-encode your POST data. Keep-alive and HTTP connectionpooling are 100% automatic, powered by

urllib3,which is embedded within Requests.

3 獲取網頁原始碼（Get方法）

直接獲取原始碼
修改Http頭獲取原始碼

直接獲取：

import requests
html = requests.get('http://www.baidu.com')
print html.text

修改http頭：

import requests
import re

#下面三行是編碼轉換的功能
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

#hea是我們自己構造的一個字典，裡面儲存了user-agent。
#讓目標網站誤以為本程式是瀏覽器，並非爬蟲。
#從網站的Requests Header中獲取。【審查元素】
hea = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}

html = requests.get('http://jp.tingroom.com/yuedu/yd300p/',headers = hea)

html.encoding = 'utf-8' #這一行是將編碼轉為utf-8否則中文會顯示亂碼。
print html.text

4 帶正則表示式的提取

<pre name="code" class="python">import requests
import re

#下面三行是編碼轉換的功能
import sys
reload(sys)
sys.setdefaultencoding("utf-8")

#hea是我們自己構造的一個字典，裡面儲存了user-agent。
#讓目標網站誤以為本程式是瀏覽器，並非爬蟲。
#從網站的Requests Header中獲取。【審查元素】
hea = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36'}

html = requests.get('http://jp.tingroom.com/yuedu/yd300p/',headers = hea)

html.encoding = 'utf-8' #這一行是將編碼轉為utf-8否則中文會顯示亂碼。

#此為正則表示式部分。找到規律，利用正則，內容就可以出來
title = re.findall('color:#666666;">(.*?)</span>',html.text,re.S)
for each in title: 
    print each

chinese = re.findall('color: #039;">(.*?)</a>',html.text,re.S)
for each in chinese: 
    print each<pre>

5 向網頁提交資料（Post方法）

第二幅圖：

此處構造表單，就是下面程式碼中data的部分，用的字典。為什麼要改字典裡面的page數字？因為，目標網站採用非同步載入方式，不是一次性載入你所需要爬取的全部內容，所以要一頁一頁的爬去（改數）。

程式碼中爬取的是目標網址的公司名稱，title。

程式碼展示（含原理解釋）：

#-*-coding:utf8-*-
import requests
import re

#需要使用Chrome瀏覽器中的：審查元素->Network
#很多資訊，如url、page、提交方法等都必須從裡得到

#原來的目標網址，但不能作為目標url
# url = 'https://www.crowdfunder.com/browse/deals'

#Post表單向此連結提交資料
url = 'https://www.crowdfunder.com/browse/deals&template=false'

#get方法比較
# html = requests.get(url).text
# print html


#注意這裡的page後面跟的數字需要放到引號裡面。
#page的資料可以改動
data = {
    'entities_only':'true',
    'page':'2'
}

html_post = requests.post(url,data=data)
title = re.findall('"card-title">(.*?)</div>',html_post.text,re.S)
for each in title:
    print each

資料摘自極客學院

Python篇----Requests獲取網頁原始碼（爬蟲基礎）

1 下載與安裝見其他教程。 2 Requsts簡介 Requests is an Apache2 Licensed HTTP library, written inPython, for human beings. Python’s standard urllib2

關於如何只用python獲取網頁天氣（數據）的方法

pytho 解析 dsta ads parser 導入 3.0 根據 lang 獲取網頁數據無非就三步！第一步：首先通過python獲取到前端html完整代碼！（需要用到request模塊）第二步：通過獲取到的html代碼進行過濾，獲取到有用天氣數據（需要用到bs4模

VC++6.0下基於MFC框架利用CInternetSession和CHttpFile獲取網頁資料（附程式碼）

例：從網站http://qq.ip138.com/weather/guangdong/GuangZhou.htm獲取近三天的日期、天氣、溫度、風向，程式碼如下： //新增標頭檔案 #include <afxinet.h> //獲取網路資料 void CSensorSysDlg:

python 闖關之路一（語法基礎）

英文下劃線 .... 保存數據 bject 分用 alex 否則變化 1，什麽是編程？為什麽要編程？　　答：編程是個動詞，編程就等於寫代碼，那麽寫代碼是為了什麽呢？也就是為什麽要編程呢，肯定是為了讓計算機幫我們搞事情，代碼就是計算機能理解的語言。 2，編程語言進化史是

Python爬蟲第一步之獲取網頁原始碼

軟體使用：Python2.7 +Pycharm，稍後使用Python3.5+notepad++試試 #coding: utf-8 import urllib def getHtml(url):

Linux獲取網頁原始碼的幾種方法 linux爬蟲程式

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

獲取網頁中的所有超級連結（爬蟲專用）

//取得所有連結 function get_all_url($code) { preg_match_all('/<a\s+href=["|\']?([^>"\' ]+)["|\']?

微信公眾平臺網頁開發實戰--3.利用JSSDK在網頁中獲取地理位置（HTML5+jQuery）

fff .html 1.4 style minimum log fill rdquo 位置復制一份JSSDK環境，創建一份index.html文件，結構如圖7.1所示。圖7.1 7.1節文件結構在location.js中，封裝“getLoc

利用requests獲取網頁的源代碼

python requests 安裝第三方模塊 requests，前提：確保python中安裝了pip，切換到 C:\Python27\Scripts，使用命令 pip install requests；安裝完成後，可以編寫代碼： import requests tt = requests.

PHP獲取網頁原始碼最簡單的兩種方法

第一種：curl 廢話不多說，直接上程式碼 //1，獲取curl控制代碼 $ch = curl_init(); // 2. 設定選項，包括URL curl_setopt($ch,CURLOPT_URL,"http://www.baidu.com/"); curl_

Python3 Selenium WebDriver網頁的前進、後退、重新整理、最大化、獲取視窗位置、設定視窗大小、獲取頁面title、獲取網頁原始碼、獲取Url等基本操作

Python3 Selenium WebDriver網頁的前進、後退、重新整理、最大化、獲取視窗位置、設定視窗大小、獲取頁面title、獲取網頁原始碼、獲取Url等基本操作通過selenium webdriver操作網頁前進、後退、重新整理、最大化、獲取視窗位置、設定視窗大小、獲取頁面title、獲取網頁

C#獲取網頁原始碼

/// <summary> /// 獲取網頁原始碼 /// </summary> /// <param name="url"></param> /// <returns></returns> protected string

python課程設計筆記(五) ----Resuests+BeautifulSoup （爬蟲入門）

官方參考文件（中文版）： requests：http://docs.python-requests.org/zh_CN/latest/user/quickstart.html beautifulsoup：https://www.crummy.com/software/BeautifulSoup/bs4/d