Python爬蟲基礎(一)
最近在學習python,順便了解一下網路爬蟲,整理了一下爬蟲基礎(基於py2.7):
獲取網頁資料的三種方法:
# encoding=utf-8
import urllib2
def download1(url):
return urllib2.urlopen(url).read()
# read()方法是預設獲取全部資料
# read(100)方法是獲取前100個字元
def download2(url):
return urllib2.urlopen(url).readlines()
def download3(url):
response = urllib2.urlopen(url)
while True:
line = response.readline()
if not line:
break
print line
url = "http://wwww.baidu.com"
print download3(url)
基於urllib2框架,這個比較簡單。
偽裝瀏覽器
現在很多網站為了防止資料被爬取,都使用了反爬蟲措施。為了在這種情況下能繼續使用爬蟲,目前所學有兩種方案:一是新增隨機的Header,二是使用框架進行模擬,其實意思都差不多。
新增隨機的header:
import urllib2
def download(url):
# header = {"User-Agent": "User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"}
header = {"User-Agent": "User-Agent: UCWEB7.0.2.37/28/999"}
request = urllib2.Request(url=url, headers=header)
# add another header
request.add_header("name", "zhangsan" )
# open the request
response = urllib2.urlopen(request)
print "result:" + str(response.code)
print response.read()
download("http://www.baidu.com")
一般我們可以使用隨機數進行header的選取,就像上面的分別模擬了IE瀏覽器和手機UC瀏覽器進行訪問。當然了,網上有很多User-Agent,大家可以隨機選取一下,進行隨機模擬,下面是網上摘抄的一部分記錄,大家可以使用:
pcUserAgent = {
"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
}
mobileUserAgent = {
"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UC":"User-Agent: UCWEB7.0.2.37/28/999",
"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
}
第二種方法,可以使用selenium網路測試框架,進行遠端訪問,簡單程式碼如下:
import selenium #網路測試框架
import selenium.webdriver #模擬瀏覽器訪問
def getJobNumberByName(name):
target_url = "http://www.baidu.com"
driver = selenium.webdriver.Chrome() # 模擬瀏覽器請求,
driver.get(target_url) # 模擬訪問連線
page_source = driver.page_source # 獲取網頁資訊
print page_source
selenium可以檢視這裡簡介,它會呼叫OS上的瀏覽器driver,如果沒有相應的配置,則會報下面的錯誤:
如果你使用的webdriver.Chrome(),那麼需要一個chromedriver,可以下載解壓,獲取地址,更改程式碼為:
driver = selenium.webdriver.Chrome(chrom_driver_path) # 模擬瀏覽器請求,
編碼統一
主要涉及到是中文傳輸,中文在傳輸的過程中如果不採取編碼,那麼伺服器接受到的內容將會是亂碼。編碼方式如下:
import urllib
words = {"name":"zhangsan","address":"上哈"}
print urllib.urlencode(words) #url編碼
print urllib.unquote(urllib.urlencode(words)) #url解碼
get/post請求
get與post請求主要是引數的傳遞方式不同,get直接在url後面新增引數,而post則將引數封裝在請求體中。
使用python的flask框架建立一個簡單的server:
app = Flask(__name__)
@app.route('/')
def hello_world():
return 'Hello World!'
@app.route("/login", methods=["POST"])
def login():
name = request.form.to_dict().get("name", "")
age = request.form.to_dict().get("age", "")
return name + "-------" + age
@app.route("/query", methods=["GET"])
def query():
age = request.args.get("age", "")
return "this age is " + age
if __name__ == '__main__':
app.run(
"127.0.0.1",
port=8090
)
那麼get請求為:
import urllib2
words = {"age" : "23"}
request = urllib2.Request(url="http://127.0.0.1:8090/query?" + urllib.urlencode(words))
response = urllib2.urlopen(request)
print response.read()
post請求為:
import urllib2
info = {"name":"Tom張","age":"20"}
info = urllib.urlencode(info) # 這是也需要進行url編碼
request = urllib2.Request("http://127.0.0.1:8090/login")
request.add_data(info)
response = urllib2.urlopen(request)
print response.read()
圖片下載
import urllib
urllib.urlretrieve(圖片原始地址,圖片本地儲存地址)
代理與本地代理
多個爬蟲使用單個ip時,如果此時IP地址被禁止,那爬蟲就沒法正常工作了,所以這也衍生了不少生態鏈,某寶上搜索”vps“等關鍵字,可以看到各種專業代理,如下圖:
當然我們可以使用免費的代理:[代理更新時間為18年4月21號 16點14分]
https://www.kuaidaili.com/free/ ## 快代理
http://www.xicidaili.com/ ## 西刺代理
python程式碼使用代理為:
import urllib2
http_proxy = urllib2.ProxyHandler({"http":"117.90.3.126:9000"}) #代理ip與埠
opener = urllib2.build_opener(http_proxy)
request = urllib2.Request("http://www.baidu.com")
response = opener.open(request)
print response.read()
重定向
1.判斷url是否被重定向了:
import urllib2
# 判斷url是否重定向了
def url_is_redirect(url):
response = urllib2.urlopen(url)
return response.geturl() != url
print url_is_redirect("http://www.baidu.cn")
2.如果是重定向,那我們需要獲取新的地址:
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
res = urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
res.status = code # 返回的編碼
res.newurl = res.geturl() # 當前的URL
print res.newurl, res.status # 檢視重定向url
return res
opener = urllib2.build_opener(RedirectHandler)
opener.open("http://www.baidu.cn")
cookie相關
網頁的關聯性獲取,需要用到cookie。
1.cookie的獲取:
# encoding=utf-8
import urllib2
import cookielib
#create a cookie object
cookie = cookielib.CookieJar()
#get the cookie
header = urllib2.HTTPCookieProcessor(cookie)
#deal the cookie
opener = urllib2.build_opener(header)
response = opener.open("http://www.baidu.com")
for data in cookie :
print data.name + "--" + data.value + "\r"
獲取的結果為:
BAIDUID--2643F48FC95482FF4ECAD2EBC7DBE11E:FG=1
BIDUPSID--2643F48FC95482FF4ECAD2EBC7DBE11E
H_PS_PSSID--1466_21088_18560_22158
PSTM--1524360190
BDSVRTM--0
BD_HOME--0
2.cookie的讀取:
# encoding=utf-8
import urllib2
import cookielib
file_path = "cookie.txt"
cookie = cookielib.LWPCookieJar(file_path) # 設定路徑
header = urllib2.HTTPCookieProcessor(cookie) # 設定cookie,與網站有關
opener = urllib2.build_opener(header)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_expires=True, ignore_discard=True)
執行之後,cookie.txt將會存入我們的cookie檔案
基本上就這些差不多了,剩下的慢慢再上來更新吧。