Python3爬蟲實戰(requests模組)
阿新 • • 發佈:2018-12-27
上次我通過兩個實戰教學展示瞭如何使用urllib模組(http://blog.csdn.net/mr_blued/article/details/79180017)來構造爬蟲,這次告訴大家一個更好的實現爬蟲的模組,requests模組。
使用requests模組進行爬蟲構造時最好先去了解一下HTTP協議與常見的幾種網頁請求方式。
閒話少說,我們進入正題。
使用requests模組改進上次的例子中的程式碼
1.爬取妹子圖。(目標網址:http://www.meizitu.com/)
import requests import os import re import time def url_open(url): # 以字典的形式新增請求頭 header = { 'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0" } # 使用get方法傳送請求獲取網頁原始碼 response = requests.get(url, headers=header) return response def find_imgs(url): html = url_open(url).text p = r'<img src="([^"]+\.jpg)"' img_addrs = re.findall(p, html) return img_addrs def download_mm(folder='OOXX'): os.mkdir(folder) os.chdir(folder) page_num = 1 # 設定為從第一頁開始爬取,可以自己改 x = 0 # 自命名圖片 img_addrs = [] # 防止圖片重複 # 只爬取前兩頁的圖片,可改,同時給圖片重新命名 while page_num <= 2: page_url = url + 'a/more_' + str(page_num) + '.html' addrs = find_imgs(page_url) print(len(addrs)) # img_addrs = [] for i in addrs: if i in img_addrs: continue else: img_addrs.append(i) print(len(img_addrs)) for each in img_addrs: print(each) page_num += 1 # x = (len(img_addrs)+1)*(page_num-1) for each in img_addrs: filename = str(x) + '.' + each.split('.')[-1] x += 1 with open(filename, 'wb') as f: img = url_open(each).content f.write(img) # page_num += 1 if __name__ == '__main__': url = 'http://www.meizitu.com/' download_mm()
2.爬取百度貼吧圖片 (目標網址:https://tieba.baidu.com/p/5085123197)
import requests import re import os def open_url(url): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0"} response = requests.get(url, headers=headers) return response def find_img(url): html = open_url(url).text p = r'<img class="BDE_Image" src="([^"]+\.jpg)"' img_addrs = re.findall(p, html) for each in img_addrs: print(each) for each in img_addrs: file = each.split("/")[-1] with open(file, "wb") as f: img = open_url(each).content f.write(img) def get_img(): os.mkdir("TieBaTu") os.chdir("TieBaTu") find_img(url) if __name__ == "__main__": url = 'https://tieba.baidu.com/p/5085123197' get_img()
總結:1.熟悉requests模組的方法,以及瞭解http協議和幾種常見的請求方式2.瞭解網站的反爬蟲策略,並建立相對應的反反爬蟲手段3.知道其他模組的作用。