1. 程式人生 > >第一次爬蟲實例

第一次爬蟲實例

信息 bmi browser .sh turn 有時 data imp app

第一次爬蟲實例

1、這是我第一次寫的爬蟲實例,寫的不好請見諒,最後發現爬取的次數多了,被網站拉黑了,還是需要代理才行,代理還不太清楚怎麽弄就先這樣了

後面請大神可以幫忙改下怎麽使用代理爬取。

第一次爬取網站的所有電影信息(僅供參考)

具體思路就是先獲取第一頁上的信息

然後根據翻頁上的頁數來確定爬取多少頁

#下一步就是要寫怎麽爬取這些電影的種子並且下載到本地,等有時間了在寫

下面是具體代碼:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor


root 
= http://www.btrenren.com def bt_video(num): url = "http://www.btrenren.com/index.php/Index/index/p/%s.html"%num head = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} html = requests.get(url=url,headers=head) soup
= BeautifulSoup(html.content,lxml) data = soup.find(div,class_="ml") data = data.find_all(class_="title") with open("video.txt", a, encoding=utf-8) as f: for i in data: #電影名 name = i.find_all(font)[1].get_text().split(/)[0] #日期國家 vlue = i.find_all(class_=
des)[0].get_text().split(\n)[0] #演員 vlue1 = i.find_all(class_=des)[0].get_text().split(\n)[1] #豆瓣評分 douban = i.find(class_="rt").get_text() #網頁路徑 href = root + i.find_all(a)[1].get(href) #print(name,vlue,vlue1,douban,href) f.write(%s,%s,%s,%s,%s\n % (name, vlue, vlue1, douban,href)) print(num) def num_video(): url = "http://www.btrenren.com/index.php/Index/index/p/1.html" head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36 LBBROWSER"} html = requests.get(url=url, headers=head) soup = BeautifulSoup(html.content, lxml) number = soup.find(ul,class_="pagelist") number = number.find_all(span,class_="rows") num = number[0].get_text().split( )[1] return num if __name__ == "__main__": executor = ThreadPoolExecutor(max_workers=30) num = int(num_video()) for i in range(1,num): executor.submit(bt_video,i) executor.shutdown() exit()

第一次爬蟲實例