進行大資料爬取資料,存入Mongodb
阿新 • • 發佈:2019-02-07
爬蟲剛入門,對趕集網進行一次所有商品大資料的爬取
使用了多執行緒,存到資料庫裡,目前不知道要爬多久
有一個要注意的地方,比如我們要進行兩次爬蟲,一次是把每個專案的連結爬下來,一次是把每個專案裡的詳情資訊爬下來,最好是先把每個專案的連結存下來,並用資料庫儲存,然後再從資料庫取出每個連結,進行爬取詳情資訊。不要每爬到一個連結,就進去把這個連結的資訊爬出來,這樣不僅速度變慢,往往在爬取詳情資訊過程中會發生網路超時等斷開爬蟲的情況,如果是每次爬取連結都去取資訊,那麼一旦爬蟲斷了,每次都要重新開始
from multiprocessing import Pool from channel_extract import channel from page_parsing import get_item_info, get_links def get_all_link_from(channel): for i in range(1, 100): get_links(channel, i) if __name__ == '__main__': pool = Pool() pool.map(get_all_link_from, channel.split()) pool.close() pool.join()
import time
from page_parsing import url_list
while True:
print(url_list.find().count())
time.sleep(5)
from bs4 import BeautifulSoup import requests import time url = 'http://bj.ganji.com/wu/' host_url = 'http://bj.ganji.com' def get_channel(url): wb_data = requests.get(url) soup = BeautifulSoup(wb_data.text, 'lxml') links = soup.select('div > div > dl > dt > a') for link in links: new_url = host_url + link.get('href') print(new_url) get_channel(url) channel = ''' http://bj.ganji.com/jiaju/ http://bj.ganji.com/rirongbaihuo/ http://bj.ganji.com/shouji/ http://bj.ganji.com/shoujihaoma/ http://bj.ganji.com/bangong/ http://bj.ganji.com/nongyongpin/ http://bj.ganji.com/jiadian/ http://bj.ganji.com/ershoubijibendiannao/ http://bj.ganji.com/ruanjiantushu/ http://bj.ganji.com/yingyouyunfu/ http://bj.ganji.com/diannao/ http://bj.ganji.com/xianzhilipin/ http://bj.ganji.com/fushixiaobaxuemao/ http://bj.ganji.com/meironghuazhuang/ http://bj.ganji.com/shuma/ http://bj.ganji.com/laonianyongpin/ http://bj.ganji.com/xuniwupin/ http://bj.ganji.com/qitawupin/ http://bj.ganji.com/ershoufree/ http://bj.ganji.com/wupinjiaohuan/ '''
from bs4 import BeautifulSoup import requests import time import pymongo import random client = pymongo.MongoClient('localhost', 27017) ganji = client['gangji'] url_list = ganji['url_list'] item_info = ganji['item_info'] headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36', 'Connection':'keep-alive' } proxy_list = [ 'http://101.4.136.34:81', 'http://182.96.194.184:8118' ] proxy_ip = random.choice(proxy_list) proxies = {'http':proxy_ip} def get_links(channel, page): list_view = '{}o{}'.format(channel, page) wb_data = requests.get(list_view) soup = BeautifulSoup(wb_data.text, 'lxml') links = soup.select('td.t > a.t') if(soup.find('td', 't')): for link in links: new_url = str(link.get('href').split('?')[0]) url_list.insert_one({'url':new_url}) else: pass def get_item_info(url): wb_data = requests.get(url, headers = headers, proxies = proxies) soup = BeautifulSoup(wb_data.text, 'lxml') no_longer_exist = (wb_data.status_code == 404) if no_longer_exist: pass else: try: title = soup.title.text price = soup.select('span.price_now > i')[0].get_text() area = soup.select('div.palce_li > span > i')[0].get_text() data = { 'title':title, 'price':price, 'area' :area } print(data) except (AttributeError, IndexError): pass