分散式爬蟲之celery
阿新 • • 發佈:2019-01-24
以爬douban小說為例
首先啟動Redis,新建檔案crawl_douban.py
import requests
from bs4 import BeautifulSoup
import time
from celery import Celery
import redis
from configparser import ConfigParser
cp=ConfigParser()
cp.read('config')
#獲取配置資訊
db_host=cp.get(section='redis',option='db_host')
db_port=cp.getint('redis' ,'db_port')
db_pwd=cp['redis']['db_pwd']
#redis連線
pool = redis.ConnectionPool(host=db_host, port=db_port, db=15, password=db_pwd)
r = redis.StrictRedis(connection_pool=pool)
set_name='crawl:douban'
app = Celery('crawl', include=['task'], broker='redis://:{}@{}:{}/12'.format(db_pwd,db_host,db_port), backend='redis://:{}@{}:{}/13' .format(db_pwd,db_host,db_port))
# 官方推薦使用json作為訊息序列化方式
app.conf.update(
CELERY_TIMEZONE='Asia/Shanghai',
CELERY_ENABLE_UTC=True,
CELERY_ACCEPT_CONTENT=['json'],
CELERY_TASK_SERIALIZER='json',
CELERY_RESULT_SERIALIZER='json',
)
headers={
'User-Agent':'',
}
@app.task
def crawl (url):
res=requests.get(url,headers=headers)
#延遲2秒
time.sleep(2)
soup=BeautifulSoup(res.text,'lxml')
items=soup.select('.subject-list .subject-item .info h2 a')
titles=[item['title'] for item in items]
#將小說的title存入redis資料庫
r.sadd(set_name,(url,titles,time.time()))
print(titles)
return (url,titles)
將上面的指令碼部署到兩臺主機A和B,然後各自執行下面的命令:
celery -A crawl_douban worker -l info
在本機C新建檔案task_dispatcher.py用於非同步分發任務,程式碼如下:
from crawl_douban import app
from crawl_douban import crawl
def manage_crawl(urls):
for url in urls:
app.send_task('crawl_douban.crawl', args=(url,))
#上句也可以寫成 crawl.apply_async(args=(url,)) 或 crawl.delay(url)
if __name__ == '__main__':
start_url = 'https://book.douban.com/tag/小說'
#爬去10頁,每頁20本書
url_list = ['{}?start={}&type=T'.format(start_url, page * 20) for page in range(10)]
manage_crawl(url_list)
執行task_dispatcher.py,跑完用時2.8s
celery worker -A tasks --loglevel=info --concurrency=5
- 引數”-A”指定了Celery例項的位置
- 引數”loglevel”指定了日誌等級,也可以不加,預設為warning。
- 引數”concurrency”指定最大併發數,預設為CPU核數。
[program:celery]
command=celery worker -A tasks --loglevel=info --concurrency=5
directory=/home/user_00/learn
stdout_logfile=/home/user_00/learn/logs/celery.log
autorestart=true
redirect_stderr=true