Mac自己搭建爬蟲搜索引擎(nutch+elasticsearch是失敗的嘗試,改用scrapy+elasticsearch)
1.引言
項目需要做爬蟲並能提供個性化信息檢索及推送,發現各種爬蟲框架。其中比較吸引的是這個:
Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎
E文原文在:http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
考慮用docker把系統搭建起來測試:
docker來源如下:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
https://store.docker.com/community/images/pure/nutch-mongo
然而,docker下載image時實在是太慢,放棄docker!
Mac 設置JAVA_HOME:
vi ~/.bash_profile
export JAVA_HOME=$(/usr/libexec/java_home)
export PATH=$JAVA_HOME/bin:$PATH
export CLASS_PATH=$JAVA_HOME/lib
2.安裝Mongo
Mac下直接用brew安裝,此時最新版本是3.4.7。
安裝好後建/data/db目錄,mongod啟動服務。
測試可用mongo命令連接,輸入dbs查看數據庫。
brew install mongo sudo mkdir /data/db sudo chown <你都用戶名> /data mongod
3.安裝es+kibana
下載es, 最新版是5.5.1. 地址:https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.tar.gz
修改配置
$ vim config/elasticsearch.yml cluster.name: my-application node.name: "node-1" node.master: true node.data: true path.data: /opt/elasticsearch/data network.bind_host: 127.0.0.1 network.publish_host: 127.0.0.1下載kibana, 最新版是5.5.1,地址:Mac
運行命令:bin/kibana
瀏覽器訪問:http://localhost:5601
4.安裝Apache nutch
下載Apache Nutch 2.3.1 (src.tar.gz): http://nutch.apache.org/downloads.html
配置環境變量:export NUTCH_HOME=$(pwd)
修改配置
$ cat conf/nutch-site.xml <configuration> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> </configuration> 解除註釋mongodb相關註釋: $NUTCH_HOME/ivy/ivy.xml:<dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" />
$NUTCH_HOME/conf/gora.properties
############################ # MongoDBStore properties # ############################ gora.datastore.default=org.apache.gora.mongodb.store.MongoStore gora.mongodb.override_hadoop_configuration=false gora.mongodb.mapping.file=/gora-mongodb-mapping.xml gora.mongodb.servers=localhost:27017 gora.mongodb.db=nutch 重要!需要更新elastic插件!原插件版本1.4.1,現最新是5.5.1. 修改cd src/plugin/indexer-elastic/
vi src/plugin/indexer-elastic/ivy.xml
...
<dependencies>
<dependency org="org.elasticsearch" name="elasticsearch"
rev="5.5.1" conf="*->default" />
</dependencies>
...
ant -f ./build-ivy.xml
ls lib 查看版本,更新plugin.xml中版本號。
<library name="HdrHistogram-2.1.9.jar"/>
<library name="elasticsearch-5.5.1.jar"/>
<library name="hppc-0.7.1.jar"/>
<library name="jackson-core-2.8.6.jar"/>
<library name="jackson-dataformat-cbor-2.8.6.jar"/>
<library name="jackson-dataformat-smile-2.8.6.jar"/>
<library name="jackson-dataformat-yaml-2.8.6.jar"/>
<library name="jna-4.4.0.jar"/>
<library name="joda-time-2.9.5.jar"/>
<library name="jopt-simple-5.0.2.jar"/>
<library name="log4j-api-2.8.2.jar"/>
<library name="lucene-analyzers-common-6.6.0.jar"/>
<library name="lucene-backward-codecs-6.6.0.jar"/>
<library name="lucene-core-6.6.0.jar"/>
<library name="lucene-grouping-6.6.0.jar"/>
<library name="lucene-highlighter-6.6.0.jar"/>
<library name="lucene-join-6.6.0.jar"/>
<library name="lucene-memory-6.6.0.jar"/>
<library name="lucene-misc-6.6.0.jar"/>
<library name="lucene-queries-6.6.0.jar"/>
<library name="lucene-queryparser-6.6.0.jar"/>
<library name="lucene-sandbox-6.6.0.jar"/>
<library name="lucene-spatial-6.6.0.jar"/>
<library name="lucene-spatial-extras-6.6.0.jar"/>
<library name="lucene-spatial3d-6.6.0.jar"/>
<library name="lucene-suggest-6.6.0.jar"/>
<library name="securesm-1.1.jar"/>
<library name="snakeyaml-1.15.jar"/>
<library name="t-digest-3.0.jar"/>
然而!更大的坑是這個plugin代碼出錯了!不折騰了,放棄!
開始編譯:ant runtime (跑了33分鐘!)結論
1. nutch 2.x 與 elasticsearch 5.x暫時不能很好兼容,不想折騰,放棄。
2. 下次嘗試新的架構:scrapy + scrapy-redis + mongodb + elasticsearch
Mac自己搭建爬蟲搜索引擎(nutch+elasticsearch是失敗的嘗試,改用scrapy+elasticsearch)