Nutch+MongoDB+ElasticSearch+Kibana 搭建搜尋引擎

阿新 • • 發佈：2019-01-15

前言：

文章講述如何通過Nutch、MongoDB、ElasticSearch、Kibana搭建網路爬蟲，其中Nutch用於網頁資料爬取，MongoDB用於儲存爬蟲而來的資料，ElasticSearch用來作Index索引，Kibana用來形象化檢視索引結果。具體步驟如下：

配置環境：

系統環境：Ubuntu 14.04

JDK版本：jdk1.8.0_45
通過wget獲取下載安裝包:

gannyee@ubuntu:~/download$ wget https://www.reucon.com/cdn/java/jdk-8u45-linux-x64.tar.gz
tar zxvf jdk-8 
u45-linux-x64.tar.gz

解壓後得到jdk1.8.0_45這個資料夾，先檢視/usr/lib/路徑下有沒有jvm這個資料夾，若沒有，則新建一個jvm資料夾：

gannyee@ubuntu:~/download$ mkdir /usr/lib/jvm

將當前解壓得到的jdk1.8.0_45複製到/usr/lib/jvm中：

gannyee@ubuntu:~/download$mv jdk1.8.0_45 /usr/lib/jvm

開啟profile設定環境變數：

gannyee@ubuntu:~/download$vim /etc/profile

在profile的末尾加入以下內容：

export 
 JAVA_HOME=/usr/lib/jvm/jdk1.8.0_45
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin

然後使用以下命令使得環境變數生效：

gannyee@ubuntu:~/download$source /etc/profile

到此為止，JDK就安裝完成了。檢視JDK的版本：

gannyee@ubuntu:~/download$java –version
java version "1.8.0_45" 

Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)

若以上命令沒有成功顯示版本資訊，那有可能是之前的操作出現問題，請仔細檢查之前的操作。

gannyee@ubuntu:~/download$ wget https://archive.apache.org/dist/ant/binaries/binaries/apache-ant-1.9.4-bin.tar.gz

解壓後可得到apache-ant-1.9.6這個資料夾，將其移動到/usr/local/ant資料夾中：

gannyee@ubuntu:~/download$sudo tar -zvxf apache-ant-1.9.4-bin.tar.gz
gannyee@ubuntu:~/download$sudo mkdir /usr/local/ant
gannyee@ubuntu:~/download$mv apache-ant-1.9.4 /usr/local/ant

開啟profile設定環境變數：

gannyee@ubuntu:~/download$vim /etc/profile

在profile檔案末尾加入以下內容：

export ANT_HOME=/usr/local/ant/apache-ant-1.9.4
export PATH=$PATH:$ANT_HOME/bin

使用以下命令使得環境變數生效：

 gannyee@ubuntu:~/download$source /etc/profile

檢視Ant版本：

gannyee@ubuntu:~/download$ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014

至此，配置引擎所需的環境預先配置完成！

引擎資料流如圖示：

這裡寫圖片描述

Mongodb下載、安裝、啟動

開源文件資料庫，Nosql資料典型代表之一。
版本：MongoDB-2.6.11

[email protected]:~/download$ wget https://fastdl.mongodb.org/src/mongodb-src-r2.6.11.tar.gz
[email protected]:~/download$ sudo tar -zxvf mongodb-src-r2.6.11.tar.gz
[email protected]:~/download$ mv mongodb-src-r2.6.11/ ../mongodb/
[email protected]:~$cd mongodb/
[email protected]:~/mongodb$ 
sudo mkdir log/ conf/ data/

從2.6版開始,mongodb使用YAML-based配置檔案格式。參考下面的配置可以在這裡找到。

建立se.yml

gannyee@ubuntu:~/mongodb$ vim conf/se.yml
net:
  port: 27017
  bindIp: 127.0.0.1
systemLog:
  destination: file
  path: "/opt/mongodb/log/mongodb.log"
  logAppend: true
processManagement:
  fork: true
  pidFilePath: "/opt/mongodb/log/mongodb.pid"
storage:
  dbPath: "/opt/mongodb/data"
  directoryPerDB: true
  smallFiles: true

啟動Mongodb

gannyee@ubuntu:~/mongodb$ ./bin/mongod -f conf/se.yml

進入Mongodb以檢查Mongodb是否啟動成功

gannyee@ubuntu:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11
connecting to: test
> show dbs
admin (empty)
local 0.031GB
> exit
bye

關閉Mongodb：

>use admin
>db.shutdownServer()

gannyee@ubuntu:~/mongodb$ sudo wget http://app.robomongo.org/files/linux/robomongo-0.8.5-x86_64.deb
gannyee@ubuntu:~/mongodb$sudo dpkg -i robomongo-0.8.5-x86_64.deb

[email protected]:~$robomongo就可以開啟客戶端。
建立新連線，只需要新增host和port即可。
note：我第一次安裝成功後連結也成功，但是看不到任何資料。
解決辦法：重新使用root許可權安裝即可。
軟體介面如圖所示：

如果需要外網訪問的話，需要配置檔案中的bindIp: 127.0.0.1改為bindIp: 0.0.0.0

然後在瀏覽器中輸入：http://localhost:27017,如果出現以下內容，說明外網可以訪問：
It looks like you are trying to access MongoDB over HTTP on the native driver port.

如果出現無法執行./mongod的錯誤
大部分是因為mongodb 服務在不正常關閉的情況下,mongod 被鎖,想想可能是上次無故宕機造成的.
如何解決這種問題:

刪除 mongod.lock 檔案和日誌檔案 mongodb.log.2016-1-26T06-55-20 ,如果有必要把 log日誌全部刪除
mongod –repair –dbpath /home/gannyee/mongodb/data/db / –repairpath /home/gannyee/mongodb

ElasticSearch下載、安裝

從Apache Lucene提取高效能的分散式搜尋引擎。
版本：ElastricSearch-1.4.4

gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-1.4.4.tar.gz
gannyee@ubuntu:~/download$tar -zxvf elasticsearch-1.4.4.tar.gz
gannyee@ubuntu:~/download$ mv elasticsearch-1.4.4 ../elasticsearch 
gannyee@ubuntu:~$cd /elasticsearch

修改config下檔案elasticsearch.yml

[email protected]:~/elasticsearch$ vim config/elasticsearch.yml
......
cluster.name: gannyee
node.name: "gannyee"
node.master: true
node.data: true
path.conf: /home/gannyee/elasticsearch/config
path.data: /home/gannyee/elasticsearch/data
http.enabled: true
network.bind_host: 127.0.0.1
network.publish_host: 127.0.0.1
network.host: 127.0.0.1
.......

後臺啟動ElasticSearch

gannyee@ubuntu:~/elasticsearch$ ./bin/elasticsearch -d

終止ElasticSearch程序
關閉單一節點

gannyee@ubuntu:~/elasticsearch$curl -XPOST 
http://localhost:9200/_cluster/nodes/_shutdown

關閉節點BlrmMvBdSKiCeYGsiHijdg

gannyee@ubuntu:~/elasticsearch$curl –XPOST 
http://localhost:9200/_cluster/nodes/BlrmMvBdSKiCeYGsiHijdg/_shutdown

檢測是否成功執行ElasticSearch

gannyee@ubuntu:~/elasticsearch$ curl -XGET 'http://localhost:9200'
{
  "status" : 200,
  "name" : "gannyee",
  "cluster_name" : "gannyee",
  "version" : {
    "number" : "1.4.4",
    "build_hash" : "c88f77ffc81301dfa9dfd81ca2232f09588bd512",
    "build_timestamp" : "2015-02-19T13:05:36Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.3"
  },
  "tagline" : "You Know, for Search"
}

elasticsearch-head是一個elasticsearch的叢集管理工具，它是完全由html5編寫的獨立網頁程式，你可以通過外掛把它整合到es
安裝 elasticsearch-head外掛

gannyee@ubuntu:~/elasticsearch$ cd elasticsearch
gannyee@ubuntu:~/elasticsearch$ ./bin/plugin -install mobz/elasticsearch-head

執行重啟elasticsearch
在瀏覽器輸入:http://localhost:9200/_plugin/head/
介面的右邊有些按鈕，如：node stats， cluster nodes，這些是直接請求es的相關狀態的api，返回結果為json，如下圖：
這裡寫圖片描述

Kibana下載、安裝

基於分析和搜尋Elasticsearch儀表板的開源瀏覽器
版本：kibana-4.0.1

gannyee@ubuntu:~/download$wget https://download.elasticsearch.org/kibana/kibana/kibana-4.0.1-linux-x64.tar.gz
gannyee@ubuntu:~/download$ tar -zxvf /download kibana-4.0.1-linux-x64.tar.gz 
gannyee@ubuntu:~/download$mv kibana-4.0.1-linux-x64/ ../kibana/ 
gannyee@ubuntu:~/download$cd ../kibana/
gannyee@ubuntu:~/kibana$ ./bin/kibana

Apache Nutch 安裝、編譯、配置：

在Lucene發展來的開源網路爬蟲，本次配置只能使用nutch2.x系列，1.x系列不支援MongoDB等其他如Mysql,Habase資料庫。
版本：apache-nutch-2.3.1

Nutch2.3下載、編譯、配置

gannyee@ubuntu:~/download$  wget
http://www.apache.org/dyn/closer.lua/nutch/2.3.1/apache-nutch-2.3.1-src.tar.gz
gannyee@ubuntu:~/download$ tar -zxvf apache-nutch-2.3.1-src.tar.gz
gannyee@ubuntu:~/download$  mv apache-nutch-2.3.1 ../nutch
gannyee@ubuntu:~/download$ cd ../nutch
gannyee@ubuntu:~/nutch$ export NUTCH_HOME=$(pwd)

修改/conf/nutch-site.xml使Mongodb作為GORA的儲存單元

[email protected]:~/nutch/conf$ vim nutch-site.conf
<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
</configuration>

從/ivy/ivy.xml檔案中取消下面部分的註釋

[email protected]:~/nutch/conf$  vim $NUTCH_HOME/ivy/ivy.xml
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" />
...
</dependency>

確保MongoStore設定為預設資料儲存

[email protected]:~/nutch$ vim conf/gora.properties
/#######################
/# MongoDBStore properties #
/#######################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch

開始編譯nutch

gannyee@ubuntu:~/nutch$ant runtime

如果編譯過程中有如下錯誤

Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

Trying to override old definition of task javac
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

ivy-probe-antlib:

ivy-download:
  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

是因為缺少lib包，解決辦法如下（其實可以無視）：
下載 sonar-ant-task-2.1.jar，拷貝到 $NUTCH_HOME/lib 目錄下面

修改 $NUTCH_HOME/build.xml，引入上面新增

<!-- Define the Sonar task if this hasn't been done in a common script -->
 <taskdef uri="antlib:org.sonar.ant" resource="org/sonar/ant/antlib.xml">
  <classpath path="${ant.library.dir}" />
  <classpath path="${mysql.library.dir}" />
  <classpath><fileset dir="lib/" includes="sonar*.jar" /></classpath>
 </taskdef>

編譯後的檔案將被放在新生成的資料夾/nutch/runtime中

最後確認nutch已經正確地編譯和執行,輸出如下：

[email protected]:~/nutch/runtime/local$ ./bin/nutch
 Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 index          run the plugin-based indexer on parsed batches
 elasticindex   run the elasticsearch indexer - DEPRECATED use the index command instead
 solrindex      run the solr indexer on parsed batches - DEPRECATED use the index command instead
 solrdedup      remove duplicates from solr
 solrclean      remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead
 clean          remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 webapp         run a local Nutch web application
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

定製你的爬取特性

[email protected]:~$ sudo vim /nutch/runtime/local/conf/nutch-site.xml

< ?xml version="1.0"?>
< ?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>

  <property>
    <name>http.agent.name</name>
    <value>Hist Crawler</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(httphttpclient)urlfilter-regexindex-(basicmore)query-(basicsiteurllang)indexer-elasticnutch-extensionpointsparse-(texthtmlmsexcelmswordmspowerpointpdf)summary-basicscoring-opicurlnormalizer-(passregexbasic)parse-(htmltikametatags)index-(basicanchormoremetadata)</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>hist</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>

</configuration>

爬取自己第一個網頁
建立一個URL種子列表

gannyee@ubuntu:~$ mkdir -p /nutch/runtime/local/urls
gannyee@ubuntu:~$ echo 'http://www.aossama.com/' >/nutch/runtime/local/urls/seed.txt

編輯conf/regex-urlfilter.txt檔案，並且替換以下內容

/# accept anything else
+.

使用正則表示式匹配你想要爬取的域名

+^http://([a-z0-9]*\.)*aossama.com/

初始化crawldb

gannyee@ubuntu:~/nutch/runtime/local$  ./bin/nutch inject urls/

從 crawldb生成urls

gannyee@ubuntu:~/nutch/runtime/local$  ./bin/nutch generate -topN 80

獲取生成的所有urls

gannyee@ubuntu:~/nutch/runtime/local$ ./bin/nutch fetch -all

解析獲取的urls

gannyee@ubuntu:~/nutch/runtime/local$./ bin/nutch parse -all

更新database資料庫

gannyee@ubuntu:~/nutch/runtime/local$  ./bin/nutch updatedb -all

索引解析的urls

gannyee@ubuntu:~/nutch/runtime/local$ bin/nutch index -all

爬取完給定網頁，mongoDB會生成一個新的資料庫：nutch_1

[email protected]:~/mongodb$ ./bin/mongo
MongoDB shell version: 2.6.11
connecting to: test
> show dbs
admin    (empty)
local    0.031GB
nutch_1  0.031GB
test     (empty)
> use nutch_1
switched to db nutch_1
> show tables
system.indexes
webpage

具體資料可以在terminal下用指令或在圖形介面下直接點選檢視！

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜尋引擎

前言：

配置環境：

引擎資料流如圖示：

Mongodb下載、安裝、啟動

ElasticSearch下載、安裝

Kibana下載、安裝

Apache Nutch 安裝、編譯、配置：

爬取自己第一個網頁
建立一個URL種子列表

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜尋引擎

logstash+elasticsearch+kibana搭建日誌收集分析系統

【Elasticsearch全文搜索引擎實戰】之Kibana搭建

ELK日誌系統：Elasticsearch+Logstash+Kibana搭建教程

快速搭建ELK 叢集日誌收集工具Centos7 +Logstash +Elasticsearch+Kibana 環境

ELASTICSEARCH、LOGSTASH、KIBANA 搭建高效率日誌管理系統

filebeat + kafka + logstash + Elasticsearch + Kibana日誌收集系統搭建

ELK(ElasticSearch, Logstash, Kibana)搭建實時日誌分析平臺筆記

Elasticsearch、Logstash、Kibana 搭建統一日誌分析平臺 ( 第一篇 )

CentOS 6.5搭建ELK環境ElasticSearch+Kibana+Logstash

ELK(ElasticSearch, Logstash, Kibana)搭建實時日誌分析平臺

Logstash+Elasticsearch+Kibana 聯合使用搭建日誌分析系統(Windows系統)

logstash、elasticsearch、kibana搭建日誌平臺

elasticsearch + kibana 叢集環境搭建

filebeat+kafkaLogstash+ElasticSearch+Kibana windows搭建日誌分析系統

Elasticsearch+Logstash+Kibana搭建分布式日誌平臺

mongodb 副本集搭建

Ubuntu16.04下安裝elasticsearch+kibana實現php客戶端的中文分詞

mongodb分片集搭建

ELK 學習筆記之 elasticsearch環境搭建

Nutch+MongoDB+ElasticSearch+Kibana 搭建搜尋引擎

前言：

配置環境：

引擎資料流如圖示：

Mongodb下載、安裝、啟動

ElasticSearch下載、安裝

Kibana下載、安裝

Apache Nutch 安裝、編譯、配置：

爬取自己第一個網頁 建立一個URL種子列表

相關推薦

爬取自己第一個網頁
建立一個URL種子列表