1. 程式人生 > >Nutch2.3.1在Linux上部署

Nutch2.3.1在Linux上部署

1、下載Nutch2.3.1,使用Ant構造Eclipse工程 2、在Idea中匯入Nutch 3、配置nutch-default.xml檔案 <property>
  <name>plugin.folders</name>
  <value>/usr/local/nutch/nutch-2.3.1/src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>
4、配置nutch-site.xml檔案,在configuration標籤中加入,紅色部分需要修改 <property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> <property> <name>generate.batch.id</name> <value>*</value> </property> <property> <name>io.serializations</name> <value>org.apache.hadoop.io.serializer.WritableSerialization</value> <description>A list of serialization classes that can be used for obtaining serializers and deserializers.</description> </property> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value> </property> <property> <name>plugin.folders</name> <value>/usr/local/nutch/nutch-2.3.1/src/plugin
</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> <!-- utf-8 --> <property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available </description> </property> <!-- utf-8 --> <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>file.content.limit</name> <value>6553600</value> <description>The length limit for downloaded content using the file protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> <property> <name>ftp.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. </description> </property> <!-- <property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> --> 5、配置gora.properties檔案 gora.datastore.default=org.apache.gora.mongodb.store.MongoStore gora.mongodb.override_hadoop_configuration=false gora.mongodb.mapping.file=/gora-mongodb-mapping.xml gora.mongodb.servers=114.115.158.90:27017 gora.mongodb.db=crawl //mongo資料庫名稱 #gora.mongodb.login=login #gora.mongodb.secret=secret 6、建立Crawl測試類 package com.xxy.main; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.nutch.crawl.*; import org.apache.nutch.fetcher.FetcherJob; import org.apache.nutch.indexer.IndexingJob; import org.apache.nutch.indexer.solr.SolrDeleteDuplicates; import org.apache.nutch.metadata.Nutch; import org.apache.nutch.parse.ParserJob; import org.apache.nutch.util.NutchConfiguration; import org.apache.nutch.util.StringUtil; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.util.Random; // Commons Logging imports public class Crawl extends Configured implements Tool { public static final Logger LOG = LoggerFactory.getLogger(Crawl.class); /* Perform complete crawling and indexing (to Solr) given a set of root urls and the -solr parameter respectively. More information and Usage parameters can be found below. */ public static void main(String args[]) throws Exception { Configuration conf = NutchConfiguration.create(); String[] parameter = new String[3]; parameter[0] = "/usr/local/nutch/nutch-2.3.1/urls";
parameter[1] = "TestCrawl"; parameter[2] = "1"; int res = ToolRunner.run(conf, new Crawl(), parameter); System.exit(res); } @Override public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println ("Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>"); return -1; } String seedDir = args[0]; String crawlId = args[1]; String limit="",solrUrl=""; if (args.length==3){ limit = args[2]; }else if (args.length==4){ solrUrl = args[2]; limit = args[3]; }else { System.out.println("引數個數不匹配,檢查輸入引數"); } if (StringUtil.isEmpty(seedDir)){ System.out.println("Missing seedDir : crawl <seedDir> <crawlID> [<solrURL>] <numberOfRounds>"); } if (StringUtil.isEmpty(crawlId)){ System.out.println("Missing crawlID : crawl <seedDir> <crawlID> [<solrURL>] <numberOfRounds>"); } if (StringUtil.isEmpty(solrUrl)){ System.out.println("No SOLRURL specified. Skipping indexing."); } if (StringUtil.isEmpty(limit)){ System.out.println("Missing numberOfRounds : crawl <seedDir> <crawlID> [<solrURL>] <numberOfRounds>"); } //MODIFY THE PARAMETERS BELOW TO YOUR NEEDS //set the number of slaves nodes int numSlaves=1; //and the total number of available tasks //sets Hadoop parameter "mapred.reduce.tasks" int numTasks= numSlaves<<1; // number of urls to fetch in one iteration //250K per task? // int sizeFetchlist=numSlaves * 5; int sizeFetchlist=10; //time limit for feching String timeLimitFetch="180"; //Adds <days> to the current time to facilitate //crawling urls already fetched sooner then //db.default.fetch.interval. int addDays=0; getConf().set("mapred.reduce.tasks", String.valueOf(numTasks)); getConf().set("mapred.child.java.opts","-Xmx1000m"); getConf().set("mapred.reduce.tasks.speculative.execution","false"); getConf().set("mapred.map.tasks.speculative.execution","false"); getConf().set("mapred.compress.map.output","true"); InjectorJob injector = new InjectorJob(getConf()); GeneratorJob generator = new GeneratorJob(getConf()); FetcherJob fetcher = new FetcherJob(getConf()); ParserJob parse = new ParserJob(getConf()); DbUpdaterJob dbUpdaterJob = new DbUpdaterJob(getConf()); IndexingJob indexingJob = new IndexingJob(); SolrDeleteDuplicates solrDeleteDuplicates = new SolrDeleteDuplicates(); // initialize crawlDb getConf().set(Nutch.CRAWL_ID_KEY, crawlId); int res; String[] injectParameter = new String[3]; injectParameter[0] = seedDir; injectParameter[1] = "-crawlId"; injectParameter[2] = crawlId; System.out.println("initial injection"); res = ToolRunner.run(getConf(), injector,injectParameter); print(res,"inject"); for (int i = 0; i < Integer.parseInt(limit); i++) { System.out.println("Begin Generate"); String batchId = System.currentTimeMillis()+"-"+new Random().nextInt(32767); String[] generateParameter = new String[10]; // generate new segment generateParameter[0] = "-topN"; generateParameter[1] = String.valueOf(sizeFetchlist); generateParameter[2] = "-noNorm"; generateParameter[3] = "-noFilter"; generateParameter[4] = "-adddays"; generateParameter[5] = String.valueOf(addDays); generateParameter[6] = "-crawlId"; generateParameter[7] = crawlId; generateParameter[8] = "-batchId"; generateParameter[9] = batchId; res = ToolRunner.run(getConf(), generator,generateParameter); print(res,"generate"); System.out.println("Begin Fetch"); String[] fetchParameter = new String[5]; fetchParameter[0] = batchId; fetchParameter[1] = "-crawlId"; fetchParameter[2] = crawlId; fetchParameter[3] = "-threads"; //執行緒數量 thread fetchParameter[4] = "10"; getConf().set("fetcher.timelimit.mins",timeLimitFetch); res = ToolRunner.run(getConf(),fetcher, fetchParameter); print(res,"fetch"); /** * 配置檔案中 已經在fetch過程中就使用parse 所以這個單獨的parse不用在重複呼叫 */ System.out.println("parse begin"); String[] parseParameter = new String[3]; parseParameter[0] = batchId; parseParameter[1] = "-crawlId"; parseParameter[2] = crawlId; getConf().set("mapred.skip.attempts.to.start.skipping","2"); getConf().set("mapred.skip.map.max.skip.records","1"); res = ToolRunner.run(getConf(), parse,parseParameter); if (res==0){ System.out.println("parse finish"); }else { System.out.println("parse failed"); } //updatedb with this batch System.out.println("begin updatedb"); String[] updatedbParameter = new String[3]; updatedbParameter[0] = batchId; updatedbParameter[1] = "-crawlId"; updatedbParameter[2] = crawlId; res = ToolRunner.run(getConf(),dbUpdaterJob,updatedbParameter); print(res,"updatedb"); if (StringUtil.isEmpty(solrUrl)){ System.out.println("Skipping indexing tasks: no SOLR url provided."); }else { System.out.println("begin Indexing"); getConf().set("solr.server.url",solrUrl); String[] indexingParameter = new String[3]; indexingParameter[0] = "-all"; indexingParameter[1] = "-crawlId"; indexingParameter[2] = crawlId; res = ToolRunner.run(getConf(), indexingJob, indexingParameter); print(res,"indexing"); System.out.println("begin SOLR dedup"); String[] solrdedupParameter = new String[1]; solrdedupParameter[0] = solrUrl; res = ToolRunner.run(getConf(),solrDeleteDuplicates , solrdedupParameter); print(res,"solr Delete Duplicates"); } } return 0; }

相關推薦

在樹黴派3 NextCloudPi部署frp客戶端

首先感謝frp QQ群裡幾位大神,沒有他們的耐心解答和分析問題,並幫助問題原因和修正方法,就不會有這篇配置成功的文章。本文是以nat.ee的服務端為例子,部署frp客戶端。感謝nat.ee提供的frp免費公共服務:https://www.nat.ee/frp.html1. 下

Nutch2.3.1在Linux部署

1、下載Nutch2.3.1,使用Ant構造Eclipse工程 2、在Idea中匯入Nutch 3、配置nutch-default.xml檔案 <property>   <name>plugin.folders</name>   &

基於CentOS 7.x部署Zabbix 3.4

zabbix 3.4 監控 linux 服務器 1、安裝zabbix最新版epel源:rpm -ivh http://repo.zabbix.com/zabbix/3.4/rhel/7/x86_64/zabbix-release-3.4-1.el7.centos.noarch.rpm2、安

CentOS7.3部署簡單的網站(Tomcat)

oar boa firewall 查看 cat 由於 轉載 默認端口 tails 本文轉載自:沙師弟專欄 https://blog.csdn.net/u014597198/article/details/79649219 [ 感謝郭大大 ] 服務器版本:CentOS

CentOS7.3部署安裝Oracle12c

關閉防火墻 bbc tps run bdd sim api 缺省 安全 準備工作: 一臺CentOS7.3 Oracle12c安裝包 最好設置為雙核心,4G內存,8G虛擬內存 NAT模式 安裝過程: 1.關閉防火墻自啟動以及相關功能和增強安全功能 systemctl d

kvm部署3個虛擬機器實驗演示

1、實驗規劃設計 2、開啟虛擬化 3、安裝虛擬化相關的軟體包 4、啟動IP轉發並關閉防火牆 5、重啟服務,載入防火牆策略 6、建立並配置httpfpm網橋 7、建立並配置phpmysql網橋,過程同第6步 8、建立物理橋 

hadoop2.7.3在centos7部署安裝(單機版)

hadoop2.7.3在centos7上部署安裝(單機版)   (1)hadoop2.7.3下載 (前提:先安裝java環境) 下載地址:http://hadoop.apache.org/releases.html (注意是binary檔案,source那個是原始

centos7.3部署Habse叢集及遇到的問題

一、前期準備工作 1、下載hbase安裝包 Hbaser官方下載地址:http://mirror.bit.edu.cn/apache/hbase/ 目前安裝的版本為:hbase-2.0.2-bin.tar.gz 2、準備好要安裝的叢集環境的目標機器 3、將下載好的壓縮包拷貝到叢集

centos7.3部署zookeeper叢集環境

一、前期準備 1、下載zookeeper安裝包 zookeeper官方下載地址:http://www-eu.apache.org/dist/zookeeper/ zookeeper-3.5.3-beta.tar.gz 2、準備好要安裝的叢集環境的目標機器 3、將下載好的壓縮包拷貝

在Kubernetes的3個node部署redis cluster_Kubernetes中文社群

目的 redis clustor 需要6臺伺服器才能正常運⾏,由於種種原因,開發或者某些特別的需求,只能在3臺伺服器上運⾏redis clustor。在不使用哨兵模式情況下,而使⽤最新的clustor模式運行redis。 本文僅作為redis部署方式的研究及理解 準備工作 製作redis do

如何在阿里雲部署django網站(3)——runserver試執行

python提供了最基本最簡單的部署方案:runserver。不過我們真正部署的時候,都不會用到它,這是因為runserver本身有很大的缺陷。不過,作為測試,使用runserver對於新手來說是一件簡單且具有里程碑意義的事件。 假設我們已經將mysite通過git克隆到阿里雲ecs上

在Kubernetes的3個node部署redis cluster

目的 redis clustor 需要6臺伺服器才能正常運⾏,由於種種原因,開發或者某些特別的需求,只能在3臺伺服器上運⾏redis clustor。在不使用哨兵模式情況下,而使⽤最新的clustor模式運行redis。 本文僅作為redis部署方式的研究及理解 準

在Tomcat8.5部署ArcGIS API 3.2

----因為實習公司的需要,本人又需要撿起快要忘掉了的ArcGIS API。閒話不多說,我們直接進入正題。----一、在自己的電腦上安裝Tomcat8.5首先我們去Tomcat的官網下載Tomcat8.5,如下圖所示:圖1 Tomcat官網下載後得到的是一個.exe安裝包,執

在k8s的3個node部署redis cluster

目的redis clustor 需要6臺伺服器才能正常運⾏,由於種種原因,開發或者某些特別的需求,只能在3臺伺服器上運⾏redis clustor。在不使用哨兵模式情況下,而使⽤最新的clustor模式運行redis。本文僅作為redis部署方式的研究及理解準備工作製作red

windowsNutch2.3.1匯入Eclipse詳解

自己搞了大半天才配置好,所以記錄一下。 正文:環境配置:Eclipse:Mars(4.5.2)  JDK1.7   Ivy:2.4.0              http://download.csdn.net/detail/xiaoyaoxiaozi007/9921665

在騰訊云云函式計算部署.NET Core 3.1

雲廠商(騰訊雲、Azure等)提供了Serverless服務,藉助於Serverless,開發人員可以更加專注於程式碼的開發,減少運維的成本。騰訊雲的函式計算提供了很多執行庫,對.NET的支援需要通過custom runtime 來支援,可以支援任何版本的.NET Core,也就是需要自定義runtime,需

Centos7.3 Docker安裝部署學習記錄1

docker一、Docker基礎環境的安裝1. 環境說明本機采用操作系統如下: CentOS-7.3-X86-64,內核3.10 x64位,docker 1.12.x版本。 建議采用CentOS7版本或ubuntu版本,本文采用CentOS7即CentOS-7-x86_64-DVD-1611.iso版本最

在PythonAnyWhere部署Django項目

con 1.4 cnblogs 刪除 oschina 安裝 通過 osc ati http://www.jianshu.com/p/91047e3a4ee9 將項目放到git上,然後將pathonanywhere上的ssh傳到git上,沒有的話先創建,然後從git上把項目拷

遠程維護--->Linux 部署Teamviewer

遠程維護--->linux 上部署teamviewerCentOS6.5 x86_64 安裝 Teamviewer 101. 在 teamviewer_linux.rpm 上下載到基於 Linux 發行版本的rpm包。 cd /usr/src wget http://www.teamview

Coding部署Ghost博客

賬號 -o 屏幕 位置 搭建 center data- 分鐘 cal Ghost構建於Node.js平臺之上。支持0.10.*版本號的Node.js。在你的本地計算機上執行Ghost事實上非常easy,前提是你已經安裝了Node.js。什麽是Node.js?略過在Win