Nutch2.3.1在Linux上部署

阿新 • • 發佈：2019-01-15

1、下載Nutch2.3.1，使用Ant構造Eclipse工程 2、在Idea中匯入Nutch 3、配置nutch-default.xml檔案 <property>
<name>plugin.folders</name>
<value>/usr/local/nutch/nutch-2.3.1/src/plugin</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
4、配置nutch-site.xml檔案，在configuration標籤中加入，紅色部分需要修改 <property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.mongodb.store.MongoStore</value> <description>Default class for storing data</description> </property> <property> <name>generate.batch.id</name> <value>*</value> </property> <property> <name>io.serializations</name> <value>org.apache.hadoop.io.serializer.WritableSerialization</value> <description>A list of serialization classes that can be used for obtaining serializers and deserializers.</description> </property> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value> </property> <property> <name>plugin.folders</name> <value>/usr/local/nutch/nutch-2.3.1/src/plugin

</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>  <property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the “Accept-Language” request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available </description> </property>  <property> <name>db.max.outlinks.per.page</name> <value>-1</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>file.content.limit</name> <value>6553600</value> <description>The length limit for downloaded content using the file protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the http.content.limit setting. </description> </property> <property> <name>ftp.content.limit</name> <value>-1</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. </description> </property>  5、配置gora.properties檔案 gora.datastore.default=org.apache.gora.mongodb.store.MongoStore gora.mongodb.override_hadoop_configuration=false gora.mongodb.mapping.file=/gora-mongodb-mapping.xml gora.mongodb.servers=114.115.158.90:27017 gora.mongodb.db=crawl //mongo資料庫名稱 #gora.mongodb.login=login #gora.mongodb.secret=secret 6、建立Crawl測試類 package com.xxy.main; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.nutch.crawl.*; import org.apache.nutch.fetcher.FetcherJob; import org.apache.nutch.indexer.IndexingJob; import org.apache.nutch.indexer.solr.SolrDeleteDuplicates; import org.apache.nutch.metadata.Nutch; import org.apache.nutch.parse.ParserJob; import org.apache.nutch.util.NutchConfiguration; import org.apache.nutch.util.StringUtil; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import java.util.Random; // Commons Logging imports public class Crawl extends Configured implements Tool { public static final Logger LOG = LoggerFactory.getLogger(Crawl.class); /* Perform complete crawling and indexing (to Solr) given a set of root urls and the -solr parameter respectively. More information and Usage parameters can be found below. */ public static void main(String args[]) throws Exception { Configuration conf = NutchConfiguration.create(); String[] parameter = new String[3]; parameter[0] = "/usr/local/nutch/nutch-2.3.1/urls";

parameter[1] = "TestCrawl"; parameter[2] = "1"; int res = ToolRunner.run(conf, new Crawl(), parameter); System.exit(res); } @Override public int run(String[] args) throws Exception { if (args.length < 3) { System.out.println ("Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>"); return -1; } String seedDir = args[0]; String crawlId = args[1]; String limit="",solrUrl=""; if (args.length==3){ limit = args[2]; }else if (args.length==4){ solrUrl = args[2]; limit = args[3]; }else { System.out.println("引數個數不匹配,檢查輸入引數"); } if (StringUtil.isEmpty(seedDir)){ System.out.println("Missing seedDir : crawl <seedDir> <crawlID> [<solrURL>] <numberOfRounds>"); } if (StringUtil.isEmpty(crawlId)){ System.out.println("Missing crawlID : crawl <seedDir> <crawlID> [<solrURL>] <numberOfRounds>"); } if (StringUtil.isEmpty(solrUrl)){ System.out.println("No SOLRURL specified. Skipping indexing."); } if (StringUtil.isEmpty(limit)){ System.out.println("Missing numberOfRounds : crawl <seedDir> <crawlID> [<solrURL>] <numberOfRounds>"); } //MODIFY THE PARAMETERS BELOW TO YOUR NEEDS //set the number of slaves nodes int numSlaves=1; //and the total number of available tasks //sets Hadoop parameter "mapred.reduce.tasks" int numTasks= numSlaves<<1; // number of urls to fetch in one iteration //250K per task? // int sizeFetchlist=numSlaves * 5; int sizeFetchlist=10; //time limit for feching String timeLimitFetch="180"; //Adds <days> to the current time to facilitate //crawling urls already fetched sooner then //db.default.fetch.interval. int addDays=0; getConf().set("mapred.reduce.tasks", String.valueOf(numTasks)); getConf().set("mapred.child.java.opts","-Xmx1000m"); getConf().set("mapred.reduce.tasks.speculative.execution","false"); getConf().set("mapred.map.tasks.speculative.execution","false"); getConf().set("mapred.compress.map.output","true"); InjectorJob injector = new InjectorJob(getConf()); GeneratorJob generator = new GeneratorJob(getConf()); FetcherJob fetcher = new FetcherJob(getConf()); ParserJob parse = new ParserJob(getConf()); DbUpdaterJob dbUpdaterJob = new DbUpdaterJob(getConf()); IndexingJob indexingJob = new IndexingJob(); SolrDeleteDuplicates solrDeleteDuplicates = new SolrDeleteDuplicates(); // initialize crawlDb getConf().set(Nutch.CRAWL_ID_KEY, crawlId); int res; String[] injectParameter = new String[3]; injectParameter[0] = seedDir; injectParameter[1] = "-crawlId"; injectParameter[2] = crawlId; System.out.println("initial injection"); res = ToolRunner.run(getConf(), injector,injectParameter); print(res,"inject"); for (int i = 0; i < Integer.parseInt(limit); i++) { System.out.println("Begin Generate"); String batchId = System.currentTimeMillis()+"-"+new Random().nextInt(32767); String[] generateParameter = new String[10]; // generate new segment generateParameter[0] = "-topN"; generateParameter[1] = String.valueOf(sizeFetchlist); generateParameter[2] = "-noNorm"; generateParameter[3] = "-noFilter"; generateParameter[4] = "-adddays"; generateParameter[5] = String.valueOf(addDays); generateParameter[6] = "-crawlId"; generateParameter[7] = crawlId; generateParameter[8] = "-batchId"; generateParameter[9] = batchId; res = ToolRunner.run(getConf(), generator,generateParameter); print(res,"generate"); System.out.println("Begin Fetch"); String[] fetchParameter = new String[5]; fetchParameter[0] = batchId; fetchParameter[1] = "-crawlId"; fetchParameter[2] = crawlId; fetchParameter[3] = "-threads"; //執行緒數量 thread fetchParameter[4] = "10"; getConf().set("fetcher.timelimit.mins",timeLimitFetch); res = ToolRunner.run(getConf(),fetcher, fetchParameter); print(res,"fetch"); /** * 配置檔案中已經在fetch過程中就使用parse 所以這個單獨的parse不用在重複呼叫 */ System.out.println("parse begin"); String[] parseParameter = new String[3]; parseParameter[0] = batchId; parseParameter[1] = "-crawlId"; parseParameter[2] = crawlId; getConf().set("mapred.skip.attempts.to.start.skipping","2"); getConf().set("mapred.skip.map.max.skip.records","1"); res = ToolRunner.run(getConf(), parse,parseParameter); if (res==0){ System.out.println("parse finish"); }else { System.out.println("parse failed"); } //updatedb with this batch System.out.println("begin updatedb"); String[] updatedbParameter = new String[3]; updatedbParameter[0] = batchId; updatedbParameter[1] = "-crawlId"; updatedbParameter[2] = crawlId; res = ToolRunner.run(getConf(),dbUpdaterJob,updatedbParameter); print(res,"updatedb"); if (StringUtil.isEmpty(solrUrl)){ System.out.println("Skipping indexing tasks: no SOLR url provided."); }else { System.out.println("begin Indexing"); getConf().set("solr.server.url",solrUrl); String[] indexingParameter = new String[3]; indexingParameter[0] = "-all"; indexingParameter[1] = "-crawlId"; indexingParameter[2] = crawlId; res = ToolRunner.run(getConf(), indexingJob, indexingParameter); print(res,"indexing"); System.out.println("begin SOLR dedup"); String[] solrdedupParameter = new String[1]; solrdedupParameter[0] = solrUrl; res = ToolRunner.run(getConf(),solrDeleteDuplicates , solrdedupParameter); print(res,"solr Delete Duplicates"); } } return 0; }

Nutch2.3.1在Linux上部署

在樹黴派3 NextCloudPi上部署frp客戶端

Nutch2.3.1在Linux上部署

基於CentOS 7.x上部署Zabbix 3.4

CentOS7.3上部署簡單的網站(Tomcat)

CentOS7.3上部署安裝Oracle12c

kvm上部署3個虛擬機器實驗演示

hadoop2.7.3在centos7上部署安裝（單機版）

centos7.3上部署Habse叢集及遇到的問題

centos7.3上部署zookeeper叢集環境

在Kubernetes的3個node上部署redis cluster_Kubernetes中文社群

如何在阿里雲上部署django網站（3）——runserver試執行

在Kubernetes的3個node上部署redis cluster

在Tomcat8.5上部署ArcGIS API 3.2

在k8s的3個node上部署redis cluster

windows上Nutch2.3.1匯入Eclipse詳解

在騰訊云云函式計算上部署.NET Core 3.1

Centos7.3 Docker安裝部署學習記錄1

在PythonAnyWhere上部署Django項目

遠程維護--->Linux 上部署Teamviewer

Coding上部署Ghost博客

Nutch2.3.1在Linux上部署

相關推薦