【Apache Hadoop】MapReuce 程式設計總結-多MapReduce執行

阿新 • • 發佈：2019-02-04

學習hadoop，必不可少的就是寫MapReduce程式，當然，對於簡單的分析程式，我們只需一個MapReduce就能搞定，這裡就不提單MapReuce的情況了，網上例子很多，大家可以百度Google一下。對於比較複雜的分析程式，我們可能需要多個Job或者多個Map或者Reduce進行分析計算。

多Job或者多MapReduce的程式設計形式有以下幾種：

1、迭代式MapReduce

MapReduce迭代方式，通常是前一個MapReduce的輸出作為下一個MapReduce的輸入，最終可只保留最終結果，中間資料可以刪除或保留，根據業務需要自己決定

示例程式碼如下：

Configuration conf = new Configuration();
//first Job
Job job1 = new Job(conf,"job1");
.....
FileInputFormat.addInputPath(job1,InputPaht1);
FileOutputFromat.setOutputPath(job1,Outpath1);
job1.waitForCompletion(true);
//second Mapreduce
Job job2 = new Job(conf1,"job1");
.....
FileInputFormat.addInputPath(job2,Outpath1);
FileOutputFromat.setOutputPath(job2,Outpath2);
job2.waitForCompletion(true);
//third Mapreduce
Job job3 = new Job(conf1,"job1");
.....
FileInputFormat.addInputPath(job3,Outpath2);
FileOutputFromat.setOutputPath(job3,Outpath3);
job3.waitForCompletion(true);
.....

下面列舉一個mahout怎樣運用mapreduce迭代的，下面的程式碼快就是mahout中kmeans的演算法的程式碼，在main函式中用一個while迴圈來做mapreduce的迭代，其中：runIteration()是一次mapreduce的過程。

但個人感覺現在的mapreduce迭代設計不太滿意的地方。

1. 每次迭代，如果所有Job（task）重複建立，代價將非常高。

2.每次迭代，資料都寫入本地和讀取本地，I/O和網路傳輸的代價比較大。

好像Twister和Haloop的模型能過比較好的解決這些問題，但他們抽象度不夠高，支援的計算有限。

期待著下個版本hadoop更好的支援迭代演算法。

//main function
while (!converged && iteration <= maxIterations) {
      log.info("K-Means Iteration {}", iteration);
      // point the output to a new directory per iteration
      Path clustersOut = new Path(output, AbstractCluster.CLUSTERS_DIR + iteration);
      converged = runIteration(conf, input, clustersIn, clustersOut, measure.getClass().getName(), delta);
      // now point the input to the old output directory
      clustersIn = clustersOut;
      iteration++;
}

  private static boolean runIteration(Configuration conf,
                                      Path input,
                                      Path clustersIn,
                                      Path clustersOut,
                                      String measureClass,
                                      String convergenceDelta)
    throws IOException, InterruptedException, ClassNotFoundException {

    conf.set(KMeansConfigKeys.CLUSTER_PATH_KEY, clustersIn.toString());
    conf.set(KMeansConfigKeys.DISTANCE_MEASURE_KEY, measureClass);
    conf.set(KMeansConfigKeys.CLUSTER_CONVERGENCE_KEY, convergenceDelta);

    Job job = new Job(conf, "KMeans Driver running runIteration over clustersIn: " + clustersIn);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(ClusterObservations.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Cluster.class);

    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setMapperClass(KMeansMapper.class);
    job.setCombinerClass(KMeansCombiner.class);
    job.setReducerClass(KMeansReducer.class);

    FileInputFormat.addInputPath(job, input);
    FileOutputFormat.setOutputPath(job, clustersOut);

    job.setJarByClass(KMeansDriver.class);
    HadoopUtil.delete(conf, clustersOut);
    if (!job.waitForCompletion(true)) {
      throw new InterruptedException("K-Means Iteration failed processing " + clustersIn);
    }
    FileSystem fs = FileSystem.get(clustersOut.toUri(), conf);

    return isConverged(clustersOut, conf, fs);
  }

2、依賴關係式MapReuce-JobControl

依賴關係式主要是由JobControl來實現，JobControl由兩個類組成：Job和JobControl。其中，Job類封裝了一個MapReduce作業及其對應的依賴關係，主要負責監控各個依賴作業的執行狀態，以此更新自己的狀態。

JobControl包含了一個執行緒用於週期性的監控和更新各個作業的執行狀態，排程依賴作業執行完成的作業，提交處於READY狀態的作業等，同事，還提供了一些API用於掛起、回覆和暫停該執行緒。

示例程式碼如下：

<span style="font-size:14px">Configuration job1conf = new Configuration();
Job job1 = new Job(job1conf,"Job1");
.........//job1 其他設定
Configuration job2conf = new Configuration();
Job job2 = new Job(job2conf,"Job2");
.........//job2 其他設定
Configuration job3conf = new Configuration();
Job job3 = new Job(job3conf,"Job3");
.........//job3 其他設定
job3.addDepending(job1);//設定job3和job1的依賴關係
job3.addDepending(job2);
JobControl JC = new JobControl("123");
JC.addJob(job1);//把三個job加入到jobcontorl中
JC.addJob(job2);
JC.addJob(job3);
JC.run();</span>

3、線性鏈式MapReduce-ChainMapper/ChainReduce

ChainMapper/ChainReduce主要為了解決線性鏈式Mapper提出的。在Map或者Reduce階段存在多個Mapper，這些Mapper像Linux管道一樣，前一個Mapper的輸出結果直接重定向到下一個Mapper的輸入，行程流水線。

需要注意的是，對於任意一個MapReduce作業，Map和Reduce階段可以有無線個Mapper，但是Reduce只能有一個。所以包含多個Reduce的作業，不能使用ChainMapper/ChainReduce來完成。

程式碼如下：

...
conf.setJobName("chain");
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

JobConf mapper1Conf=new JobConf(false);
JobConf mapper2Conf=new JobConf(false);
JobConf redduce1Conf=new JobConf(false);
JobConf mappe3Conf=new JobConf(false);
...
ChainMapper.addMapper(conf,Mapper1.class,LongWritable.class,Text.class,Text.class,Text.class,true,mapper1Conf);
ChainMapper.addMapper(conf,Mapper2.class,Text.class,Text.class,LongWritable.class,Text.class,false,mapper2Conf);
ChainReducer.setReduce(conf,Reducer.class,LongWritable.class,Text.class,Text.class,Text.class,true,reduce1Conf);
ChainReducer.addMapper(conf,Mapper3.class,Text.class,Text.class,LongWritable.class,Text.class,true,mapper3Conf);
JobClient.runJob(conf);

4、子Job式MapReduce

子Job式其實也是迭代式中的一種，我這裡單獨的提取出來了，說白了，就是一個父Job包含多個子Job。

在nutch中，Crawler是一個父Job，通過run方法中呼叫runTool工具進行子Job的呼叫，而runTool是通過反射來呼叫子Job執行。

下面來看下Nutch裡面是如何實現的

....
private NutchTool currentTool = null;
....
private Map<String, Object> runTool(Class<? extends NutchTool> toolClass,
			Map<String, Object> args) throws Exception {
		currentTool = (NutchTool) ReflectionUtils.newInstance(toolClass,
				getConf());
		return currentTool.run(args);
	}
...
@Override
	public Map<String, Object> run(Map<String, Object> args) throws Exception {
		results.clear();
		status.clear();
		String crawlId = (String) args.get(Nutch.ARG_CRAWL);
		if (crawlId != null) {
			getConf().set(Nutch.CRAWL_ID_KEY, crawlId);
		}
		String seedDir = null;
		String seedList = (String) args.get(Nutch.ARG_SEEDLIST);
		if (seedList != null) { // takes precedence
			String[] seeds = seedList.split("\\s+");
			// create tmp. dir
			String tmpSeedDir = getConf().get("hadoop.tmp.dir") + "/seed-"
					+ System.currentTimeMillis();
			FileSystem fs = FileSystem.get(getConf());
			Path p = new Path(tmpSeedDir);
			fs.mkdirs(p);
			Path seedOut = new Path(p, "urls");
			OutputStream os = fs.create(seedOut);
			for (String s : seeds) {
				os.write(s.getBytes());
				os.write('\n');
			}
			os.flush();
			os.close();
			cleanSeedDir = true;
			seedDir = tmpSeedDir;
		} else {
			seedDir = (String) args.get(Nutch.ARG_SEEDDIR);
		}
		Integer depth = (Integer) args.get(Nutch.ARG_DEPTH);
		if (depth == null)
			depth = 1;
		boolean parse = getConf().getBoolean(FetcherJob.PARSE_KEY, false);
		String solrUrl = (String) args.get(Nutch.ARG_SOLR);
		int onePhase = 3;
		if (!parse)
			onePhase++;
		float totalPhases = depth * onePhase;
		if (seedDir != null)
			totalPhases++;
		float phase = 0;
		Map<String, Object> jobRes = null;
		LinkedHashMap<String, Object> subTools = new LinkedHashMap<String, Object>();
		status.put(Nutch.STAT_JOBS, subTools);
		results.put(Nutch.STAT_JOBS, subTools);
		// inject phase
		if (seedDir != null) {
			status.put(Nutch.STAT_PHASE, "inject");
			jobRes = runTool(InjectorJob.class, args);
			if (jobRes != null) {
				subTools.put("inject", jobRes);
			}
			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
			if (cleanSeedDir && tmpSeedDir != null) {
				LOG.info(" - cleaning tmp seed list in " + tmpSeedDir);
				FileSystem.get(getConf()).delete(new Path(tmpSeedDir), true);
			}
		}
		if (shouldStop) {
			return results;
		}
		// run "depth" cycles
		for (int i = 0; i < depth; i++) {
			
			status.put(Nutch.STAT_PHASE, "generate " + i);
			jobRes = runTool(GeneratorJob.class, args);
			if (jobRes != null) {
				subTools.put("generate " + i, jobRes);
			}

			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
			if (shouldStop) {
				return results;
			}
			status.put(Nutch.STAT_PHASE, "fetch " + i);
			jobRes = runTool(FetcherJob.class, args);
			if (jobRes != null) {
				subTools.put("fetch " + i, jobRes);
			}
			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
			if (shouldStop) {
				return results;
			}
			if (!parse) {
				status.put(Nutch.STAT_PHASE, "parse " + i);
				jobRes = runTool(ParserJob.class, args);
				if (jobRes != null) {
					subTools.put("parse " + i, jobRes);
				}
				status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
				if (shouldStop) {
					return results;
				}
			}
			status.put(Nutch.STAT_PHASE, "updatedb " + i);
			jobRes = runTool(DbUpdaterJob.class, args);
			if (jobRes != null) {
				subTools.put("updatedb " + i, jobRes);
			}
			status.put(Nutch.STAT_PROGRESS, ++phase / totalPhases);
			if (shouldStop) {
				return results;
			}
		}
		if (solrUrl != null) {
			status.put(Nutch.STAT_PHASE, "index");
			jobRes = runTool(SolrIndexerJob.class, args);
			if (jobRes != null) {
				subTools.put("index", jobRes);
			}
		}
		return results;
	}

【Apache Hadoop】MapReuce 程式設計總結-多MapReduce執行

【Apache Hadoop】MapReuce 程式設計總結-多MapReduce執行

【備戰NOIP】[算法總結] 二分查找

Python + Appium 【已解決】driver(session)在多個class之間復用，執行完一個類的用例，再次執行下個類的用例時不需要初始化

【刷題】BZOJ 5248 [2018多省省隊聯測]一雙木棋

【刷題】BZOJ 5249 [2018多省省隊聯測]IIIDX

【Apache Kafka】Kafka學習筆記

【k8s安裝】使用kubeasz安裝多主多從k8s集群（簡單快捷）

【Java基礎】基礎精華總結

【PHP學習】檔案程式設計——對目錄的操作

【ECLIPSE 三】Eclipse neon2 在多版本JDK環境下執行方法

【Apache Thrift】windows下thrift的安裝(一)

【面試題】Java 2個(多個)大整數相加如何實現

【機器學習】基於sklearn-MLP多層感知機例項

【學習筆記】分類程式設計練習

【十日談】將程式設計的思維用於檔案管理

【java學習】java知識點總結

【50本】Java程式設計師從入門SE到進階EE技術書籍推薦十個系列最全附閱讀技術書方法論（珍藏版)

【unix學習】shell程式設計—引數、流程控制、算數運算、函式

【圖文詳細】Scala——程式設計練習

【推薦系統】評估指標總結

【Apache Hadoop】MapReuce 程式設計總結-多MapReduce執行

相關推薦