hadoop處理不同輸入目錄檔案

阿新 • • 發佈：2019-01-30

在寫mapred任務的時候免不了要處理join。
在join中最簡單的就是一對一的join。
下面通過一個小例子介紹如果在mapred中實現一對一的join。

name.txt
100 tom
101 mary
102 kate

score.txt
100 90
101 85
102 80

要得到如下的join結果
100 tom 90
101 mary 85
102 kate 80

處理思路：
mapred的輸入檔案為name.txt和score.txt兩個，我們要通過標誌區分出每條記錄是來自哪個檔案，所以在map的輸出結果要增加檔案的標誌。
map的輸出類似
100 name+tom
100 score+90
然後在red的過程中根據不同的字首來區分不同記錄。因為是一對一的join，所以只要將相同key的不同value連線起來後輸出即可。

程式程式碼如下：

package org.myorg;

import java.io.IOException;
import java.util.*;
import java.lang.String;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class OneOnOneJoin {

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text ,Text> {

      public void map(Text key, Text value, OutputCollector<Text , Text> output, Reporter reporter) throws IOException {

      String path=((FileSplit)reporter.getInputSplit()).getPath().toString();
      Text kv = new Text();

      if(path.indexOf(”name.txt”)>0) {
        kv.set(”name”+”+”+value);
      } else if(path.indexOf(”score.txt”)>0) {
        kv.set(”score”+”+”+value);
      }

      output.collect(key,kv);
      }
    }

    public static class Reduce extends MapReduceBase implements Reducer<Text ,Text, Text,Text> {
      public void reduce(Text key, Iterator<Text> values, OutputCollector<Text,Text> output, Reporter reporter) throws IOException {

String name=”";
String score=”";

        while(values.hasNext()) {
        String value = values.next().toString();

        if(value.startsWith(”name+”)) {
          name=value.substring(5 , value.length());
        } else if(value.startsWith(”score+”)) {
          score=value.substring(6 , value.length());
        }
        }

if(!name.equals(”") && !score.equals(”")) {
  output.collect(key,new Text(name + “\t” + score));
}

      }
    }

    public static void main(String[] args) throws Exception {
      JobConf conf = new JobConf(OneOnOneJoin.class);
      conf.setJobName(”oneononejoin”);

      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(Text.class);

      conf.setMapperClass(Map.class);
      conf.setReducerClass(Reduce.class);

      conf.setInputFormat(KeyValueTextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));

      JobClient.runJob(conf);
    }
}

編譯並執行
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d OneOnOneJoin OneOnOneJoin.java
jar -cvf OneOnOneJoin.jar -C OneOnOneJoin/ .

hadoop fs -rmr /sunwg/output
hadoop jar OneOnOneJoin.jar org.myorg.OneOnOneJoin /sunwg/input /sunwg/output

檢視結果
[[email protected] sunwg]$ hadoop fs -cat /sunwg/output/part-00000
100     tom     90
101     mary    85
102     kate    80

進一步考慮外連線的情況，有可能在score中沒有對應的記錄
比如：

name.txt
100 tom
101 mary
102 kate

score.txt
100 90
102 80

要得到如下的join結果
100 tom 90
101 mary
102 kate 80

只要修改reduce中的最後輸出結果的檢驗條件為

if(!name.equals(”")) {
output.collect(key,new Text(name + “\t” + score));
}

結果
[[email protected] sunwg]$ hadoop fs -cat /sunwg/output/part-00000
100     tom     90
101     mary
102     kate    80

在進一步考慮全連線的情況，有可能在name中沒有對應的記錄
比如：
name.txt
100 tom
101 mary

score.txt
100 90
102 80

要得到如下的join結果
100 tom 90
101 mary
102 80
只要修改reduce中的最後輸出結果的檢驗條件為

if(!name.equals(”")) {
output.collect(key,new Text(name + “\t” + score));
}
else if(!score.equals(”")) {
output.collect(key,new Text(name + “\t” + score));
}

結果
[[email protected] sunwg]$ hadoop fs -cat /sunwg/output/part-00000
100     tom     90
101     mary
102             80

以上實現了兩個檔案的JOIN操作，採用相同的策略，可以對不同輸入目錄的檔案新增相同標記，採用不同的map、reduce策略來實現對不同的目錄實現不同的操作

hadoop處理不同輸入目錄檔案

hadoop處理不同輸入目錄檔案

maprecue將兩類ip分類去重並且輸出到不同目錄檔案中

自定義OutputFormat -實現往不同的目錄輸出檔案

springboot程式logback日誌基本配置，多個包不同日誌級別輸入到檔案中

javaSE (三十）IO流異常處理、圖片加密處理、輸入檔案路徑並拷貝檔案內容、在檔案中鍵入內容

目錄檔案處理

批處理-CMD檢視目錄下檔案

使用spark對輸入目錄的檔案進行過濾

DOS命令FOR批處理COPY指定目錄下檔案到指定目錄

Linux學習日記 —— 4.1.2 檔案處理命令之目錄處理命令

MapReduce處理多個不同的出入檔案

如何將程式的執行檔案和靜態載入動態庫放在不同的目錄

第4章處理使用者輸入與顯示資料------------(標準檔案描述符、STDIN/STDERR/STDOUT、臨時重定向、永久重定向exec)

Hadoop NameNode元資料相關檔案目錄解析

Web---檔案上傳-用apache的工具處理、打散目錄、簡單檔案上傳進度

MapReduce對輸入多檔案的處理

[批處理] 把盤內指定檔案複製到指定目錄

【總結】Hadoop NameNode元資料相關檔案目錄解析

自定義hadoop map/reduce輸入檔案切割InputFormat 更改輸入value的分隔符

hadoop 處理檔案的過程

hadoop處理不同輸入目錄檔案

相關推薦