MapReduce型別與格式(輸入與輸出)
一、輸入格式
(1)輸入分片記錄
①JobClient通過指定的輸入檔案的格式來生成資料分片InputSplit;
②一個分片不是資料本身,而是可分片資料的引用;
③InputFormat介面負責生成分片;
原始碼位置:org.apache.hadoop.mapreduce.lib.input包(新)
org.apache.hadoop.mapred.lib 包(舊)
檢視其中FileInputFormat類中的getSplits()方法;
computeSplitSize()函式決定分片大小;
各種輸入類的結構關係圖:
(2)檔案輸入
抽象類:FileInputFormat
①FileInputFormat是所有使用檔案作為資料來源的InputFormat實現的基類;
②FileInputFormat輸入資料格式的分配大小由資料塊大小決定;
抽象類:CombineFileInputFormat
①可以使用CombineFileInputFormat來合併小檔案;
②因為CombineFileInputFormat是一個抽象類,使用的時候需要建立一個CombineFileInputFormat的實體類,並且實現getRecordReader()的方法;
③避免檔案分割的方法:
A.資料塊大小盡可能大,這樣使檔案的大小小於資料塊的大小,就不用進行分片;
B.繼承FileInputFormat,並且過載isSplitable()方法;
(3)文字輸入
類名:TextInputFormat
①TextInputFormat是預設的InputFormat,每一行資料就是一條記錄;
②TextInputFormat的key是LongWritable型別的,儲存該行在整個檔案的偏移量,value是每行的資料內容,Text型別;
③輸入分片與HDFS資料塊關係:TextInputFormat每一條記錄就是一行,很有可能某一行跨資料塊存放;
類名:KeyValueInputFormat類
可以通過key為行號的方式來知道記錄的行號,並且可以通過key.value.separator.in.input設定key與value的分割符;
類名:NLineInputFormat類
可以設定每個mapper處理的行數,可以通過mapred.line.input.format.lienspermap屬性設定;
(4)二進位制輸入
類名:SequenceFileInputFormat
SequenceFileAsTextInputFormat
SequenceFileAsBinaryInputFormat
由於SequenceFile能夠支援Splittable,所以能夠作為mapreduce輸入檔案的格式,能夠很方便的得到已經含有,value>的分片;
(5)多檔案輸入
類名:MultipleInputs
①MultipleInputs能夠提供多個輸入資料型別;
②通過addInputPath()方法來設定多路徑;
(6)資料庫格式輸入
類名:DBInputFormat
①DBInputFormat是一個使用JDBC並且從關係型資料庫中讀取資料的一種輸入格式;
②避免過多的資料庫連線;
③HBase中的TableInputFormat可以讓MapReduce程式訪問HBase表裡的資料;
實驗部分:
新建專案TestMRInputFormat,新建包com.mr,匯入相關依賴包
實驗①,以SequenceFile作為輸入,故預先執行SequenceFileWriter.java產生一個b.seq檔案;
新建類:TestInputFormat1.java(基於WordCount.java修改):
package com.mr;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class TestInputFormat {
public static class TokenizerMapper
extends Mapper< IntWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(IntWritable key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount ");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(TestInputFormat.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(SequenceFileInputFormat.class);//輸入格式的設定
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Eclipse中執行,引數配置如下圖:
輸出統計結果如下:
實驗②,多種來源輸入:
TestInputFormat2.java:
package com.mr;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class TestInputFormat2 {
public static class Mapper1 //第一個mapper類
extends Mapper<<font color="#ed1c24">LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class Mapper2 extends //第二個mapper類
Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(IntWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "word count");
job.setJarByClass(TestInputFormat2.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Path path1 = new Path("/a.txt");
Path path2 = new Path("/b.seq");
//多輸入
MultipleInputs.addInputPath(job, path1,TextInputFormat.class, Mapper1.class);
MultipleInputs.addInputPath(job, path2,SequenceFileInputFormat.class, Mapper2.class);
FileOutputFormat.setOutputPath(job, new Path("/output2"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
建立輸入文字檔案a.txt:
aaa bbb
ccc aaa
ddd eee
將專案打包為jar(不知道為什麼eclipse中不能執行,還沒找到原因,用jar命令可以執行):
File->Export->Runnable JAR file,命名jar檔案為testMR.jar。
命令列中執行:
$hadoop jar testMR.jar com.mr.TestInputFormat2
輸出統計結果如下:
二、輸出格式
各種類關係結構圖:
(1)文字輸出
類名:TextOutputFormat
①預設的輸出方式,key是LongWritable型別的,value是Text型別的;
②以“key \t value”的方式輸出行;
(2)二進位制輸出
類名:SequenceFileOutputFormat
SequenceFileAsTextOutputFormat
SequenceFileAsBinaryOutputFormat
MapFileOutputFormat
(3)多檔案輸出
類名:MultipleOutputFormat
MultipleOutputs
區別:MultipleOutputs可以產生不同型別的輸出;
(4)資料庫輸出
類名:DBOutputFormat
http://blog.sina.com.cn/s/blog_4438ac090101qfuh.html