Hadoop學習之idea開發wordcount例項
阿新 • • 發佈:2018-12-21
在使用idea開發wordcount例項中,首先構建一個maven工程,需要引入的依賴有:
<repositories> <repository> <id>apache</id> <url>http://maven.apache.org</url> </repository> </repositories> <dependencies> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-dependency-plugin</artifactId> <configuration> <excludeTransitive>false</excludeTransitive> <stripVersion>true</stripVersion> <outputDirectory>./lib</outputDirectory> </configuration> </plugin> </plugins> </build>
在引入各個依賴後,點選自己建立的專案,選擇open module setting如下所示;
之後,引入Hadoop的包,如下圖所示:
選擇自己Hadoop的路徑,之後選擇以下所示的資料夾,選中引入即可。
之後點選配置,配置本專案的檔案輸入路徑和輸出路徑,在program arguments中前一個為檔案輸入路徑,後一個為輸出路徑,當然,此時的路徑均為hdfs叢集路徑,應該將建立的資料夾上傳到hdfs叢集中,然後把該路徑寫入,否則會報找不到檔案路徑的錯誤,出錯解決辦法參考我上一篇部落格。
配置完成後,將core-site.xml配置檔案引入,如下圖所示:
在一切配置準備完成後,便可以進行編碼了,首先建立一個java類名為WordCount,具體程式碼如下所示:
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; import java.util.StringTokenizer; public class WordCount { //編寫TokenizerMapper類繼承Mapper類 public static class TokenizerMapper extends Mapper<Object,Text,Text,IntWritable>{ //定義變數one值設定為1,為每個單詞定義value為1 public static final IntWritable one=new IntWritable(1); private Text word=new Text(); //編寫map函式,其中輸入引數為value(即為單詞),輸出引數為context public void map(Object key,Text values,Context context) throws IOException, InterruptedException { StringTokenizer str=new StringTokenizer(values.toString()); while(str.hasMoreTokens()){ word.set(str.nextToken()); context.write(word,one); } } } //定義IntSumReducer繼承Reducer public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>{ private IntWritable result=new IntWritable(); //定義reduce方法 public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { //遍歷,將統計各個單詞的總個數 int sum=0; for (IntWritable val:values) { sum+=val.get(); } result.set(sum); context.write(key,result); } } //編寫主函式 public static void main(String[] args) throws Exception{ Configuration conf=new Configuration(); Job job=Job.getInstance(conf,"wordCount"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //新增檔案的輸入路徑 FileInputFormat.addInputPath(job, new Path(args[0])); //新增檔案的輸出路徑 FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true)?0:1); } }
一切就緒後,點選執行便可執行出結果。當然在執行之前要開啟hadoop叢集。