Hadoop 統計檔案中某個單詞出現的次數

阿新 • • 發佈：2018-12-12

轉自：https://www.xuebuyuan.com/1270346.html

2013年10月24日 ⁄ 綜合 ⁄ 共 2628字 ⁄ 字號小中大 ⁄ 評論關閉

如檔案word.txt內容如下：

what is you name?

my name is zhang san。

要求統計word.txt中出現“is”的次數？

程式碼如下：

PerWordMapper

package com.hadoop.wordcount;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class PerWordMapper extends Mapper<Object, Text, Text, IntWritable> {

	public Text keyText = new Text();
	public IntWritable intValue = new IntWritable(1);
	
	@Override
	protected void map(Object key, Text value,
			Context context)
			throws IOException, InterruptedException {
		String str = value.toString();
		StringTokenizer to = new StringTokenizer(str);
		while (to.hasMoreTokens()) {
			String t = to.nextToken();
			//此處為判斷統計字串的地方
			if(t.equals("is")){
				keyText = new Text(t);
				context.write(keyText, intValue);
			}
	       
	     }
	}
}

PerWordReducer

package com.hadoop.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class PerWordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

	public IntWritable intValue = new IntWritable(0);
	@Override
	protected void reduce(Text key, Iterable<IntWritable> value,
			Context context)
			throws IOException, InterruptedException {
		int sum = 0;
		while(value.iterator().hasNext()){
			sum += value.iterator().next().get();
		}
		intValue.set(sum);
		context.write(key, intValue);
	}
	
}

PerWordCount

package com.hadoop.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import com.hadoop.mapreducer.MapperClass;
import com.hadoop.mapreducer.ReducerClass;
import com.hadoop.mapreducer.WordCount;

public class PerWordCount {
	public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
		Configuration conf = new Configuration();
	    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
	    System.out.println("otherArgs.length:"+otherArgs.length);
	    if (otherArgs.length != 2) {
	      System.err.println("Usage: wordcount <in> <out>");
	      System.exit(2);
	    }
	    Job job = new Job(conf, "word count");
	    job.setJarByClass(PerWordCount.class);
	    job.setMapperClass(PerWordMapper.class);
	    job.setCombinerClass(PerWordReducer.class);
	    job.setReducerClass(PerWordReducer.class);
	    job.setOutputKeyClass(Text.class);
	    job.setOutputValueClass(IntWritable.class);
	    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
	    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
	    System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

}

Hadoop 統計檔案中某個單詞出現的次數

轉自：https://www.xuebuyuan.com/1270346.html 2013年10月24日 ⁄ 綜合 ⁄ 共 2628字 ⁄ 字號小中大 ⁄ 評論關閉如檔案wor

Linux 統計檔案中某個字串出現的次數（2種方法，推薦第二種）

一、grep+wc 1、單個字串 grep -o targetStr filename | wc -l 備註：單個字串可以不加引號 2、多個字串 grep -o targetStr_1\|targetStr_2\|targetStr_3…… filename | wc

用python統計檔案中各個單詞出現的次數

import string d = {} def choice(str): s = str.lower() #全部轉化為小寫 for c in range(97,123): #ASC

python統計列表中某個元素出現次數

users = ['bobby1', 'bobby2', 'bobby3', 'bobby1', 'bobby2', 'bobby2' ]統計各個元素出現次數 1.傳統方法 user_dict = {} for user in users: if user not in user_d

Linux awk+uniq+sort 統計檔案中某字串出現次數並排序

在伺服器開發中，我們經常會寫入大量的日誌檔案。有時候我們需要對這些日誌檔案進行統計。Linux中我們可以利用以下命令簡單高效的實現這一功能。需要用到的命令簡介 cat命令 cat命令主要有三大功能 1.一次顯示整個檔案 cat file

Hadoop 統計檔案中單詞出現的次數

pom.xml <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://

Python每日一題：第3題：統計一個檔案中每個單詞出現的次數

題目：統計一個檔案中每個單詞出現的次數，列出出現頻率最多的5個單詞。前言：這道題在實際應用場景中使用比較廣泛，比如統計歷年來四六級考試中出現的高頻詞彙，記得李笑來就利用他的程式設計技能出版過一本背單詞的暢銷書，就是根據詞頻來記單詞，深受學生喜歡。這就是一個把程式設計技能用來解決實際問

統計一個檔案中各個字母出現次數

import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; /**

統計字串中每個單詞出現的次數 for C++

void CountWords(char str[])//實現字串中每個單詞出現的次數統計 { char S[20][10];//用於儲存每個單詞 int Count[20];//用於統計每個單詞出現的次數 int i=0,j=0,k=0; int Num;//表示單詞的個數

統計文章中英語單詞出現的次數

nextline ati pri num ger .get ins dex inf 統計文章中各個英語單詞出現的次數： import java.io.*; import java.util.*; public class Tongji { publi

統計檔案內每個字元出現次數

package com.zyq.test; import java.io.*; import java.util.Set; import java.util.TreeMap; public class TsetCountCharacters { public st

sort +awk+uniq 統計檔案中出現次數最多的前10個單詞

原文地址：http://blog.sina.com.cn/s/blog_5dce657a01012ddi.html 作者：小新例項cat logt.log|sort -s -t '-' -k1n |awk '{print $1;}'|uniq -c|sort -k1nr|head

linux統計txt檔案中單詞出現次數並排序

檔案：a.txt 任務：統計該檔案中每一個單詞出現的次數，並按照出現頻率從大到小排序 sed 's/ /\n/g' "a.txt" | sort | uniq -c | sort -nr 解析： sed替換 sed 's/被替換的字串/新字串/[替換選項]' fil

linux中sort（統計檔案中出現次數最多的前10個單詞）

例項 cat logt.log|sort -s -t '-' -k1n |awk '{print $1;}'|uniq -c|sort -k1nr|head -100 使用linux命令或者shell實現：檔案words存放英文單詞，格式為每行一個英文單詞

編寫一個模組，含char_freq_table()函式。傳入檔名，統計檔案中的所有英文字元的出現次數，忽略大小寫的區別，並根據次數的高低列印字元以及頻率到螢幕

編寫一個模組，包含char_freq_table()函式。傳入一個檔名，統計檔案中的所有英文字元的出現次數，忽略大小寫的區別，並根據次數的高低列印字元以及頻率到螢幕如果有更好的思路，歡迎交流因為時間問題就直接寫在一個檔案裡了，主要實現了統計檔案中的所出現的各個字元以及他的次數

統計檔案中單詞出現的頻次

public class Util{ public static void main(String[] args) throws IOException { //鍵盤錄入指定檔名 Scanner sc = new Scanner(Sys

Java版統計文件中的每個單詞出現次數

ack ioe .cn style pri .html key red reg 正則表達式之Pattern和Matcher，請參見轉載博客 http://www.cnblogs.com/haodawang/p/5967219.html 代碼實現： 1 import

Java實現統計一篇文章中每個單詞出現的次數

import java.io.File; import java.io.FileReader; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.Set; import jav

使用shell指令碼統計檔案中ip出現的次數

首先準備檔案demo.txt，內容如下： 1 192.168.41.20 2 192.168.41.21 3 192.168.41.22 4 192.168.41.23 5 192.168.41.24 6 192.168.41.25 統

(三)計算數組中某個值出現的次數

clas lte 出現的次數 pub post bsp arr stat ces public static long countOccurrences(int[] numbers, int value) { return Arrays.stream(number

Hadoop 統計檔案中某個單詞出現的次數

相關推薦