利用MapReduce來實現文件全域性搜尋引擎

阿新 • • 發佈：2019-01-07

利用MapReduce來實現全域性搜尋引擎

根據內容來檢視文件,可以統計每個單詞在一些文件中出現了幾次,來實現全文檢索的這樣的一個功能

預備檔案:

hadoop中分三步走:

1.mapper對文件初步處理, 獲得每個單詞以及單詞的路徑,設定每個單詞出現的次數都初步設定為1;

輸出格式 : 單詞||文件uri 1;

2.combiner對於每個文件同樣的單詞初步的合計統計次數並輸出到reducer

合併每個檔案單詞出現的次數,也就是詞頻

輸出格式: 單詞 uri------詞頻

3.reducer經過shuffer處理形成最終的檔案

輸出格式; 單詞 uri------詞頻;uri-------詞頻;

程式碼展示:

package demo01.hadoop;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class InvertedIndex extends Configured implements Tool {

	@Override
	public int run(String[] args) throws Exception {

		if (args.length != 2) {

			System.out.println("args error!");
			return -1;
		}
		Path src = new Path(args[0]);
		Path desc = new Path(args[1]);

		Configuration conf = getConf();
		FileSystem fs = FileSystem.get(conf);
		if (fs.exists(desc)) {
			fs.delete(desc, true);
		}

		Job job = Job.getInstance(conf,"倒排索引");
		job.setJarByClass(getClass());

		job.setMapperClass(MyMapper.class);	
		job.setMapOutputKeyClass(Text.class);            //這是reducer的東西
		job.setMapOutputValueClass(Text.class);
		
		
		job.setCombinerClass(MyCombiner.class);
		
		//job.setReducerClass(MyReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class); 
		
		FileInputFormat.addInputPath(job, src);
		FileOutputFormat.setOutputPath(job, desc);

		return job.waitForCompletion(true) ? 0 : 1;

	}

	public static void main(String[] args) throws Exception {

		int code = ToolRunner.run(new InvertedIndex(), args);
		System.exit(code);

	}

	/**
	 * 
	 * @author hp 輸出格式 : key單詞:文件uri value每個單詞設定出現次數為1;
	 *
	 */

	public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

		String uri;
		Text key2 = new Text();
		Text value2 = new Text();

		@Override
		public void setup(Mapper<LongWritable, Text, Text,Text>.Context context)
				throws IOException, InterruptedException {

			FileSplit split = (FileSplit) context.getInputSplit();
			this.uri = split.getPath().toString();

		}

		@Override
		public void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text,Text>.Context context)
				throws IOException, InterruptedException {

			String[] strs = value.toString().split("\\s+");
			for (String str : strs) {

				key2.set(str + "||" + uri);
				value2.set("1");
				
				context.write(key2, value2);
			}

		}
	}

	/**
	 * 合併每個檔案單詞出現的次數,也就是詞頻 輸出格式: key單詞 value uri+每個文件中的詞頻
	 */

	public static class MyCombiner extends Reducer<Text, Text, Text, Text> {

		Text key4 = new Text();
		Text value4 = new Text();

		
		@Override
		public void reduce(Text key3, Iterable<Text> value3,
				Reducer<Text, Text, Text, Text>.Context context) throws IOException, InterruptedException {

			
			int sum = 0;
			for (Text v3 : value3) {

				sum += Integer.parseInt(v3.toString());

			}
			

			String word = key3.toString().substring(0, key3.toString().indexOf("||"));
			key4.set(word);
			int pos = key3.toString().length();
			String uri = key3.toString().substring(key3.toString().indexOf("||") + 2, pos);
            
			value4.set(uri + "-------" +sum);
			context.write(key4, value4);

		}

	}
	
	public static class MyReducer extends Reducer<Text, Text, Text,Text>{
		
		Text key6 = new Text();
		Text value6 = new Text();
		
		@Override
		public void reduce(Text key5, Iterable<Text> value5, Reducer<Text, Text, Text, Text>.Context context)
				throws IOException, InterruptedException {
			
			StringBuffer sb = new StringBuffer();
			for (Text v5 : value5) {
				
				sb.append(v5 + ";");
			}
			
			key6.set(key5.toString());
			value6.set(sb.toString());	
			context.write(key6, value6);
		}
		
	}

}

歡迎提出見解跟指導

利用MapReduce來實現文件全域性搜尋引擎

利用MapReduce來實現全域性搜尋引擎根據內容來檢視文件,可以統計每個單詞在一些文件中出現了幾次,來實現全文檢索的這樣的一個功能預備檔案: hadoop中分三步走: 1.mapper對文件

程式設計師過關斬將--自定義執行緒池來實現文件轉碼

背景我司在很久之前，一位很久之前的同事寫過一個文件轉圖片的服務，具體業務如下：使用者在客戶端上傳文件，可以是ppt，word，pdf 等格式，使用者上傳完成可以在客戶端預覽上傳的文件，預覽的時候採用的是圖片形式（不要和我說用別的方式預覽，現在已經來不及了）當用戶把文件上傳到雲端之後（阿里雲），把文件相

PHP使用flock實現文件加鎖來防止多進程同時寫入文件

php flock 文件鎖定 lock在官方文檔裏的解釋是：flock() 允許你執行一個簡單的可以在任何平臺中使用的讀取/寫入模型（包括大部分的 Unix 派生版和甚至是Windows）。如果鎖定會堵塞的話（EWOULDBLOCK 錯誤碼情況下），請將可選的第三個參數設置為 TRUE。鎖定操作也

利用iis虛擬目錄實現文件服務器功能（分布式存儲）

clas mage 自己的關於彈出 fig文件 images span sdn 要求說明：　　通過網站上傳文件保存到統一的文件服務器上。　　服務器說明：　　　　1.文件服務器以下稱為FilesServer，IP地址為：192.168.1.213　　　　2.Web服

通過FTP服務端來實現匿名用戶和基本用戶的訪問,可以實現文件上傳和下載。通過web網站來瀏覽內容

ges 用戶分享圖片 size 下載 mark col term http 通過FTP服務端來實現匿名用戶和基本用戶的訪問,可以實現文件上傳和下載。通過web網站來瀏覽內容

Windows利用文件夾映射來同步文件

src 服務 spa 映射如果 pan win 修改文件文件在windows服務器上有時有這樣的需求：你的文件在f:\test中，但由於其它原因用戶訪問的是e:\test，如果又希望e:\test 中的文件與f:\test的保持同步，除了用同步軟件來做外，可以用wi

rpm實驗--刪除某文件後利用rpm2cpio解開rpm來恢復文件

dia bsp tab fun sysconfig var 刪除文件 entos func 1.刪除文件 [root@centos6 ~]# rm -rf /etc/rc.d/init.d/functions 2.查詢文件來自於哪一個已安裝rpm包 [root@ce

C/C++實現利用添加註冊表項實現文件自啟動

c++實現 ng- 註冊表操作 comm .cpp mod mps main 文件路徑簡介添加註冊表項是實現文件自啟動的經典方法之一，但因為操作註冊表項是一個敏感操作，被報毒可能性較大，但即便如此，這個方法還是值得一學的，因為後期大部分編程都涉及到註冊表操作。最常

struts2實現文件的上傳下載實現

需要內容配置 multipart eth struts content 截器 XML ---恢復內容開始--- 了解 1 前臺form 表單 method=post , enctype=multipart/form-data Struts

Http服務器實現文件上傳與下載(五)

aaa con 選擇 gethost cte 分隔數據 gif 開始一、引言歡迎大家和我一起編寫Http服務器實現文件的上傳和下載，現在我回顧一下在上一章節中提到的一些內容，之前我已經提到過文件的下載，在文件的下載中也提到了文件的續下載只需要在響應頭中填寫C

Http服務器實現文件上傳與下載(四)

讀取版本 html 出現問題 type 函數名 range 讀取文件都是一、引言　　歡迎大家來到和我一起編寫Http服務器實現文件的上傳和下載，現在我稍微回顧一下之前我說的，第一、二章說明說明了整體的HTTP走向，第三章實現底層的網絡編程。接著這一章我想給大家講的是

Http服務器實現文件上傳與下載(一)

80端口 window har name content cti 封裝商業利用一、引言　　大家都知道web編程的協議就是http協議，稱為超文本傳輸協議。在J2EE中我們可以很快的實現一個Web工程，但在C++中就不是非常的迅速，原因無非就是底層的socket網絡編

Http服務器實現文件上傳與下載(二)

turn 查找讀者查詢 failed 邏輯關系主機文件 ret 一、引言　　歡迎大家接著看我的博客，如何大家有什麽想法的話回復我哦，閑話不多聊了，接著上一講的內容來說吧，在上一節中已經講到了請求頭字符串的解析，並且在解析中我我們已經獲取了url。就是上節中提到的/

C# asp.net實現文件上傳

function asp.net visual 開發 null 前端代碼：使用visual studio開發實現文件上傳前端頁面代碼： <%@ Page Language="C#" AutoEventWireup="true" CodeBehind="WebForm1.as

在java項目中怎樣利用Dom4j解析XML文件獲取數據

avi conf get 自己 mar dom4j eas localhost b2c 在曾經的學習.net時常常會遇到利用配置文件來解決項目中一些須要常常變換的數據。比方數據庫的連接字符串兒等。這個時候在讀取配置文件的時候。我們一般會用到一個雷configuratio

WebForm實現文件上傳，並預覽

image form ict itl lec nbsp style object 是否實現效果：頁面代碼： <%@ Page Language="C#" AutoEventWireup="true" CodeFile="Default.aspx.cs" Inh

java實現文件批量導入導出實例（兼容xls,xlsx）

constant fill 導出文件 except 描述空值 ice .cn cto 1、介紹 java實現文件的導入導出數據庫，目前在大部分系統中是比較常見的功能了，今天寫個小demo來理解其原理，沒接觸過的同學也可以看看參考下。目前我所接觸過的導入導出技術主要有PO

Linux下利用objdump查看文件空間地址分布

nbsp ima section 次方平臺 size linux中源代碼 turn 在linux中新建test.c文件，源代碼如下： 1 #include<stdio.h> 2 int a=1; 3 int main() 4 { 5 printf(

asp.net Mvc實現文件下載

result 絕對路徑路徑 ati 文件絕對路徑 filepath href encode string <a href=‘~/Download?filePath=路徑" > 下載 </a> public ActionResult Down

tomcat集群實現文件的共享

參考設置 linu port webapp 必須項目可能共享當項目中用了tomcat集群，用戶上傳了頭像，或者文件，這個時候又沒有ftp等文件系統。就是說把用戶上傳的頭像直接放到項目的目錄下，有可能會出現上傳的時候，上傳到了Atomcat中，顯示的時候讀取

利用MapReduce來實現文件全域性搜尋引擎

利用MapReduce來實現全域性搜尋引擎

hadoop中分三步走:

相關推薦