MapReduce之MapJoin案例

阿新 • • 發佈：2020-08-17

@目錄

使用場景
優點
具體辦法：採用DistributedCache
案例
需求分析
程式碼實現

使用場景

Map Join 適用於一張表十分小、一張表很大的場景。

優點

思考：在Reduce 端處理過多的表，非常容易產生資料傾斜。怎麼辦？
在Map端快取多張表，提前處理業務邏輯，這樣增加Map 端業務，減少Reduce 端資料的壓力，儘可能的減少資料傾斜。

具體辦法：採用`DistributedCache`

（1）在Mapper的setup階段，將檔案讀取到快取集合中。
（2）在驅動函式中載入快取。

/快取普通檔案到Task執行節點。
job.addCacheFile(new URI("file://e:/cache/pd.txt");

案例

每個MapTask在map()中完成Join
注意：

只需要將要Join的資料order.txt作為切片，讓MapTask讀取
pd.txt不以切片形式讀入，而直接在MapTask中使用HDFS下載此檔案，下載後，使用輸入流手動讀取其中的資料
在map()之前通常是將大檔案以切片形式讀取，小檔案手動讀取！

order.txt---->切片(orderId,pid,amount)----JoinMapper.map()
pd.txt----->切片(pid,pname)----JoinMapper.map()

需求分析

MapJoin適用於關聯表中有小表的情形

程式碼實現

JoinBean.java

public class JoinBean {
	
	private String orderId;
	private String pid;
	private String pname;
	private String amount;
	
	@Override
	public String toString() {
		return  orderId + "\t" +  pname + "\t" + amount ;
	}

	public String getOrderId() {
		return orderId;
	}

	public void setOrderId(String orderId) {
		this.orderId = orderId;
	}

	public String getPid() {
		return pid;
	}

	public void setPid(String pid) {
		this.pid = pid;
	}

	public String getPname() {
		return pname;
	}

	public void setPname(String pname) {
		this.pname = pname;
	}

	public String getAmount() {
		return amount;
	}

	public void setAmount(String amount) {
		this.amount = amount;
	}


}

MapJoinMapper.java

/*
 * 1. 在Hadoop中，hadoop為MR提供了分散式快取
 * 			①用來快取一些Job執行期間的需要的檔案(普通檔案，jar，歸檔檔案(har))
 * 			②通過在Job的Configuration中，使用uri代替要快取的檔案
 * 			③分散式快取會假設當前的檔案已經上傳到了HDFS，並且在叢集的任意一臺機器都可以訪問到這個URI所代表的檔案
 * 			④分散式快取會在每個節點的task執行之前，提前將檔案傳送到節點
 * 			⑤分散式快取的高效是由於每個Job只會複製一次檔案，且可以自動在從節點對歸檔檔案解歸檔
 * 
 * 		
 * 
 * 
 */
public class MapJoinMapper extends Mapper<LongWritable, Text, JoinBean, NullWritable>{

	private JoinBean out_key=new JoinBean();
	private Map<String, String> pdDatas=new HashMap<String, String>();
	//在map之前手動讀取pd.txt中的內容
	
	@Override
	protected void setup(Mapper<LongWritable, Text, JoinBean, NullWritable>.Context context)
			throws IOException, InterruptedException {
		
		//從分散式快取中讀取資料
		URI[] files = context.getCacheFiles();
		
		for (URI uri : files) {
			
			BufferedReader reader = new BufferedReader(new FileReader(new File(uri)));
			
			String line="";
			
			//迴圈讀取pd.txt中的每一行
			while(StringUtils.isNotBlank(line=reader.readLine())) {
				
				String[] words = line.split("\t");
				
				pdDatas.put(words[0], words[1]);

			}
			
			reader.close();
			
		}
		
	}
	
	//對切片中order.txt的資料進行join，輸出
	@Override
	protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, JoinBean, NullWritable>.Context context)
			throws IOException, InterruptedException {
		
		String[] words = value.toString().split("\t");
		
		out_key.setOrderId(words[0]);
		out_key.setPname(pdDatas.get(words[1]));
		out_key.setAmount(words[2]);
		
		context.write(out_key, NullWritable.get());
			
	}
	
}

MapJoinDriver.java

public class MapJoinDriver {
	
	public static void main(String[] args) throws Exception {
		
		Path inputPath=new Path("e:/mrinput/mapjoin");
		Path outputPath=new Path("e:/mroutput/mapjoin");
		

		//作為整個Job的配置
		Configuration conf = new Configuration();
		//保證輸出目錄不存在
		FileSystem fs=FileSystem.get(conf);
		
		if (fs.exists(outputPath)) {
			
			fs.delete(outputPath, true);
			
		}
		
		// ①建立Job
		Job job = Job.getInstance(conf);
		
		job.setJarByClass(MapJoinDriver.class);
		
		
		// 為Job建立一個名字
		job.setJobName("wordcount");
		
		// ②設定Job
		// 設定Job執行的Mapper，Reducer型別，Mapper,Reducer輸出的key-value型別
		job.setMapperClass(MapJoinMapper.class);
		
		// 設定輸入目錄和輸出目錄
		FileInputFormat.setInputPaths(job, inputPath);
		FileOutputFormat.setOutputPath(job, outputPath);
		
		// 設定分散式快取
		job.addCacheFile(new URI("file:///e:/pd.txt"));
		
		//取消reduce階段
		job.setNumReduceTasks(0);

		// ③執行Job
		job.waitForCompletion(true);
		
	}

}

MapReduce之MapJoin案例

@目錄使用場景優點具體辦法：採用DistributedCache案例需求分析程式碼實現使用場景

MapReduce之ReduceJoin案例

@目錄Reduce Join原理案例實操需求分析MR分析MR實現ReduceJoin程式碼實現 Reduce Join原理

MapReduce之多個Job串聯的案例

@目錄需求分析程式碼實現輸出結果需求有三個檔案，裡面記錄著一些單詞，請統計每個單詞分別在每個檔案出現的次數。

Mapreduce之排序&規約&實戰案例

MapReduce 排序和序列化簡單介紹 ①序列化 (Serialization) 是指把結構化物件轉化為位元組流②反序列化 (Deserialization) 是序列化的逆過程. 把位元組流轉為結構化物件. 當要在程序間傳遞物件或持久化物件的時候,

使用MapReduce執行WordCount案例

@ 目錄一、準備資料二、MR的程式設計規範三、程式設計步驟四、編寫程式 Mapper程式解讀

MapReduce之Shuffle詳解

Hadoop原生的計算框架MapReduce，簡單概括一下：程序量級很重，啟動很慢，但能承載的資料量很大，效率相較於Spark微批處理和Flink實時來講很慢，Shuffle任何一個寫MR同學都必須掌握的東西，說難不難，說簡單也不簡單

MapReduce之片和塊的關係

@目錄片大小的計算片和塊的關係片大小的計算 long splitSize = computeSplitSize(blockSize, minSize, maxSize);

MapReduce之自定義InputFormat

在企業開發中，Hadoop框架自帶的InputFormat型別不能滿足所有應用場景，需要自定義InputFormat來解決實際問題。

MapReduce之MapTask工作機制

1. 階段定義 MapTask：map----->sort map：Mapper.map()中將輸出的key-value寫出之前 sort：Mapper.map()中將輸出的key-value寫出之後

MapReduce之自定義分割槽器Partitioner

@ 目錄問題引出預設Partitioner分割槽自定義Partitioner步驟 Partition分割槽案例實操

MapReduce之GroupingComparator分組（輔助排序、二次排序）

指對Reduce階段的資料根據某一個或幾個欄位進行分組。案例需求有如下訂單資料

7-2.檔案處理工具之grep案例分析

1.顯示/proc/meminfo檔案中以大小s開頭的行(要求：使用兩種方法) 案例思路分析：首先大小s，我們可以想到-i忽略大小選項，s開頭的行可以用行首錨定符^，用"."代表s後面的任意字元，用"*"表示前

Mapreduce之分割槽與自定義計數器

Mepreduce分割槽分割槽概述在 MapReduce 中, 通過我們指定分割槽, 會將同一個分割槽的資料傳送到同一個 Reduce 當中進行處理例如: 為了資料的統計, 可以把一批類似的資料傳送到同一個 Reduce 當中, 在同一個 Reduc

10.檔案處理之sed-案例分析

1.刪除centos7系統/etc/grub2.cfg檔案中所有以空白開頭的行行首的空白字元 [root@localhost ~]#sed -r \'s@^[[:space:]]+(.*)@\\1@\' /etc/grub2.cfg

16.文字處理之awk--案例分析

1.檔案host_list.log 如下格式，請提取”.magedu.com”前面的主機名部分並寫入到回到該檔案中。

css計數器之小案例

技術標籤：css css計數器 counter-reset:counterName //建立或重置計數器 counter-increment:counterName //遞增變數 content:插入生成的內容 //一般和偽類 ::before或::after搭配使用 counter(counterName)或co

MapReduce之詞頻統計本地執行

1、上述的MapReduce之Mapper、Reducer、Driver三步實現，是基於輸入和輸出都是HDFS的

PHP8新特性之JIT案例講解

8 alpha1已經在昨天釋出，相信關於JIT是大家最關心的，它到底怎麼用，有什麼要注意的，以及效能提升到底咋樣？

JavaScript 拖拉時間之drag案例詳解

目錄DragEvent 介面DataTransfer 介面概述DataTransfer 的例項屬性DataTransfer.dropEffectDataTransfer.effectAllowedDataTransfer.filesDataTransfer.typesDataTransfer.itemsDataTwww.cppcns.comransfer 的例項方

Vue之TodoList案例詳解

MapReduce之MapJoin案例

使用場景

優點

具體辦法：採用DistributedCache

案例

需求分析

程式碼實現

相關推薦

具體辦法：採用`DistributedCache`