Hadoop中MapReduce多種join實現例項分析

感謝分享：http://database.51cto.com/art/201410/454277.htm

1、在Reudce端進行連線。

在Reudce端進行連線是MapReduce框架進行表之間join操作最為常見的模式，其具體的實現原理如下：

Map端的主要工作：為來自不同表（檔案）的key/value對打標籤以區別不同來源的記錄。然後用連線欄位作為key，其餘部分和新加的標誌作為value，最後進行輸出。

reduce端的主要工作：在reduce端以連線欄位作為key的分組已經完成，我們只需要在每一個分組當中將那些來源於不同檔案的記錄（在map階段已經打標誌）分開，最後進行笛卡爾只就ok了。原理非常簡單，下面來看一個例項：

(1)自定義一個value返回型別:

package com.mr.reduceSizeJoin;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
publicclass CombineValues implements WritableComparable<CombineValues>{
//private static final Logger logger = LoggerFactory.getLogger(CombineValues.class);
private Text joinKey;//連結關鍵字
private Text flag;//檔案來源標誌
private Text secondPart;//除了連結鍵外的其他部分
publicvoid setJoinKey(Text joinKey) {
this.joinKey = joinKey;
}
publicvoid setFlag(Text flag) {
this.flag = flag;
}
publicvoid setSecondPart(Text secondPart) {
this.secondPart = secondPart;
}
public Text getFlag() {
return flag;
}
public Text getSecondPart() {
return secondPart;
}
public Text getJoinKey() {
return joinKey;
}
public CombineValues() {
this.joinKey = new Text();
this.flag = new Text();
this.secondPart = new Text();
}
@Override
publicvoid write(DataOutput out) throws IOException {
this.joinKey.write(out);
this.flag.write(out);
this.secondPart.write(out);
}
@Override
publicvoid readFields(DataInput in) throws IOException {
this.joinKey.readFields(in);
this.flag.readFields(in);
this.secondPart.readFields(in);
}
@Override
publicint compareTo(CombineValues o) {
returnthis.joinKey.compareTo(o.getJoinKey());
}
@Override
public String toString() {
// TODO Auto-generated method stub
return"[flag="+this.flag.toString()+",joinKey="+this.joinKey.toString()+",secondPart="+this.secondPart.toString()+"]";
}
}

(2)map、reduce主體程式碼

package com.mr.reduceSizeJoin;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* @author zengzhaozheng
* 用途說明：
* reudce side join中的left outer join
* 左連線，兩個檔案分別代表2個表,連線欄位table1的id欄位和table2的cityID欄位
* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)
* tb_dim_city.dat檔案內容,分隔符為"|"：
* id name orderid city_code is_show
* 0 其他 9999 9999 0
* 1 長春 1 901 1
* 2 吉林 2 902 1
* 3 四平 3 903 1
* 4 松原 4 904 1
* 5 通化 5 905 1
* 6 遼源 6 906 1
* 7 白城 7 907 1
* 8 白山 8 908 1
* 9 延吉 9 909 1
* -------------------------風騷的分割線-------------------------------
* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)
* tb_user_profiles.dat檔案內容,分隔符為"|"：
* userID network flow cityID
* 1 2G 123 1
* 2 3G 333 2
* 3 3G 555 1
* 4 2G 777 3
* 5 3G 666 4
*
* -------------------------風騷的分割線-------------------------------
* 結果：
* 1 長春 1 901 1 1 2G 123
* 1 長春 1 901 1 3 3G 555
* 2 吉林 2 902 1 2 3G 333
* 3 四平 3 903 1 4 2G 777
* 4 松原 4 904 1 5 3G 666
*/
publicclass ReduceSideJoin_LeftOuterJoin extends Configured implements Tool{
privatestaticfinal Logger logger = LoggerFactory.getLogger(ReduceSideJoin_LeftOuterJoin.class);
publicstaticclass LeftOutJoinMapper extends Mapper<Object, Text, Text, CombineValues> {
private CombineValues combineValues = new CombineValues();
private Text flag = new Text();
private Text joinKey = new Text();
private Text secondPart = new Text();
@Override
protectedvoid map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//獲得檔案輸入路徑
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
//資料來自tb_dim_city.dat檔案,標誌即為"0"
if(pathName.endsWith("tb_dim_city.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 5){
return;
}
flag.set("0");
joinKey.set(valueItems[0]);
secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}//資料來自於tb_user_profiles.dat，標誌即為"1"
elseif(pathName.endsWith("tb_user_profiles.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 4){
return;
}
flag.set("1");
joinKey.set(valueItems[3]);
secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}
}
}
publicstaticclass LeftOutJoinReducer extends Reducer<Text, CombineValues, Text, Text> {
//儲存一個分組中的左表資訊
private ArrayList<Text> leftTable = new ArrayList<Text>();
//儲存一個分組中的右表資訊
private ArrayList<Text> rightTable = new ArrayList<Text>();
private Text secondPar = null;
private Text output = new Text();
/**
* 一個分組呼叫一次reduce函式
*/
@Override
protectedvoid reduce(Text key, Iterable<CombineValues> value, Context context)
throws IOException, InterruptedException {
leftTable.clear();
rightTable.clear();
/**
* 將分組中的元素按照檔案分別進行存放
* 這種方法要注意的問題：
* 如果一個分組內的元素太多的話，可能會導致在reduce階段出現OOM，
* 在處理分散式問題之前最好先了解資料的分佈情況，根據不同的分佈採取最
* 適當的處理方法，這樣可以有效的防止導致OOM和資料過度傾斜問題。
*/
for(CombineValues cv : value){
secondPar = new Text(cv.getSecondPart().toString());
//左表tb_dim_city
if("0".equals(cv.getFlag().toString().trim())){
leftTable.add(secondPar);
}
//右表tb_user_profiles
elseif("1".equals(cv.getFlag().toString().trim())){
rightTable.add(secondPar);
}
}
logger.info("tb_dim_city:"+leftTable.toString());
logger.info("tb_user_profiles:"+rightTable.toString());
for(Text leftPart : leftTable){
for(Text rightPart : rightTable){
output.set(leftPart+ "\t" + rightPart);
context.write(key, output);
}
}
}
}
@Override
publicint run(String[] args) throws Exception {
Configuration conf=getConf(); //獲得配置檔案物件
Job job=new Job(conf,"LeftOutJoinMR");
job.setJarByClass(ReduceSideJoin_LeftOuterJoin.class);
FileInputFormat.addInputPath(job, new Path(args[0])); //設定map輸入檔案路徑
FileOutputFormat.setOutputPath(job, new Path(args[1])); //設定reduce輸出檔案路徑
job.setMapperClass(LeftOutJoinMapper.class);
job.setReducerClass(LeftOutJoinReducer.class);
job.setInputFormatClass(TextInputFormat.class); //設定檔案輸入格式
job.setOutputFormatClass(TextOutputFormat.class);//使用預設的output格格式
//設定map的輸出key和value型別
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(CombineValues.class);
//設定reduce的輸出key和value型別
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
return job.isSuccessful()?0:1;
}
publicstaticvoid main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
try {
int returnCode = ToolRunner.run(new ReduceSideJoin_LeftOuterJoin(),args);
System.exit(returnCode);
} catch (Exception e) {
// TODO Auto-generated catch block
logger.error(e.getMessage());
}
}
}

其中具體的分析以及資料的輸出輸入請看程式碼中的註釋已經寫得比較清楚了，這裡主要分析一下reduce join的一些不足。之所以會存在reduce join這種方式，我們可以很明顯的看出原：因為整體資料被分割了，每個map task只處理一部分資料而不能夠獲取到所有需要的join欄位，因此我們需要在講join key作為reduce端的分組將所有join key相同的記錄集中起來進行處理，所以reduce join這種方式就出現了。這種方式的缺點很明顯就是會造成map和reduce端也就是shuffle階段出現大量的資料傳輸，效率很低。

2、在Map端進行連線。

使用場景：一張表十分小、一張表很大。

用法:在提交作業的時候先將小表文件放到該作業的DistributedCache中，然後從DistributeCache中取出該小表進行join key / value解釋分割放到記憶體中（可以放大Hash Map等等容器中）。然後掃描大表，看大表中的每條記錄的join key /value值是否能夠在記憶體中找到相同join key的記錄，如果有則直接輸出結果。

直接上程式碼，比較簡單：

package com.mr.mapSideJoin;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* @author zengzhaozheng
*
* 用途說明：
* Map side join中的left outer join
* 左連線，兩個檔案分別代表2個表,連線欄位table1的id欄位和table2的cityID欄位
* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)，
* 假設tb_dim_city檔案記錄數很少，tb_dim_city.dat檔案內容,分隔符為"|"：
* id name orderid city_code is_show
* 0 其他 9999 9999 0
* 1 長春 1 901 1
* 2 吉林 2 902 1
* 3 四平 3 903 1
* 4 松原 4 904 1
* 5 通化 5 905 1
* 6 遼源 6 906 1
* 7 白城 7 907 1
* 8 白山 8 908 1
* 9 延吉 9 909 1
* -------------------------風騷的分割線-------------------------------
* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)
* tb_user_profiles.dat檔案內容,分隔符為"|"：
* userID network flow cityID
* 1 2G 123 1
* 2 3G 333 2
* 3 3G 555 1
* 4 2G 777 3
* 5 3G 666 4
* -------------------------風騷的分割線-------------------------------
* 結果：
* 1 長春 1 901 1 1 2G 123
* 1 長春 1 901 1 3 3G 555
* 2 吉林 2 902 1 2 3G 333
* 3 四平 3 903 1 4 2G 777
* 4 松原 4 904 1 5 3G 666
*/
publicclass MapSideJoinMain extends Configured implements Tool{
privatestaticfinal Logger logger = LoggerFactory.getLogger(MapSideJoinMain.class);
publicstaticclass LeftOutJoinMapper extends Mapper<Object, Text, Text, Text> {
private HashMap<String,String> city_info = new HashMap<String, String>();
private Text outPutKey = new Text();
private Text outPutValue = new Text();
private String mapInputStr = null;
private String mapInputSpit[] = null;
private String city_secondPart = null;
/**
* 此方法在每個task開始之前執行，這裡主要用作從DistributedCache
* 中取到tb_dim_city檔案，並將裡邊記錄取出放到記憶體中。
*/
@Override
protectedvoid setup(Context context)
throws IOException, InterruptedException {
BufferedReader br = null;
//獲得當前作業的DistributedCache相關檔案
Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String cityInfo = null;
for(Path p : distributePaths){
if(p.toString().endsWith("tb_dim_city.dat")){
//讀快取檔案，並放到mem中
br = new BufferedReader(new FileReader(p.toString()));
while(null!=(cityInfo=br.readLine())){
String[] cityPart = cityInfo.split("\\|",5);
if(cityPart.length ==5){
city_info.put(cityPart[0], cityPart[1]+"\t"+cityPart[2]+"\t"+cityPart[3]+"\t"+cityPart[4]);
}
}
}
}
}
/**
* Map端的實現相當簡單，直接判斷tb_user_profiles.dat中的
* cityID是否存在我的map中就ok了，這樣就可以實現Map Join了
*/
@Override
protectedvoid map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//排掉空行
if(value == null || value.toString().equals("")){
return;
}
mapInputStr = value.toString();
mapInputSpit = mapInputStr.split("\\|",4);
//過濾非法記錄
if(mapInputSpit.length != 4){
return;
}
//判斷連結欄位是否在map中存在
city_secondPart = city_info.get(mapInputSpit[3]);
if(city_secondPart != null){
this.outPutKey.set(mapInputSpit[3]);
this.outPutValue.set(city_secondPart+"\t"+mapInputSpit[0]+"\t"+mapInputSpit[1]+"\t"+mapInputSpit[2]);
context.write(outPutKey, outPutValue);
}
}
}
@Override
publicint run(String[] args) throws Exception {
Configuration conf=getConf(); //獲得配置檔案物件
DistributedCache.addCacheFile(new Path(args[1]).toUri(), conf);//為該job新增快取檔案
Job job=new Job(conf,"MapJoinMR");
job.setNumReduceTasks(0);
FileInputFormat.addInputPath(job, new Path(args[0])); //設定map輸入檔案路徑
FileOutputFormat.setOutputPath(job, new Path(args[2])); //設定reduce輸出檔案路徑
job.setJarByClass(MapSideJoinMain.class);
job.setMapperClass(LeftOutJoinMapper.class);
job.setInputFormatClass(TextInputFormat.class); //設定檔案輸入格式
job.setOutputFormatClass(TextOutputFormat.class);//使用預設的output格式
//設定map的輸出key和value型別
job.setMapOutputKeyClass(Text.class);
//設定reduce的輸出key和value型別
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
return job.isSuccessful()?0:1;
}
publicstaticvoid main(String[] args) throws IOException,
ClassNotFoundException, InterruptedException {
try {
int returnCode = ToolRunner.run(new MapSideJoinMain(),args);
System.exit(returnCode);
} catch (Exception e) {
// TODO Auto-generated catch block
logger.error(e.getMessage());
}
}
}

這裡說說DistributedCache。DistributedCache是分散式快取的一種實現，它在整個MapReduce框架中起著相當重要的作用，他可以支撐我們寫一些相當複雜高效的分散式程式。說回到這裡，JobTracker在作業啟動之前會獲取到DistributedCache的資源uri列表，並將對應的檔案分發到各個涉及到該作業的任務的TaskTracker上。另外，關於DistributedCache和作業的關係，比如許可權、儲存路徑區分、public和private等屬性，接下來有用再整理研究一下寫一篇blog，這裡就不詳細說了。

另外還有一種比較變態的Map Join方式，就是結合HBase來做Map Join操作。這種方式完全可以突破記憶體的控制，使你毫無忌憚的使用Map Join，而且效率也非常不錯。

3、SemiJoin。

SemiJoin就是所謂的半連線，其實仔細一看就是reduce join的一個變種，就是在map端過濾掉一些資料，在網路中只傳輸參與連線的資料不參與連線的資料不必在網路中進行傳輸，從而減少了shuffle的網路傳輸量，使整體效率得到提高，其他思想和reduce join是一模一樣的。說得更加接地氣一點就是將小表中參與join的key單獨抽出來通過DistributedCach分發到相關節點，然後將其取出放到記憶體中（可以放到HashSet中），在map階段掃描連線表，將join key不在記憶體HashSet中的記錄過濾掉，讓那些參與join的記錄通過shuffle傳輸到reduce端進行join操作，其他的和reduce join都是一樣的。看程式碼：

package com.mr.SemiJoin;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* @author zengzhaozheng
*
* 用途說明：
* reudce side join中的left outer join
* 左連線，兩個檔案分別代表2個表,連線欄位table1的id欄位和table2的cityID欄位
* table1(左表):tb_dim_city(id int,name string,orderid int,city_code,is_show)
* tb_dim_city.dat檔案內容,分隔符為"|"：
* id name orderid city_code is_show
* 0 其他 9999 9999 0
* 1 長春 1 901 1
* 2 吉林 2 902 1
* 3 四平 3 903 1
* 4 松原 4 904 1
* 5 通化 5 905 1
* 6 遼源 6 906 1
* 7 白城 7 907 1
* 8 白山 8 908 1
* 9 延吉 9 909 1
* -------------------------風騷的分割線-------------------------------
* table2(右表)：tb_user_profiles(userID int,userName string,network string,double flow,cityID int)
* tb_user_profiles.dat檔案內容,分隔符為"|"：
* userID network flow cityID
* 1 2G 123 1
* 2 3G 333 2
* 3 3G 555 1
* 4 2G 777 3
* 5 3G 666 4
* -------------------------風騷的分割線-------------------------------
* joinKey.dat內容：
* city_code
* 1
* 2
* 3
* 4
* -------------------------風騷的分割線-------------------------------
* 結果：
* 1 長春 1 901 1 1 2G 123
* 1 長春 1 901 1 3 3G 555
* 2 吉林 2 902 1 2 3G 333
* 3 四平 3 903 1 4 2G 777
* 4 松原 4 904 1 5 3G 666
*/
publicclass SemiJoin extends Configured implements Tool{
privatestaticfinal Logger logger = LoggerFactory.getLogger(SemiJoin.class);
publicstaticclass SemiJoinMapper extends Mapper<Object, Text, Text, CombineValues> {
private CombineValues combineValues = new CombineValues();
private HashSet<String> joinKeySet = new HashSet<String>();
private Text flag = new Text();
private Text joinKey = new Text();
private Text secondPart = new Text();
/**
* 將參加join的key從DistributedCache取出放到記憶體中，以便在map端將要參加join的key過濾出來。b
*/
@Override
protectedvoid setup(Context context)
throws IOException, InterruptedException {
BufferedReader br = null;
//獲得當前作業的DistributedCache相關檔案
Path[] distributePaths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
String joinKeyStr = null;
for(Path p : distributePaths){
if(p.toString().endsWith("joinKey.dat")){
//讀快取檔案，並放到mem中
br = new BufferedReader(new FileReader(p.toString()));
while(null!=(joinKeyStr=br.readLine())){
joinKeySet.add(joinKeyStr);
}
}
}
}
@Override
protectedvoid map(Object key, Text value, Context context)
throws IOException, InterruptedException {
//獲得檔案輸入路徑
String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();
//資料來自tb_dim_city.dat檔案,標誌即為"0"
if(pathName.endsWith("tb_dim_city.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 5){
return;
}
//過濾掉不需要參加join的記錄
if(joinKeySet.contains(valueItems[0])){
flag.set("0");
joinKey.set(valueItems[0]);
secondPart.set(valueItems[1]+"\t"+valueItems[2]+"\t"+valueItems[3]+"\t"+valueItems[4]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}else{
return ;
}
}//資料來自於tb_user_profiles.dat，標誌即為"1"
elseif(pathName.endsWith("tb_user_profiles.dat")){
String[] valueItems = value.toString().split("\\|");
//過濾格式錯誤的記錄
if(valueItems.length != 4){
return;
}
//過濾掉不需要參加join的記錄
if(joinKeySet.contains(valueItems[3])){
flag.set("1");
joinKey.set(valueItems[3]);
secondPart.set(valueItems[0]+"\t"+valueItems[1]+"\t"+valueItems[2]);
combineValues.setFlag(flag);
combineValues.setJoinKey(joinKey);
combineValues.setSecondPart(secondPart);
context.write(combineValues.getJoinKey(), combineValues);
}else{
return ;
}
}
}
}
publicstaticclass SemiJoinReducer extends Reducer<Text, CombineValues, Text, Text> {
//儲存一個分組中的左表資訊
相關推薦

Hadoop中MapReduce多種join實現例項分析

感謝分享：http://database.51cto.com/art/201410/454277.htm 1、在Reudce端進行連線。在Reudce端進行連線是MapReduce框架進行表之間join操作最為常見的模式，其具體的實現原理如下： Map端的主要工作：為來自

MapReduce多種join實現實例分析（二）

this hashmap track -- 類型 throw mapjoin pac actor 上一篇《MapReduce多種join實現實例分析（一）》，大家可以點擊回顧該篇文章。本文是MapReduce系列第二篇。一、在Map端進行連接使用場景：一張表十分小、一張表

Hadoop中 MapReduce中InputSplit的分析

前言 MapReduce的原始碼分析是基於Hadoop1.2.1基礎上進行的程式碼分析。什麼是InputSplit InputSplit是指分片，在MapReduce當中作業中，作為map ta

Hadoop基礎-MapReduce的Join操作

否則 mapred HA 原創 -m mapr red 轉載 hadoop基礎　　　　　　　　　　　　　　　　　　Hadoop基礎-MapReduce的Join操作　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　作者：尹正傑版權聲明：原創作品，

Python中簡單的GUI--Tkinter例項分析--2

廢話少說，直接從最初到後面一個一個程式碼展示最近參考文章（辛星tkinter第二版）書寫的程式碼內容 from tkinter import * def xinlabel(): '''2 ways to bind''' global xin s =

hadoop 中MapReduce因為檔案開啟檔案數目超過linux限制報錯

haoop中mapreduce報錯 java.io.IOException: All datanodes xxx.xxx.xxx.xxx:xxx are bad. Aborting… at org.apache.hadoop.dfs.DFSClient$DFSOutputSt

Hadoop中必須配置hostname的原始碼分析

搭建Hadoop的時候必須配置兩個檔案：/etc/sysconfig/network和/etc/hosts /etc/sysconfig/network的作用是設定機器的hostname /etc/hosts的作用是主機名與ip地址的匹配，如果沒有DNS伺服器的話，系統上的

Hadoop中兩表JOIN的處理方法

參考小結 1，reduce side join 在reduce階段join。 map階段標記資料來自哪個檔案，比如來自file1標記tag=1，來自file2標記tag=2。 reduce階段把key相同的file1的資料和file2的資

hadoop中MapReduce的sort(部分排序,完全排序,二次排序)

1.部分排序 MapReduce預設就是在每個分割槽裡進行排序 2.完全排序在所有的分割槽中，整體有序 1)使用一個reduce 2)自定義分割槽函式不同的key進入的到不同的分割槽之中,在每個分割槽中自動

hadoop中使用MapReduce程式設計例項

從網上搜到的一篇hadoop的程式設計例項，對於初學者真是幫助太大了，看過以後對MapReduce程式設計基本有了大概的瞭解。看了以後受益匪淺啊，趕緊儲存起來。 1、資料去重　　 "資料去重"主要是為了掌握和利用並行化思想來對資料進行有意義的篩選

Hadoop學習筆記之初識MapReduce以及WordCount例項分析

MapReduce簡介 MapReduce是什麼? MapReduce是一種程式設計模型，用於大規模資料集的分散式運算。 Mapreduce基本原理 1、MapReduce通俗解釋圖書館要清點圖書數量，有10個書架，管理員為了加快統計速度，找來了

hive實現txt資料匯入，理解hadoop中hdfs、mapreduce

背景：通過hive操作，瞭解hadoop的hdfs、mapreduce。場景：hadoop雙機叢集、hive 版本：hadoop和hive的版本搭配最和諧的是什麼，目前沒有定論，每種版本的搭配都會有一些bug出現。本例中版本：hadoop-1.0.3

MySQL中使用INNER JOIN來實現Intersect並集操作

int isam har 業務 charset tin ner get 一句話 MySQL中使用INNER JOIN來實現Intersect並集操作一、業務背景我們有張表設計例如以下： CREATE TABLE `user_defined_value` (

Hadoop Mapreduce之WordCount實現

註意 com split gin 繼承 [] leo ring exce 1.新建一個WCMapper繼承Mapper public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritab

ssm redis 數據字典在J2EE中的多種應用與實現

stat ide ddk ucc gif ndt ida creat img 數據字典在項目中是不可缺少的“基礎設施”，關於數據字典如何設計如何實現，今天抽空講一下吧先看一下表設計：通過自定義標簽來實現頁面的渲染： public class DataDictVal

手動實現一個單詞統計MapReduce程序與過程原理分析

Hadoop MapReduce Java [toc] 手動實現一個單詞統計MapReduce程序與過程原理分析前言我們知道，在搭建好hadoop環境後，可以運行wordcount程序來體驗一下hadoop的功能，該程序在hadoop目錄下的share/hadoop/mapreduce目錄中

利用JUnit實現對hadoop中javaAPI的測試

package gorilla.test; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOExc

推薦系統中協同過濾演算法實現分析（重要兩個圖！！）

“協”，指許多人協力合作。 “協同”，就是指協調兩個或者兩個以上的不同資源或者個體，協同一致地完成某一目標的過程。 “協同過濾”，簡單來說，就是利用興趣相投或擁有共同經驗的群體的喜好來給使用者推薦感興趣的資訊，記錄下來個人對於資訊相當程度的迴應（如評分），以達到過濾的目的，進而幫助別人篩

【Hadoop】MapReduce深度分析

MapReduce深度分析 MapReduce總結構分析資料流向分析處理過程分析各階段分析 MapTask Read階段 Map階段 Collector和Partitio

結合多個例項深入理解js的深拷貝和淺拷貝，多種方法實現物件的深拷貝

親們為什麼要研究深拷貝和淺拷貝呢，因為我們專案開發中有許多情況需要拷貝一個數組抑或是物件，但是單純的靠=“賦值”並不會解決所有問題，如果遇到引用型別的物件改變新賦值的物件會造成原始物件也發生同樣改變，而要去除影響就必須用到深拷貝，深拷貝，對於引用物件需要進行深拷貝才會去除影響。如果是值型別直接“=”

Hadoop中MapReduce多種join實現例項分析

相關推薦