MapReduce原始碼解析之Outputforamt

阿新 • • 發佈：2019-01-11

OutputFormat過程的作用就是定義資料key-value的輸出格式，給你處理好後的資料，究竟以什麼樣的形式輸出呢，才能讓下次別人拿到這個檔案的時候能準確的提取出裡面的資料。這裡，我們撇開這個話題，僅僅我知道的一些定義的資料格式的方法，比如在Redis中會有這樣的設計:

[key-length][key][value-length][value][key-length][key][value-length][value]...

或者說不一定非要省空間,直接搞過分隔符

[key] [value]\n

[key] [value]\n

[key] [value]\n

.....

這樣逐行讀取，再以空格隔開，取出裡面的鍵值對，這麼做簡單是簡單，就是不緊湊，空間浪費得有點多。在MapReduce的OutputFormat的有種格式用的就是這種方式。

首先必須得了解OutputFormat裡面到底有什麼東西:

[java] view plain copy print?

publicinterface OutputFormat<K, V> {
/**
* Get the {@link RecordWriter} for the given job.
* 獲取輸出記錄鍵值記錄
*
* @param ignored
* @param job configuration for the job whose output is being written.
* @param name the unique name for this part of the output.
* @param progress mechanism for reporting progress while writing to file.
* @return a {@link RecordWriter} to write the output for the job.
* @throws IOException
*/
RecordWriter<K, V> getRecordWriter(FileSystem ignored, JobConf job,
String name, Progressable progress)
throws IOException;
/**
* Check for validity of the output-specification for the job.
*
* <p>This is to validate the output specification for the job when it is
* a job is submitted. Typically checks that it does not already exist,
* throwing an exception when it already exists, so that output is not
* overwritten.</p>
* 作業執行之前進行的檢測工作，例如配置的輸出目錄是否存在等
*
* @param ignored
* @param job job configuration.
* @throws IOException when output should not be attempted
*/
void checkOutputSpecs(FileSystem ignored, JobConf job) throws IOException;
}

很簡單的2個方法，RecordWriter比較重要，後面的key-value的寫入操作都是根據他來完成的。但是他是一個介面，在MapReduce中，我們用的最多的他的子類是FileOutputFormat： [java] view plain copy print?

/** A base class for {@link OutputFormat}. */
publicabstractclass FileOutputFormat<K, V> implements OutputFormat<K, V> {

他是一個抽象類，但是實現了介面中的第二個方法checkOutputSpecs()方法： [java] view plain copy print?

publicvoid checkOutputSpecs(FileSystem ignored, JobConf job)
throws FileAlreadyExistsException,
InvalidJobConfException, IOException {
// Ensure that the output directory is set and not already there
Path outDir = getOutputPath(job);
if (outDir == null && job.getNumReduceTasks() != 0) {
thrownew InvalidJobConfException("Output directory not set in JobConf.");
}
if (outDir != null) {
FileSystem fs = outDir.getFileSystem(job);
// normalize the output directory
outDir = fs.makeQualified(outDir);
setOutputPath(job, outDir);
// get delegation token for the outDir's file system
TokenCache.obtainTokensForNamenodes(job.getCredentials(),
new Path[] {outDir}, job);
// check its existence
if (fs.exists(outDir)) {
//如果輸出目錄以及存在，則拋異常
thrownew FileAlreadyExistsException("Output directory " + outDir +
" already exists");
}
}
}

就是檢查輸出目錄在不在的操作。在這個類裡還出現了一個輔助類： [java] view plain copy print?

publicstatic Path getTaskOutputPath(JobConf conf, String name)
throws IOException {
// ${mapred.out.dir}
Path outputPath = getOutputPath(conf);
if (outputPath == null) {
thrownew IOException("Undefined job output-path");
}
//根據OutputCommitter獲取輸出路徑
OutputCommitter committer = conf.getOutputCommitter();
Path workPath = outputPath;
TaskAttemptContext context = new TaskAttemptContext(conf,
TaskAttemptID.forName(conf.get("mapred.task.id")));
if (committer instanceof FileOutputCommitter) {
workPath = ((FileOutputCommitter)committer).getWorkPath(context,
outputPath);
}
// ${mapred.out.dir}/_temporary/_${taskid}/${name}
returnnew Path(workPath, name);
}

就是上面OutputCommiter，裡面定義了很多和Task,job作業相關的方法。很多時候都會與OutputFormat合作的形式出現。他也有自己的子類實現FileOutputCommiter: [java] view plain copy print?

publicclass FileOutputCommitter extends OutputCommitter {
publicstaticfinal Log LOG = LogFactory.getLog(
"org.apache.hadoop.mapred.FileOutputCommitter");
/**
* Temporary directory name
*/
publicstaticfinal String TEMP_DIR_NAME = "_temporary";
publicstaticfinal String SUCCEEDED_FILE_NAME = "_SUCCESS";
staticfinal String SUCCESSFUL_JOB_OUTPUT_DIR_MARKER =
"mapreduce.fileoutputcommitter.marksuccessfuljobs";
publicvoid setupJob(JobContext context) throws IOException {

MapReduce原始碼解析之Outputforamt

OutputFormat過程的作用就是定義資料key-value的輸出格式，給你處理好後的資料，究竟以什麼樣的形式輸出呢，才能讓下次別人拿到這個檔案的時候能準確的提取出裡面的資料。這裡，我們撇開這個話題，僅僅我知道的一些定義的資料格式的方法，比如在Redis

MapReduce原始碼解析之Mapper

MapReduce原始碼解析之Mapper 北京易觀智庫網路科技有限公司作者：賀斌摘要：詳解MapReduce中Map（對映）的實現者Mapper。導語：說起MapReduce，只要是大資料領域的小夥伴，相信都不陌生。它作為Hadoop生態系統中的一部分，最早是由G

Android框架原始碼解析之（四）Picasso

這次要分析的原始碼是 Picasso 2.5.2 ，四年前的版本，用eclipse寫的，但不影響這次我們對其原始碼的分析地址：https://github.com/square/picasso/tree/picasso-parent-2.5.2 Picasso的簡單使用

Android框架原始碼解析之（三）ButterKnife

注：所有分析基於butterknife:8.4.0 原始碼目錄：https://github.com/JakeWharton/butterknife 其中最主要的3個模組是： Butterknife註解處理器https://github.com/JakeWharton/

Android框架原始碼解析之（二）OKhttp

原始碼在：https://github.com/square/okhttp 包實在是太多了，OKhttp核心在這塊https://github.com/square/okhttp/tree/master/okhttp 直接匯入Android Studio中即可。基本使用：

Android框架原始碼解析之（一）Volley

前幾天面試CVTE，HR面掛了。讓內部一個學長幫我查看了一下面試官評價，發現二面面試官的評價如下：廣度OK，但缺乏深究能力，深度與實踐不足原始碼：只能說流程，細節程式碼不清楚，retrofit和volley都是。感覺自己一方面：自己面試技巧有待提高吧（框

Android原始碼解析之應用程式資源管理器（Asset Manager）的建立過程分析

轉載自：https://blog.csdn.net/luoshengyang/article/details/8791064 我們分析了Android應用程式資源的編譯和打包過程，最終得到的應用程式資源就與應用程式程式碼一起打包在一個APK檔案中。Android應用程式在執行的過程中，是通過一個

Spring-web原始碼解析之Filter-OncePerRequestFilter

轉自： http://blog.csdn.net/ktlifeng/article/details/50630934 基於4.1.7.RELEASE 我們先看一個filter-mapping的配置

spring原始碼解析之AOP原理

一、準備工作　　在這裡我先簡單記錄下如何實現一個aop： AOP：【動態代理】指在程式執行期間動態的將某段程式碼切入到指定方法指定位置進行執行的程式設計方式； 1、匯入aop模組；Spring AOP：(spring-aspects) 2、定義一個業務邏輯類（

Dubbo原始碼解析之服務端接收訊息

準備 dubbo 版本：2.5.4 服務端接收訊息流程 Handler鏈路 DubboProtocol private ExchangeServer createServer(URL url) { url = url.addParameterIfAbsent("c

Dubbo原始碼解析之服務釋出與註冊

準備 dubbo版本：2.5.4 Spring自定義擴充套件 dubbo 是基於 spring 配置來實現服務釋出，並基於 spring 的擴充套件機制定義了一套自定義標籤，要實現自定義擴充套件， spring 中提供了 NamespaceHandler 、BeanDefinit

MyBatis原始碼解析之日誌記錄

一 .概述 MyBatis沒有提供日誌的實現類，需要接入第三方的日誌元件，但第三方日誌元件都有各自的Log級別，且各不相同，但MyBatis統一提供了trace、debug、warn、error四個級別；自動掃描日誌實現，並且第三方日誌外掛載入優先順序如下：slf4J → commonsLoging →

MyBatis原始碼解析之資料來源（含資料庫連線池簡析）

一.概述：常見的資料來源元件都實現了javax.sql.DataSource介面； MyBatis不但要能整合第三方的資料來源元件，自身也提供了資料來源的實現；一般情況下，資料來源的初始化過程引數較多，比較複雜；二.設計模式：為什麼要使用工廠模式資料來

Spring原始碼解析之 Spring Security啟動細節和工作模式

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow 也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

Laravel原始碼解析之反射的使用

前言 PHP的反射類與例項化物件作用相反，例項化是呼叫封裝類中的方法、成員，而反射類則是拆封類中的所有方法、成員變數，幷包括私有方法等。就如“解刨”一樣，我們可以呼叫任何關鍵字修飾的方法、成員。當然在正常業務中是建議不使用，比較反射類已經摒棄了封裝的概念。本章講解反射類的使用及Laravel對反射的使用

hanlp原始碼解析之中文分詞演算法詳解

詞圖詞圖指的是句子中所有詞可能構成的圖。如果一個詞A的下一個詞可能是B的話，那麼A和B之間具有一條路徑E(A,B)。一個詞可能有多個後續，同時也可能有多個前驅，它們構成的圖我稱作詞圖。需要稀疏2維矩陣模型，以一個詞的起始位置作為行，終止位置作為列，可以得到一個二維矩陣。例如：“他說的確實

高併發程式設計thirft原始碼解析之Selector

Selector作用關於套接字程式設計，有一套經典的IO模型需要提前介紹一下：. 同步IO模型：阻塞式IO模型非阻塞式IO模型 IO複用模型使用selector 訊號驅動式IO模型非同步IO模型使用aio_read thri

Vue原始碼解析之nextTick

Vue原始碼解析之nextTick 前言 nextTick是Vue的一個核心功能，在Vue內部實現中也經常用到nextTick。但是，很多新手不理解nextTick的原理，甚至不清楚nextTick的作用。那麼，我們就先來看看nextTick是什麼。 nextTick功能看看

vue 原始碼解析之 data的省略用法

var vu = new vue( { data() { name: kk age: 123 } }) vue中獲取 name 有如下幾種寫法, 1 vu.name 2 vu.$data.name 其實他們實際都是獲取的 vu._data.name 第一種的原始碼在 function initDa

Dubbo原始碼解析之LoadBalance負載均衡

閱讀須知 dubbo版本：2.6.0 spring版本：4.3.8 文章中使用/* */註釋的方法會做深入分析正文 dubbo一共支援四種負載均衡策略，RoundRobinLoadBalance（輪詢）、RandomLoadBalance（隨機）、