1. 程式人生 > >Google MapReduce 論文

Google MapReduce 論文

man 訪問 derived mat close check 機器學習 them 文件


  • 1. MapReduce: Simplified Data Processing on Large Clusters
    • 1.1. Abstract
    • 1.2. 1 Introduction
    • 1.3. Programming Model
      • 1.3.1. 2.3 More Examples
    • 1.4. 3 Implementation
      • 1.4.1. 3.1 Execution Overview
      • 1.4.2. 3.2 Master Data Structures
      • 1.4.3. 3.3 Fault Tolerance
    • 1.5. 4 Refinements 優化改進
      • 1.5.1. 4.3 Combiner Function 組合器優化改進
    • 1.6. 5 Performance
      • 1.6.1. 5.2 Grep
    • 1.7. 6 Experience
      • 1.7.1. 6.1 Large-Scale Indexing 大規模索引

MapReduce: Simplified Data Processing on Large Clusters

原文地址:http://3gods.com/2017/07/31/Google-Paper-MapReduce.html。

Abstract

Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key
用戶指定一個Map函數用於處理KV鍵值對,生成一系列的中間結果KV鍵值對, 一個Reduce函數對中間結果中相同的K值,進行V值合並。

The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication
運行間系統會關註以下細節:輸入數據分區,程序執行任務調度,處理集群中機器宕機, 處理集群間機器的通信。

1 Introduction

The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with these issues.
如何並行計算,分發數據,處理因為大量復雜代碼導致的錯誤。

hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library
MapReduce庫隱藏了大量的細節,包括並行化,容錯,數據分發,負載均衡等。

Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages
我們的抽象是被List和其他很多函數語言中的Map和Reduce基元。

We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately
我們意識到,絕大多數的計算操作可以用Map函數作用在輸入數據的每一個邏輯記錄上, 以此得到一系列的中間KV鍵值對。然後使用Reduce函數聚合所有相同K的V值。

Our use of a functional model with user specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.
使用函數模型,讓用戶編寫Map和Reduce,讓我們能夠 輕易的大量並行化,並使用重新運算作為主要的容錯機制。

Programming Model

Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.
Map函數,用戶寫的,接收KV的輸入,輸出中間的KV。 然後將相同K的所有V聚合起來,傳遞給Reduce函數。

The Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation
Reduce函數,也是用戶寫的,接收中間結果的Key I和對應的值集合。 它合並這些值集合得到更小的集合。每次Reduce調用一般只有零個或者一個輸出。

2.3 More Examples

Distributed Grep:
這個的話可以用於分布式日誌查找。

Reverse Web-Link Graph:
這個的功能相當於用Google搜索的link:,查找引用了指定URL的頁面。
結合到我們目前的標簽項目,就是比如:給定了一個用戶,我們匯集這個用戶的全部相關屬性。

Term-Vector per Host: 主機檢索詞向量
統計一個或多個文檔中出現的最頻繁的詞匯,也就是最重要的詞匯。
hostname : term-vector。hostname就是文檔的URL。
這個場景就是可以用來做個性化推薦之類的。

Inverted Index:倒排索引
Map函數輸出word : documentId
Reduce函數輸出word : list(documentId)
這樣就實現了一個非常簡單的倒排索引。用於搜索。

Distributed Sort: 分布式排序

3 Implementation

The file system uses replication to provide availability and reliability on top of unreliable hardware.
文件系統使用復制保證在不可靠的硬件上的文件可用性和可靠性。

Users submit jobs to a scheduling system. Each job consists of a set of tasks, and is mapped by the scheduler to a set of available machines within a cluster.
用戶提交工作給調度系統。每個工作包含一系列的任務,並被調度系統
映射到集群中一系列可用的機器上面。

3.1 Execution Overview

Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function (e.g., hash(key) mod R)
Reduce調用通過分割中間K到R份(比如取模)進行分發。

執行流程概述:

  1. 用戶程序中的MapReduce庫首先將輸入文件切分為M份,一般是16到64M。 然後啟動集群機器上的拷貝程序。
  2. Master選擇空閑可用的機器,並將Map任務或者Reduce任務分配給workers。
  3. 分配了Map任務的worker讀取相應的切分後的輸入數據。它解析出KV鍵值對,並將每條記錄 傳遞給用戶定義的Map函數。產生的中間結果KV鍵值對緩存在內存中。
  4. 緩沖會定期寫到本地磁盤中,通過分區函數分發到R個區域中。 存放地址會被傳回給master,然後由master告知分配了Reduce任務的worker。
  5. 當Reduce worker被Master告知地址,就會遠程讀取中間結果數據。 當讀取完所有中間數據,根據K對中間結果進行排序,並分組。 這樣相同K值的就會分組到一組了。排序是必要的,如果排序後結果還是太大,要使用外部排序。
  6. Reduce worker遍歷有順序的中間數據,對遇到的每個唯一的K, 它將K和相應的V結果集傳遞給Reduce函數。並將最終結果輸出到對應的分區。
  7. 當所有Map和Reduce任務執行完畢後,喚醒用戶程序,並返回結果。

Typically, users do not need to combine these R output files into one file – they often pass these files as input to another MapReduce call, or use them from another distributed application that is able to deal with input that is partitioned into multiple files.
一般,用戶不需要合並這些R輸出——它們經常被傳遞並作為另外的MapReduce的輸入,
或被另外的分布式應用使用。

3.2 Master Data Structures

The master keeps several data structures. For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks)
Master保存一些數據結構。對於每個Map和Reduce任務,存儲狀態(
空閑,進行中,完成),以及worker機器的標示。
master也會廣播並保存中間結果地址。

3.3 Fault Tolerance

  1. Worker Failure

    The master pings every worker periodically. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers.
    通過Master對worker的ping來探活。
    任何掛掉機器上的Map任務都會被重置,以便調度系統重新分配。

    Completed map tasks are re-executed on a failure because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible.
    故障機器上已完成的Map任務會被重新執行,因為它們的輸出保存在
    本地,因此不可訪問。
    所有被重新執行的Map任務會被通知到Reduce worker,還未讀取中間結果的會更新地址。
    這點我覺得不太好,應該可以存在一個遠程,集中的地方。

    Completed reduce tasks do not need to be re-executed since their output is stored in a global file system.
    已完成的Reduce任務無需重新執行,因為它們的結果保存在全局文件系統中。

  2. Master Failure

    It is easy to make the master write periodic checkpoints of the master data structures described above.
    Master定期寫數據結構的檢查點到磁盤很容易。

    If the master task dies, a new copy can be started from the last checkpointed state.
    如果Master掛掉,會從最後一個檢查點生成新的拷貝並啟動。

4 Refinements 優化改進

4.3 Combiner Function 組合器優化改進

Combiner function that does partial merging of this data before it is sent over the network.
組合器函數會局部合並數據,然後通過網絡發送出去。

When a MapReduce operation is close to completion, the master schedules backup executions
of the remaining in-progress tasks. The task is marked as completed whenever either the primary or the backup execution completes.
當MapReduce操作快完成時,調度系統會備份還在執行中的任務。
當備份或者原始任務之一完成,此任務就被標記為完成。
這是為了應對部分最後階段的“遊蕩”任務。

The only difference between a reduce function and a combiner function is how the MapReduce library handles the output of the function
Reduce函數和組合函數唯一的區別是對輸出的處理。

5 Performance

5.2 Grep

The overhead is due to the propagation of the program to all worker machines, and delays interacting with
GFS to open the set of 1000 input files and to get the information needed for the locality optimization.
性能的天花板在於:傳播程序到所有的worker機器是,和GFS交互時打開1000個文件, 獲取本地優化需要的相關信息。

6 Experience

MapReduce應用領域:

  • large-scale machine learning problems,
    大規模的機器學習問題
  • clustering problems for the Google News and Froogle products,
    Google新聞和購物搜索產品的集群問題
  • extraction of data used to produce reports of popular queries (e.g. Google Zeitgeist),
    從普遍的查詢中提取數據產生報告
  • extraction of properties of web pages for new experiments and products
    (e.g. extraction of geographical locations from a large corpus of web pages for localized search), and
    提取頁面屬性以用於新的實驗和產品。(比如從大量的web頁面中提取地理位置用於局部搜索)
  • large-scale graph computations
    大規模圖片運算

6.1 Large-Scale Indexing 大規模索引

The performance of the MapReduce library is good enough that we can keep conceptually unrelated
computations separate, instead of mixing them together to avoid extra passes over the data.
This makes it easy to change the indexing process.
MapReduce的性能很好,所以我們能將無關的計算分開。 而不是為了避免數據傳輸和混雜在一起。這讓改變建立索引過程很容易。

Google MapReduce 論文