scala-zipWithIndex、zipWithUniqueId 函式用法解析

阿新 • • 發佈：2018-12-02

1. 是什麼

顧名思義，zipWithIndex:通過主鍵打包，ZipWithUniqueId:通過唯一主鍵打包。二者的主要作用

1. def zipWithIndex(): RDD[(T, Long)]

該函式將RDD中的元素和這個元素在RDD中的ID（索引號）組合成鍵/值對。

2. def zipWithUniqueId(): RDD[(T, Long)]

該函式將RDD中元素和一個唯一ID組合成鍵/值對，該唯一ID生成演算法如下：
每個分割槽中第一個元素的唯一ID值為：該分割槽索引號，
每個分割槽中第N個元素的唯一ID值為：(前一個元素的唯一ID值) + (該RDD總的分割槽數)
該函式將RDD中的元素和這個元素在RDD中的ID（索引號）組合成鍵/值對。

2. 怎麼用

// 1. zipWithIndex
scala> var rdd2 = sc.makeRDD(Seq("A","B","R","D","F"),2)
rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[34] at makeRDD at :21
scala> rdd2.zipWithIndex().collect
res27: Array[(String, Long)] = Array((A,0), (B,1), (R,2), (D,3), (F,4))

// 2. zipWithUniqueId
scala> var rdd1 = sc.makeRDD 
(Seq("A","B","C","D","E","F"),2)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[44] at makeRDD at :21
//rdd1有兩個分割槽，
scala> rdd1.zipWithUniqueId().collect
res32: Array[(String, Long)] = Array((A,0), (B,2), (C,4), (D,1), (E,3), (F,5))
//總分割槽數為2
//第一個分割槽第一個元素ID為0，第二個分割槽第一個元素ID為1
//第一個分割槽第二個元素ID為0 
+2=2，第一個分割槽第三個元素ID為2+2=4
//第二個分割槽第二個元素ID為1+2=3，第二個分割槽第三個元素ID為3+2=5

3. 深入原始碼

1.zipWithIndex

  /**
   * Zips this RDD with its element indices. The ordering is first based on the partition index
   * and then the ordering of items within each partition. So the first item in the first
   * partition gets index 0, and the last item in the last partition receives the largest index.
   *
   * This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.
   * This method needs to trigger a spark job when this RDD contains more than one partitions.
   *
   * @note Some RDDs, such as those returned by groupBy(), do not guarantee order of
   * elements in a partition. The index assigned to each element is therefore not guaranteed,
   * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
   * the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
   */
  def zipWithIndex(): RDD[(T, Long)] = withScope {
    new ZippedWithIndexRDD(this)
  }

正如文件所註釋，The ordering is first based on the partition index，the last item in the last partition receives the largest index，ID號跟著分割槽走。方法new了ZippedWithIndexRDD物件，繼續點選

/**
 * Represents an RDD zipped with its element indices. The ordering is first based on the partition
 * index and then the ordering of items within each partition. So the first item in the first
 * partition gets index 0, and the last item in the last partition receives the largest index.
 *
 * @param prev parent RDD
 * @tparam T parent RDD item type
 */
private[spark]
class ZippedWithIndexRDD[T: ClassTag](prev: RDD[T]) extends RDD[(T, Long)](prev) {

  /** The start index of each partition. */
  @transient private val startIndices: Array[Long] = {
    val n = prev.partitions.length
    if (n == 0) {
      Array.empty
    } else if (n == 1) {
      Array(0L)
    } else {
      prev.context.runJob(
        prev,
        Utils.getIteratorSize _,
        0 until n - 1 // do not need to count the last partition
      ).scanLeft(0L)(_ + _)
    }
  }

  override def getPartitions: Array[Partition] = {
    firstParent[T].partitions.map(x => new ZippedWithIndexRDDPartition(x, startIndices(x.index)))
  }

  override def getPreferredLocations(split: Partition): Seq[String] =
    firstParent[T].preferredLocations(split.asInstanceOf[ZippedWithIndexRDDPartition].prev)

  override def compute(splitIn: Partition, context: TaskContext): Iterator[(T, Long)] = {
    val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
    val parentIter = firstParent[T].iterator(split.prev, context)
    Utils.getIteratorZipWithIndex(parentIter, split.startIndex)
  }
}

哦，果然是根據分割槽去Index

2.zipWithUniqueId

  /**
   * Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,
   * 2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method
   * won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
   *
   * @note Some RDDs, such as those returned by groupBy(), do not guarantee order of
   * elements in a partition. The unique ID assigned to each element is therefore not guaranteed,
   * and may even change if the RDD is reevaluated. If a fixed ordering is required to guarantee
   * the same index assignments, you should sort the RDD with sortByKey() or save it to a file.
   */
  def zipWithUniqueId(): RDD[(T, Long)] = withScope {
    val n = this.partitions.length.toLong
    this.mapPartitionsWithIndex { case (k, iter) =>
      Utils.getIteratorZipWithIndex(iter, 0L).map { case (item, i) =>
        (item, i * n + k)
      }
    }
  }

原始碼doc註釋中已經定義好 index的規則：will get ids k, n+k, 2*n+k, …, where n is the number of partitions.
更加值得注意一句話是：won’t trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]]，不會觸發spark job任務。再回去看下ZipWithIndex的原始碼：
prev.context.runJob, 哦，原來這個方法還啟動了 spark job任務，我只想拍個序給個ID，非要起個任務？

4. 總結

2個方法都有對RDD中的元素進行ID標號的功能，但是有以下區別：

前者依賴分割槽，可能會造成ID相同的情況。而後者根據演算法“k, n+k, 2*n+k”生成Long型別的ID，所以一定不會重複，這也是他被命名為UniqueId的原因吧
後者效率更高，因為前者會啟動runJob的任務
2者的共性，在Doc上也有註釋：Some RDDs, such as those returned by groupBy(), do not guarantee order of elements in a partition

scala-zipWithIndex、zipWithUniqueId 函式用法解析

1. 是什麼顧名思義，zipWithIndex:通過主鍵打包，ZipWithUniqueId:通過唯一主鍵打包。二者的主要作用 1. def zipWithIndex(): RDD[(T, Long)] 該函式將RDD中的元素和這個元素在RDD中的ID（索引號

Matlab中image、imagesc和imshow函式用法解析

原文轉自：http://blog.csdn.net/zhupananhui/article/details/16340345 1、顯示RGB影象相同點：這三個函式都是把m*n*3的矩陣中的數值當做RGB值來顯示的。區別：imshow將影象以原始尺寸顯示，imag

Scala 混入、高階函式

使用混入組合類用於組合類的特質稱為混入。 abstract class A { val message: String } class B extends A { val message = "I'm an instance of class

Python strip()、join()、split()函式用法

在對資料做預處理時可能會用到對字串操作的函式，這幾個函式的功能都是在操作字串，下面逐個介紹。一.strip() 語法： str.strip([chars]); 引數說明 chars：指定要移除的字串首位的字元或字串函式的作用是，移除字串頭尾指定的字

concat、concat_ws、group_concat函式用法

一、concat()函式可以連線一個或者多個字串　　CONCAT(str1,str2,…) 返回結果為連線引數產生的字串。如有任何一個引數為NULL ，則返回值為 NULL。　　select concat('11','22','33'); 112233 二、CONCAT_WS(separator,

JavaScript中bind、call、apply函式用法詳解

在給我們專案組的其他程式介紹 js 的時候，我準備了很多的內容，但看起來效果不大，果然光講還是不行的，必須動手。前幾天有人問我關於程式碼裡 call() 函式的用法，我讓他去看書，這裡推薦用js 寫伺服器的程式猿看《javascript程式設計精粹》這本書，crockfo

JavaScript對bind、call、apply函式用法的理解

我們知道，在 js 裡，函式其實也是一個物件，那麼函式自然也可以擁有它自己的方法，有點繞，在 js 裡，每個函式都有一個公共的 prototype —— Function，而這個原型自帶有好幾個屬性和方法，其中就有這裡困惑的 bind、call、apply 方法。先說 apply 方法，它讓我們構造一個引數陣

一：Vue中的join(),reverse()與 split()函式用法解析

Html<div id="app-5"> <p>{{ message }}</p> <button v-on:click="reverseMessage

python中eval()函式和input()函式用法解析

1.eval()函式 eval(<字串>)能夠以Python表示式的方式解析並執行字串，並將返回結果輸出。eval()函式將去掉字串的兩個引號，將其解釋為一個變數。作用： a. 處理數字單引號，雙引號，eval()函式都將其解釋為int型別；三引號

matplotlib.legend()函式用法解析

1.圖例legend基礎語法及用法 legend語法引數如下: matplotlib.pyplot.legend(*args, **kwargs) Keyword Description loc

MySQL的行轉列、列轉行、連線字串 concat、concat_ws、group_concat函式用法

1.concat函式使用方法： CONCAT(str1,str2,…) 返回結果為連線引數產生的字串。如有任何一個引數為NULL ，則返回值為 NULL。注意：如果所有引數均為非二進位制字串，則結果為非二進位制字串。如果自變

scala的==、equals、eq、ne區別與用法

根據官方API的定義： final def ==(arg0: Any): Boolean The expression x == that is equivalent to if (x eq null) that eq null else x.equals(that) final de

python的pandas庫的sort_values、set_index、reset_index、cumsum、groupby函式的用法

import pandas as pd #sort_values()函式是按照選中索引所在列的原素進行排序 df=pd.DataFrame({'A':[3,1,1,6,7],'B':['a','d','c','b','e'],'C':[123,343,122,978,459]}) print(

C++中begin、end、front、back函式的用法

JavaScript的函式（定義與解析、匿名函式、函式傳參、return關鍵字）和陣列（操作資料的方法、多維陣列、陣列去重）

函式函式就是重複執行的程式碼片。 1、函式定義與執行 <script type="text/javascript"> // 函式定義 function aa(){ alert('hello!'); } // 函式執行

MySQL的時間差函式TIMESTAMPDIFF、DATEDIFF的用法

MySQL的時間差函式TIMESTAMPDIFF、DATEDIFF的用法時間差函式TIMESTAMPDIFF、DATEDIFF的用法我們在寫sql語句，尤其是儲存過程中，會頻繁用到對於日期、時間的比較和判斷，那麼對於這兩個時間差比較函式用法做一個舉例介紹。 datediff函式，返回

abs、fabs、fabsf函式的用法區別

abs、fabs、fabsf三個函式都是用來求一個數的絕對值，區別如下： 1）int abs(int a); // 處理int型別的取絕對值

Python：解析PDF文字及表格——pdfminer、tabula、pdfplumber 的用法及對比

pdf 是個異常坑爹的東西，有很多處理 pdf 的庫，但是沒有完美的。一、pdfminer3k pdfminer3k 是 pdfminer 的 python3 版本，主要用於讀取 pdf 中的文字。網上有很多 pdfminer3k 的程式碼示例，看過以後，只想吐槽一下，太複雜了，有違 python

Python：解析PDF文本及表格——pdfminer、tabula、pdfplumber 的用法及對比

bubuko class ota ces manage 數據源碼 elif 便是 pdf 是個異常坑爹的東西，有很多處理 pdf 的庫，但是沒有完美的。一、pdfminer3k pdfminer3k 是 pdfminer 的 python3 版本，主要用於讀取 pdf

Scala學習筆記（六）：本地函式、頭等函式、佔位符和部分應用函式

本地函式可以在方法內定義方法，這種方法叫本地函式，本地函式可以直接訪問父函式的引數 def parent(x: Int, y: Int): Unit ={ def child(y:Int) = y + 1 val z = child(y) println(s"x: $x, z

scala-zipWithIndex、zipWithUniqueId 函式用法解析

1. 是什麼

1. def zipWithIndex(): RDD[(T, Long)]

2. def zipWithUniqueId(): RDD[(T, Long)]

2. 怎麼用

3. 深入原始碼

1.zipWithIndex

2.zipWithUniqueId

4. 總結

相關推薦