1. 程式人生 > >spark之sample

spark之sample


原始碼:

/**
 * Return a sampled subset of this RDD.
 *
 * @param withReplacement can elements be sampled multiple times (replaced when sampled out)
 * @param fraction expected size of the sample as a fraction of this RDD's size
 *  without replacement: probability that each element is chosen; fraction must be [0, 1]
* with replacement: expected number of times each element is chosen; fraction must be >= 0 * @param seed seed for the random number generator */ def sample( withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T] = withScope { require(fraction >= 0.0
, "Negative fraction value: " + fraction) if (withReplacement) { new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed) } else { new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed) } }

引數說明:

withReplacement :取樣後是否將元素放回。 

true:元素放回 ,返回的子集會有重複,可以被多次抽樣;
false:元素不放回 ,返回的子集沒有重複
fraction:期望樣本的大小作為RDD大小的一部分,

seed:隨機數生成器的種子

示例:

object sample{
  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName(this.getClass.getName)
    val sc = new SparkContext(conf)
    val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9))
    rdd1.sample(false,0.6).collect().mkString(",").map(print)// 輸出 2,3,6,8,9
    rdd1.sample(true,0.6).collect().mkString(".").map(print) // 輸出 2.4.4.6
  }
}