spark之sample
阿新 • • 發佈:2018-11-03
原始碼:
/** * Return a sampled subset of this RDD. * * @param withReplacement can elements be sampled multiple times (replaced when sampled out) * @param fraction expected size of the sample as a fraction of this RDD's size * without replacement: probability that each element is chosen; fraction must be [0, 1]* with replacement: expected number of times each element is chosen; fraction must be >= 0 * @param seed seed for the random number generator */ def sample( withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T] = withScope { require(fraction >= 0.0, "Negative fraction value: " + fraction) if (withReplacement) { new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed) } else { new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed) } }
引數說明:
withReplacement :取樣後是否將元素放回。
false:元素不放回 ,返回的子集沒有重複
fraction:期望樣本的大小作為RDD大小的一部分,
seed:隨機數生成器的種子
示例:
object sample{ def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[2]").setAppName(this.getClass.getName) val sc = new SparkContext(conf) val rdd1 = sc.parallelize(List(1,2,3,4,5,6,7,8,9)) rdd1.sample(false,0.6).collect().mkString(",").map(print)// 輸出 2,3,6,8,9 rdd1.sample(true,0.6).collect().mkString(".").map(print) // 輸出 2.4.4.6 } }