Spark学习之Spark RDD算子
個(gè)人主頁zicesun.com
這里,從源碼的角度總結(jié)一下Spark RDD算子的用法。
單值型Transformation算子
map
/*** Return a new RDD by applying a function to all elements of this RDD.*/def map[U: ClassTag](f: T => U): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))} 復(fù)制代碼源碼中有一個(gè) sc.clean() 函數(shù),它的所用是去除閉包中不能序列話的外部引用變量。Scala支持閉包,閉包會(huì)把它對外的引用(閉包里面引用了閉包外面的對像)保存到自己內(nèi)部,這個(gè)閉包就可以被單獨(dú)使用了,而不用擔(dān)心它脫離了當(dāng)前的作用域;但是在spark這種分布式環(huán)境里,這種作法會(huì)帶來問題,如果對外部的引用是不可serializable的,它就不能正確被發(fā)送到worker節(jié)點(diǎn)上去了;還有一些引用,可能根本沒有用到,這些沒有使用到的引用是不需要被發(fā)到worker上的; 實(shí)際上sc.clean函數(shù)調(diào)用的是ClosureCleaner.clean();ClosureCleaner.clean()通過遞歸遍歷閉包里面的引用,檢查不能serializable的, 去除unused的引用;
map函數(shù)是一個(gè)粗粒度的操作,對于一個(gè)RDD來說,會(huì)使用迭代器對分區(qū)進(jìn)行遍歷,然后針對一個(gè)分區(qū)使用你想要執(zhí)行的操作f, 然后返回一個(gè)新的RDD。其實(shí)可以理解為rdd的每一個(gè)元素都會(huì)執(zhí)行同樣的操作。
val array = Array(1,2,3,4,5,6) array: Array[Int] = Array(1, 2, 3, 4, 5, 6) val rdd = sc.app appName applicationAttemptId applicationId val rdd = sc.parallelize(array, 2) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26 val mapRdd = rdd.map(x => x * 2) mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25 mapRdd.collect().foreach(println) 2 4 6 8 10 12 復(fù)制代碼flatMap
flatMap方法與map方法類似,但是允許一次map方法中輸出多個(gè)對象,而不是map中的一個(gè)對象經(jīng)過函數(shù)轉(zhuǎn)換成另一個(gè)對象。
/*** Return a new RDD by first applying a function to all elements of this* RDD, and then flattening the results.*/def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))} 復(fù)制代碼 val a = sc.parallelize(1 to 10, 5) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 a.flatMap(num => 1 to num).collect res1: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)復(fù)制代碼mapPartitions
mapPartitions是map的另一個(gè)實(shí)現(xiàn),map的輸入函數(shù)應(yīng)用與RDD的每個(gè)元素,但是mapPartitions的輸入函數(shù)作用于每個(gè)分區(qū),也就是每個(gè)分區(qū)的內(nèi)容作為整體。
/*** Return a new RDD by applying a function to each partition of this RDD.** `preservesPartitioning` indicates whether the input function preserves the partitioner, which* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.*/def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U],preservesPartitioning: Boolean = false): RDD[U] = withScope {val cleanedF = sc.clean(f)new MapPartitionsRDD(this,(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),preservesPartitioning)} 復(fù)制代碼 def myfunc[T](iter: Iterator[T]):Iterator[(T,T)]={ | var res = List[(T,T)]()| var pre = iter.next| while(iter.hasNext){| var cur = iter.next| res .::= (pre, cur)| pre = cur| }| res.iterator| } myfunc: [T](iter: Iterator[T])Iterator[(T, T)] val a = sc.parallelize(1 to 9,3) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 a.mapPartitions mapPartitions mapPartitionsWithIndex a.mapPartitions(myfunc).collect res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8)) 復(fù)制代碼mapPartitionWithIndex
mapPartitionWithIndex方法與mapPartitions方法類似,不同的是mapPartitionWithIndex會(huì)對原始分區(qū)的索引進(jìn)行追蹤,這樣就可以知道分區(qū)所對應(yīng)的元素,方法的參數(shù)為一個(gè)函數(shù),函數(shù)的輸入為整型索引和迭代器。
/*** Return a new RDD by applying a function to each partition of this RDD, while tracking the index* of the original partition.** `preservesPartitioning` indicates whether the input function preserves the partitioner, which* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.*/def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false): RDD[U] = withScope {val cleanedF = sc.clean(f)new MapPartitionsRDD(this,(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),preservesPartitioning)} 復(fù)制代碼 val x = sc.parallelize(1 to 10, 3) x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 def myFunc(index:Int, iter:Iterator[Int]):Iterator[String]={| iter.toList.map(x => index + "," + x).iterator| } myFunc: (index: Int, iter: Iterator[Int])Iterator[String] x.mapPartitions mapPartitions mapPartitionsWithIndex x.mapPartitionsWithIndex(myFunc).collect res1: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10) 復(fù)制代碼foreach
foreach主要對每一個(gè)輸入的數(shù)據(jù)對象執(zhí)行循環(huán)操作,可以用來執(zhí)行對RDD元素的輸出操作。
/*** Applies a function f to all elements of this RDD.*/def foreach(f: T => Unit): Unit = withScope {val cleanF = sc.clean(f)sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))} 復(fù)制代碼 var x = sc.parallelize(List(1 to 9), 3) x: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[5] at parallelize at <console>:24 x.foreach(print) Range(1, 2, 3, 4, 5, 6, 7, 8, 9) 復(fù)制代碼foreachPartition
foreachPartition方法和mapPartition的作用一樣,通過迭代器參數(shù)對RDD中每一個(gè)分區(qū)的數(shù)據(jù)對象應(yīng)用函數(shù),區(qū)別在于使用的參數(shù)是否有返回值。
/*** Applies a function f to each partition of this RDD.*/def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {val cleanF = sc.clean(f)sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))} 復(fù)制代碼 val b = sc.parallelize(List(1,2,3,4,5,6), 3) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24 b.foreachPartition(x => println(x.reduce((a,b) => a +b))) 7 3 11 復(fù)制代碼glom
glom的作用與collec類似,collect是將RDD直接轉(zhuǎn)化為數(shù)組的形式,而glom則是將RDD分區(qū)數(shù)據(jù)組裝到數(shù)組類型的RDD中,每一個(gè)返回的數(shù)組包含一個(gè)分區(qū)的所有元素,按分區(qū)轉(zhuǎn)化為數(shù)組,有幾個(gè)分區(qū)就返回幾個(gè)數(shù)組類型的RDD。
/*** Return an RDD created by coalescing all elements within each partition into an array.*/def glom(): RDD[Array[T]] = withScope {new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))} 復(fù)制代碼下面的例子中,RDD a有三個(gè)分區(qū),glom將a轉(zhuǎn)化為由三個(gè)數(shù)組構(gòu)成的RDD。
val a = sc.parallelize(1 to 9, 3) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 a.glom.collect res5: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9)) a.glom res6: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[4] at glom at <console>:26 復(fù)制代碼union
union方法與++方法是等價(jià)的,將兩個(gè)RDD去并集,取并集的過程中不會(huì)去重。
/*** Return the union of this RDD and another one. Any identical elements will appear multiple* times (use `.distinct()` to eliminate them).*/def union(other: RDD[T]): RDD[T] = withScope {sc.union(this, other)}/*** Return the union of this RDD and another one. Any identical elements will appear multiple* times (use `.distinct()` to eliminate them).*/def ++(other: RDD[T]): RDD[T] = withScope {this.union(other)} 復(fù)制代碼 val a = sc.parallelize(1 to 4,2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24 val b = sc.parallelize(2 to 5,1) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24 a.un union unpersist a.union(b).collect res7: Array[Int] = Array(1, 2, 3, 4, 2, 3, 4, 5)復(fù)制代碼cartesian
計(jì)算兩個(gè)RDD中每個(gè)對象的笛卡爾積
/*** Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of* elements (a, b) where a is in `this` and b is in `other`.*/def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {new CartesianRDD(sc, this, other)}復(fù)制代碼 val a = sc.parallelize(1 to 4,2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24 val b = sc.parallelize(2 to 5,1) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24 a.cartesian(b).collect res8: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5), (4,2), (4,3), (4,4), (4,5))復(fù)制代碼groupBy
groupBy方法有三個(gè)重載方法,功能是講元素通過map函數(shù)生成Key-Value格式,然后使用groupByKey方法對Key-Value進(jìn)行聚合。
/*** Return an RDD of grouped items. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {groupBy[K](f, defaultPartitioner(this))}/*** Return an RDD of grouped elements. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K,numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {groupBy(f, new HashPartitioner(numPartitions))}/*** Return an RDD of grouped items. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null): RDD[(K, Iterable[T])] = withScope {val cleanF = sc.clean(f)this.map(t => (cleanF(t), t)).groupByKey(p)}復(fù)制代碼 val a = sc.parallelize(1 to 9,2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 a.groupBy(x => {if(x % 2 == 0) "even" else "odd"}).collect res9: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9))) def myfunc(a: Int):Int={| a % 2| } myfunc: (a: Int)Int a.groupBy(myfunc).collect res10: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9))) a.groupBy(myfunc(_), 1).collect res11: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9)))復(fù)制代碼filter
filter方法對輸入元素進(jìn)行過濾,參數(shù)是一個(gè)返回值為boolean的函數(shù),如果函數(shù)對元素的運(yùn)算結(jié)果為true,則通過元素,否則就將該元素過濾,不進(jìn)入結(jié)果集。
/*** Return a new RDD containing only the elements that satisfy a predicate.*/def filter(f: T => Boolean): RDD[T] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[T, T](this,(context, pid, iter) => iter.filter(cleanF),preservesPartitioning = true)}復(fù)制代碼 val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America")) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>:24 val b = a.filter(x => x.length >= 4) b: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at filter at <console>:25 b.collect.foreach(println) from China from America復(fù)制代碼distinct
distinct方法將RDD中重復(fù)的元素去掉,只留下唯一的RDD元素。
/*** Return a new RDD containing the distinct elements in this RDD.*/def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)}復(fù)制代碼 val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America")) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24 val b = a.map(x => x.length) b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at map at <console>:25 val c = b.distinct c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at distinct at <console>:25 c.foreach(println) 5 4 2 3 7復(fù)制代碼subtract
subtract方法就是求集合A-B,即把集合A中包含集合B的元素都刪除,返回剩下的元素。
/*** Return an RDD with the elements from `this` that are not in `other`.** Uses `this` partitioner/partition size, because even if `other` is huge, the resulting* RDD will be <= us.*/def subtract(other: RDD[T]): RDD[T] = withScope {subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {subtract(other, new HashPartitioner(numPartitions))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T],p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {if (partitioner == Some(p)) {// Our partitioner knows how to handle T (which, since we have a partitioner, is// really (K, V)) so make a new Partitioner that will de-tuple our fake tuplesval p2 = new Partitioner() {override def numPartitions: Int = p.numPartitionsoverride def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)}// Unfortunately, since we're making a new p2, we'll get ShuffleDependencies// anyway, and when calling .keys, will not have a partitioner set, even though// the SubtractedRDD will, thanks to p2's de-tupled partitioning, already be// partitioned by the right/real keys (e.g. p).this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys} else {this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys}}復(fù)制代碼 val a = sc.parallelize(1 to 9, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24 val b = sc.parallelize(2 to 5, 4) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24 val c = a.subtract(b) c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at subtract at <console>:27 c.collect res14: Array[Int] = Array(6, 8, 1, 7, 9)復(fù)制代碼persist與cache
cache,緩存數(shù)據(jù),把RDD緩存到內(nèi)存中,以便下次計(jì)算式再次被調(diào)用。persist是把RDD根據(jù)不同的級別進(jìn)行持久化,通過參數(shù)指定持久化級別,如果不帶參數(shù)則為默認(rèn)持久化級別,即只保存到內(nèi)存中,與Cache等價(jià)。
sample
sample方法的作用是隨即對RDD中的元素進(jìn)行采樣,或得一個(gè)新的子RDD。根據(jù)參數(shù)制定是否放回采樣,子集占總數(shù)的百分比和隨機(jī)種子。
/*** Return a sampled subset of this RDD.** @param withReplacement can elements be sampled multiple times (replaced when sampled out)* @param fraction expected size of the sample as a fraction of this RDD's size* without replacement: probability that each element is chosen; fraction must be [0, 1]* with replacement: expected number of times each element is chosen; fraction must be greater* than or equal to 0* @param seed seed for the random number generator** @note This is NOT guaranteed to provide exactly the fraction of the count* of the given [[RDD]].*/def sample(withReplacement: Boolean,fraction: Double,seed: Long = Utils.random.nextLong): RDD[T] = {require(fraction >= 0,s"Fraction must be nonnegative, but got ${fraction}")withScope {require(fraction >= 0.0, "Negative fraction value: " + fraction)if (withReplacement) {new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)} else {new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)}}}復(fù)制代碼 val a = sc.parallelize(1 to 100, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24 val b = a.sample(false, 0.2, 0) b: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[32] at sample at <console>:25 b.foreach(println) 5 19 20 26 27 29 30 57 40 61 45 68 73 50 75 79 81 85 89 99復(fù)制代碼鍵值對型transformation算子
groupByKey
類似于groupBy,將每一個(gè)相同的Key的Value聚集起來形成序列,可以使用默認(rèn)的分區(qū)器和自定義的分區(qū)器。
/*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.* The ordering of elements within each group is not guaranteed, and may even differ* each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {// groupByKey shouldn't use map side combine because map side combine does not// reduce the amount of data shuffled and requires all map side data be inserted// into a hash table, leading to more objects in the old gen.val createCombiner = (v: V) => CompactBuffer(v)val mergeValue = (buf: CompactBuffer[V], v: V) => buf += vval mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2val bufs = combineByKeyWithClassTag[CompactBuffer[V]](createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)bufs.asInstanceOf[RDD[(K, Iterable[V])]]}/*** Group the values for each key in the RDD into a single sequence. Hash-partitions the* resulting RDD with into `numPartitions` partitions. The ordering of elements within* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {groupByKey(new HashPartitioner(numPartitions))}復(fù)制代碼 val a = sc.parallelize(List("mk", "zq", "xwc", "fig", "dcp", "snn"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[33] at parallelize at <console>:24 val b = a.keyBy(x => x.length) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[34] at keyBy at <console>:25 b.groupByKey.collect res17: Array[(Int, Iterable[String])] = Array((2,CompactBuffer(mk, zq)), (3,CompactBuffer(xwc, fig, dcp, snn)))復(fù)制代碼combineByKey
comineByKey方法能夠有效地講鍵值對形式的RDD相同的Key的Value合并成序列形式,用戶能自定義RDD的分區(qū)器和是否在Map端進(jìn)行聚合操作。
/*** Generic function to combine the elements for each key using a custom set of aggregation* functions. This method is here for backward compatibility. It does not provide combiner* classtag information to the shuffle.** @see `combineByKeyWithClassTag`*/def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,partitioner: Partitioner,mapSideCombine: Boolean = true,serializer: Serializer = null): RDD[(K, C)] = self.withScope {combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,partitioner, mapSideCombine, serializer)(null)}/*** Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.* This method is here for backward compatibility. It does not provide combiner* classtag information to the shuffle.** @see `combineByKeyWithClassTag`*/def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,numPartitions: Int): RDD[(K, C)] = self.withScope {combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)}復(fù)制代碼 val a = sc.parallelize(List("xwc", "fig","wc", "dcp", "zq", "znn", "mk", "zl", "hk", "lp"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24 val b = sc.parallelize(List(1,2,2,3,2,1,2,2,2,3),2) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[37] at parallelize at <console>:24 val c = b.zip(a) c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[38] at zip at <console>:27 val d = c.combineByKey(List(_), (x:List[String], y:String)=>y::x, (x:List[String], y:List[String])=>x::: y) d: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[39] at combineByKey at <console>:25 d.collect res18: Array[(Int, List[String])] = Array((2,List(zq, wc, fig, hk, zl, mk)), (1,List(xwc, znn)), (3,List(dcp, lp)))復(fù)制代碼上面的例子使用三個(gè)參數(shù)重載的方法,該方法的第一個(gè)參數(shù)createCombiner把元素V轉(zhuǎn)換成另一類元素C,該例子中使用的參數(shù)是List(_),表示將輸入元素放在List集合中;第二個(gè)參數(shù)mergeValue的含義是吧元素V合并到元素C中,該例子中使用的是(x:List[String],y:String)=>y::x,表示將y字符合并到x鏈表集合中;第三個(gè)參數(shù)的含義是講兩個(gè)C元素合并,該例子中使用的是(x:List[String], y:List[String])=>x:::y, 表示把x鏈表集合中的內(nèi)容合并到y(tǒng)鏈表中。
reduceByKey
使用一個(gè)reduce函數(shù)來實(shí)現(xiàn)對想要的Key的value的聚合操作,發(fā)送給reduce前會(huì)在map端本地merge操作,該方法的底層實(shí)現(xiàn)是調(diào)用combineByKey方法的一個(gè)重載方法。
/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce.*/def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.*/def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {reduceByKey(new HashPartitioner(numPartitions), func)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/* parallelism level.*/def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {reduceByKey(defaultPartitioner(self), func)}復(fù)制代碼 val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24 val b = a.map(x => (x.length,x)) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at map at <console>:25 b.reduceByKey((a, b) => a + b ).collect res1: Array[(Int, String)] = Array((2,wcza), (3,dcpfjgsnn))復(fù)制代碼sortByKey
根據(jù)Key值對鍵值對進(jìn)行排序,如果是字符,則按照字典順序排序,如果是數(shù)組則按照數(shù)字大小排序,可通過參數(shù)指定升序還是降序。
val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:24 val b = sc.parallelize(1 to a.count.toInt,2) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:26 val c = a.zip(b) c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[6] at zip at <console>:27 c.sortByKey(true).collect res2: Array[(String, Int)] = Array((dcp,1), (fjg,2), (snn,3), (wc,4), (za,5))復(fù)制代碼cogroup
scala> val a = sc.parallelize(List(1,2,2,3,1,3),2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24scala> val b = a.map(x => (x, "b")) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[11] at map at <console>:25scala> val c = a.map(x => (x, "c")) c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[12] at map at <console>:25scala> b.cogroup(c).collect res3: Array[(Int, (Iterable[String], Iterable[String]))] = Array((2,(CompactBuffer(b, b),CompactBuffer(c, c))), (1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b, b),CompactBuffer(c, c))))scala> val a = sc.parallelize(List(1,2,2,2,1,3),1) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24scala> val b = a.map(x => (x, "b")) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[16] at map at <console>:25scala> val c = a.map(x => (x, "c")) c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at map at <console>:25scala> b.cogroup(c).collect res4: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b, b, b),CompactBuffer(c, c, c))))復(fù)制代碼join
首先對RDD進(jìn)行cogroup操作,然后對每個(gè)新的RDD下Key的值進(jìn)行笛卡爾積操作,再返回結(jié)果使用flatmapValue方法。
scala> val a= sc.parallelize(List("fjg","wc","xwc"),2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[20] at parallelize at <console>:24 scala> val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2) c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at <console>:24scala> val d = c.keyBy(_.length) d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[23] at keyBy at <console>:25scala> val b = a.keyBy(_.length) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[24] at keyBy at <console>:25scala> b.join(d).collect res6: Array[(Int, (String, String))] = Array((2,(wc,wc)), (2,(wc,zq)), (3,(fjg,fig)), (3,(fjg,sbb)), (3,(fjg,xwc)), (3,(fjg,dcp)), (3,(xwc,fig)), (3,(xwc,sbb)), (3,(xwc,xwc)), (3,(xwc,dcp)))復(fù)制代碼Action算子
collect
把RDD中的元素以數(shù)組的形式返回。
/*** Return an array that contains all of the elements in this RDD.** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.*/def collect(): Array[T] = withScope {val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)Array.concat(results: _*)}復(fù)制代碼 val a = sc.parallelize(List("a", "b", "c"),2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24 a.collect res7: Array[String] = Array(a, b, c)復(fù)制代碼reduce
使用一個(gè)帶兩個(gè)參數(shù)的函數(shù)把元素進(jìn)行聚集,返回一個(gè)元素的結(jié)果。該函數(shù)中的二元操作應(yīng)該滿足交換律和結(jié)合律,這樣才能在并行計(jì)算中得到正確的計(jì)算結(jié)果。
/*** Reduces the elements of this RDD using the specified commutative and* associative binary operator.*/def reduce(f: (T, T) => T): T = withScope {val cleanF = sc.clean(f)val reducePartition: Iterator[T] => Option[T] = iter => {if (iter.hasNext) {Some(iter.reduceLeft(cleanF))} else {None}}var jobResult: Option[T] = Noneval mergeResult = (index: Int, taskResult: Option[T]) => {if (taskResult.isDefined) {jobResult = jobResult match {case Some(value) => Some(f(value, taskResult.get))case None => taskResult}}}sc.runJob(this, reducePartition, mergeResult)// Get the final result out of our Option, or throw an exception if the RDD was emptyjobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))}復(fù)制代碼 val a = sc.parallelize(1 to 10, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[29] at parallelize at <console>:24 a.reduce((a, b) => a + b) res8: Int = 55復(fù)制代碼take
take方法會(huì)從RDD中取出前n個(gè)元素。先掃描一個(gè)分區(qū),之后從分區(qū)中得到結(jié)果,然后評估該分區(qū)的元素是否滿足n,若果不滿足則繼續(xù)從其他分區(qū)中掃描獲取。
/*** Take the first num elements of the RDD. It works by first scanning one partition, and use the* results from that partition to estimate the number of additional partitions needed to satisfy* the limit.** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @note Due to complications in the internal implementation, this method will raise* an exception if called on an RDD of `Nothing` or `Null`.*/def take(num: Int): Array[T] = withScope {val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)if (num == 0) {new Array[T](0)} else {val buf = new ArrayBuffer[T]val totalParts = this.partitions.lengthvar partsScanned = 0while (buf.size < num && partsScanned < totalParts) {// The number of partitions to try in this iteration. It is ok for this number to be// greater than totalParts because we actually cap it at totalParts in runJob.var numPartsToTry = 1Lval left = num - buf.sizeif (partsScanned > 0) {// If we didn't find any rows after the previous iteration, quadruple and retry.// Otherwise, interpolate the number of partitions we need to try, but overestimate// it by 50%. We also cap the estimation in the end.if (buf.isEmpty) {numPartsToTry = partsScanned * scaleUpFactor} else {// As left > 0, numPartsToTry is always >= 1numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toIntnumPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)}}val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)res.foreach(buf ++= _.take(num - buf.size))partsScanned += p.size}buf.toArray}}復(fù)制代碼 val a = sc.parallelize(1 to 10, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at parallelize at <console>:24 a.take(5) res9: Array[Int] = Array(1, 2, 3, 4, 5)復(fù)制代碼top
top會(huì)采用隱式排序轉(zhuǎn)換來獲取最大的前n個(gè)元素。
/*** Returns the top k (largest) elements from this RDD as defined by the specified* implicit Ordering[T] and maintains the ordering. This does the opposite of* [[takeOrdered]]. For example:* {{{* sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)* // returns Array(12)** sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)* // returns Array(6, 5)* }}}** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @param num k, the number of top elements to return* @param ord the implicit ordering for T* @return an array of top elements*/def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {takeOrdered(num)(ord.reverse)}/*** Returns the first k (smallest) elements from this RDD as defined by the specified* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].* For example:* {{{* sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)* // returns Array(2)** sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)* // returns Array(2, 3)* }}}** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @param num k, the number of elements to return* @param ord the implicit ordering for T* @return an array of top elements*/def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {if (num == 0) {Array.empty} else {val mapRDDs = mapPartitions { items =>// Priority keeps the largest elements, so let's reverse the ordering.val queue = new BoundedPriorityQueue[T](num)(ord.reverse)queue ++= collectionUtils.takeOrdered(items, num)(ord)Iterator.single(queue)}if (mapRDDs.partitions.length == 0) {Array.empty} else {mapRDDs.reduce { (queue1, queue2) =>queue1 ++= queue2queue1}.toArray.sorted(ord)}}}復(fù)制代碼 val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2) c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24 c.top(3) res10: Array[Int] = Array(97, 32, 8)復(fù)制代碼count
count方法計(jì)算返回RDD中元素的個(gè)數(shù)。
/*** Return the number of elements in the RDD.*/def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum復(fù)制代碼 val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2) c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:24 c.count res11: Long = 9復(fù)制代碼takeSample
返回一個(gè)固定大小的數(shù)組形式的采樣子集,此外還會(huì)把返回元素的順序隨機(jī)打亂。
/*** Return a fixed-size sampled subset of this RDD in an array** @param withReplacement whether sampling is done with replacement* @param num size of the returned sample* @param seed seed for the random number generator* @return sample of specified size in an array** @note this method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.*/def takeSample(withReplacement: Boolean,num: Int,seed: Long = Utils.random.nextLong): Array[T] = withScope {val numStDev = 10.0require(num >= 0, "Negative number of elements requested")require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),"Cannot support a sample size > Int.MaxValue - " +s"$numStDev * math.sqrt(Int.MaxValue)")if (num == 0) {new Array[T](0)} else {val initialCount = this.count()if (initialCount == 0) {new Array[T](0)} else {val rand = new Random(seed)if (!withReplacement && num >= initialCount) {Utils.randomizeInPlace(this.collect(), rand)} else {val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,withReplacement)var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()// If the first sample didn't turn out large enough, keep trying to take samples;// this shouldn't happen often because we use a big multiplier for the initial sizevar numIters = 0while (samples.length < num) {logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()numIters += 1}Utils.randomizeInPlace(samples, rand).take(num)}}}}復(fù)制代碼 val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2) c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24 c.takeSample(true,3, 1) res14: Array[Int] = Array(1, 3, 7)復(fù)制代碼saveAsTextFile
將RDD存儲為文本文件,一次存一行
countByKey
類似count,但是countByKey會(huì)根據(jù)Key計(jì)算對應(yīng)的Value個(gè)數(shù),返回Map類型的結(jié)果。
/*** Count the number of elements for each key, collecting the results to a local Map.** @note This method should only be used if the resulting map is expected to be small, as* the whole thing is loaded into the driver's memory.* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which* returns an RDD[T, Long] instead of a map.*/def countByKey(): Map[K, Long] = self.withScope {self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap}復(fù)制代碼 val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2) c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24 val d = c.keyBy(_.length) d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[37] at keyBy at <console>:25 d.countByKey res15: scala.collection.Map[Int,Long] = Map(2 -> 2, 3 -> 4)復(fù)制代碼aggregate
/*** Aggregate the elements of each partition, and then the results for all the partitions, using* given combine functions and a neutral "zero value". This function can return a different result* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are* allowed to modify and return their first argument instead of creating a new U to avoid memory* allocation.** @param zeroValue the initial value for the accumulated result of each partition for the* `seqOp` operator, and also the initial value for the combine results from* different partitions for the `combOp` operator - this will typically be the* neutral element (e.g. `Nil` for list concatenation or `0` for summation)* @param seqOp an operator used to accumulate results within a partition* @param combOp an associative operator used to combine results from different partitions*/def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {// Clone the zero value since we will also be serializing it as part of tasksvar jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())val cleanSeqOp = sc.clean(seqOp)val cleanCombOp = sc.clean(combOp)val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)sc.runJob(this, aggregatePartition, mergeResult)jobResult}復(fù)制代碼fold
/*** Aggregate the elements of each partition, and then the results for all the partitions, using a* given associative function and a neutral "zero value". The function* op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object* allocation; however, it should not modify t2.** This behaves somewhat differently from fold operations implemented for non-distributed* collections in functional languages like Scala. This fold operation may be applied to* partitions individually, and then fold those results into the final result, rather than* apply the fold to each element sequentially in some defined ordering. For functions* that are not commutative, the result may differ from that of a fold applied to a* non-distributed collection.** @param zeroValue the initial value for the accumulated result of each partition for the `op`* operator, and also the initial value for the combine results from different* partitions for the `op` operator - this will typically be the neutral* element (e.g. `Nil` for list concatenation or `0` for summation)* @param op an operator used to both accumulate results within a partition and combine results* from different partitions*/def fold(zeroValue: T)(op: (T, T) => T): T = withScope {// Clone the zero value since we will also be serializing it as part of tasksvar jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())val cleanOp = sc.clean(op)val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)sc.runJob(this, foldPartition, mergeResult)jobResult}復(fù)制代碼轉(zhuǎn)載于:https://juejin.im/post/5cfdf178e51d454d1d6284e8
總結(jié)
以上是生活随笔為你收集整理的Spark学习之Spark RDD算子的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: JS 设计模式之初识(一)-单例模式
- 下一篇: React入门看这篇就够了