當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Spark 常用算子详解（转换算子、行动算子、控制算子）

發布時間：2024/3/12 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 Spark 常用算子详解（转换算子、行动算子、控制算子）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Spark簡介

Spark是專為大規模數據處理而設計的快速通用的計算引擎；
Spark擁有Hadoop MapReduce所具有的優點，但是運行速度卻比MapReduce有很大的提升，特別是在數據挖掘、機器學習等需要迭代的領域可提升100x倍的速度：

Spark是基于內存進行數據處理的，MapReduce是基于磁盤進行數據處理的；

Spark中具有DAG有向無環圖，DAG有向無環圖在此過程中減少了shuffle以及落地磁盤的次數；

Spark流程

Spark Application的運行環境：創建SparkConf對象

- 可以設置Application name；
- 可以設置運行模式及資源需求；

創建SparkContext對象；

- SparkContext向資源管理器申請運行Executor資源，并啟動StandaloneExecutorbackend；
- Executor向SparkContext申請Task；
- SparkContext將程序分發給Executor；
- SparkContext構建成DAG圖，將DAG圖分解成Stage、將Taskset發送給Task Scheduler，最后由Task Scheduler將Task發送給Executor運行；
- Task在Excutor上運行，運行完釋放所有的資源；

基于Spark的上下文創建一個RDD，對RDD進行處理；

應用程序中y有Action累算子來觸發Transformation類算子執行；

關閉Spark上下文對象SparkContext；

value 類型

細類型算子

輸入分區與輸出分區一對一型	map flatMap mapPartitions glom
輸入分區與輸出分區多對一型	union cartesain
輸入分區與輸出分區多對多型	groupBy
輸出分區為輸入分區子集型	filter distinct substract sample takeSample
Cache型	cache persist

key-value類型

細類型算子

輸入分區與輸出分區一對一	mapValues
對單個RDD或兩個RDD聚集	單個RDD聚集: combineByKey reduceByKey partitionBy; 兩個RDD聚集: Cogroup
連接	join leftOutJoin和 rightOutJoin

Action算子

細類型算子

無輸出	foreach
HDFS	saveAsTextFile saveAsObjectFile
Scala集合和數據類型	collect collectAsMap reduceByKeyLocally lookup count top reduce fold aggregate

轉換算子（Transformations）

不觸發提交作業，只是完成作業中間過程處理；Transformation 操作是延遲計算的，也就是說從一個RDD 轉換生成另一個 RDD 的轉換操作不是馬上執行，需要等到有 Action 操作的時候才會真正觸發運算。Transformation參數類型為value或者key-value的形式；

轉換算子是延遲執行的，也叫懶加載執行

map

將原來RDD的每個數據通過map中的用戶自定義函數映射為一個新的元素，源碼中map算子相當于初始化一個RDD --------- f(x)=x -> y

scala源碼 def map[U: ClassTag](f: T => U): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))}

flatMap

將原來 RDD 中的每個元素通過函數 f 轉換為新的元素，并將生成的 RDD 的每個集合中的元素合并為一個集合，內部創建 FlatMappedRDD(this，sc.clean(f))。 ------ f: T => TraversableOnce[U]

scala源碼
def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))}
例子
lines.flatMap{lines => {lines.split(" ")

mapPartitions

mapPartitions 函數獲取到每個分區的迭代器，在函數中通過這個分區整體的迭代器對整個分區的元素進行操作。內部實現是生成 -------f (iter)=>iter.f ilter(_>=3)

scala源碼
def filter(f: T => Boolean): RDD[T] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[T, T](this,(context, pid, iter) => iter.filter(cleanF),preservesPartitioning = true)}

glom

glom函數將每個分區形成一個數組，內部實現是返回的GlommedRDD。圖4中的每個方框代表一個RDD分區。圖4中的方框代表一個分區。該圖表示含有V1、 V2、 V3的分區通過函數glom形成一數組Array[（V1），（V2），（V3）]

scala源碼
def glom(): RDD[Array[T]] = withScope {new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))}

union

使用 union 函數時需要保證兩個 RDD 元素的數據類型相同，返回的 RDD 數據類型和被合并的 RDD 元素數據類型相同，并不進行去重操作，保存所有元素。如果想去重可以使用 distinct()(并集)

scala源碼
def union(other: RDD[T]): RDD[T] = withScope {sc.union(this, other)}

cartesian

對兩個 RDD 內的所有元素進行笛卡爾積操作。操作后，內部實現返回CartesianRDD。圖6中左側大方框代表兩個 RDD，大方框內的小方框代表 RDD 的分區。右側大方框代表合并后的 RDD，大方框內的小方框代表分區。圖6中的大方框代表RDD，大方框中的小方框代表RDD分區。

scala源碼
def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {new CartesianRDD(sc, this, other)}

groupBy

將元素通過函數生成相應的 Key，數據就轉化為 Key-Value 格式，之后將 Key 相同的元素分為一組。

函數實現如下:

1）將用戶函數預處理：

val cleanF = sc.clean(f)

2）對數據 map 進行函數操作，最后再進行 groupByKey 分組操作。

this.map(t => (cleanF(t), t)).groupByKey(p)

scala源碼
def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {groupBy[K](f, defaultPartitioner(this))}

distinct

對數據進行去重

scala源碼
/*** Return a new RDD containing the distinct elements in this RDD.*/def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)}/*** Return a new RDD containing the distinct elements in this RDD.*/def distinct(): RDD[T] = withScope {distinct(partitions.length)}

subtract

subtract相當于進行集合的差操作，RDD 1去除RDD 1和RDD 2交集中的所有元素；

scala源碼
/*** Return an RDD with the elements from `this` that are not in `other`.** Uses `this` partitioner/partition size, because even if `other` is huge, the resulting* RDD will be <= us.*/def subtract(other: RDD[T]): RDD[T] = withScope {subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {subtract(other, new HashPartitioner(numPartitions))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T],p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {if (partitioner == Some(p)) {// Our partitioner knows how to handle T (which, since we have a partitioner, is// really (K, V)) so make a new Partitioner that will de-tuple our fake tuplesval p2 = new Partitioner() {override def numPartitions: Int = p.numPartitionsoverride def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)}

sample

sample 將 RDD 這個集合內的元素進行采樣，獲取所有元素的子集。用戶可以設定是否有放回的抽樣、百分比、隨機種子，進而決定采樣方式。內部實現是生成 SampledRDD(withReplacement， fraction， seed)。

函數參數設置：

withReplacement=true，表示有放回的抽樣。 withReplacement=false，表示無放回的抽樣。

scala源碼

def sample(withReplacement: Boolean,fraction: Double,seed: Long = Utils.random.nextLong): RDD[T] = {require(fraction >= 0,s"Fraction must be nonnegative, but got ${fraction}")withScope {require(fraction >= 0.0, "Negative fraction value: " + fraction)if (withReplacement) {new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)} else {new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)}}}

takeSample

takeSample（）函數和上面的sample函數是一個原理，但是不使用相對比例采樣，而是按設定的采樣個數進行采樣，同時返回結果不再是RDD，而是相當于對采樣后的數據進行Collect(),返回結果集為單機的數組

scala源碼

//返回集為數組def takeSample(withReplacement: Boolean,num: Int,seed: Long = Utils.random.nextLong): Array[T] = withScope {val numStDev = 10.0require(num >= 0, "Negative number of elements requested")require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),"Cannot support a sample size > Int.MaxValue - " +s"$numStDev * math.sqrt(Int.MaxValue)")if (num == 0) {new Array[T](0)} else {val initialCount = this.count()if (initialCount == 0) {new Array[T](0)} else {val rand = new Random(seed)if (!withReplacement && num >= initialCount) {Utils.randomizeInPlace(this.collect(), rand)} else {val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,withReplacement)var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()// If the first sample didn't turn out large enough, keep trying to take samples;// this shouldn't happen often because we use a big multiplier for the initial sizevar numIters = 0while (samples.length < num) {logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()numIters += 1}Utils.randomizeInPlace(samples, rand).take(num)}}}}

mapValues

針對（key, Value)型數據中的value進行map操作，而不對key進行處理

scala源碼

/*** Pass each value in the key-value pair RDD through a map function without changing the keys;* this also retains the original RDD's partitioning.*/def mapValues[U](f: V => U): RDD[(K, U)] = self.withScope {val cleanF = self.context.clean(f)new MapPartitionsRDD[(K, U), (K, V)](self,(context, pid, iter) => iter.map { case (k, v) => (k, cleanF(v)) },preservesPartitioning = true)}

combineByKey

scala源碼

def combineByKey[C](createCombiner: V => C, //C不存在的情況下，比如通過V創建seq CmergeValue: (C, V) => C, //當C已經存在的情況下需要merge，比如把item V加入到seq C中，或者疊加mergeCombiners: (C, C) => C, //合并兩個Cpartitioner: Partitioner, //Partitioner，Shuffle時需要的PartitionermapSideCombine: Boolean = true, //為了減小傳輸量，很多 combine 可以在 map端先做，比如疊加，可以先在一個 partition 中把所有相同的 key 的 value 疊加，再 shuff le。//傳輸需要序列化，用戶可以自定義序列化類serializer: Serializer = null): RDD[(K, C)] = self.withScope {combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,partitioner, mapSideCombine, serializer)(null)}

reduceByKey

reduceByKey是比combineByKey更簡單的一種情況，只是兩個值合并成一個值 -------- *（Int，Int V) >> (Int, IntC)*

scala源碼

/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce.*/def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.*/def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {reduceByKey(new HashPartitioner(numPartitions), func)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/* parallelism level.*/def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {reduceByKey(defaultPartitioner(self), func)}

partitionBy

partitionBy函數對RDD進行分區操作；------ partitionBy（partitioner：Partitioner）

scala源碼

/*** Return a copy of the RDD partitioned using the specified partitioner.*/def partitionBy(partitioner: Partitioner): RDD[(K, V)] = self.withScope {if (keyClass.isArray && partitioner.isInstanceOf[HashPartitioner]) {throw new SparkException("HashPartitioner cannot partition array keys.")}if (self.partitioner == Some(partitioner)) {self} else {new ShuffledRDD[K, V, V](self, partitioner)}}

cogroup

cogroup函數將兩個RDD進行協同劃分，對兩個RDD中的key-valuel類型的元素，每個RDD相同key的元素風別聚合為一個集合，并且返回兩個RDD中對應Key的元素集合的迭代器

scala源碼

def cogroup[W1, W2, W3](other1: RDD[(K, W1)],other2: RDD[(K, W2)],other3: RDD[(K, W3)],partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2], Iterable[W3]))] = self.withScope {if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {throw new SparkException("HashPartitioner cannot partition array keys.")}val cg = new CoGroupedRDD[K](Seq(self, other1, other2, other3), partitioner)cg.mapValues { case Array(vs, w1s, w2s, w3s) =>(vs.asInstanceOf[Iterable[V]],w1s.asInstanceOf[Iterable[W1]],w2s.asInstanceOf[Iterable[W2]],w3s.asInstanceOf[Iterable[W3]])}}

join

join 對兩個需要連接的 RDD 進行 cogroup函數操作，將相同 key 的數據能夠放到一個分區，在 cogroup 操作之后形成的新 RDD 對每個key 下的元素進行笛卡爾積的操作，返回的結果再展平，對應 key 下的所有元組形成一個集合。最后返回 RDD[(K， (V， W))]。

scala源碼

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {this.cogroup(other, partitioner).flatMapValues( pair =>for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w))}

sortyByKey(sortBy)

作用再（Key, Value)格式的數據上，根據Key進行升序或降序排序

scala源碼

def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)] = self.withScope{val part = new RangePartitioner(numPartitions, self, ascending)new ShuffledRDD[K, V, V](self, part).setKeyOrdering(if (ascending) ordering else ordering.reverse)}def sortBy[K](f: (T) => K,ascending: Boolean = true,numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope {this.keyBy[K](f).sortByKey(ascending, numPartitions).values}

Action算子（行動算子）

本質上在Action算子中通過SparkContext觸發SparkContext提交job作業。Action 算子會觸發 Spark 提交作業（Job），并將數據輸出 Spark系統。

foreach

foreach對RDD中的每個元素都應用f函數操作，不返回RDD和Array，而返回Uint；（遍歷）

scala源碼

def foreach(f: T => Unit): Unit = withScope {val cleanF = sc.clean(f)sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))}

saveAsTextFile

函數將數據輸出，存儲到HDFS的制定目錄下

saveAsObjectFile

將風趣中的每10個元素組成一個Array，然后將這個Array序列化，映射為Null，BytesWritable（Y）的元素，寫入HDFS為SequenceFile的格式；

map（x=>（NullWritable.get（），new BytesWritable（Utils.serialize（x））））

colloect

相當于toArray，collect將分布式的RDD返回為一個單機的scala Array數組，在這個數組上運用scala的函數式操作；

左側方框代表 RDD 分區，右側方框代表單機內存中的數組。通過函數操作，將結果返回到 Driver 程序所在的節點，以數組形式存儲。

collectAsMap

collectAsMap對（K，V）型的RDD數據返回一個單機HashMap；
對于重復K的RDD元素，后面的元素覆蓋前面的元素

lookup

Lookup函數對（Key，Value）型的RDD操作，返回指定Key對應的元素形成的Seq。這個函數處理優化的部分在于，如果這個RDD包含分區器，則只會對應處理K所在的分區，然后返回由（K，V）形成的Seq。如果RDD不包含分區器，則需要對全RDD元素進行暴力掃描處理，搜索指定K對應的元素。 ------- lookup（key：K）：Seq[V]

count

count(計數器)返回整個RDD的元素個數

defcount():Long=sc.runJob(this,Utils.getIteratorSize_).sum

top

top可返回最大的k個元素

top（num：Int）（implicit ord：Ordering[T]）：Array[T]

top返回最大的k個元素。
take返回最小的k個元素。
takeOrdered返回最小的k個元素，并且在返回的數組中保持元素的順序。
first相當于top（1）返回整個RDD中的前k個元素，可以定義排序的方式Ordering[T]，返回的是一個含前k個元素的數組.

reduce

reduce函數相當于對RDD中的元素進行reduceLeft函數的操作; ---- Some（iter.reduceLeft（cleanF））

reduceLeft先對兩個元素<K，V>進行reduce函數操作，然后將結果和迭代器取出的下一個元素<k，V>進行reduce函數操作，直到迭代器遍歷完所有元素，得到最后結果。在RDD中，先對每個分區中的所有元素<K，V>的集合分別進行reduceLeft。每個分區形成的結果相當于一個元素<K，V>，再對這個結果集合進行reduceleft操作；

fold

fold和reduce的原理相同，但是與reduce不同，相當于每個reduce時，迭代器取的第一個元素是zeroValue;

控制算子

控制算子有三種，cache,persist,checkpoint，以上算子都可以將RDD持久化，持久化的單位是partition。cache和persist都是懶執行的。必須有一個action類算子觸發執行。checkpoint算子不僅能將RDD持久化到磁盤，還能切斷RDD之間的依賴關系；

cache

cache默認將RDD的數據持久化到內存中，相當與persist（MEMORY_ONLY）函數的功能；

chche () = persist()=persist(StorageLevel.Memory_Only)

persist

可以指定持久話的級別，最常用的是MEMORY_ONLTY和MEMORY_AND_DISK；"_2"表示副本數；

持久化有如下級別：

cache和persist的注意事項

cache和persist都是懶執行，必須有一個action類算子觸發執行。
cache和persist算子的返回值可以賦值給一個變量，在其他job中直接使用這個變量就是使用持久化的數據了。持久化的單位是partition
cache和persist算子后不能立即緊跟action算子。

checkpoint

checkpoint將RDD持久化到磁盤，還可以切斷RDD之間的依賴關系

checkpoint 的執行原理：
- 當RDD的job執行完畢后，會從finalRDD從后往前回溯。
- 當回溯到某一個RDD調用了checkpoint方法，會對當前的RDD做一個標記。
- Spark框架會自動啟動一個新的job，重新計算這個RDD的數據，將數據持久化到HDFS上。
優化：對RDD執行checkpoint之前，最好對這個RDD先執行cache，這樣新啟動的job只需要將內存中的數據拷貝導HDFS上就可以，省去重新計算這一步；

例如：

SparkConf conf = new SparkConf();conf.setMaster("local").setAppName("checkpoint");JavaSparkContext sc = new JavaSparkContext(conf);sc.setCheckpointDir("./checkpoint");JavaRDD<Integer> parallelize = sc.parallelize(Arrays.asList(1,2,3));parallelize.checkpoint();parallelize.count();sc.stop();

總結

以上是生活随笔為你收集整理的Spark 常用算子详解（转换算子、行动算子、控制算子）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：利用java输出一个月的日历表
下一篇：汇编跳转指令: JMP、JCXZ、JE

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Spark 常用算子详解（转换算子、行动算子、控制算子）

Spark簡介

Spark流程

轉換算子（Transformations）

map

flatMap

mapPartitions

glom

union

cartesian

groupBy

distinct

subtract

sample

takeSample

mapValues

combineByKey

reduceByKey

partitionBy

cogroup

join

sortyByKey(sortBy)

Action算子（行動算子）

foreach

saveAsTextFile

saveAsObjectFile

colloect

collectAsMap

lookup

count

top

reduce

fold

控制算子

cache

persist

checkpoint

總結