Spark Streaming介绍,DStream,DStream相关操作(来自学习资料)
一、 Spark Streaming介紹
1. SparkStreaming概述
1.1. 什么是Spark Streaming
Spark Streaming類(lèi)似于Apache Storm,用于流式數(shù)據(jù)的處理。根據(jù)其官方文檔介紹,Spark Streaming有高吞吐量和容錯(cuò)能力強(qiáng)等特點(diǎn)。SparkStreaming支持的數(shù)據(jù)輸入源很多,例如:Kafka、Flume、Twitter、ZeroMQ和簡(jiǎn)單的TCP套接字等等。數(shù)據(jù)輸入后可以用Spark的高度抽象原語(yǔ)如:map、reduce、join、window等進(jìn)行運(yùn)算。而結(jié)果也能保存在很多地方,如HDFS,數(shù)據(jù)庫(kù)等。另外Spark Streaming也能和MLlib(機(jī)器學(xué)習(xí))以及Graphx完美融合。
1.2. 為什么要學(xué)習(xí)Spark Streaming
?
1.易用
2.容錯(cuò)
3.易整合到Spark體系
1.3. Spark與Storm的對(duì)比
Spark | Storm |
開(kāi)發(fā)語(yǔ)言:Scala | 開(kāi)發(fā)語(yǔ)言:Clojure |
編程模型:DStream | 編程模型:Spout/Bolt |
?
二、 DStream
1. 什么是DStream
Discretized Stream是Spark Streaming的基礎(chǔ)抽象,代表持續(xù)性的數(shù)據(jù)流和經(jīng)過(guò)各種Spark原語(yǔ)操作后的結(jié)果數(shù)據(jù)流。在內(nèi)部實(shí)現(xiàn)上,DStream是一系列連續(xù)的RDD來(lái)表示。每個(gè)RDD含有一段時(shí)間間隔內(nèi)的數(shù)據(jù),如下圖:
對(duì)數(shù)據(jù)的操作也是按照RDD為單位來(lái)進(jìn)行的
計(jì)算過(guò)程由Spark engine來(lái)完成
2. DStream相關(guān)操作
DStream上的原語(yǔ)與RDD的類(lèi)似,分為T(mén)ransformations(轉(zhuǎn)換)和Output Operations(輸出)兩種,此外轉(zhuǎn)換操作中還有一些比較特殊的原語(yǔ),如:updateStateByKey()、transform()以及各種Window相關(guān)的原語(yǔ)。
?
2.1. Transformationson DStreams
Transformation | Meaning |
map(func) | Return a new DStream by passing each element of the source DStream through a function func. |
flatMap(func) | Similar to map, but each input item can be mapped to 0 or more output items. |
filter(func) | Return a new DStream by selecting only the records of the source DStream on which func returns true. |
repartition(numPartitions) | Changes the level of parallelism in this DStream by creating more or fewer partitions. |
union(otherStream) | Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
count() | Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream. |
reduce(func) | Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel. |
countByValue() | When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream. |
reduceByKey(func, [numTasks])?? | When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks. |
join(otherStream, [numTasks]) | When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. |
cogroup(otherStream, [numTasks]) | When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples. |
transform(func)????? | Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream. |
updateStateByKey(func) | Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key. |
?
特殊的Transformations
?
1.UpdateStateByKeyOperation
UpdateStateByKey原語(yǔ)用于記錄歷史記錄,上文中Word Count示例中就用到了該特性。若不用UpdateStateByKey來(lái)更新?tīng)顟B(tài),那么每次數(shù)據(jù)進(jìn)來(lái)后分析完成后,結(jié)果輸出后將不在保存
?
2.TransformOperation
Transform原語(yǔ)允許DStream上執(zhí)行任意的RDD-to-RDD函數(shù)。通過(guò)該函數(shù)可以方便的擴(kuò)展Spark API。此外,MLlib(機(jī)器學(xué)習(xí))以及Graphx也是通過(guò)本函數(shù)來(lái)進(jìn)行結(jié)合的。
?
3.WindowOperations
Window Operations有點(diǎn)類(lèi)似于Storm中的State,可以設(shè)置窗口的大小和滑動(dòng)窗口的間隔來(lái)動(dòng)態(tài)的獲取當(dāng)前Steaming的允許狀態(tài)
2.2. OutputOperations on DStreams
Output Operations可以將DStream的數(shù)據(jù)輸出到外部的數(shù)據(jù)庫(kù)或文件系統(tǒng),當(dāng)某個(gè)Output Operations原語(yǔ)被調(diào)用時(shí)(與RDD的Action相同),streaming程序才會(huì)開(kāi)始真正的計(jì)算過(guò)程。
Output Operation | Meaning |
print() | Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. |
saveAsTextFiles(prefix, [suffix]) | Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsObjectFiles(prefix, [suffix]) | Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
saveAsHadoopFiles(prefix, [suffix]) | Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]". |
foreachRDD(func) | The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs. |
?
總結(jié)
以上是生活随笔為你收集整理的Spark Streaming介绍,DStream,DStream相关操作(来自学习资料)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 中信证券和中信建投有什么区别
- 下一篇: 华东医药是国企吗