spark partition
2019獨角獸企業重金招聘Python工程師標準>>>
HA方式啟動spark
#HA方式啟動spark,當Leader,掛掉的時候,standy變為alive
./bin/spark-shell --master spark://xupan001:7070,xupan002:7070
?
指定分區
#指定兩個分區,會生成兩個作業task,hdfs上會有兩個文件
?val rdd1 = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9), 2)
rdd1.partitions.length //2
#saveAsTextFile
?rdd1.saveAsTextFile("hdfs://xupan001:8020/user/root/spark/output/partition")
| -rw-r--r-- | root | supergroup | 0 B | 1 | 128 MB | _SUCCESS |
| -rw-r--r-- | root | supergroup | 8 B | 1 | 128 MB | part-00000 |
| -rw-r--r-- | root | supergroup | 10 B | 1 | 128 MB | part-00001 |
?
cores相關
如果沒有指定分區數:文件個數和cores有關,也就是可用核數有關(總核數)
val ?rdd1 = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9))
rdd1.partitions.length //6
rdd1.saveAsTextFile("hdfs://xupan001:8020/user/root/spark/output/partition2")
| -rw-r--r-- | root | supergroup | 0 B | 1 | 128 MB | _SUCCESS |
| -rw-r--r-- | root | supergroup | 2 B | 1 | 128 MB | part-00000 |
| -rw-r--r-- | root | supergroup | 4 B | 1 | 128 MB | part-00001 |
| -rw-r--r-- | root | supergroup | 2 B | 1 | 128 MB | part-00002 |
| -rw-r--r-- | root | supergroup | 4 B | 1 | 128 MB | part-00003 |
| -rw-r--r-- | root | supergroup | 2 B | 1 | 128 MB | part-00004 |
| -rw-r--r-- | root | supergroup | 4 B | 1 | 128 MB | part-00005 |
?
對照基本配置:
- URL:?spark://xupan001:7070
- REST URL:?spark://xupan001:6066?(cluster mode)
- Alive Workers:?3
- Cores in use:?6 Total, 6 Used
- Memory in use:?6.0 GB Total, 3.0 GB Used
- Applications:?1?Running, 5?Completed
- Drivers:?0 Running, 0 Completed
- Status:?ALIVE
Workers
| worker-20171211031717-192.168.0.118-7071 | 192.168.0.118:7071 | ALIVE | 2 (2 Used) | 2.0 GB (1024.0 MB Used) |
| worker-20171211031718-192.168.0.119-7071 | 192.168.0.119:7071 | ALIVE | 2 (2 Used) | 2.0 GB (1024.0 MB Used) |
| worker-20171211031718-192.168.0.120-7071 | 192.168.0.120:7071 | ALIVE | 2 (2 Used) | 2.0 GB (1024.0 MB Used) |
?
?
======================================================
hdfs文件大小相關
從hdfs上讀取文件如果沒有指定分區,默認為2個分區
scala> val rdd = sc.textFile("hdfs://xupan001:8020/user/root/spark/input/zookeeper.out")
scala> rdd.partitions.length
res3: Int = 2
?
如果hdfs文件很大,則會根據 文件Size/128個partition,如果余數不足128則Size/128 + 1個partition
?
總結:以上是我在spark2.2.0上做的測試:
1.如果是Driver端的Scala集合并行化創建RDD,并且沒有指定RDD的分區,RDD的分區數就是Application分配的總cores數
2:如果是hdfs文件系統的方式讀取數據
2.1一個文件文件的大小小于128M
scala> val rdd = sc.textFile("hdfs://xupan001:8020/user/root/spark/input/zookeeper.out",1)
scala> rdd.partitions.length
res0: Int = 1
2.2多個文件,其中一個文件大大小為:
| -rw-r--r-- | root | supergroup | 4.9 KB | 1 | 128 MB | userLog.txt |
| -rw-r--r-- | root | supergroup | 284.35 MB | 1 | 128 MB | userLogBig.txt |
| -rw-r--r-- | root | supergroup | 51.83 KB | 1 | 128 MB | zookeeper.out |
?
scala> val rdd = sc.textFile("hdfs://xupan001:8020/user/root/spark/input")
rdd: org.apache.spark.rdd.RDD[String] = hdfs://xupan001:8020/user/root/spark/input MapPartitionsRDD[3] at textFile at <console>:24
?
scala> rdd.partitions.length
res1: Int = 5
userLogBig.txt會有3個block
?
轉載于:https://my.oschina.net/u/2253438/blog/1590655
總結
以上是生活随笔為你收集整理的spark partition的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 最大计算机病毒诈骗怎么发生,又是怎么被制
- 下一篇: 时间对象、引用类型