Spark SQL之案例实战(四)
生活随笔
收集整理的這篇文章主要介紹了
Spark SQL之案例实战(四)
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
1. 獲取數(shù)據(jù)
本文通過將github上的Spark項目Git日志作為數(shù)據(jù),對SparkSQL的內(nèi)容進行詳細介紹
數(shù)據(jù)獲取命令如下:
格式化日志內(nèi)容輸出如下:
[root@master spark]# head -1 sparktest.json {"commit":"30b706b7b36482921ec04145a0121ca147984fa8","author":"Josh Rosen","author_email":"joshrosen@databricks.com","date":"Fri Nov 6 18:17:34 2015 -0800","message":"SPARK-11389-CORE-Add-support-for-off-heap-memory-to-MemoryManager"}- 12
然后使用命令將sparktest.json文件上傳到HDFS上
[root@master spark]#hadoop dfs -put sparktest.json /data/- 1
2. 創(chuàng)建DataFrame
使用數(shù)據(jù)創(chuàng)建DataFrame
scala> val df = sqlContext.read.json("/data/sparktest.json") 16/02/05 09:59:56 INFO json.JSONRelation: Listing hdfs://ns1/data/sparktest.json on driver- 1
查看其模式:
scala> df.printSchema() root|-- author: string (nullable = true)|-- author_email: string (nullable = true)|-- commit: string (nullable = true)|-- date: string (nullable = true)|-- message: string (nullable = true)3. DataFrame方法實戰(zhàn)
(1)顯式前兩行數(shù)據(jù)
scala> df.show(2)+----------------+--------------------+--------------------+--------------------+--------------------+ | author| author_email| commit| date| message| +----------------+--------------------+--------------------+--------------------+--------------------+ | Josh Rosen|joshrosen@databri...|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...| |Michael Armbrust|michael@databrick...|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...| +----------------+--------------------+--------------------+--------------------+--------------------+- 1
(2)計算總提交次數(shù)
scala> df.count res4: Long = 13507 下圖給出的是我github上的commits次數(shù),可以看到,其結束是一致的- 1
(3)按提交次數(shù)進行降序排序
scala>df.groupBy("author").count.sort($"count".desc).show+--------------------+-----+ | author|count| +--------------------+-----+ | Matei Zaharia| 1590| | Reynold Xin| 1071| | Patrick Wendell| 857| | Tathagata Das| 416| | Josh Rosen| 348| | Mosharaf Chowdhury| 290| | Andrew Or| 287| | Xiangrui Meng| 285| | Davies Liu| 281| | Ankur Dave| 265| | Cheng Lian| 251| | Michael Armbrust| 243| | zsxwing| 200| | Sean Owen| 197| | Prashant Sharma| 186| | Joseph E. Gonzalez| 185| | Yin Huai| 177| |Shivaram Venkatar...| 173| | Aaron Davidson| 164| | Marcelo Vanzin| 142| +--------------------+-----+ only showing top 20 rows- 1
4. DataFrame注冊成臨時表使用實戰(zhàn)
使用下列語句將DataFrame注冊成表
scala> val commitLog=df.registerTempTable("commitlog")- 1
(1)顯示前2行數(shù)據(jù)
scala> sqlContext.sql("SELECT * FROM commitlog").show(2) +----------------+--------------------+--------------------+--------------------+--------------------+ | author| author_email| commit| date| message| +----------------+--------------------+--------------------+--------------------+--------------------+ | Josh Rosen|joshrosen@databri...|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...| |Michael Armbrust|michael@databrick...|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...| +----------------+--------------------+--------------------+--------------------+--------------------+- 1
(2)計算總提交次數(shù)
scala> sqlContext.sql("SELECT count(*) as TotalCommitNumber FROM commitlog").show +-----------------+ |TotalCommitNumber| +-----------------+ | 13507| +-----------------+- 1
- 2
(3)按提交次數(shù)進行降序排序
scala> sqlContext.sql("SELECT author,count(*) as CountNumber FROM commitlog GROUP BY author ORDER BY CountNumber DESC").show+--------------------+-----------+ | author|CountNumber| +--------------------+-----------+ | Matei Zaharia| 1590| | Reynold Xin| 1071| | Patrick Wendell| 857| | Tathagata Das| 416| | Josh Rosen| 348| | Mosharaf Chowdhury| 290| | Andrew Or| 287| | Xiangrui Meng| 285| | Davies Liu| 281| | Ankur Dave| 265| | Cheng Lian| 251| | Michael Armbrust| 243| | zsxwing| 200| | Sean Owen| 197| | Prashant Sharma| 186| | Joseph E. Gonzalez| 185| | Yin Huai| 177| |Shivaram Venkatar...| 173| | Aaron Davidson| 164| | Marcelo Vanzin| 142| +--------------------+-----------+更多復雜的玩法,大家可以自己去嘗試,這里給出的只是DataFrame方法與臨時表SQL語句的用法差異,以便于有整體的認知。
創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯,堅持創(chuàng)作打卡瓜分現(xiàn)金大獎總結
以上是生活随笔為你收集整理的Spark SQL之案例实战(四)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Spark SQL之queryExecu
- 下一篇: 深入理解Spark 2.1 Core (