當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

spark-sql-perf

發(fā)布時(shí)間：2024/3/13 编程问答 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 spark-sql-perf 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

簡(jiǎn)介
測(cè)試
- tpcds-kit
- spark-sql-perf
- - 生成數(shù)據(jù)
  - 執(zhí)行查詢
  - 查詢結(jié)果
TPC-DS
FAQ

簡(jiǎn)介

spark-sql-perf 是一個(gè) spark sql 性能測(cè)試框架，可以用來進(jìn)行一些基準(zhǔn)測(cè)試。

測(cè)試環(huán)境：

spark 2.4.0
spark-sql-perf_2.11-0.5.0-SNAPSHOT

測(cè)試

tpcds-kit

通過 tpcds-kit 生成 TPC-DS 數(shù)據(jù)。

sudo yum install gcc make flex bison byacc git git clone https://github.com/databricks/tpcds-kit.git cd tpcds-kit/tools make OS=LINUX

spark-sql-perf

編譯打包，從$spark-sql-perf/target/scala-2.11 下獲得需要的jar包（spark-sql-perf_2.11-0.5.0-SNAPSHOT.jar）

git clone https://github.com/databricks/spark-sql-perf.git sbt package

啟動(dòng) spark-shell

spark-shell \--conf spark.executor.instances=40 \--conf spark.executor.cores=3 \--conf spark.executor.memory=8g \--conf spark.executor.memoryOverhead=2g \--jars scala-logging-slf4j_2.11-2.1.2.jar,scala-logging-api_2.11-2.1.2.jar,spark-sql-perf_2.11-0.5.0-SNAPSHOT.jar

生成數(shù)據(jù)

需要提前將 tpcds-kit 分發(fā)到所有 spark executor 節(jié)點(diǎn)

import com.databricks.spark.sql.perf.tpcds.TPCDSTablesval rootDir = "hdfs://ns/user/admin/tpcds/data" val dsdgenDir = "/path/to/tpcds-kit/tools" val scaleFactor = "20" val format = "parquet" val databaseName = "tpcds"val sqlContext = spark.sqlContext val tables = new TPCDSTables(sqlContext,dsdgenDir = dsdgenDir, scaleFactor = scaleFactor,useDoubleForDecimal = true, useStringForDate = true)tables.genData(location = rootDir,format = format,overwrite = true,partitionTables = true, clusterByPartitionColumns = true, filterOutNullPartitionValues = false, tableFilter = "", numPartitions = 120)//創(chuàng)建臨時(shí)表 tables.createTemporaryTables(rootDir, format) //將表信息注冊(cè)到 hive metastore //sql(s"create database $databaseName") //tables.createExternalTables(rootDir, format, databaseName, overwrite = true, discoverPartitions = true)

執(zhí)行查詢

默認(rèn)情況下，使用 runExperiment 會(huì)在后臺(tái)線程中進(jìn)行，最終將結(jié)果以JSON格式保存到 resultLocation 下時(shí)間戳命名的子目錄中，例如 $resultLocation/timestamp=1429213883272

import com.databricks.spark.sql.perf.tpcds.TPCDSval tpcds = new TPCDS (sqlContext) val databaseName = "tpcds" sql(s"use $databaseName")val resultLocation = "hdfs://ns/user/admin/result" val iterations = 1 val queries = tpcds.tpcds2_4Queries //單個(gè)查詢?cè)O(shè)置超時(shí)時(shí)間 val timeout = 300val experiment = tpcds.runExperiment(queries, iterations = iterations,resultLocation = resultLocation,forkThread = true) experiment.waitForFinish(timeout)

查詢結(jié)果

有兩種獲取方式，如果 experiment 還沒有關(guān)閉，可以使用 experiment.getCurrentResults 方法獲取

//從 experiment 獲取結(jié)果 experiment.getCurrentResults. withColumn("Name", substring(col("name"), 2, 100)). withColumn("Runtime", (col("parsingTime") + col("analysisTime") + col("optimizationTime") + col("planningTime") + col("executionTime")) / 1000.0). selectExpr('Name, 'Runtime)

如果已經(jīng)關(guān)閉，則可以從 resultLocation 中獲取結(jié)果JSON文件并解析

//從文件中讀取 val result = spark.read.json(resultLocation) result.select("results.name","results.executionTime").flatMap(r=>{val name = r.getAs[Seq[String]]("name")val executionTime = r.getAs[Seq[Double]]("executionTime")name.zip(executionTime) }).toDF("name","executionTime").show()

TPC-DS

TPC-DS采用星型、雪花型等多維數(shù)據(jù)模式。它包含7張事實(shí)表，17張緯度表平均每張表含有18列。其工作負(fù)載包含99個(gè)SQL查詢，覆蓋SQL99和2003的核心部分以及OLAP。這個(gè)測(cè)試集包含對(duì)大數(shù)據(jù)集的統(tǒng)計(jì)、報(bào)表生成、聯(lián)機(jī)查詢、數(shù)據(jù)挖掘等復(fù)雜應(yīng)用，測(cè)試用的數(shù)據(jù)和值是有傾斜的，與真實(shí)數(shù)據(jù)一致。可以說TPC-DS是與真實(shí)場(chǎng)景非常接近的一個(gè)測(cè)試集，也是難度較大的一個(gè)測(cè)試集。

TPC-DS的這個(gè)特點(diǎn)跟大數(shù)據(jù)的分析挖掘應(yīng)用非常類似。Hadoop等大數(shù)據(jù)分析技術(shù)也是對(duì)海量數(shù)據(jù)進(jìn)行大規(guī)模的數(shù)據(jù)分析和深度挖掘，也包含交互式聯(lián)機(jī)查詢和統(tǒng)計(jì)報(bào)表類應(yīng)用，同時(shí)大數(shù)據(jù)的數(shù)據(jù)質(zhì)量也較低，數(shù)據(jù)分布是真實(shí)而不均勻的。因此TPC-DS成為客觀衡量多個(gè)不同Hadoop版本以及SQL on Hadoop技術(shù)的最佳測(cè)試集。這個(gè)基準(zhǔn)測(cè)試有以下幾個(gè)主要特點(diǎn)：

一共99個(gè)測(cè)試案例，遵循SQL’99和SQL 2003的語法標(biāo)準(zhǔn)，SQL案例比較復(fù)雜
分析的數(shù)據(jù)量大，并且測(cè)試案例是在回答真實(shí)的商業(yè)問題
測(cè)試案例中包含各種業(yè)務(wù)模型（如分析報(bào)告型，迭代式的聯(lián)機(jī)分析型，數(shù)據(jù)挖掘型等）
幾乎所有的測(cè)試案例都有很高的IO負(fù)載和CPU計(jì)算需求

FAQ

執(zhí)行 new TPCDS 時(shí)如果提示

java.lang.ClassNotFoundException: com.typesafe.scalalogging.slf4j.LazyLogging java.lang.ClassNotFoundException: com.typesafe.scalalogging.Logging

導(dǎo)入相應(yīng)的包即可

--jars /path/to/scala-logging-slf4j_2.11-2.1.2.jar,/path/to/scala-logging-api_2.11-2.1.2.jar

總結(jié)

以上是生活随笔為你收集整理的spark-sql-perf的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：大数据、互联网、机器人成大热门
下一篇：网易杭研 java 校招_09网易杭研校

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

spark-sql-perf

文章目錄

簡(jiǎn)介

測(cè)試

tpcds-kit

spark-sql-perf

生成數(shù)據(jù)

執(zhí)行查詢

查詢結(jié)果

TPC-DS

FAQ

總結(jié)