當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

Spark SQL 函数全集

發(fā)布時間：2025/3/21 数据库 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 Spark SQL 函数全集小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

org.apache.spark.sql.functions是一個Object，提供了約兩百多個函數(shù)。

大部分函數(shù)與Hive的差不多。

除UDF函數(shù)，均可在spark-sql中直接使用。

經(jīng)過import org.apache.spark.sql.functions._ ，也可以用于Dataframe，Dataset。

version
2.3.0

大部分支持Column的函數(shù)也支持String類型的列名。這些函數(shù)的返回類型基本都是Column。

函數(shù)很多，都在下面了。

聚合函數(shù)

approx_count_distinct count_distinct近似值avg 平均值collect_list 聚合指定字段的值到listcollect_set 聚合指定字段的值到setcorr 計算兩列的Pearson相關(guān)系數(shù)count 計數(shù)countDistinct 去重計數(shù) SQL中用法 select count(distinct class)covar_pop 總體協(xié)方差（population covariance）covar_samp 樣本協(xié)方差（sample covariance）first 分組第一個元素last 分組最后一個元素groupinggrouping_idkurtosis計算峰態(tài)(kurtosis)值skewness計算偏度(skewness)max 最大值min 最小值mean 平均值stddev即stddev_sampstddev_samp樣本標準偏差（sample standard deviation）stddev_pop 總體標準偏差（population standard deviation）sum 求和sumDistinct 非重復值求和 SQL中用法 select sum(distinct class)var_pop 總體方差（population variance）var_samp 樣本無偏方差（unbiased variance）variance 即var_samp

集合函數(shù)

array_contains(column,value) 檢查array類型字段是否包含指定元素explode展開array或map為多行explode_outer 同explode，但當array或map為空或null時，會展開為null。posexplode 同explode，帶位置索引。posexplode_outer 同explode_outer，帶位置索引。from_json 解析JSON字符串為StructType or ArrayType，有多種參數(shù)形式，詳見文檔。to_json 轉(zhuǎn)為json字符串，支持StructType, ArrayType of StructTypes, a MapType or ArrayType of MapTypes。get_json_object(column,path) 獲取指定json路徑的json對象字符串。 select get_json_object('{"a"1,"b":2}','$.a'); [JSON Path介紹](http://blog.csdn.net/koflance/article/details/63262484) json_tuple(column,fields) 獲取json中指定字段值。select json_tuple('{"a":1,"b":2}','a','b');map_keys 返回map的鍵組成的arraymap_values 返回map的值組成的arraysize array or map的長度sort_array(e: Column, asc: Boolean) 將array中元素排序（自然排序），默認asc。

時間函數(shù)

add_months(startDate: Column, numMonths: Int) 指定日期添加n月date_add(start: Column, days: Int) 指定日期之后n天 e.g. select date_add('2018-01-01',3)date_sub(start: Column, days: Int) 指定日期之前n天datediff(end: Column, start: Column) 兩日期間隔天數(shù)current_date() 當前日期current_timestamp() 當前時間戳，TimestampType類型date_format(dateExpr: Column, format: String) 日期格式化dayofmonth(e: Column) 日期在一月中的天數(shù)，支持 date/timestamp/stringdayofyear(e: Column) 日期在一年中的天數(shù)，支持 date/timestamp/stringweekofyear(e: Column) 日期在一年中的周數(shù)，支持 date/timestamp/stringfrom_unixtime(ut: Column, f: String) 時間戳轉(zhuǎn)字符串格式from_utc_timestamp(ts: Column, tz: String) 時間戳轉(zhuǎn)指定時區(qū)時間戳to_utc_timestamp(ts: Column, tz: String) 指定時區(qū)時間戳轉(zhuǎn)UTF時間戳hour(e: Column) 提取小時值minute(e: Column) 提取分鐘值month(e: Column) 提取月份值quarter(e: Column) 提取季度second(e: Column) 提取秒year(e: Column):提取年last_day(e: Column) 指定日期的月末日期months_between(date1: Column, date2: Column) 計算兩日期差幾個月next_day(date: Column, dayOfWeek: String) 計算指定日期之后的下一個周一、二...，dayOfWeek區(qū)分大小寫，只接受 "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"。to_date(e: Column) 字段類型轉(zhuǎn)為DateTypetrunc(date: Column, format: String) 日期截斷unix_timestamp(s: Column, p: String) 指定格式的時間字符串轉(zhuǎn)時間戳unix_timestamp(s: Column) 同上，默認格式為 yyyy-MM-dd HH:mm:ssunix_timestamp():當前時間戳(秒),底層實現(xiàn)為unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss)window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String) 時間窗口函數(shù)，將指定時間(TimestampType)劃分到窗口

數(shù)學函數(shù)

cos,sin,tan 計算角度的余弦，正弦。。。sinh,tanh,cosh 計算雙曲正弦，正切，。。acos,asin,atan,atan2 計算余弦/正弦值對應(yīng)的角度bin 將long類型轉(zhuǎn)為對應(yīng)二進制數(shù)值的字符串For example, bin("12") returns "1100".bround 舍入，使用Decimal的HALF_EVEN模式，v>0.5向上舍入，v< 0.5向下舍入，v0.5向最近的偶數(shù)舍入。round(e: Column, scale: Int) HALF_UP模式舍入到scale為小數(shù)點。v>=0.5向上舍入，v< 0.5向下舍入,即四舍五入。ceil 向上舍入floor 向下舍入cbrt Computes the cube-root of the given value.conv(num:Column, fromBase: Int, toBase: Int)轉(zhuǎn)換數(shù)值（字符串）的進制log(base: Double, a: Column):$log_{base}(a)$log(a: Column):$log_e(a)$log10(a: Column):$log_{10}(a)$log2(a: Column):$log_{2}(a)$log1p(a: Column):$log_{e}(a+1)$pmod(dividend: Column, divisor: Column):Returns the positive value of dividend mod divisor.pow(l: Double, r: Column):$r^l$ 注意r是列pow(l: Column, r: Double):$r^l$ 注意l是列pow(l: Column, r: Column):$r^l$ 注意r,l都是列radians(e: Column):角度轉(zhuǎn)弧度rint(e: Column):Returns the double value that is closest in value to the argument and is equal to a mathematical integer.shiftLeft(e: Column, numBits: Int):向左位移shiftRight(e: Column, numBits: Int):向右位移shiftRightUnsigned(e: Column, numBits: Int):向右位移（無符號位）signum(e: Column):返回數(shù)值正負符號sqrt(e: Column):平方根hex(column: Column):轉(zhuǎn)十六進制unhex(column: Column):逆轉(zhuǎn)十六進制

混雜(misc)函數(shù)

crc32(e: Column):計算CRC32,返回biginthash(cols: Column*):計算 hash code，返回intmd5(e: Column):計算MD5摘要，返回32位，16進制字符串sha1(e: Column):計算SHA-1摘要，返回40位，16進制字符串sha2(e: Column, numBits: Int):計算SHA-1摘要，返回numBits位，16進制字符串。numBits支持224, 256, 384, or 512.

其他非聚合函數(shù)

abs(e: Column) 絕對值array(cols: Column*) 多列合并為array，cols必須為同類型map(cols: Column*): 將多列組織為map，輸入列必須為（key,value)形式，各列的key/value分別為同一類型。bitwiseNOT(e: Column): Computes bitwise NOT.broadcast[T](df: Dataset[T]): Dataset[T]: 將df變量廣播，用于實現(xiàn)broadcast join。如left.join(broadcast(right), "joinKey")coalesce(e: Column*): 返回第一個非空值col(colName: String): 返回colName對應(yīng)的Columncolumn(colName: String): col函數(shù)的別名expr(expr: String): 解析expr表達式，將返回值存于Column，并返回這個Column。greatest(exprs: Column*): 返回多列中的最大值，跳過Nullleast(exprs: Column*): 返回多列中的最小值，跳過Nullinput_file_name():返回當前任務(wù)的文件名？？isnan(e: Column): 檢查是否NaN（非數(shù)值）isnull(e: Column): 檢查是否為Nulllit(literal: Any): 將字面量(literal)創(chuàng)建一個ColumntypedLit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): 將字面量(literal)創(chuàng)建一個Column，literal支持 scala types e.g.: List, Seq and Map.monotonically_increasing_id(): 返回單調(diào)遞增唯一ID，但不同分區(qū)的ID不連續(xù)。ID為64位整型。nanvl(col1: Column, col2: Column): col1為NaN則返回col2negate(e: Column): 負數(shù)，同df.select( -df("amount") )not(e: Column): 取反，同df.filter( !df("isActive") )rand(): 隨機數(shù)[0.0, 1.0]rand(seed: Long): 隨機數(shù)[0.0, 1.0]，使用seed種子randn(): 隨機數(shù)，從正態(tài)分布取randn(seed: Long): 同上spark_partition_id(): 返回partition IDstruct(cols: Column*): 多列組合成新的struct column ？？when(condition: Column, value: Any): 當condition為true返回value，如 people.select(when(people("gender") === "male", 0).when(people("gender") === "female", 1).otherwise(2)) 如果沒有otherwise且condition全部沒命中，則返回null.

排序函數(shù)

asc(columnName: String):正序asc_nulls_first(columnName: String):正序，null排最前asc_nulls_last(columnName: String):正序，null排最后e.g. df.sort(asc("dept"), desc("age"))對應(yīng)有desc函數(shù)desc,desc_nulls_first,desc_nulls_last

字符串函數(shù)

ascii(e: Column): 計算第一個字符的ascii碼base64(e: Column): base64轉(zhuǎn)碼unbase64(e: Column): base64解碼concat(exprs: Column*):連接多列字符串concat_ws(sep: String, exprs: Column*):使用sep作為分隔符連接多列字符串decode(value: Column, charset: String): 解碼encode(value: Column, charset: String): 轉(zhuǎn)碼，charset支持 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16'。format_number(x: Column, d: Int):格式化'#,###,###.##'形式的字符串format_string(format: String, arguments: Column*): 將arguments按format格式化，格式為printf-style。initcap(e: Column): 單詞首字母大寫lower(e: Column): 轉(zhuǎn)小寫upper(e: Column): 轉(zhuǎn)大寫instr(str: Column, substring: String): substring在str中第一次出現(xiàn)的位置length(e: Column): 字符串長度levenshtein(l: Column, r: Column): 計算兩個字符串之間的編輯距離（Levenshtein distance）locate(substr: String, str: Column): substring在str中第一次出現(xiàn)的位置，位置編號從1開始，0表示未找到。locate(substr: String, str: Column, pos: Int): 同上，但從pos位置后查找。lpad(str: Column, len: Int, pad: String):字符串左填充。用pad字符填充str的字符串至len長度。有對應(yīng)的rpad，右填充。ltrim(e: Column):剪掉左邊的空格、空白字符，對應(yīng)有rtrim.ltrim(e: Column, trimString: String):剪掉左邊的指定字符,對應(yīng)有rtrim.trim(e: Column, trimString: String):剪掉左右兩邊的指定字符trim(e: Column):剪掉左右兩邊的空格、空白字符regexp_extract(e: Column, exp: String, groupIdx: Int): 正則提取匹配的組regexp_replace(e: Column, pattern: Column, replacement: Column): 正則替換匹配的部分，這里參數(shù)為列。regexp_replace(e: Column, pattern: String, replacement: String): 正則替換匹配的部分repeat(str: Column, n: Int):將str重復n次返回reverse(str: Column): 將str反轉(zhuǎn)soundex(e: Column): 計算桑迪克斯代碼（soundex code）PS:用于按英語發(fā)音來索引姓名,發(fā)音相同但拼寫不同的單詞，會映射成同一個碼。split(str: Column, pattern: String): 用pattern分割strsubstring(str: Column, pos: Int, len: Int): 在str上截取從pos位置開始長度為len的子字符串。substring_index(str: Column, delim: String, count: Int):Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.translate(src: Column, matchingString: String, replaceString: String):把src中的matchingString全換成replaceString。

UDF函數(shù)

user-defined function.callUDF(udfName: String, cols: Column*): 調(diào)用UDF import org.apache.spark.sql._val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") val spark = df.sparkSession spark.udf.register("simpleUDF", (v: Int) => v * v) df.select($"id", callUDF("simpleUDF", $"value"))udf: 定義UDF

窗口函數(shù)

cume_dist(): cumulative distribution of values within a window partitioncurrentRow(): returns the special frame boundary that represents the current row in the window partition.rank():排名，返回數(shù)據(jù)項在分組中的排名，排名相等會在名次中留下空位 1,2,2,4。dense_rank(): 排名，返回數(shù)據(jù)項在分組中的排名，排名相等會在名次中不會留下空位 1,2,2,3。row_number():行號，為每條記錄返回一個數(shù)字 1,2,3,4percent_rank():returns the relative rank (i.e. percentile) of rows within a window partition.lag(e: Column, offset: Int, defaultValue: Any): offset rows before the current rowlead(e: Column, offset: Int, defaultValue: Any): returns the value that is offset rows after the current rowntile(n: Int): returns the ntile group id (from 1 to n inclusive) in an ordered window partition.unboundedFollowing():returns the special frame boundary that represents the last row in the window partition.

轉(zhuǎn)載于:https://www.cnblogs.com/itboys/p/9818836.html

總結(jié)

以上是生活随笔為你收集整理的Spark SQL 函数全集的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Spring MVC能响应HTTP请求的
下一篇：第二次做HDOJ 1051