當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

如何写一个包含多个事件四则运算的留存SQL ——impala hive

發布時間：2024/7/5 数据库 26 豆豆

生活随笔收集整理的這篇文章主要介紹了如何写一个包含多个事件四则运算的留存SQL ——impala hive 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在實現一個留存業務需求時，碰到了一個難題，我需要提供展示一個按照如下圖格式的數據，

day 1 ~ day n的第一行是留存用戶數量，第二行是一個由多個事件組合執行四則算術運算得到的復合數值，這里碰到的難點主要是第二行的計算，如果只想查看第二行的解決方法可以點擊這里

由于數據傳輸速率受限，我不能使用先查詢出所有數據然后在代碼里處理數據的方法，因此我需要在sql查詢中盡量完成所有聚合計算以減少查詢返回的行數

留存模型采用的是經典模型（Classic retention）留存用戶的數量都是在各天day n獨立計算的

這里day 1~day n第一行計算新用戶留存數量，第二行的小數計算留存的新用戶中某個混合事件的表現，混合事件可以是由某一事件計算得到的值或者由多個事件進行四則運算得到的組合事件的值，第二行的值計算是這篇文章要講的重點

例如求某個活動事件有兩個入口entrance_a和entrance_b，結束通關標識事件為event_over

假設事件名稱event_name為"activety_1"，用戶表為t_user，事件表為t_event
t_user表的數據是這樣的

t_event表的數據是這樣的

1

計算第一行, 也就是計算經典留存模型day 1~day n的留存用戶量，可以用t_user和t_event的join和case when語句實現：

with temp_user as ( select distinct `uid`, to_date(`firstday`) as `firstday` from `t_user` where firstday >= "2022-02-01" and firstday < "2022-02-04" -- 02-01 ~ 02-03的新用戶 ), temp_event as ( select distinct`uid`,to_date(`event_date`) as `event_date` from `t_event` where event_name = "activity_1" and event_date >= "2022-02-01" and event_date < "2022-02-07" -- 02-03往后推3天-->day 3是02-06 ) select `firstday`, Count(distinct a.uid) as `new_user`, Count(distinct case when event_date = date_add(firstday, 1) then a.uid end) as `day 1`, Count(distinct case when event_date = date_add(firstday, 2) then a.uid end) as `day 2`, Count(distinct case when event_date = date_add(firstday, 3) then a.uid end) as `day 3` from temp_user a left join temp_event b on a.uid = b.uid group by firstday

用以上sql語句查詢得到的結果：

格式跟開頭的圖中的表格保持了一致，數值稍微驗證一下可知沒有問題，求第一行的留存用戶數相對較簡單

2

計算第二行的值，我想計算事件activity_1在day 1 ~ day n的通關表現，具體來說就是要計算出day 1~day n各留存用戶的(entrance_a + entrance_b)/event_over值

現在假設事件表t_event_1的內容是這樣的

用戶表t_user同之前的不變

要計算出day 1~day n各留存用戶的(entrance_a + entrance_b)/event_over的表現，結果展示格式類似下面這張圖

除了要按day 1 ~ day n展示數據外還涉及到屬性entrance_a, entrance_b等的聚合計算，使用分析函數（Analytics Function）并不能降低得到的行數，這里我采用了先把要統計的數據（entrance_a, entrance_b, event_over）先分別計算出來按firstday和event_date分組成行，然后再利用case when和求和語句把算出來的結果聚合到對應的day n，寫出來的sql如下

-- 查詢新用戶 with temp_user as ( select distinct `uid`, to_date(`firstday`) as `firstday` from `t_user` where firstday >= "2022-02-01" and firstday < "2022-02-04" -- 02-01 ~ 02-03的新用戶 ), -- 按firstday和event_date分組統計各個要查詢的值 temp_event_1 as ( select a.uid,to_date(a.firstday) `firstday`,to_date(b.event_date) `event_date`,count(case when b.entrance = "a" then 1 end) as entrance_a,count(case when b.entrance = "b" then 1 end) as entrance_b,count(case when b.event_status = "event_over" then 1 end) as event_over from t_user a left join t_event_1 b on a.uid = b.uidand to_date(b.event_date) between date_add(a.firstday, 1) and date_add(a.firstday, 3)and (b.entrance in ("a", "b") OR b.event_status = "event_over") -- 注意不同的列必須要用OR分開查！因為在同一張表里面的限定了event_over就會有一部分的entrance事件查不到！and b.event_date >= "2022-02-01" and b.event_date < "2022-02-07" -- 02-03往后推3天 --> 02-06 group by firstday, event_date, a.uid ) -- 行轉列減少返回的數據行數 select evt.firstday,count(distinct evt.uid) `new user`,(sum(case when date_add(evt.firstday, 1) = evt.event_date then evt.entrance_a else 0 end ) + sum(case when date_add(evt.firstday, 1) = evt.event_date then evt.entrance_b else 0 end )) / sum(case when date_add(evt.firstday, 1) = evt.event_date then evt.event_over else 0 end ) `day 1`,(sum(case when date_add(evt.firstday, 2) = evt.event_date then evt.entrance_a else 0 end ) + sum(case when date_add(evt.firstday, 2) = evt.event_date then evt.entrance_b else 0 end )) / sum(case when date_add(evt.firstday, 2) = evt.event_date then evt.event_over else 0 end ) `day 2`,(sum(case when date_add(evt.firstday, 3) = evt.event_date then evt.entrance_a else 0 end ) + sum(case when date_add(evt.firstday, 3) = evt.event_date then evt.entrance_b else 0 end )) / sum(case when date_add(evt.firstday, 3) = evt.event_date then evt.event_over else 0 end ) `day 3` from temp_event_1 evt group by firstday

查詢的結果：

這里出現的null如果你想在sql中把它們變成0也可以使用zeroifnull()函數將每一列計算結果包住

驗證正確性
根據用戶表t_user和事件表t_event_1對比查詢結果的正確性：
用戶表

先看firstday在02-01的用戶，只有0和1兩個，對應的day 1就是他們在02-02的(entrance_a+entrance_b)/event_over表現值：

再檢查02-02的用戶2和3在day 1, day 2, day3的表現數值，也就是他們分別再02-03, 02-04和02-05的表現數值：

根據驗證結果查詢是沒有問題的

day 1 ~ day n第二行的解決sql相對來說比較復雜，但是我目前沒有想到更好的sql，如果有更好的方法查出結果，歡迎評論告訴我，感謝~

以上的模擬數據，都可以在這個鏈接(https://demo.gethue.com/hue/home)找到，賬號密碼登錄的時候都是demo，找到Hive下的數據庫選擇default子庫

Editor選擇Hive就可以查詢了

你也可以選擇自己創建新表和插入數據模擬，以下是我的建表和插入數據的sql：

-- 用戶表創建和值插入 CREATE TABLE `t_user`(`uid` int, `firstday` timestamp) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION'hdfs://namenode:8020/user/hive/warehouse/t_user';INSERT INTO t_user (uid, firstday) VALUES(0, '2022-02-01 00:01:00'),(1, '2022-02-01 00:04:30'),(2, '2022-02-02 10:00:00'),(3, '2022-02-02 14:30:00') ; -- 事件表t_event創建和插入值 CREATE TABLE `t_event`(`uid` int,`event_name` STRING,`event_date` TIMESTAMP) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION'hdfs://namenode:8020/user/hive/warehouse/t_event';INSERT INTO t_event (uid, event_name, event_date) VALUES(1, "activity_1", '2022-02-01 00:01:00'),(1, "activity_1", '2022-02-01 00:04:30'),(1, "activity_1", '2022-02-02 10:00:00'),(1, "activity_1", '2022-02-02 14:30:00'),(2, "activity_1", '2022-02-02 00:04:30'),(2, "activity_1", '2022-02-03 10:00:00'),(2, "activity_1", '2022-02-04 14:30:00'),(3, "activity_1", '2022-02-05 09:00:00'),(3, "activity_1", '2022-02-05 12:30:00'),(0, "activity_1", '2022-02-06 14:30:00') ;-- 用戶表t_event_1創建和值插入 CREATE TABLE `t_event_1`(`uid` int,`entrance` string,`event_status` STRING,`event_name` STRING,`event_date` TIMESTAMP) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION'hdfs://namenode:8020/user/hive/warehouse/t_event_1';INSERT INTO `t_event_1` (uid, entrance, event_status, event_name, event_date) VALUES(1, "a", "event_over", "activity_1", '2022-02-01 00:01:00'),(1, "b", "event_over", "activity_1", '2022-02-01 00:04:30'),(1, "a", "event_failed", "activity_1", '2022-02-02 10:00:00'),(1, "b", "event_failed", "activity_1", '2022-02-02 14:30:00'),(1, "a", "event_over", "activity_1", '2022-02-06 14:30:00'),(2, "a", "event_over", "activity_1", '2022-02-02 14:30:00'), (2, "a", "event_failed", "activity_1", '2022-02-02 00:04:30'),(2, "b", "event_over", "activity_1", '2022-02-03 10:00:00'),(2, "a", "event_over", "activity_1", '2022-02-04 14:30:00'),(2, "b", "event_failed", "activity_1", '2022-02-05 16:30:00'),(3, "a", "event_over", "activity_1", '2022-02-05 09:00:00'),(3, "b", "event_over", "activity_1", '2022-02-02 00:04:30'),(3, "b", "event_over", "activity_1", '2022-02-05 12:30:00'),(0, "a", "event_over", "activity_1", '2022-02-06 14:30:00'),(0, "a", "event_over", "activity_1", '2022-02-02 14:30:00'), (0, "b", "event_over", "activity_1", '2022-02-02 00:04:30') ;

注意你登錄的session只會保持一段時間，大概十幾分鐘或更多？，一般你可以通過re-create session來重新打開并繼續執行sql查詢，如果re-create沒有效果就只能重新登錄了

總結

以上是生活随笔為你收集整理的如何写一个包含多个事件四则运算的留存SQL ——impala hive的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： matlab画线不同颜色_怎样画线框图才
下一篇： .net与mysql,ASP.NET与M

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

数据库

如何写一个包含多个事件四则运算的留存SQL ——impala hive

1

2

總結