淘宝双11大数据分析(Hive 分析篇-上)
生活随笔
收集整理的這篇文章主要介紹了
淘宝双11大数据分析(Hive 分析篇-上)
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 前言
- 當前環境
- 用聚合函數 count() 計算出表內有多少條數據
- 查出 user_id 不重復的數據的條數
- 查詢不重復的數據有多少條(為了排除客戶刷單的情況)
- 關鍵字條件查詢分析
- 查詢雙11那天有多少人購買了商品
- 求當天購買的此品牌商品的數量
- 根據用戶行為分析
- 查詢一件商品在某天的購買比例或瀏覽比例
- 查詢雙11那天,男女買家購買商品的比例
- 給定購買商品的數量范圍,查詢某一天在該網站的購買該數量商品的用戶id
- 用戶實時查詢分析
前言
閱讀此篇文章,需要先閱讀前篇:
當前環境
使用 jps 查看當前啟動的服務:(其中 Master 和 Worker是 Spark 的服務,不本篇無關)
[root@centos2020 dataset]# jps 11408 Master 12707 RunJar 7876 NameNode 8183 ResourceManager 7930 DataNode 8477 NodeManager 11550 Worker 12990 Jps另外,本篇需要先啟動 mysql ,hive服務(啟動步驟請參考:淘寶雙11大數據分析(數據準備篇))。
用聚合函數 count() 計算出表內有多少條數據
hive> select count(*) from user_log;執行過程和結果(文章篇幅原因,后邊的運行只展示結果,不展示過程):
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Query ID = root_20200226001346_be1efc88-10bf-4030-9509-62de994fb0d2 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number> In order to set a constant number of reducers:set mapreduce.job.reduces=<number> Starting Job = job_1582623436135_0001, Tracking URL = http://centos2020:8088/proxy/application_1582623436135_0001/ Kill Command = /usr/hadoop/hadoop-2.7.7/bin/hadoop job -kill job_1582623436135_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2020-02-26 00:14:53,555 Stage-1 map = 0%, reduce = 0% 2020-02-26 00:15:16,085 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.0 sec 2020-02-26 00:15:39,527 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.35 sec MapReduce Total cumulative CPU time: 6 seconds 350 msec Ended Job = job_1582623436135_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.35 sec HDFS Read: 482200 HDFS Write: 105 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 350 msec OK 10000 Time taken: 115.187 seconds, Fetched: 1 row(s)結果是 10000 ,結果正確。因為在上一篇中,給該表中存的數據就是 10000 條。
查出 user_id 不重復的數據的條數
hive> select count(distinct user_id) from user_log;結果:
MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.41 sec HDFS Read: 482582 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 410 msec OK 358 Time taken: 45.244 seconds, Fetched: 1 row(s)查詢不重復的數據有多少條(為了排除客戶刷單的情況)
hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1)a;結果:
MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.93 sec HDFS Read: 485070 HDFS Write: 116 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.33 sec HDFS Read: 5058 HDFS Write: 104 SUCCESS Total MapReduce CPU Time Spent: 11 seconds 260 msec OK 4754 Time taken: 89.299 seconds, Fetched: 1 row(s)注意以上的查詢語句中,要加上別名 a 否則會報錯:
hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1); NoViableAltException(256@[])at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.atomjoinSource(HiveParser_FromClauseParser.java:2265)at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:2475)at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:1690)at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1312)at org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:42074)at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36735)at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987)at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36633)at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35822)at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35710)at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2284)at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333)at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.hadoop.util.RunJar.run(RunJar.java:226)at org.apache.hadoop.util.RunJar.main(RunJar.java:141) FAILED: ParseException line 1:22 cannot recognize input near '(' 'select' 'user_id' in joinSource關鍵字條件查詢分析
查詢雙11那天有多少人購買了商品
hive> select count(distinct user_id) from user_log where action='2';結果:
MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.76 sec HDFS Read: 483409 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 760 msec OK 358 Time taken: 49.434 seconds, Fetched: 1 row(s)求當天購買的此品牌商品的數量
hive> select count(*) from user_log where action='2' and brand_id=2661;結果:
MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.46 sec HDFS Read: 483258 HDFS Write: 101 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 460 msec OK 3 Time taken: 52.679 seconds, Fetched: 1 row(s)根據用戶行為分析
查詢一件商品在某天的購買比例或瀏覽比例
先查出有多少用戶在這天購買了商品:
即關鍵字查詢中那條語句:
select count(distinct user_id) from user_log where action='2';其查詢結果是 358。
再查出有多少用戶點擊了該商品:
即查詢 user_id 不重復的數據條數:
select count(distinct user_id) from user_log;其結果是 358。
也就是說比例是 358/358 = 1。
查詢雙11那天,男女買家購買商品的比例
先查出男性買家買商品的數量:
select count(*) from user_log where gender=1;結果是:3299
再查出女性買家買商品的數量:
select count(*) from user_log where gender=0;結果是:3361
因此男女比例是:3299 / 3361 大約等于 0.98
給定購買商品的數量范圍,查詢某一天在該網站的購買該數量商品的用戶id
查詢某一天在該網站購買商品超過5次的用戶id:
select user_id from user_log where action='2' group by user_id having count(action='2')>5;結果(篇幅原因,展示部分結果):
MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.77 sec HDFS Read: 483294 HDFS Write: 1304 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 770 msec OK 1321 6058 16464 ... 422917 Time taken: 53.104 seconds, Fetched: 65 row(s)用戶實時查詢分析
查詢不同品牌的購買次數
這次創建新的數據表進行存儲:
create table scan(brand_id INT,scan INT) COMMENT 'This is the search of bigdatataobao' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;導入數據:
insert overwrite table scan select brand_id,count(action) from user_log where action='2' group by brand_id;查看結果:
select * from scan;結果(篇幅原因,展示部分結果):
OK NULL 8 60 3 69 1 82 11 99 3 104 1 125 1 127 1 133 1 ... 8396 3 8410 1 8461 1 Time taken: 0.852 seconds, Fetched: 544 row(s)總結
以上是生活随笔為你收集整理的淘宝双11大数据分析(Hive 分析篇-上)的全部內容,希望文章能夠幫你解決所遇到的問題。