Hive:使用Apache Hive查询客户最喜欢的搜索查询和产品视图计数
這篇文章涵蓋了使用Apache Hive查詢存儲在Hadoop下的搜索點擊數據。 我們將以示例的方式生成有關總產品瀏覽量的客戶排名靠前的搜索查詢和統計信息。
繼續之前的文章
- 使用大數據分析客戶產品搜索點擊次數 ,
- Flume:使用Apache Flume收集客戶產品搜索點擊數據 ,
我們已經有使用Flume在Hadoop HDFS中收集的客戶搜索點擊數據。
這里將進一步分析使用Hive在Hadoop下查詢存儲的數據。
蜂巢
Hive允許我們使用類似SQL的語言HiveQL查詢大數據。
Hadoop數據
如上一篇文章中所分享的那樣,我們具有以以下格式“ / searchevents / 2014/05/15/16 /”存儲在hadoop下的搜索點擊數據。 數據存儲在每小時創建的單獨目錄中。
文件創建為:
hdfs://localhost.localdomain:54321/searchevents/2014/05/06/16/searchevents.1399386809864數據存儲為DataSteam:
{"eventid":"e8470a00-c869-4a90-89f2-f550522f8f52-1399386809212-72","hostedmachinename":"192.168.182.1334","pageurl":"http://jaibigdata.com/0","customerid":72,"sessionid":"7871a55c-a950-4394-bf5f-d2179a553575","querystring":null,"sortorder":"desc","pagenumber":0,"totalhits":8,"hitsshown":44,"createdtimestampinmillis":1399386809212,"clickeddocid":"23","favourite":null,"eventidsuffix":"e8470a00-c869-4a90-89f2-f550522f8f52","filters":[{"code":"searchfacettype_brand_level_2","value":"Apple"},{"code":"searchfacettype_color_level_2","value":"Blue"}]} {"eventid":"2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0-1399386809743-61","hostedmachinename":"192.168.182.1330","pageurl":"http://jaibigdata.com/0","customerid":61,"sessionid":"78286f6d-cc1e-489c-85ce-a7de8419d628","querystring":"queryString59","sortorder":"asc","pagenumber":3,"totalhits":32,"hitsshown":9,"createdtimestampinmillis":1399386809743,"clickeddocid":null,"favourite":null,"eventidsuffix":"2a4c1e1b-d2c9-4fe2-b38d-9b7d32feb4e0","filters":[{"code":"searchfacettype_age_level_2","value":"0-12 years"}]}Spring數據
我們將使用Spring for Apache Hadoop通過Spring運行配置單元作業。 要在您的應用程序中設置hive環境,請使用以下配置:
<hdp:configuration id="hadoopConfiguration"resources="core-site.xml">fs.default.name=hdfs://localhost.localdomain:54321mapred.job.tracker=localhost.localdomain:54310 </hdp:configuration> <hdp:hive-server auto-startup="true" port="10234" min-threads="3" id="hiveServer" configuration-ref="hadoopConfiguration"> </hdp:hive-server> <hdp:hive-client-factory id="hiveClientFactory" host="localhost" port="10234"> </hdp:hive-client-factory> <hdp:hive-runner id="hiveRunner" run-at-startup="false" hive-client-factory-ref="hiveClientFactory"> </hdp:hive-runner>檢查Spring上下文文件applicationContext-elasticsearch.xml以獲得更多詳細信息。 我們將使用hiveRunner來運行hive腳本。
應用程序中的所有配置單元腳本都位于資源配置單元文件夾下。
可以在HiveSearchClicksServiceImpl.java中找到運行所有hive腳本的服務。
設置數據庫
讓我們設置數據庫以首先查詢數據。
DROP DATABASE IF EXISTS search CASCADE; CREATE DATABASE search;使用外部表查詢搜索事件
我們將創建一個外部表search_clicks來讀取hadoop下存儲的搜索事件數據。
USE search; CREATE EXTERNAL TABLE IF NOT EXISTS search_clicks (eventid String, customerid BIGINT, hostedmachinename STRING, pageurl STRING, totalhits INT, querystring STRING, sessionid STRING, sortorder STRING, pagenumber INT, hitsshown INT, clickeddocid STRING, filters ARRAY<STRUCT<code:STRING, value:STRING>>, createdtimestampinmillis BIGINT) PARTITIONED BY (year STRING, month STRING, day STRING, hour STRING) ROW FORMAT SERDE 'org.jai.hive.serde.JSONSerDe' LOCATION 'hdfs:///searchevents/';JSONSerDe
自定義SerDe“ org.jai.hive.serde.JSONSerDe”用于映射json數據。 檢查有關同一JSONSerDe.java的更多詳細信息
如果您從Eclipse本身運行查詢,則依賴關系將自動解決。 如果您是從hive控制臺運行的,請確保在運行hive查詢之前為該類創建一個jar文件,并將相關依賴項添加到hive控制臺。
#create hive json serde jar jar cf jaihivejsonserde-1.0.jar org/jai/hive/serde/JSONSerDe.class # run on hive console to add jar add jar /opt/hive/lib/jaihivejsonserde-1.0.jar; # Or add jar path to hive-site.xml file permanently <property><name>hive.aux.jars.path</name><value>/opt/hive/lib/jaihivejsonserde-1.0.jar</value> </property>創建配置單元分區
我們將使用配置單元分區策略來讀取存儲在分層位置下的hadoop中的數據。 根據以上位置“ / searchevents / 2014/05/06/16 /”,我們將傳遞以下參數值(DBNAME =搜索,TBNAME = search_clicks,YEAR = 2014,MONTH = 05,DAY = 06,HOUR = 16)。
USE ${hiveconf:DBNAME}; ALTER TABLE ${hiveconf:TBNAME} ADD IF NOT EXISTS PARTITION(year='${hiveconf:YEAR}', month='${hiveconf:MONTH}', day='${hiveconf:DAY}', hour='${hiveconf:HOUR}') LOCATION "hdfs:///searchevents/${hiveconf:YEAR}/${hiveconf:MONTH}/${hiveconf:DAY}/${hiveconf:HOUR}/";要運行腳本,
Collection<HiveScript> scripts = new ArrayList<>();Map<String, String> args = new HashMap<>();args.put("DBNAME", dbName);args.put("TBNAME", tbName);args.put("YEAR", year);args.put("MONTH", month);args.put("DAY", day);args.put("HOUR", hour);HiveScript script = new HiveScript(new ClassPathResource("hive/add_partition_searchevents.q"), args);scripts.add(script);hiveRunner.setScripts(scripts);hiveRunner.call();在后面的文章中,我們將介紹如何使用Oozie協調器作業為小時數據自動創建配置單元分區。
獲取所有搜索點擊事件
獲取存儲在外部表search_clicks中的搜索事件。 傳遞以下參數值(DBNAME =搜索,TBNAME = search_clicks,YEAR = 2014,MONTH = 05,DAY = 06,HOUR = 16)。
USE ${hiveconf:DBNAME}; select eventid, customerid, querystring, filters from ${hiveconf:TBNAME} where year='${hiveconf:YEAR}' and month='${hiveconf:MONTH}' and day='${hiveconf:DAY}' and hour='${hiveconf:HOUR}';這將返回指定位置下的所有數據,還可以幫助您測試自定義SerDe。
查找最近30天內的商品視圖
最近n天中瀏覽/點擊產品的次數。
Use search; DROP TABLE IF EXISTS search_productviews; CREATE TABLE search_productviews(id STRING, productid BIGINT, viewcount INT); -- product views count in the last 30 days. INSERT INTO TABLE search_productviews select clickeddocid as id, clickeddocid as productid, count(*) as viewcount from search_clicks where clickeddocid is not null and createdTimeStampInMillis > ((unix_timestamp() * 1000) - 2592000000) group by clickeddocid order by productid;要運行腳本,
Collection<HiveScript> scripts = new ArrayList<>();HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_productviews-table.q"));scripts.add(script);hiveRunner.setScripts(scripts);hiveRunner.call();樣本數據,從“ search_productviews”表中選擇數據。
# id, productid, viewcount 61, 61, 15 48, 48, 8 16, 16, 40 85, 85, 7查找過去30天內的Cutomer熱門查詢
Use search; DROP TABLE IF EXISTS search_customerquery; CREATE TABLE search_customerquery(id String, customerid BIGINT, querystring String, querycount INT); -- customer top query string in the last 30 days INSERT INTO TABLE search_customerquery select concat(customerid,"_",queryString), customerid, querystring, count(*) as querycount from search_clicks where querystring is not null and customerid is not null and createdTimeStampInMillis > ((unix_timestamp() * 1000) - 2592000000) group by customerid, querystring order by customerid;樣本數據,從“ search_customerquery”表中選擇數據。
# id, querystring, count, customerid 61_queryString59, queryString59, 5, 61 298_queryString48, queryString48, 3, 298 440_queryString16, queryString16, 1, 440 47_queryString85, queryString85, 1, 47分析構面/過濾器以進行導航
您可以進一步擴展Hive查詢,以生成有關最終客戶在使用構面/過濾器搜索相關產品時的行為表現的統計信息。
USE search; -- How many times a particular filter has been clicked. select count(*) from search_clicks where array_contains(filters, struct("searchfacettype_color_level_2", "Blue")); -- how many distinct customer clicked the filter select DISTINCT customerid from search_clicks where array_contains(filters, struct("searchfacettype_color_level_2", "Blue")); -- top query filters by a customer select customerid, filters.code, filters.value, count(*) as filtercount from search_clicks group by customerid, filters.code, filters.value order by filtercount DESC limit 100;數據提取Hive查詢可以根據要求按夜/小時進行調度,并且可以使用作業調度程序(如Oozie)執行。 該數據可以進一步用于BI分析或改善客戶體驗。
在以后的文章中,我們將介紹進一步分析生成的數據,
- 使用ElasticSearch Hadoop為客戶熱門查詢和產品視圖數據編制索引
- 使用Oozie計劃針對配置單元分區進行協調的作業,并將作業捆綁以將數據索引到ElasticSearch。
- 使用Pig來計算唯一客戶總數等
翻譯自: https://www.javacodegeeks.com/2014/05/hive-query-customer-top-search-query-and-product-views-count-using-apache-hive.html
總結
以上是生活随笔為你收集整理的Hive:使用Apache Hive查询客户最喜欢的搜索查询和产品视图计数的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 华为畅享8e拆机视频(畅享8e拆机图解)
- 下一篇: 电脑上硬盘型号查看(电脑硬盘型号在哪里查