自定义hive url parse函数
生活随笔
收集整理的這篇文章主要介紹了
自定义hive url parse函数
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
在用hive做nginx日志url 分析的時候,經常需要parse url。
hive中自帶的函數parse_url可以實現這個功能,不過它對格式的要求比較嚴格,不能直接用于nginx log的request字段。
| 1 2 | hive -e?"select parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') from dual" facebook.com |
| 1 2 | hive -e?"select parse_url('facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') from dual" NULL |
也可以通過regexp_extract來實現,不過需要寫正則,同時性能也有些問題。。
| 1 2 | hive -e?"select regexp_extract('GET /vips-mobile/router.do?api_key=24415b921531551cb2ba756b885ce783&app_version=1.8.6&fields=sku_id HTTP/1.1','.+? +(.+?)app_version=(.+?)&(.+) .+?',2) from dual" 1.8.6 |
考慮自己寫一個,借鑒parse_url的udf:
代碼如下:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | package?com.hive.myudf; import?java.net.URL; import?java.util.regex.Matcher; import?java.util.regex.Pattern; import?org.apache.hadoop.hive.ql.exec.UDF; public?class?UDFNginxParseUrl?extends?UDF { ??private?String schemal =?"http://"; ??private?String host1 =?null; ??private?Pattern p1 =?null; ??private?URL url =?null; ??private?Pattern p =?null; ??private?String lastKey =?null; ??public?UDFNginxParseUrl() { ??} ??public?String evaluate(String host1, String urlStr, String partToExtract) { ????if?(host1 ==?null?|| urlStr ==?null?|| partToExtract ==?null) { ??????return?null; ????} ?????p1 = Pattern.compile("(.+?) +(.+?) (.+)"); ?????Matcher m1 = p1.matcher(urlStr); ?????if?(m1.matches()){ ??????????String realUrl = schemal + host1 + m1.group(2); ??????????System.out.println("URL is "?+ realUrl); ??????????try{ ???????????????url =?new?URL(realUrl); ??????????}catch?(Exception e){ ???????????????return?null; ??????????} ?????????????????????????????????????????????? ?????} ?????/* ????if (lastUrlStr == null || !urlStr.equals(lastUrlStr)) { ??????try { ????????url = new URL(urlStr); ??????} catch (Exception e) { ????????return null; ??????} ????} ????lastUrlStr = urlStr; ?????*/ ????if?(partToExtract.equals("HOST")) { ??????return?url.getHost(); ????} ????if?(partToExtract.equals("PATH")) { ??????return?url.getPath(); ????} ????if?(partToExtract.equals("QUERY")) { ??????return?url.getQuery(); ????} ????if?(partToExtract.equals("REF")) { ??????return?url.getRef(); ????} ????if?(partToExtract.equals("PROTOCOL")) { ??????return?url.getProtocol(); ????} ????if?(partToExtract.equals("FILE")) { ??????return?url.getFile(); ????} ????if?(partToExtract.equals("AUTHORITY")) { ??????return?url.getAuthority(); ????} ????if?(partToExtract.equals("USERINFO")) { ??????return?url.getUserInfo(); ????} ????return?null; ??} ??public?String evaluate(String host, String urlStr, String partToExtract, String key) { ????if?(!partToExtract.equals("QUERY")) { ??????return?null; ????} ????String query =?this.evaluate(host, urlStr, partToExtract); ????if?(query ==?null) { ??????return?null; ????} ????if?(!key.equals(lastKey)) { ??????p = Pattern.compile("(&|^)"?+ key +?"=([^&]*)"); ????} ????lastKey = key; ????Matcher m = p.matcher(query); ????if?(m.find()) { ??????return?m.group(2); ????} ????return?null; ??} } |
add jar和create function之后測試:
| 1 2 | hive -e?"select nginx_url_parse('test.test.com','GET /vips-mobile/router.do?api_key=24415&app_version=1.8.6&fields=sku_id HTTP/1.1','HOST') FROM dual;" test.test.com |
| 1 2 | hive -e?"select nginx_url_parse('test.test.com','GET /vips-mobile/router.do?api_key=24415&app_version=1.8.6&fields=sku_id HTTP/1.1','QUERY','api_key') FROM dual;" 24415 |
這樣就可以直接應用于nginx的日志了。
本文轉自菜菜光 51CTO博客,原文鏈接:http://blog.51cto.com/caiguangguang/1350463,如需轉載請自行聯系原作者
總結
以上是生活随笔為你收集整理的自定义hive url parse函数的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: LaTeX多文件编译的方法总结
- 下一篇: 多元地理加权回归软件使用和含义