如何指南:Apache Drill入门
Apache Drill是一種引擎,可以連接到許多不同的數據源,并為它們提供SQL接口。 它不僅是想跳入任何復雜事物的SQL接口,而且是一個功能強大的接口, 其中包括對許多內置函數和窗口函數的支持。 盡管它可以連接到您仍然可以使用SQL查詢的標準數據源(例如Oracle或MySQL),但它也可以處理諸如CSV或JSON之類的平面文件以及Avro和Parquet格式。 正是這種針對文件運行SQL的能力首先激發了我對Apache Drill的興趣。 我花了大量時間研究大數據架構和工具,包括大數據發現 。 在此過程中,嘗試數據管道選項是我發現的空白之一,那就是在將它們引入Hive之類的東西之前,可以以原始狀態挖掘文件的功能,這將使他們能夠通過BDD和其他工具。
在本文中,我將逐步介紹Apache Drill的入門知識,并展示一些我認為是其有用性的絕佳示例。
入門
開始使用Apache Drill非常簡單–只需下載并解壓縮并運行即可。 雖然它可以跨機器分布運行以提高性能,但也可以在筆記本電腦上獨立運行。
啟動它
cd /opt/apache-drill-1.7.0/ bin/sqlline -u jdbc:drill:zk=local如果您No current connection或com.fasterxml.jackson.databind.JavaType.isReferenceType()Z則您遇到沖突的JAR問題 (例如,我在Oracle BigDataLite VM上遇到此問題 ),應在干凈的環境下啟動它
env -i HOME="$HOME" LC_CTYPE="${LC_ALL:-${LC_CTYPE:-$LANG}}" PATH="$PATH" USER="$USER" /opt/apache-drill-1.7.0/bin/drill-embedded有一個內置數據集可用于測試:
USE cp; SELECT employee_id, first_name FROM `employee.json` limit 5;如果您習慣使用SQL * Plus和類似的工具,那么在非常熟悉的環境中,這應該返回五行:
0: jdbc:drill:zk=local> USE cp; +-------+---------------------------------+ | ok | summary | +-------+---------------------------------+ | true | Default schema changed to [cp] | +-------+---------------------------------+ 1 row selected (1.776 seconds) 0: jdbc:drill:zk=local> SELECT employee_id, first_name FROM `employee.json` limit 5; +--------------+-------------+ | employee_id | first_name | +--------------+-------------+ | 1 | Sheri | | 2 | Derrick | | 4 | Michael | | 5 | Maya | | 6 | Roberta | +--------------+-------------+ 5 rows selected (3.624 seconds)到目前為止,是如此的SQL,如此的關系-如此熟悉,真的。 Apache Drill開始偏離明顯之處的是它對存儲處理程序的使用。 在上面的查詢中,cp是我們要對其運行查詢的“數據庫”,但實際上這是默認情況下定義的“類路徑”(因此稱為“ cp”)存儲處理程序。 在“數據庫”中,存在“方案”,它們是存儲處理程序的子配置。 稍后我們將查看和定義這些內容。 現在,知道您還可以列出可用的數據庫很有用:
0: jdbc:drill:zk=local> show databases; +---------------------+ | SCHEMA_NAME | +---------------------+ | INFORMATION_SCHEMA | | cp.default | | dfs.default | | dfs.root | | dfs.tmp | | sys | +---------------------+注意databases命令是schemas的同義詞; 兩者都返回的是<database>.<schema> 。 在Apache Drill中,反引號用于封裝標識符(例如模式名稱,列名稱等),并且它非常具體。 例如,這是有效的:
0: jdbc:drill:zk=local> USE `cp.default`; +-------+-----------------------------------------+ | ok | summary | +-------+-----------------------------------------+ | true | Default schema changed to [cp.default] | +-------+-----------------------------------------+ 1 row selected (0.171 seconds)雖然不是:
0: jdbc:drill:zk=local> USE cp.default; Error: PARSE ERROR: Encountered ". default" at line 1, column 7. Was expecting one of: <EOF> "." <IDENTIFIER> ... "." <QUOTED_IDENTIFIER> ... "." <BACK_QUOTED_IDENTIFIER> ... "." <BRACKET_QUOTED_IDENTIFIER> ... "." <UNICODE_QUOTED_IDENTIFIER> ... "." "*" ...SQL Query USE cp.default這是因為default是保留字,因此必須加引號。 因此,您也可以使用
0: jdbc:drill:zk=local> use cp.`default`;
但不是
0: jdbc:drill:zk=local> use `cp`.default;
查詢JSON數據
在Apache Drill網站上有一些有用的教程 ,其中包括一個使用Yelp提供的數據的教程 。 這是最初使我著眼于Drill的數據集,因為我將其用作大數據發現 (BDD)的輸入,但在兩個方面都遇到了困難。 首先是如何最好地在其上定義合適的Hive表,以便將其提取到BDD中。 從此之后,我們試圖了解數據中可能具有什么價值,這將使我花多長時間來完善我在Hive中公開數據的方式。 以下示例顯示了以表格形式查詢時,復雜JSON可能帶來的復雜性。
首先,查詢JSON文件,并自動推斷出架構。 很酷
0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` limit 5; +---------+------+-------------+-------+------+------+ | user_id | text | business_id | likes | date | type | +---------+------+-------------+-------+------+------+ | -6rEfobYjMxpUWLNxszaxQ | Don't waste your time. | cE27W9VPgO88Qxe4ol6y_g | 0 | 2013-04-18 | tip | | EZ0r9dKKtEGVx2CdnowPCw | Your GPS will not allow you to find this place. Put Rankin police department in instead. They are directly across the street. | mVHrayjG3uZ_RLHkLj-AMg | 1 | 2013-01-06 | tip | | xb6zEQCw9I-Gl0g06e1KsQ | Great drink specials! | KayYbHCt-RkbGcPdGOThNg | 0 | 2013-12-03 | tip | | QawZN4PSW7ng_9SP7pjsVQ | Friendly staff, good food, great beer selection, and relaxing atmosphere | KayYbHCt-RkbGcPdGOThNg | 0 | 2015-07-08 | tip | | MLQre1nvUtW-RqMTc4iC9A | Beautiful restoration. | 1_lU0-eSWJCRvNGk78Zh9Q | 0 | 2015-10-25 | tip | +---------+------+-------------+-------+------+------+ 5 rows selected (2.341 seconds)我們可以使用標準的SQL聚合,例如COUNT :
0: jdbc:drill:zk=local> select count(*) from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json`; +---------+ | EXPR$0 | +---------+ | 591864 | +---------+ 1 row selected (4.495 seconds)以及GROUP BY操作:
0: jdbc:drill:zk=local> select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5; +-------------+------------+ | date | tip_count | +-------------+------------+ | 2012-07-21 | 719 | | 2012-05-19 | 718 | | 2012-08-04 | 699 | | 2012-06-23 | 690 | | 2012-07-28 | 682 | +-------------+------------+ 5 rows selected (7.111 seconds)稍微研究一下數據,我們可以看到它并不完全平坦–請注意,例如hours列,它是一個嵌套的JSON對象:
0: jdbc:drill:zk=local> select full_address,city,hours from `/user/oracle/incoming/yelp/business_json` b limit 5; +--------------+------+-------+ | full_address | city | hours | +--------------+------+-------+ | 4734 Lebanon Church Rd Dravosburg, PA 15034 | Dravosburg | {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} | | 202 McClure St Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} | | 1 Ravine St Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} | | 1530 Hamilton Rd Bethel Park, PA 15234 | Bethel Park | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} | | 301 South Hills Village Pittsburgh, PA 15241 | Pittsburgh | {"Friday":{"close":"17:00","open":"10:00"},"Tuesday":{"close":"21:00","open":"10:00"},"Thursday":{"close":"17:00","open":"10:00"},"Wednesday":{"close":"21:00","open":"10:00"},"Monday":{"close":"21:00","open":"10:00"},"Sunday":{"close":"18:00","open":"11:00"},"Saturday":{"close":"21:00","open":"10:00"}} | +--------------+------+-------+ 5 rows selected (0.721 seconds) 0: jdbc:drill:zk=local>使用Apache Drill,我們可以簡單地使用點表示法來訪問嵌套值。 在執行此操作時,必須為表加上別名(在本示例中為b ):
0: jdbc:drill:zk=local> select b.hours from `/user/oracle/incoming/yelp/business_json` b limit 1; +-------+ | hours | +-------+ | {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} | +-------+嵌套對象本身可以嵌套-Apache Drill并不存在問題,我們只是將點符號進一步鏈接起來:
0: jdbc:drill:zk=local> select b.hours.Friday from `/user/oracle/incoming/yelp/business_json` b limit 1; +-----------------------------------+ | EXPR$0 | +-----------------------------------+ | {"close":"21:00","open":"11:00"} | +-----------------------------------+ 1 row selected (0.238 seconds)注意,使用反引號( ` )引用保留的open和close關鍵字:
0: jdbc:drill:zk=local> select b.hours.Friday.`open`,b.hours.Friday.`close` from `/user/oracle/incoming/yelp/business_json` b limit 1; +---------+---------+ | EXPR$0 | EXPR$1 | +---------+---------+ | 11:00 | 21:00 | +---------+---------+ 1 row selected (0.58 seconds)嵌套列本身就是查詢中的適當對象,也可以用作謂詞:
0: jdbc:drill:zk=local> select b.name,b.full_address,b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5; +------------------------+------------------------------------------------+---------+ | name | full_address | EXPR$2 | +------------------------+------------------------------------------------+---------+ | Mr Hoagie | 4734 Lebanon Church Rd Dravosburg, PA 15034 | 11:00 | | Alexion's Bar & Grill | 141 Hawthorne St Greentree Carnegie, PA 15106 | 11:00 | | Rocky's Lounge | 1201 Washington Ave Carnegie, PA 15106 | 11:00 | | Papa J's | 200 E Main St Carnegie Carnegie, PA 15106 | 11:00 | | Italian Village Pizza | 2615 Main St Homestead, PA 15120 | 11:00 | +------------------------+------------------------------------------------+---------+ 5 rows selected (0.404 seconds)您會在上面的輸出中注意到full_address字段中包含換行符—我們可以使用SQL函數用逗號替換換行符:
0: jdbc:drill:zk=local> select b.name,regexp_replace(b.full_address,'\n',','),b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5; +------------------------+------------------------------------------------+---------+ | name | EXPR$1 | EXPR$2 | +------------------------+------------------------------------------------+---------+ | Mr Hoagie | 4734 Lebanon Church Rd,Dravosburg, PA 15034 | 11:00 | | Alexion's Bar & Grill | 141 Hawthorne St,Greentree,Carnegie, PA 15106 | 11:00 | | Rocky's Lounge | 1201 Washington Ave,Carnegie, PA 15106 | 11:00 | | Papa J's | 200 E Main St,Carnegie,Carnegie, PA 15106 | 11:00 | | Italian Village Pizza | 2615 Main St,Homestead, PA 15120 | 11:00 | +------------------------+------------------------------------------------+---------+ 5 rows selected (1.346 seconds)查詢聯合
因此,Apache Drill使您能夠對多種格式和位置的數據運行SQL查詢,這本身就非常有用。 但比這更好的是,它使您可以在單個查詢中聯合這些源。 這是在HDFS和Oracle中的數據之間進行聯接的示例:
0: jdbc:drill:zk=local> select X.text, . . . . . . . . . . . > Y.NAME . . . . . . . . . . . > from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X . . . . . . . . . . . > inner join ora.MOVIEDEMO.YELP_BUSINESS Y . . . . . . . . . . . > on X.business_id = Y.BUSINESS_ID . . . . . . . . . . . > where Y.NAME = 'Chick-fil-A' . . . . . . . . . . . > limit 5; +--------------------------------------------------------------------+--------------+ | text | NAME | +--------------------------------------------------------------------+--------------+ | It's daddy daughter date night here and they go ALL OUT! | Chick-fil-A | | Chicken minis! The best part of waking up Saturday mornings. :) | Chick-fil-A | | Nice folks as always unlike those ghetto joints | Chick-fil-A | | Great clean and delicious chicken sandwiches! | Chick-fil-A | | Spicy Chicken with lettuce, tomato, and pepperjack cheese FTW! | Chick-fil-A | +--------------------------------------------------------------------+--------------+ 5 rows selected (3.234 seconds)您可以為此定義一個視圖:
0: jdbc:drill:zk=local> create or replace view dfs.tmp.yelp_tips as select X.text as tip_text, Y.NAME as business_name from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X inner join ora.MOVIEDEMO.YELP_BUSINESS Y on X.business_id = Y.BUSINESS_ID ; +-------+-------------------------------------------------------------+ | ok | summary | +-------+-------------------------------------------------------------+ | true | View 'yelp_tips' replaced successfully in 'dfs.tmp' schema | +-------+-------------------------------------------------------------+ 1 row selected (0.574 seconds) 0: jdbc:drill:zk=local> describe dfs.tmp.yelp_tips; +----------------+--------------------+--------------+ | COLUMN_NAME | DATA_TYPE | IS_NULLABLE | +----------------+--------------------+--------------+ | tip_text | ANY | YES | | business_name | CHARACTER VARYING | YES | +----------------+--------------------+--------------+ 2 rows selected (0.756 seconds)然后將其作為任何常規對象進行查詢:
0: jdbc:drill:zk=local> select tip_text,business_name from dfs.tmp.yelp_tips where business_name like '%Grill' limit 5; +------+------+ | text | NAME | +------+------+ | Great drink specials! | Alexion's Bar & Grill | | Friendly staff, good food, great beer selection, and relaxing atmosphere | Alexion's Bar & Grill | | Pretty quiet here... | Uno Pizzeria & Grill | | I recommend this location for quick lunches. 10 min or less lunch menu. Soup bar ( all you can eat) the broccoli cheddar soup is delicious. | Uno Pizzeria & Grill | | Instead of pizza, come here for dessert. The deep dish sundae is really good. | Uno Pizzeria & Grill | +------+------+ 5 rows selected (3.272 seconds)查詢Twitter JSON數據
這是一個使用Drill查詢包含一些Twitter數據的本地文件的示例。 如果您想嘗試自己查詢文件,可以在這里下載文件 。
首先,我切換到使用dfs存儲插件:
0: jdbc:drill:zk=local> use dfs; +-------+----------------------------------+ | ok | summary | +-------+----------------------------------+ | true | Default schema changed to [dfs] | +-------+----------------------------------+然后嘗試對文件進行選擇。 注意limit 5子句–僅在檢查文件結構時非常有用。
0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/twitter/geo_tweets.json` limit 5; Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within/between OBJECT entriesFile /user/oracle/incoming/twitter/geo_tweets.json Record 2819 Column 3503 Fragment 0:0一個錯誤? 那不應該發生的。 我有一個JSON文件,對不對? 事實證明,JSON文件是每行一個完整的JSON對象。 除了不在最后一個記錄上。 請注意上面的錯誤– 2819中給出的記錄計數:
[oracle@bigdatalite ~]$ wc -l geo_tweets.json 2818 geo_tweets.json因此,該文件只有2818條完整的行。 嗯 讓我們使用頭/尾巴打擊組合查看該記錄:
[oracle@bigdatalite ~]$ head -n 2819 geo_tweets.json |tail -n1 {"created_at":"Sun Jul 24 21:00:44 +0000 2016","id":757319630432067584,"id_str":"757319630432067584","text":"And now @HillaryClinton hires @DWStweets: Honorary Campaign Manager across the USA #corruption #hillarysamerica https://t.co/8jAGUu6w2f","source":"TweetCaster for iOS","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2170786369,"id_str":"2170786369","name":"Patricia Weber","screen_name":"InnieBabyBoomer","location":"Williamsburg, VA","url":"http://lovesrantsandraves.blogspot.com/","description":"Baby Boomer, Swing Voter, Conservative, Spiritual, #Introvert, Wife, Grandma, Italian, ? Books, Cars, Ferrari, F1 Race? #tcot","protected":false,"verified":false,"followers_count":861,"friends_count":918,"listed_count":22,"favourites_count":17,"statuses_count":2363,"created_at":"Sat Nov 02 19:13:06 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000107659131/3589f這就是文件中的完整數據-所以Drill是正確的-JSON已損壞。 如果我們刪除最后一條記錄并創建一個新文件( geo_tweets.fixed.json )
然后再次查詢,我們得到了一些東西!
0: jdbc:drill:zk=local> select text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5; +------+ | text | +------+ | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim | | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | | Donald Trump accepted the Republican nomination last night. Isis claimed responsibility. | | Obama: "We must stand together and stop terrorism" Trump: "We don't want these people in our country"| | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | +------+ 5 rows selected (0.246 seconds)此處的text是json字段之一。 我可以select *但不是很清楚:
0: jdbc:drill:zk=local> select * from `/users/rmoff/data/geo_tweets.fixed.json` limit 5; +------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+ | created_at | id | id_str | text | source | truncated | user | geo | coordinates | place | is_quote_status | retweet_count | favorite_count | entities | favorited | retweeted | possibly_sensitive | filter_level | lang | timestamp_ms | @version | @timestamp | user_name | user_screen_name | user_followers_count | user_friends_count | user_listed_count | user_favourites_count | user_statuses_count | user_created_at | place_name | place_country | hashtags_list | urls_array | urls_list | user_mentions_screen_name_list | longitude | latitude | hashtags_array | extended_entities | user_mentions_screen_name_array | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | retweeted_status | retweeted_screen_name | quoted_status_id | quoted_status_id_str | quoted_status | +------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+ | Fri Jul 22 19:37:11 +0000 2016 | 756573827589545984 | 756573827589545984 | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim | dlvr.it | false | {"id":67898674,"id_str":"67898674","name":"Vancouver Press","screen_name":"Vancouver_CP","location":"Vancouver, BC","url":"http://vancouver.cityandpress.com/","description":"Latest news from Vancouver. Updates are frequent.","protected":false,"verified":false,"followers_count":807,"friends_count":13,"listed_count":94,"favourites_count":1,"statuses_count":131010,"created_at":"Sat Aug 22 14:25:37 +0000 2009","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_link_color":"8A1C3B","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"2A2C31","profile_use_background_image":false,"profile_image_url":"http://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_image_url_https":"https://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_banner_url":"https://pbs.twimg.com/profile_banners/67898674/1411821103","default_profile":false,"default_profile_image":false} | {"type":"Point","coordinates":[49.2814375,-123.12109067]} | {"type":"Point","coordinates":[-123.12109067,49.2814375]} | {"id":"1e5cb4d0509db554","url":"https://api.twitter.com/1.1/geo/id/1e5cb4d0509db554.json","place_type":"city","name":"Vancouver","full_name":"Vancouver, British Columbia","country_code":"CA","country":"Canada","bounding_box":{"type":"Polygon","coordinates":[[[-123.224215,49.19854],[-123.224215,49.316738],[-123.022947,49.316738],[-123.022947,49.19854]]]},"attributes":{}} | false | 0 | 0 | {"urls":[{"url":"https://t.co/joI9GMfRim","expanded_url":"http://toplocalnow.com/ca/vancouver?section=trends","display_url":"toplocalnow.com/ca/vancouver?s…","indices":[70,93]}],"hashtags":[],"user_mentions":[],"media":[],"symbols":[]} | false | false | false | low | en | 1469216231616 | 1 | 2016-07-22T19:37:11.000Z | Vancouver Press | Vancouver_CP | 807 | 13 | 94 | 1 | 131010 | Sat Aug 22 14:25:37 +0000 2009 | Vancouver | Canada | | ["toplocalnow.com/ca/vancouver?s…"] | toplocalnow.com/ca/vancouver?s… | | -123.12109067 | 49.2814375 | [] | {"media":[]} | [] | null | null | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"hashtags":[],"urls":[]},"extended_entities":{"media":[]},"quoted_status":{"user":{},"entities":{"hashtags":[],"user_mentions":[],"media":[],"urls":[]},"extended_entities":{"media":[]}}} | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"urls":[],"hashtags":[]},"extended_entities":{"media":[]},"place":{"bounding_box":{"coordinates":[]},"attributes":{}},"geo":{"coordinates":[]},"coordinates":{"coordinates":[]}} |在twitter數據中,存在根級字段(例如text )以及嵌套字段(例如user字段中有關高音揚聲器的信息)。 正如我們在上面看到的,您使用點表示法引用了嵌套字段。 現在是指出您可能遇到的幾個常見錯誤的好時機。 首先是不引用保留字,而是檢查是否Encountered "."諸如“ Encountered "."類的錯誤的第一件事Encountered "." :
0: jdbc:drill:zk=local> select user.screen_name,text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5; Error: PARSE ERROR: Encountered "." at line 1, column 12. [...]其次是在使用點表示法時聲明表別名-如果您不這樣做,則Apache Drill會認為父列實際上是表名( VALIDATION ERROR: [...] Table 'user' not found ):
0: jdbc:drill:zk=local> select `user`.screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` limit 5; Aug 10, 2016 11:16:45 PM org.apache.calcite.sql.validate.SqlValidatorException SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'user' not found Aug 10, 2016 11:16:45 PM org.apache.calcite.runtime.CalciteException SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 13: Table 'user' not found Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 13: Table 'user' not foundSQL Query null[Error Id: 1427fd23-e180-40be-a751-b6f1f838233a on 192.168.56.1:31010] (state=,code=0)修正了這些錯誤之后,我們可以看到用戶的屏幕名稱:
0: jdbc:drill:zk=local> select tweets.`user`.`screen_name` as user_screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 2; +------------------+------+ | user_screen_name | text | +------------------+------+ | Vancouver_CP | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim | | tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | +------------------+------+ 2 rows selected (0.256 seconds) 0: jdbc:drill:zk=local>除了嵌套對象,JSON還支持數組。 Twitter數據中的一個示例是#標簽或URL,在給定的tweet中,兩者都可以為零,一個或多個。
0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5; +--------+ | EXPR$0 | +--------+ | [] | | [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] | | [] | | [] | | [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] | +--------+ 5 rows selected (0.286 seconds)使用FLATTEN函數,每個數組條目都會變成一個新行,因此:
0: jdbc:drill:zk=local> select flatten(tweets.entities.hashtags) from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5; +----------------------------------------------+ | EXPR$0 | +----------------------------------------------+ | {"text":"hiring","indices":[6,13]} | | {"text":"Job","indices":[98,102]} | | {"text":"SkilledTrade","indices":[103,116]} | | {"text":"Tucson","indices":[117,124]} | | {"text":"Jobs","indices":[129,134]} | +----------------------------------------------+ 5 rows selected (0.139 seconds)請注意, limit 5子句僅顯示前五個數組實例,實際上這只是上述列表中第一個tweet的主題標簽。
要訪問主題標簽的文本,我們使用子查詢和點表示法來訪問text字段:
0: jdbc:drill:zk=local> select ent_hashtags.hashtags.text from (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) as ent_hashtags limit 5; +---------------+ | EXPR$0 | +---------------+ | hiring | | Job | | SkilledTrade | | Tucson | | Jobs | +---------------+ 5 rows selected (0.168 seconds)對于相同的結果,可以使用通用表表達式 (CTE,也稱為子查詢分解)來提高可讀性:
0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) . . . . . . . . . . . > select ent_hashtags.hashtags.text from ent_hashtags . . . . . . . . . . . > limit 5; +---------------+ | EXPR$0 | +---------------+ | hiring | | Job | | SkilledTrade | | Tucson | | Jobs | +---------------+ 5 rows selected (0.253 seconds)將展平的數組與現有字段結合起來,使我們能夠看到類似推文列表及其相關主題標簽的內容:
0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp limit 10; +------------------+------+---------+ | user_screen_name | text | hashtag | +------------------+------+---------+ | tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | hiring | | tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job | | tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | SkilledTrade | | tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Tucson | | tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Jobs | | johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | lol | | johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | nowthatsfunny | | greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | WinstonSalem | | greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | ws | | trendinaliaSG | 6. Hit The Stage 7. TTTT 8. Demi Lovato 9. Beijing 10. Donald Trump2016/7/23 03:36 SGT #trndnl https://t.co/psP0GzBgZB | trndnl | +------------------+------+---------+ 10 rows selected (0.166 seconds)我們還可以根據主題標簽進行過濾:
0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp where tmp.hashtags.text = 'Job' limit 5; +------------------+------+---------+ | user_screen_name | text | hashtag | +------------------+------+---------+ | tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job | | tmj_VAL_health | Want to work at Genesis Rehab Services? We're #hiring in #Clinton, MD! Click for details: https://t.co/4lt7I4gMZk #Job #Healthcare #Jobs | Job | | tmj_in_retail | Want to work in #Clinton, IN? View our latest opening: https://t.co/UiimnlubYs #Job #Retail #Jobs #Hiring #CareerArc | Job | | tmj_la_hrta | Want to work at SONIC Drive-In? We're #hiring in #Clinton, LA! Click for details: https://t.co/aQ1FrWc7iR #Job #SONIC #Hospitality #Jobs | Job | | tmj_ia_hrta | We're #hiring! Click to apply: Department Manager - https://t.co/SnoKcwwHFk #Job #Hospitality #Clinton, IA #Jobs #CareerArc | Job | +------------------+------+---------+ 5 rows selected (0.207 seconds)以及總結標簽計數:
0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) . . . . . . . . . . . > select ent_hashtags.hashtags.text,count(ent_hashtags.hashtags.text) from ent_hashtags . . . . . . . . . . . > group by ent_hashtags.hashtags.text . . . . . . . . . . . > order by 2 desc; +-----------------------------+---------+ | EXPR$0 | EXPR$1 | +-----------------------------+---------+ | Trump | 365 | | trndnl | 176 | | job | 170 | | Hiring | 127 | | Clinton | 108 | | Yorkshire | 100 | | CareerArc | 100 | [...]要過濾掉可能沒有數組值的記錄(例如,不是每個推特都具有的哈希標簽)并且沒有查詢的記錄可能會失敗,請對數組的第一個索引的屬性使用IS NOT NULL :
0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5; +--------+ | EXPR$0 | +--------+ | [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] | | [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] | | [{"text":"WinstonSalem","indices":[0,13]},{"text":"ws","indices":[92,95]}] | | [{"text":"trndnl","indices":[89,96]}] | | [{"text":"trndnl","indices":[92,99]}] | +--------+ 5 rows selected (0.187 seconds)如果您嘗試比較數組本身,則無法使用:
0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags is not null limit 5; Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index -1. Error: Missing function implementation: [isnotnull(MAP-REPEATED)]. Full expression: --UNKNOWN EXPRESSION--.. Fragment 0:0 [Error Id: 99ac12aa-f6b4-4692-b815-8f483da682c4 on 192.168.56.1:31010] (state=,code=0)
上面的示例演示了如何使用數組索引,這是FLATTEN的替代選擇,如果您知道它們將存在,則可以訪問數組中的單個對象:
0: jdbc:drill:zk=local> select tweets.entities.hashtags[0].text as first_hashtag,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5; +---------------+------+ | first_hashtag | text | +---------------+------+ | hiring | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | | lol | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | | WinstonSalem | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 |查詢CSV文件
JSON文件相對易于解釋,因為其中包含半定義的架構,包括列名。 另一方面,在可靠地推斷列名時,CSV(通常是字符分隔文件)更像是“狂野的西部”。 如果愿意,可以配置Apache Drill忽略CSV文件的第一行(假定它是標題),也可以將它們用作列名。 如果您不這樣做,則查詢如下所示的CSV文件:
[oracle@bigdatalite ~]$ head nyc_parking_violations.csv Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation 1360858775,PHW9801,OH,PAS,07/01/2015,20,SUBN,HONDA,P,61490,26160,26190,0,0044,44,44,929822,0044,0000,0653P,,BX,O,651,RIVER AVE,,0,408,D,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,您將獲得兩個記錄,每個記錄為一列寬,它們是一個數組:
0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` LIMIT 5; +---------+ | columns | +---------+ | ["Summons Number","Plate ID","Registration State","Plate Type","Issue Date","Violation Code","Vehicle Body Type","Vehicle Make","Issuing Agency","Street Code1","Street Code2","Street Code3","Vehicle Expiration Date","Violation Location","Violation Precinct","Issuer Precinct","Issuer Code","Issuer Command","Issuer Squad","Violation Time","Time First Observed","Violation County","Violation In Front Of Or Opposite","House Number","Street Name","Intersecting Street","Date First Observed","Law Section","Sub Division","Violation Legal Code","Days Parking In Effect ","From Hours In Effect","To Hours In Effect","Vehicle Color","Unregistered Vehicle?","Vehicle Year","Meter Number","Feet From Curb","Violation Post Code","Violation Description","No Standing or Stopping Violation","Hydrant Violation","Double Parking Violation"] | | ["1360858775","PHW9801","OH","PAS","07/01/2015","20","SUBN","HONDA","P","61490","26160","26190","0","0044","44","44","929822","0044","0000","0653P","","BX","O","651","RIVER AVE","","0","408","D","","BBBBBBB","ALL","ALL","","0","0","-","0","","","","",""] |要訪問CSV文件中的實際列,您需要使用columns[x]語法來引用它們。 請注意,各columns區分大小寫,并且編號從零開始:
0: jdbc:drill:zk=local> select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` limit 5; +----------+--------------------+ | PlateID | RegistrationState | +----------+--------------------+ | AR877A | NJ | | 73268ME | NY | | 2050240 | IN | | 2250017 | IN | | AH524C | NJ | +----------+--------------------+ 5 rows selected (0.247 seconds)為了使重復處理數據更加容易,您可以定義數據視圖:
0: jdbc:drill:zk=local> create view dfs.tmp.NYC_Parking_01 as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`; +-------+-----------------------------------------------------------------+ | ok | summary | +-------+-----------------------------------------------------------------+ | true | View 'NYC_Parking_01' created successfully in 'dfs.tmp' schema | +-------+-----------------------------------------------------------------+ 1 row selected (0.304 seconds)這使用的是dfs存儲插件及其內部的tmp模式,它具有以下存儲配置–請注意, writeable為true
"tmp": {"location": "/tmp","writable": true,"defaultInputFormat": null }(如果您使用了錯誤的數據庫[存儲插件]或架構,您將獲得Schema [hdfs] is immutable. )
查詢新視圖
0: jdbc:drill:zk=local> select * from dfs.tmp.NYC_Parking_01 limit 5; +-----------+---------------------+ | PlateID | RegistrationState | +-----------+---------------------+ | Plate ID | Registration State | | PHW9801 | OH | | K8010F | TN | | GFG6211 | NY | | GHL1805 | NY | +-----------+---------------------+ 5 rows selected (0.191 seconds)通過視圖,或直接針對CSV路徑,您還可以運行聚合:
0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1; +----------+---------+ | PlateID | EXPR$1 | +----------+---------+ | 2050240 | 4 | +----------+---------+ 1 row selected (15.983 seconds)盡管對于相同的結果,這不能重新運行-可能是由于limit子句
0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1; +----------+---------+ | PlateID | EXPR$1 | +----------+---------+ | AR877A | 3 | +----------+---------+ 1 row selected (12.881 seconds)在視圖下,視圖定義被寫入/tmp –如果要在重新啟動后保留此數據,則需要移動此路徑:
[oracle@bigdatalite parking]$ cat /tmp/NYC_Parking_01.view.drill {"name" : "NYC_Parking_01","sql" : "SELECT `columns`[1] AS `PlateID`, `columns`[2] AS `RegistrationState`\nFROM `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`","fields" : [ {"name" : "PlateID","type" : "ANY","isNullable" : true}, {"name" : "RegistrationState","type" : "ANY","isNullable" : true} ],"workspaceSchemaPath" : [ "hdfs" ]您也可以使用CTAS(選擇時創建表)創建實際表:
0: jdbc:drill:zk=local> create table dfs.tmp.parking as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`; +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 1_1 | 4471875 | | 1_0 | 4788421 | +-----------+----------------------------+ 2 rows selected (42.913 seconds)它存儲在磁盤上(按dfs配置),默認情況下以Parquet格式存儲:
[oracle@bigdatalite parking]$ ls -l /tmp/parking/ total 76508 -rw-r--r--. 1 oracle oinstall 40623288 Aug 10 22:53 1_0_0.parquet -rw-r--r--. 1 oracle oinstall 37717804 Aug 10 22:53 1_1_0.parquetDrill的Web界面
Drill帶有Web界面,您可以從http://訪問
:8047 /對
- 發出查詢
- 配置其他存儲插件(例如, 數據庫 , hdfs等)
- 指標和調試
定義存儲插件
在Drill Web界面中,您可以查看現有的存儲插件或定義新的存儲插件。 要創建一個新文件,請在“存儲”頁面上的“ 新存儲插件”下輸入其名稱(例如hdfs ,但可能會引起fred -只是一個標簽),然后單擊“創建”。 在“配置”框中粘貼必要的JSON定義,然后單擊“創建”。 如果您不想使用GUI,那么還有REST API。
存儲插件配置存儲在Zookeeper中(運行分布式Drill時),或者獨立運行時在sys.store.provider.local.path路徑中本地sys.store.provider.local.path 。 默認情況下,該/tmp位于/tmp下,該/tmp在服務器重新啟動時被清除。 要保留自定義存儲配置,請修改drill-override.conf的sys.store.provider.local.path ,例如:
drill.exec: {cluster-id: "drillbits1",zk.connect: "localhost:2181"sys.store.provider.local.path="/home/oracle/drill/" }處理文件系統數據
這是一個使Drill能夠訪問CDH群集的HDFS的存儲配置示例:
{"type": "file","enabled": true,"connection": "hdfs://cdh57-01-node-01:8020/","config": null,"workspaces": {"root": {"location": "/","writable": true,"defaultInputFormat": null}},"formats": {"csv": {"type": "text","extensions": ["csv"],"delimiter": ","},"json": {"type": "json","extensions": ["json"]}} }除了HDFS的connection參數本身外,此配置中的重要部分是formats部分。 這就告訴Drill如何處理找到的文件,而最終用戶不必顯式聲明其類型。
對于基于文件系統的插件dfs (可以包括本地文件,HDFS甚至Amazon S3),您可以瀏覽可用的“表”:
列出HDFS中的文件(以前use hdfs;選擇)
0: jdbc:drill:zk=local> show files; +--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+ | name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime | +--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+ | hbase | true | false | 0 | hbase | supergroup | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-07-25 14:46:08.212 | | share | true | false | 0 | hdfs | supergroup | rwxrwxrwx | 1969-12-31 19:00:00.0 | 2016-05-15 12:28:08.152 | | solr | true | false | 0 | solr | solr | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-06-01 18:34:50.716 | | tmp | true | false | 0 | hdfs | supergroup | rwxrwxrwt | 1969-12-31 19:00:00.0 | 2016-06-24 04:54:41.491 | | user | true | false | 0 | hdfs | supergroup | rwxrwxrwx | 1969-12-31 19:00:00.0 | 2016-06-21 15:55:59.084 | | var | true | false | 0 | hdfs | supergroup | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-11 17:53:29.804 | +--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+ 6 rows selected (0.145 seconds)顯示給定路徑中的文件:
0: jdbc:drill:zk=local> show files in `/user/oracle`; +----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+ | name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime | +----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+ | .Trash | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-23 20:42:34.815 | | .sparkStaging | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-07-06 03:56:38.863 | | .staging | true | false | 0 | oracle | oracle | rwx------ | 1969-12-31 19:00:00.0 | 2016-06-01 18:37:04.005 | | incoming | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-08-03 05:34:12.38 | | mediademo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-06-01 18:59:45.653 | | moviedemo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:02:55.652 | | moviework | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.497 | | oggdemo | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.552 | | oozie-oozi | true | false | 0 | oracle | oracle | rwxr-xr-x | 1969-12-31 19:00:00.0 | 2016-05-15 12:03:01.651 | +----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+ 9 rows selected (0.428 seconds)您還可以通過指定通配符匹配來查詢多個文件。 以下是可用文件的截斷列表:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> show files in `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/`; +--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+ | name | isDirectory | isFile | length | owner | group | permissions | accessTime | modificationTime | +--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+ | FlumeData.1466176113171 | false | true | 1055675 | rmoff | rmoff | rw-r--r-- | 2016-08-10 21:28:27.072 | 2016-06-17 16:08:38.023 | | FlumeData.1466176113172 | false | true | 1051411 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.756 | 2016-06-17 16:08:40.597 | | FlumeData.1466176113173 | false | true | 1054734 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.752 | 2016-06-17 16:08:43.33 | | FlumeData.1466176113174 | false | true | 1050991 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.743 | 2016-06-17 16:08:44.361 | | FlumeData.1466176113175 | false | true | 1053577 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.748 | 2016-06-17 16:08:45.162 | | FlumeData.1466176113176 | false | true | 1051965 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.752 | 2016-06-17 16:08:46.261 | | FlumeData.1466176113177 | false | true | 1049555 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:47.425 | | FlumeData.1466176113178 | false | true | 1050566 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:48.23 | | FlumeData.1466176113179 | false | true | 1051751 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.756 | 2016-06-17 16:08:49.381 | | FlumeData.1466176113180 | false | true | 1052249 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.757 | 2016-06-17 16:08:50.042 | | FlumeData.1466176113181 | false | true | 1055002 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:50.896 | | FlumeData.1466176113182 | false | true | 1050812 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.758 | 2016-06-17 16:08:52.191 | | FlumeData.1466176113183 | false | true | 1048954 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.757 | 2016-06-17 16:08:52.994 | | FlumeData.1466176113184 | false | true | 1051559 | rmoff | rmoff | rw-r--r-- | 2016-08-05 20:46:51.773 | 2016-06-17 16:08:54.025 | [...]計算一個文件( FlumeData.1466176113171 )中的記錄數:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json')); +---------+ | EXPR$0 | +---------+ | 277 | +---------+ 1 row selected (0.798 seconds)在幾個文件中( FlumeData.146617611317* ):
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.146617611317*`(type => 'json')); +---------+ | EXPR$0 | +---------+ | 2415 | +---------+ 1 row selected (2.466 seconds)在文件夾( * )中的所有文件中:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/*`(type => 'json')); +---------+ | EXPR$0 | +---------+ | 7414 | +---------+ 1 row selected (3.867 seconds)甚至跨多個文件夾:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/flume/incoming/twitter/2016/06/*/*`(type => 'json')); +---------+ | EXPR$0 | +---------+ | 206793 | +---------+ 1 row selected (87.545 seconds)查詢不帶識別擴展名的數據
Drill依賴于orer中存儲擴展配置的format子句來確定如何根據文件的擴展名解釋文件。 您將不會總是擁有可用的或已定義的擴展。 如果嘗試查詢此類數據,您將走不遠。 在此示例中,我正在以JSON格式但不帶.json后綴的HDFS上查詢數據:
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171` limit 5; Error: VALIDATION ERROR: From line 1, column 18 to line 1, column 23: Table 'hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171' not foundSQL Query null不要擔心–您可以將它們聲明為查詢語法的一部分。
0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json')) limit 5; +------+ | text | +------+ | RT @jjkukrl: susu bayi jg lagi mahal nih ugh ayah harus semangat cari duit ^^9 https://t.co/2NvTOShRbI | | Oracle Java 1Z0-808 Web Exam Simulator https://t.co/tZ3gU8EMj3 | | @TribuneSelatan ahaha kudu gaya atuh da arek lebarann ahahaha | | Short impression of yesterday's speech. What a great day it was! #lifeatoracle #team #salesincentive #oracle https://t.co/SVK2ovOe3U | | Want to work at Oracle? We're #hiring in New York! Click for details: https://t.co/NMTo1WMHVw #Sales #Job #Jobs #CareerArc | +------+ 5 rows selected (1.267 seconds)存儲配置– Oracle
根據文檔,可以很容易地查詢RDBMS(例如Oracle)中的數據。 只需將JDBC驅動程序復制到Apache Drill的jar文件夾中:
cp /u01/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc7.jar /opt/apache-drill-1.7.0/jars/3rdparty/然后添加必要的存儲配置,我稱之為ora:
{"type": "jdbc","driver": "oracle.jdbc.OracleDriver","url": "jdbc:oracle:thin:moviedemo/welcome1@localhost:1521/ORCL","username": null,"password": null,"enabled": true }如果遇到錯誤, Please retry: error (unable to create/ update storage)然后檢查目標Oracle數據庫是否已啟動,密碼是否正確,等等。
然后,您可以在Hive中查詢數據:
0: jdbc:drill:zk=local> use ora.MOVIEDEMO; +-------+--------------------------------------------+ | ok | summary | +-------+--------------------------------------------+ | true | Default schema changed to [ora.MOVIEDEMO] | +-------+--------------------------------------------+ 1 row selected (0.205 seconds)0: jdbc:drill:zk=local> show tables; +----------------+-----------------------------+ | TABLE_SCHEMA | TABLE_NAME | +----------------+-----------------------------+ | ora.MOVIEDEMO | ACTIVITY | | ora.MOVIEDEMO | BDS_CUSTOMER_RFM | | ora.MOVIEDEMO | BUSINESS_REVIEW_SUMMARY | [...]0: jdbc:drill:zk=local> select * from ACTIVITY limit 5; +--------------+---------+ | ACTIVITY_ID | NAME | +--------------+---------+ | 3.0 | Pause | | 6.0 | List | | 7.0 | Search | | 8.0 | Login | | 9.0 | Logout | +--------------+---------+ 5 rows selected (1.644 seconds)如果出現錯誤:DATA_READ錯誤:嘗試設置SQL查詢時JDBC存儲插件失敗。 然后在Apache Drill中啟用詳細錯誤,以查看問題所在:
0: jdbc:drill:zk=local> ALTER SESSION SET `exec.errors.verbose` = true; +-------+-------------------------------+ | ok | summary | +-------+-------------------------------+ | true | exec.errors.verbose updated. | +-------+-------------------------------+ 1 row selected (0.154 seconds)0: jdbc:drill:zk=local> select * from ora.MOVIEDEMO.YELP_BUSINESS limit 1; Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query.sql SELECT * FROM "MOVIEDEMO"."YELP_BUSINESS" plugin ora Fragment 0:0[Error Id: 40343dd5-1354-48ed-90ef-77ae1390411b on bigdatalite.localdomain:31010](java.sql.SQLException) ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-29400: data cartridge error KUP-11504: error from external driver: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused這是Oracle正在查詢的外部表的問題( ORA-29913: error in executing ODCIEXTTABLEOPEN )。 它實際上是一個Hive表之上的Oracle外部表,顯然Drill可以直接查詢-但是,我們只是在這里進行沙箱測試…
查詢執行
就像Oracle有其基于成本的優化程序(CBO)來幫助它確定如何執行查詢以及最有效地執行該查詢一樣,Apache Drill的執行引擎也可以確定如何實際執行您提供給它的查詢。 這還包括如何將其拆分為多個節點(“鉆頭”)(如果可用),以及在某些情況下進行優化(例如分區修剪) 。 您可以在此處詳細了解查詢執行的工作原理 ,并在此處觀看有關它的詳細說明 。 要查看查詢的解釋計劃,請使用explain plan :
0: jdbc:drill:zk=local> !set maxwidth 10000 0: jdbc:drill:zk=local> explain plan for select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5; +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 Project(date=[$0], tip_count=[$1]) 00-02 SelectionVectorRemover 00-03 Limit(fetch=[5]) 00-04 SelectionVectorRemover 00-05 TopN(limit=[5]) 00-06 HashAgg(group=[{0}], tip_count=[$SUM0($1)]) 00-07 HashAgg(group=[{0}], tip_count=[COUNT()]) 00-08 Scan(groupscan=[EasyGroupScan [selectionRoot=hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json, numFiles=1, columns=[`date`], files=[hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json]]]) | {"head" : {"version" : 1,"generator" : {"type" : "ExplainHandler","info" : ""[...]您還可以使用Drill Web界面查看有關查詢執行方式的信息:
鉆探器
MapR Drill ODBC驅動程序附帶一個名為Drill Explorer的工具。 這是一個GUI,使您可以通過導航數據庫(==存儲插件)和其中的文件夾/文件,預覽數據甚至創建視圖來瀏覽數據。
鉆探客戶
在Drill客戶端中,有多種可用設置:
0: jdbc:drill:zk=local> !set autocommit true autosave false color true fastconnect true force false headerinterval 100 historyfile /home/oracle/.sqlline/history incremental true isolation TRANSACTION_REPEATABLE_READ maxcolumnwidth 15 maxheight 56 maxwidth 1000000 numberformat default outputformat table propertiesfile /home/oracle/.sqlline/sqlline.properties rowlimit 0 showelapsedtime true showheader true shownestederrs false showwarnings true silent false timeout -1 trimscripts true verbose false要更改一個,例如顯示的輸出寬度:
0: jdbc:drill:zk=local> !set maxwidth 10000要連接到遠程Drill,請指定存儲Drillbit連接信息的Zookeeper節點:
rmoff@asgard-3:apache-drill-1.7.0> bin/sqlline -u jdbc:drill:zk=cdh57-01-node-01.moffatt.me:2181,cdh57-01-node-02.moffatt.me:2181,cdh57-01-node-03.moffatt.me:2181結論
Apache Drill是一個功能強大的工具,用于針對不同的數據源使用熟悉的查詢語言(SQL)。 在小范圍內,僅能夠通過JSON之類的結構化文件進行切片和切塊是一個巨大的勝利。 在更大范圍內,嘗試在一組計算機集群中查詢更大數量的數據時與Apache Drill進行比較是很有意思的,也許可以與諸如Impala之類的工具進行比較。
有關Apache Drill的更多信息,請參見如何從OBIEE中訪問Drill。
翻譯自: https://www.javacodegeeks.com/2016/09/guide-getting-started-apache-drill.html
總結
以上是生活随笔為你收集整理的如何指南:Apache Drill入门的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 专访腾讯多媒体实验室商世东:咬咬牙也要跟
- 下一篇: 双11电商手机大战盘点,大神荣耀各成势力