當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Kylin的cube模型

發(fā)布時間：2025/4/16 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 Kylin的cube模型小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1. 數(shù)據(jù)倉庫的相關(guān)概念

OLAP

大部分?jǐn)?shù)據(jù)庫系統(tǒng)的主要任務(wù)是執(zhí)行聯(lián)機(jī)事務(wù)處理和查詢處理，這種處理被稱為OLTP（Online Transaction Processing, OLTP），面向的是顧客，諸如：辦事員、DBA等。而數(shù)據(jù)倉庫主要面向知識工人（如經(jīng)理、主管等）提供數(shù)據(jù)分析處理，這種處理被稱為OLAP（Online Analysis Processing）。OLTP管理的是當(dāng)前數(shù)據(jù)，比較瑣碎，很難用于做決策。而OLAP管理的是大量歷史數(shù)據(jù)，提供匯總與聚集機(jī)制，并在不同的維度、不同的粒度存儲和管理信息。

特征OLTPOLAP

面向	辦事員、DBA	知識工人
DB設(shè)計(jì)	基于ER，面向應(yīng)用	星形/雪花，面向主題
數(shù)據(jù)	當(dāng)前的、確保更新	歷史的、跨時間維護(hù)
視圖	詳細(xì)、一般關(guān)系	匯總的、多維的
訪問	讀/寫	大多數(shù)為讀
度量	事務(wù)吞吐量	查詢吞吐量、訪問時間

舉個簡單的例子：我們會用OLTP去管理app名稱與app類別的映射關(guān)系；而分析某一周app（和app類別）的UV，則會使用OLAP；并且OLAP提供了數(shù)據(jù)的多維觀察——比如：在某周在華為手機(jī)上top100用戶的APP。

Fact Table

事實(shí)表（Fact Table）是中心表，包含了大批數(shù)據(jù)并不冗余，其數(shù)據(jù)列可分為兩類：

包含大量數(shù)據(jù)事實(shí)的列；
與維表（Dimension Table）的primary key相對應(yīng)的foreign key。

Lookup Table

Lookup Table包含對事實(shí)表的某些列進(jìn)行擴(kuò)充說明的字段。在Kylin的quick start中給出sample cube（kylin_sales_cube）——其Fact Table為購買記錄，lookup table有兩個：用于對購買日期PART_DT、商品的LEAF_CATEG_ID與LSTG_SITE_ID字段進(jìn)行擴(kuò)展說明。

Dimension

維表（Dimension Table）是由fact table與lookup table邏輯抽象出來的表，包含了多個相關(guān)的列（即dimension），以提供對數(shù)據(jù)的多維觀察；其中dimension的值的數(shù)目稱為cardinatily。在kylin_sales_cube的事實(shí)表的LSTG_FORMAT_NAME被單獨(dú)抽出來做一個dimension，可與其他維度組合分析數(shù)據(jù)。

Star Schema

星形模式（Star Schema）包含一個或多個事實(shí)表、一組維表，其中維表的primary key與事實(shí)表的foreign key相對應(yīng)。這種模式很像星光四射，維表顯示在圍繞事實(shí)表的射線上。下圖是我根據(jù)某數(shù)據(jù)源所建立的星形模式：

Cube

cube是所有的dimensions組合，任一dimensions的組合稱為cuboid。因此，包含\(n\)個dimensions的cube有\(2^n\)個cuboid，如下圖所示：

2. Kylin介紹

Dimension

為了減少cuboid的數(shù)目，Kylin將Dimension分為四種類型：

Normal，為最常見的類型，與所有其他的dimension組合構(gòu)成cuboid。
Mandatory，在每一次查詢中都會用到dimension，在下圖中A為Mandatory dimension，則與B、C總共構(gòu)成了4個cuboid，相較于normal dimension的cuboid（\(2^3=8\))減少了一半。
Hierarchy，為帶層級的dimension，比如說：省份->城市，年->季度->月->周->日；以用于做drill down。

Derived，指該dimensions與維表的primary key是一一對應(yīng)關(guān)系，可以更有效地減少cuboid數(shù)量，詳細(xì)的解釋參看這里；并且derived dimension只能由lookup table的列生成。

然而，Kylin的Hierarchy dimensions并沒有做集合包含約束，比如：kylin_sales_cube定義Hierarchy dimension為META_CATEG_NAME->CATEG_LVL2_NAME->CATEG_LVL3_NAME，但是同一個CATEG_LVL2_NAME可以對應(yīng)不同META_CATEG_NAME。因此，hierarchy 顯得非常雞肋，以至于在Kylin后臺處理時被廢棄了（詳見Li Yang在mail group中所說）：

@Julian, plan to refactor the underlying aggregation group in Q4. Will drop
hierarchy concept in the backend, however in the frontend for ease of
understanding, may still call it hierarchy.

Measure

Measure為事實(shí)表的列度量，Kylin提供諸如：

Sum
Count
Max
Min
Average
Distinct Count (based on HyperLogLog)

等函數(shù)，一般配合group by dimesion使用。

3. 實(shí)戰(zhàn)

下面的SQL語句是在kylin_sales_cube build成功后執(zhí)行的。

sql命令select * from kylin_sales，得到fact table所緩存的列——均為dimension的主key、measure中所需計(jì)算的字段。

各個時間段內(nèi)的銷售額及購買量：

select part_dt, sum(price) as total_selled, count(distinct seller_id) as sellers from kylin_sales group by part_dt order by part_dt

查詢某一時間的銷售額及購買量,

select part_dt, sum(price) as total_selled, count(distinct seller_id) as sellers from kylin_sales where part_dt = '2014-01-01' group by part_dt

發(fā)現(xiàn)報(bào)錯：

Error while compiling generated Java code: public static class Record3_0 implements java.io.Serializable { public java.math.BigDecimal f0; public boolean f1; public org.apache.kylin.common.hll.HyperLogLogPlusCounter f2; public Record3_0(java.math.BigDecimal f0, boolean f1, ...

這是因?yàn)閜art_dt是date類型，在解析string到date的時候出問題，應(yīng)將sql語句改為：

select part_dt, sum(price) as total_selled, count(distinct seller_id) as sellers from kylin_sales where part_dt between '2014-01-01' and '2014-01-01' group by part_dt-- or select part_dt, sum(price) as total_selled, count(distinct seller_id) as sellers from kylin_sales where part_dt = date '2014-01-01' group by part_dt

上面查詢只用到了fact table，而沒有用到lookup table。如果查詢各個時間段所有二級商品類型的銷售額，則需要fact table與lookup table做inner join：

select fact.part_dt, lookup.CATEG_LVL2_NAME, count(distinct seller_id) as sellers from kylin_sales fact inner join KYLIN_CATEGORY_GROUPINGS lookup on fact.LEAF_CATEG_ID = lookup.LEAF_CATEG_ID and fact.LSTG_SITE_ID = lookup.SITE_ID group by fact.part_dt, lookup.CATEG_LVL2_NAME order by fact.part_dt desc

4. 參考資料

[1] 韓家煒，《數(shù)據(jù)挖掘——概念與技術(shù)》.
[2] 教練_我要踢球, OLAP引擎——Kylin介紹.
[3] Kylin, Design Cube in Kylin.

轉(zhuǎn)載于:https://www.cnblogs.com/en-heng/p/cube-model-of-kylin.html

總結(jié)

以上是生活随笔為你收集整理的Kylin的cube模型的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：伪类伪元素如何区分
下一篇：一分钟了解阿里云产品：对象存储OSS概述

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Kylin的cube模型

1. 數(shù)據(jù)倉庫的相關(guān)概念

OLAP

Fact Table

Lookup Table

Dimension

Star Schema

Cube

2. Kylin介紹

Dimension

Measure

3. 實(shí)戰(zhàn)

4. 參考資料

總結(jié)