power-bi_在Power BI中的VertiPaq内-压缩成功!
power-bi
Have you ever wondered what makes Power BI so fast and powerful when it comes to performance? So powerful, that it performs complex calculations over millions of rows in a blink of an eye.
您是否想過什么使Power BI在性能方面如此之快和強大? 如此強大,它可以在一瞬間對數百萬行執行復雜的計算。
In this series of articles, we will dig deep to discover what is “under the hood” of Power BI, how your data is being stored, compressed, queried, and finally, brought back to your report. Once you finish reading, I hope that you will get a better understanding of the hard work happening in the background and appreciate the importance of creating an optimal data model in order to get maximum performance from the Power BI engine.
在本系列文章中,我們將深入研究以發現Power BI的“內幕”,以及如何存儲,壓縮,查詢您的數據,最后將它們帶回到您的報告中。 閱讀完本書后,我希望您能更好地了解在后臺進行的艱苦工作,并希望了解創建最佳數據模型以從Power BI引擎獲得最佳性能的重要性。
As you might recall, in the previous article we scratched the surface of VertiPaq, a powerful storage engine, which is “responsible” for the blazing-fast performance of most of your Power BI reports (whenever you are using Import mode or Composite model).
您可能還記得, 在上一篇文章中,我們介紹了功能強大的存儲引擎VertiPaq的表面,它對大多數Power BI報表的出色表現(無論何時使用導入模式或復合模型)“負有責任”。 。
3、2、1…系好安全帶! (3, 2, 1…Fasten your seatbelts!)
One of the key characteristics of the VertiPaq is that it’s a columnar database. We learned that columnar databases store data optimized for vertical scanning, which means that every column has its own structure and is physically separated from other columns.
VertiPaq的主要特征之一是它是一個列式數據庫。 我們了解到,列式數據庫存儲為垂直掃描而優化的數據,這意味著每列都有其自己的結構,并且與其他列在物理上是分開的。
Photo by Dave Hoefler on UnsplashDave Hoefler在Unsplash上??的照片That fact enables VertiPaq to apply different types of compression to each of the columns independently, choosing the optimal compression algorithm based on the values in that specific column.
這一事實使VertiPaq可以獨立地對每個列應用不同類型的壓縮,并根據該特定列中的值選擇最佳壓縮算法。
Compression is being achieved by encoding the values within the column. But, before we dive deeper into a detailed overview of encoding techniques, just keep in mind that this architecture is not exclusively related to Power BI — in the background is a Tabular model, which is also “under the hood” of SSAS Tabular and Excel Power Pivot.
通過對列中的值進行編碼來實現壓縮。 但是,在深入研究編碼技術之前,請記住,該體系結構并不專門與Power BI相關—在后臺是Tabular模型,它也是SSAS Tabular和Excel的“幕后”動力樞軸。
值編碼 (Value Encoding)
This is the most desirable value encoding type since it works exclusively with integers and, therefore, require less memory than, for example, when working with text values.
這是最理想的值編碼類型,因為它僅與整數一起使用,因此比例如使用文本值時需要更少的內存。
How does this look in reality? Let’s say we have a column containing a number of phone calls per day, and the value in this column varies from 4.000 to 5.000. What the VertiPaq would do, is to find the minimum value in this range (which is 4.000) as a starting point, then calculate the difference between this value and all the other values in the column, storing this difference as a new value.
現實情況如何? 假設我們有一個列,其中包含每天的電話數量,此列中的值在4.000到5.000之間。 VertiPaq要做的是找到此范圍內的最小值(4.000)作為起點,然后計算該值與列中所有其他值之間的差,并將該差存儲為新值。
At first glance, 3 bits per value might not look like a significant saving, but multiply this by millions or even billions of rows and you will appreciate the amount of memory saved.
乍一看,每個值3位可能看起來不算是大筆的節省,但是將其乘以幾百萬甚至數十億行,您將欣賞到節省的內存量。
As I already stressed, Value Encoding is being applied exclusively to integer data type columns (currency data type is also stored as an integer).
正如我已經強調的那樣,值編碼僅應用于整數數據類型列(貨幣數據類型也存儲為整數)。
哈希編碼(字典編碼) (Hash Encoding (Dictionary Encoding))
This is probably the most used compression type by a VertiPaq. Using Hash encoding, VertiPaq creates a dictionary of the distinct values within one column and afterward replaces “real” values with index values from the dictionary.
這可能是VertiPaq最常用的壓縮類型。 使用哈希編碼,VertiPaq在一列內創建包含不同值的字典,然后用字典中的索引值替換“實際”值。
Here is the example to make things more clear:
這是使事情更清楚的示例:
As you may notice, VertiPaq identified distinct values within the Subjects column, built a dictionary by assigning indexes to those values, and finally stored index values as pointers to “real” values. I assume you are aware that integer values require way less memory space than text, so that’s the logic behind this type of data compression.
您可能會注意到,VertiPaq在“主題”列中標識了不同的值,通過為這些值分配索引來構建字典,最后將索引值存儲為“真實”值的指針。 我假設您知道整數值所需的存儲空間比文本少,因此這就是這種數據壓縮的邏輯。
Additionally, by being able to build a dictionary for any data type, VertiPaq is practically data type independent!
此外,通過能夠為任何數據類型構建字典,VertiPaq實際上是與數據類型無關的!
This brings us to another key takeover: no matter if your column is of text, bigint or float data type — from VertiPaq perspective it’s the same — it needs to create a dictionary for each of those columns, which implies that all these columns will provide the same performance, both in terms of speed and memory space allocated! Of course, by assuming that there are no big differences between dictionary sizes between these columns.
這給我們帶來了另一個關鍵的接管:無論您的列是文本,bigint還是float數據類型-從VertiPaq角度來看都是相同的-它需要為每個列創建一個字典,這意味著所有這些列都將提供在速度和分配的內存空間方面都具有相同的性能! 當然,假設這些列之間的字典大小之間沒有太大差異。
So, it’s a myth that the data type of the column affects its size within the data model. On the opposite, the number of distinct values within the column, which is known as cardinality, mostly influence column memory consumption.
因此,列的數據類型會影響其在數據模型中的大小,這是一個神話。 相反,列中不同值的數量(稱為基數 )主要影響列的內存消耗。
RLE(行程編碼) (RLE (Run-Length-Encoding))
The third algorithm (RLE) creates a kind of mapping table, containing ranges of repeating values, avoiding to store every single (repeated) value separately.
第三種算法(RLE)創建一種映射表,其中包含重復值的范圍,避免了單獨存儲每個(重復)值的情況。
Again, taking a look at an example will help to better understand this concept:
同樣,看一個例子將有助于更好地理解這個概念:
In real life, VertiPaq doesn’t store Start values, because it can quickly calculate where the next node begins by summing previous Count values.
在現實生活中,VertiPaq不存儲“開始”值,因為它可以通過對先前的“計數”值求和來快速計算下一個節點的起始位置。
As powerful as it might look at first glance, the RLE algorithm is highly dependent on the ordering within the column. If the data is stored the way you see in the example above, RLE will perform great. However, if your data buckets are smaller and rotate more frequently, then RLE would not be an optimal solution.
乍一看,RLE算法雖然功能強大,但在很大程度上取決于列中的順序。 如果按照您在上例中看到的方式存儲數據,RLE將表現出色。 但是,如果您的數據存儲區較小,并且旋轉頻率更高,則RLE將不是最佳解決方案。
One more thing to keep in mind regarding RLE: in reality, VertiPaq doesn’t store data the way it is shown in the illustration above. First, it performs Hash encoding and creating a dictionary of the subjects and then apply RLE algorithm, so the final logic, in its most simplified way, would be something like this:
關于RLE還需要記住的一件事:實際上,VertiPaq不會像上圖所示那樣存儲數據。 首先,它執行Hash編碼并創建主題字典,然后應用RLE算法,因此最終邏輯(以其最簡化的方式)將類似于以下內容:
So, RLE occurs after Value or Hash Encoding, in those scenarios when VertiPaq “thinks” that it makes sense to compress data additionally (when data is ordered in that way that RLE would achieve better compression).
因此,在VertiPaq“認為”有必要額外壓縮數據的情況下(當以這種方式訂購數據時,RLE將實現更好的壓縮),RLE發生在“值”或“哈希編碼”之后。
重新編碼注意事項 (Re-Encoding considerations)
No matter how “smart” VertiPaq is, it can also make some bad decisions, based on incorrect assumptions. Before I explain how re-encoding works, let me just briefly iterate through the process of data compression for a specific column:
無論VertiPaq多么“聰明”,它也會基于錯誤的假設做出一些錯誤的決定。 在解釋重新編碼的工作原理之前,讓我簡要地迭代一下特定列的數據壓縮過程:
- VertiPaq scans sample of rows from the column VertiPaq掃描列中的行樣本
- If the column data type is not an integer, it will look no further and use Hash encoding 如果列數據類型不是整數,則不會再使用Hash編碼
- If the column is of integer data type, some additional parameters are being evaluated: if the numbers in sample linearly increase, VertiPaq assumes that it is probably a primary key and chooses Value encoding 如果該列是整數數據類型,則將評估一些其他參數:如果樣本中的數字線性增加,則VertiPaq假定它可能是主鍵,并選擇值編碼
- If the numbers in the column are reasonably close to each other (number range is not very wide, like in our example above with 4.000–5.000 phone calls per day), VertiPaq will use Value encoding. On the contrary, when values fluctuate significantly within the range (for example between 1.000 and 1.000.000), then Value encoding doesn’t make sense and VertiPaq will apply the Hash algorithm 如果該列中的數字彼此相當接近(數字范圍不是很寬,例如在上面的示例中,每天有4.000-5.00個電話),則VertiPaq將使用值編碼。 相反,當值在此范圍內(例如1.000到1.000.000之間)波動很大時,則值編碼沒有意義,VertiPaq將應用哈希算法
However, it can happen sometimes that VertiPaq makes a decision about which algorithm to use based on the sample data, but then some outlier pops-up and it needs to re-encode the column from scratch.
但是,有時可能會發生VertiPaq根據樣本數據來決定使用哪種算法的情況,但隨后會彈出一些異常值,因此需要從頭開始對列進行重新編碼。
Let’s use our previous example for the number of phone calls: VertiPaq scans the sample and chooses to apply Value encoding. Then, after processing 10 million rows, all of a sudden it found 50.000 value (it can be an error, or whatever). Now, VertiPaq re-evaluates the choice and it can decide to re-encode the column using the Hash algorithm instead. Surely, that would impact the whole process in terms of time needed for reprocessing.
讓我們使用前面的示例來計算電話數量:VertiPaq掃描樣本并選擇應用Value編碼。 然后,在處理了1000萬行之后,突然發現50.000值(這可能是錯誤,也可能是其他)。 現在,VertiPaq重新評估選擇,并且可以決定使用Hash算法對列進行重新編碼。 當然,這將影響整個過程的重新處理所需的時間。
結論 (Conclusion)
In this part of the series on the “brain & muscles” behind Power BI, we dived deep into different data compression algorithms which VertiPaq performs in order to optimize our data model.
在Power BI背后“大腦和肌肉”系列的這一部分中,我們深入研究了VertiPaq為優化我們的數據模型而執行的各種數據壓縮算法。
Finally, here is the list of parameters (in order of importance) that VertiPaq considers when choosing which algorithm to use:
最后,這是VertiPaq在選擇使用哪種算法時要考慮的參數列表(按重要性順序):
- Number of distinct values in the column (Cardinality) 列中不同值的數量(基數)
- Data distribution in the column — column with many repeating values can be better compressed than one containing frequently changing values (RLE can be applied) 包含多個重復值的列中的數據分布比包含頻繁更改值的列(可以應用RLE)更好地壓縮
- Number of rows in the table 表中的行數
- Column data type — impacts only dictionary size 列數據類型-僅影響字典大小
In the next article, I will introduce some techniques for reducing data model size and consequentially getting the better overall performance of your Power BI report.
在下一篇文章中,我將介紹一些減少數據模型大小并因此獲得Power BI報表更好的整體性能的技術 。
翻譯自: https://towardsdatascience.com/inside-vertipaq-in-power-bi-compress-for-success-68b888d9d463
power-bi
總結
以上是生活随笔為你收集整理的power-bi_在Power BI中的VertiPaq内-压缩成功!的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 直降300元!华为Mate50大降价 还
- 下一篇: 国补退坡 上海延续新能源车置换补贴:单车