vc6.0 绘制散点图_vc有关散点图的一切
vc6.0 繪制散點(diǎn)圖
Scatterplots are one of the most popular visualization techniques in the world. Its purposes are recognizing clusters and correlations in ‘pairs’ of variables. There are many variations of scatter plots. We will look at some of them.
散點(diǎn)圖是世界上最流行的可視化技術(shù)之一。 其目的是識(shí)別變量“對(duì)”中的聚類和相關(guān)性。 散點(diǎn)圖有很多變體。 我們將研究其中的一些。
Strip Plots
帶狀圖
Scatter plots in which one attribute is categorical are called ‘strip plots’. Since it is hard to see the data points when we plot the data points as a single line, we need to slightly spread the data points, you can check the above and we can also divide the data points based on the given label.
其中一種屬性是分類的散點(diǎn)圖稱為“條形圖”。 由于在將數(shù)據(jù)點(diǎn)繪制為單線時(shí)很難看到數(shù)據(jù)點(diǎn),因此我們需要稍微散布數(shù)據(jù)點(diǎn),您可以檢查上述內(nèi)容,也可以根據(jù)給定的標(biāo)簽劃分?jǐn)?shù)據(jù)點(diǎn)。
Scatterplot Matrices (SPLOM)
散點(diǎn)圖矩陣(SPLOM)
Scatterplot Matrices散點(diǎn)圖矩陣SPLOM produce scatterplots for all pairs of variables and place them into a matrix. Total unique scatterplots are (p2-p)/2. The diagonal is filled with KDE or histogram most of the time. As you can see, there is an order of scatterplots. Does the order matter? It cannot affect the value of course but it can affect the perception of people.
SPLOM為所有變量對(duì)生成散點(diǎn)圖,并將它們放入矩陣中。 總唯一散點(diǎn)圖為(p2-p)/ 2。 對(duì)角線大部分時(shí)間都充滿KDE或直方圖。 如您所見,有一個(gè)散點(diǎn)圖順序。 順序重要嗎? 它不會(huì)影響課程的價(jià)值,但會(huì)影響人們的感知。
Ordering is matter, Image taken from [Peng et al. 2004]排序很重要,圖片取自[Peng等。 2004]Therefore we need to consider the order of it. Peng suggests the ordering that similar scatterplots are located close to each other in his work in 2004 [Peng et al. 2004]. They distinguish between high-cardinality and low cardinality(number of possible values > number of points means high cardinality.) and sort low-cardinality by a number of values. They rate the ordering of high-cardinality dimensions based on their correlation. Pearson Correlation Coefficient is used for sorting.
因此,我們需要考慮它的順序。 Peng建議在2004年的工作中將相似的散點(diǎn)圖放置在彼此附近的順序[Peng等。 2004]。 它們區(qū)分高基數(shù)和低基數(shù)(可能值的數(shù)量>點(diǎn)數(shù)表示高基數(shù)),并通過多個(gè)值對(duì)低基數(shù)進(jìn)行排序。 他們根據(jù)它們的相關(guān)性對(duì)高基數(shù)維度的排序進(jìn)行評(píng)分。 皮爾遜相關(guān)系數(shù)用于排序。
clutter measure雜亂的措施We find all other pairs of x,y scatter plots with clutter measure. It calculates all correlation and compares it with each pair (x,y ) of high-cardinality dimensions. If its results are smaller than the threshold we choose that scatter plot as an important one. However, it takes a lot of computing power because its big-o-notation is O(p2 * p!). They suggest random swapping, it chooses the smallest one and keeps it and again and again.
我們發(fā)現(xiàn)所有其他對(duì)具有散亂度量的x,y散點(diǎn)圖。 它計(jì)算所有相關(guān)并將其與高基數(shù)維的每對(duì)(x,y)進(jìn)行比較。 如果其結(jié)果小于閾值,則選擇該散點(diǎn)圖作為重要散點(diǎn)圖。 但是,由于它的big-o表示法是O(p2* p!),因此需要大量的計(jì)算能力。 他們建議隨機(jī)交換,它選擇最小的交換并一次又一次地保留。
Selecting Good Views
選擇好的觀點(diǎn)
Correlation is not enough to choose the nice scatterplots when we are trying to find out the cluster based on the given label or we can get the label from clustering.
當(dāng)我們嘗試根據(jù)給定標(biāo)簽找出聚類時(shí),或者僅從聚類中獲取標(biāo)簽時(shí),相關(guān)性不足以選擇合適的散點(diǎn)圖。
Histogram and DSC, [Sips et al. 2009]直方圖和DSC,[Sips等。 2009]If you don’t have given labels in the left graph, you can pick x-axis projection or y-axis projection because there are no many differences but there are labels. Therefore, we can know the x-axis projection is correct., DSC is introduced with respect to this view that the method checks how good our scatterplot is. More good separation, more good scatterplots.
如果在左圖中沒有給出標(biāo)簽,則可以選擇x軸投影或y軸投影,因?yàn)樗鼈冎g沒有太多差異,但是有標(biāo)簽。 因此,我們可以知道x軸投影是正確的。為此,介紹了DSC,該方法檢查了散點(diǎn)圖的質(zhì)量。 更好的分離,更好的散點(diǎn)圖。
Cluster center集群中心 The equation to calculate DSC計(jì)算DSC的方程式First of all, we calculate the center of each cluster and measure the distance between each data point and each cluster center. If the distance from its own cluster is shorter than other clusters distance, we increase the cardinality and we normalized it by the number of clusters and multiply 100. This method is similar to the k-means clustering method. Since it only considers distance, it has a limitation to applying.
首先,我們計(jì)算每個(gè)聚類的中心并測(cè)量每個(gè)數(shù)據(jù)點(diǎn)與每個(gè)聚類中心之間的距離。 如果距其自身群集的距離短于其他群集的距離,我們將增加基數(shù),并通過群集數(shù)對(duì)其進(jìn)行歸一化并乘以100。此方法類似于k均值群集方法。 由于僅考慮距離,因此在應(yīng)用方面存在局限性。
Distribution Consistency (DC)
分配一致性(DC)
DC is the upgrade(?) version of DSC. DC measures the score based on penalizing local entropy in high-density regions. DSC assumes the particular cluster shapes but DC does not assume the shapes.
DC是DSC的升級(jí)版本。 DC基于懲罰高密度區(qū)域中的局部熵來測(cè)量分?jǐn)?shù)。 DSC假定特定的群集形狀,但DC不假定這些形狀。
Example of DC [Sips et al. 2009]DC的例子[Sips等。 2009] Entropy熵This equation is from information theory and it considers how much information in a specific distribution. The data should be estimated using KDE before we apply the entropy function, p(x,y) means the KDE. This equation means it gives smaller(Look at the minus) when the region we measure is mixed with other clusters and its minimum is 0 and the maximum is log2|C|.
該方程式來自信息理論,它考慮特定分布中有多少信息。 在應(yīng)用熵函數(shù)之前,應(yīng)使用KDE估算數(shù)據(jù),p(x,y)表示KDE。 該方程式意味著當(dāng)我們測(cè)量的區(qū)域與其他簇混合且其最小值為0且最大值為log2 | C |時(shí),它的值較小(看一下負(fù)值)。
Normalization function歸一化功能 DC score function直流計(jì)分功能We calculated the entropy with KDE and we don’t want to calculate the whole region at the same weight because there are many vacant regions. Finally, we normalize the results. This gives the DC score. We can choose scatterplots based on thresholds that we can choose.
我們使用KDE來計(jì)算熵,我們不想以相同的權(quán)重來計(jì)算整個(gè)區(qū)域,因?yàn)橛性S多空置區(qū)域。 最后,我們將結(jié)果標(biāo)準(zhǔn)化。 這給出了DC得分。 我們可以根據(jù)選擇的閾值來選擇散點(diǎn)圖。
WHO example of HIV risk groups世衛(wèi)組織艾滋病毒高危人群的例子This dataset is from the WHO, 194 countries, 159 attributes, and 6 HIV risk groups. They focus on DC > 80 and they can eliminate 97% of the plots. It is a highly efficient method.
該數(shù)據(jù)集來自WHO,194個(gè)國家,159個(gè)屬性和6個(gè)HIV風(fēng)險(xiǎn)組。 他們專注于DC> 80,并且可以消除97%的地塊。 這是一種高效的方法。
Other than these methods that it only considers the clusters, there are many ways to consider other specific patterns, e.g. fraction of outliers, sparsity, convexity, and e.t.c. You can take a look at [Wilkinson et al. 2006]. PCA also can be used as an alternative way to group similar plots together.
除了僅考慮聚類的這些方法以外,還有許多方法可以考慮其他特定模式,例如,異常值的分?jǐn)?shù),稀疏性,凸度等。您可以看一下[Wilkinson等。 2006]。 PCA也可以用作將相似地塊組合在一起的替代方法。
SPLOM Navigation
SPLOM導(dǎo)航
The 3D transition between neighboring views. [Elmqvist et al. 2008]相鄰視圖之間的3D過渡。 [Elmqvist等。 2008]Since the SPLOM shares one axis with the neighboring plots, it is possible to project on to 3D space.
由于SPLOM與相鄰的圖共享一個(gè)軸,因此可以投影到3D空間。
The limitation of scatterplots: Overdraw
散點(diǎn)圖的局限性:透支
Overdraw透支 KDE solutionKDE解決方案Too many data points lead to overdraw. We can solve this with KDE but it becomes no longer see individual points. The second problem is high dimensional data because it gives too many scatterplots. We discussed the solution of the second problem. Now we are going to look at the first problem.
太多的數(shù)據(jù)點(diǎn)導(dǎo)致透支。 我們可以使用KDE解決此問題,但不再看到單個(gè)點(diǎn)。 第二個(gè)問題是高維數(shù)據(jù),因?yàn)樗峁┝颂嗟纳Ⅻc(diǎn)圖。 我們討論了第二個(gè)問題的解決方案。 現(xiàn)在我們要看第一個(gè)問題。
Splatterplots
飛濺圖
Splatterplots [Mayorga/Gleicher 2013]飛濺圖[Mayorga / Gleicher 2013]Splatterplots properly combine the KDE and Scatterplots. The high-density region is represented by colors and the low-density region is represented by a single data point. We need to choose a proper kernel width for KDE. Splatterplots define the kernel width in screen space, how many data points in the unit screen space. However, we need to choose the threshold by ourselves.
Splatterplots正確地將KDE和Scatterplots結(jié)合在一起。 高密度區(qū)域由顏色表示,低密度區(qū)域由單個(gè)數(shù)據(jù)點(diǎn)表示。 我們需要為KDE選擇合適的內(nèi)核寬度。 Splatterplots定義屏幕空間中的內(nèi)核寬度,即單位屏幕空間中有多少個(gè)數(shù)據(jù)點(diǎn)。 但是,我們需要自己選擇閾值。
Zoom of splatter plots [Mayorga / Gleicher 2013]散點(diǎn)圖的放大[Mayorga / Gleicher 2013]If clusters are mixed, then colors are matter. High luminance and saturation can cause the miss perception that people can recognize the mixed cluster as a different cluster. Therefore, we need to reduce the saturation and luminance to indicate it is mixed clusters.
如果群集混合在一起,那么顏色就很重要。 高亮度和飽和度可能會(huì)導(dǎo)致人們誤以為人們會(huì)將混合群集識(shí)別為另一個(gè)群集。 因此,我們需要降低飽和度和亮度以表明它是混合簇。
This post is published on 9/2/2020.
此帖發(fā)布于2020年9月2日。
翻譯自: https://medium.com/@jeheonpark93/vc-everything-about-scatter-plots-467f80aec77c
vc6.0 繪制散點(diǎn)圖
總結(jié)
以上是生活随笔為你收集整理的vc6.0 绘制散点图_vc有关散点图的一切的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 数据预处理 泰坦尼克号_了解泰坦尼克号数
- 下一篇: 做梦梦到龙好不好