k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类
k均值算法 二分k均值算法
Have you ever seen a Caribbean reef? Well if you haven’t, prepare yourself.
您見過加勒比礁嗎? 好吧,如果沒有,請做好準(zhǔn)備。
Today, we will be answering a question that, at face value, appears quite simple: “What does a Caribbean reef look like?” However, this question can be decomposed into many complex layers. So to avoid ambiguity, let’s refine the question to: “What are the non-mobile components of a Caribbean reefs and how are they related?”
今天,我們將回答一個從表面上看很簡單的問題:“加勒比海礁石看起來像什么?” 但是,這個問題可以分解為許多復(fù)雜的層。 因此,為避免歧義,讓我們將問題細(xì)化為:“加勒比海珊瑚礁的非活動組成部分是什么,它們之間有何關(guān)系?”
That seems reasonable; we’ll have to look at fish another day.
這似乎是合理的; 我們要改天看看魚。
Now we’re not going to roll out beautiful images of underwater cities teeming with diversity. Instead, we have bar charts. Without further ado, let’s dive in.
現(xiàn)在,我們不打算發(fā)布充滿多樣性的水下城市的美麗影像。 相反,我們有條形圖。 事不宜遲,讓我們開始吧。
什么是典型的珊瑚礁? (What Makes up a Typical Reef?)
To start, we have developed a baseline graph (Figure 1) of the components of all Caribbean reefs. Here we have the median percent cover for nine substrate types. Now, if you haven’t conducted a scuba transect before, it may be helpful to break down the above sentence. First, percent cover is how coral reef composition is measured — in other words, from a birds-eye view what percent of sea floor is hard coral, sponge, rock, etc. Second, substrates types are broad categories of sea floor, such as silt or sand. If you’re curious about the sampling methods or specific substrate definitions, check this out.
首先,我們繪制了所有加勒比海珊瑚礁成分的基線圖(圖1)。 此處,我們提供了9種基材類型的中位覆蓋率百分比。 現(xiàn)在,如果您以前沒有進(jìn)行過水肺橫斷面檢查,則最好將上述句子分解。 首先,覆蓋率是如何測量珊瑚礁成分的,換句話說,從鳥瞰角度看,硬質(zhì)珊瑚,海綿,巖石等占海床的百分比。其次,底物類型是海床的大類,例如淤泥或沙子。 如果您想了解抽樣方法或特異底物的定義,請檢查該出來。
Ok, so in Figure 1 we’re looking at the median value for each of the nine substrate values. For example, in the Hard Coral column, we can see that hard coral’s median percent cover is roughly 17%. Good to know.
好的,因此在圖1中,我們查看的是9個底物值中的每個的中值。 例如,在“ 硬珊瑚”列中,我們可以看到硬珊瑚的覆蓋率中位數(shù)約為17%。 很高興知道。
Diving deeper into the chart, it appears that most Caribbean reefs are primarily composed four substrate types: rock, hard coral, nutrient indicator algae (NI Algae), and sand. Together, these four categories account for 91% of the total median values. On the other hand, recently killed coral (RK Coral) and silt both have median values of 0. So, they’re relatively rare.
深入研究圖表,似乎大多數(shù)加勒比海礁石主要由四種基質(zhì)類型組成:巖石,硬珊瑚,營養(yǎng)指示藻類( NI Algae )和沙子。 這四個類別合起來占總中值的91%。 另一方面,最近被殺死的珊瑚( RK Coral )和淤泥的中位數(shù)均為0。因此,它們相對較少。
We have learned that Caribbean reefs are rocky and sandy. Lovely.
我們了解到加勒比礁是巖石和沙灘。 可愛。
But here’s an alarming analogy: the average number of children per US family is 1.93. If we take that number to be representative of the data, we might conclude that most families have 1.93 children, which I find hard to believe. Even worse, we have no understanding of the underlying distribution that led to an average of 1.93. There could be one family with 184 children and 9 families with one child. Instead, it would be useful to see if there are common counts for the number of kids per family.
但這是一個令人震驚的類比:每個美國家庭的平均孩子人數(shù)為1.93。 如果我們以該數(shù)字作為數(shù)據(jù)的代表,我們可以得出結(jié)論,大多數(shù)家庭有1.93個孩子,我很難相信。 更糟糕的是,我們不了解導(dǎo)致平均1.93的基本分布。 可能有一個家庭有184個孩子,有9個家庭有一個孩子。 取而代之的是,查看每個家庭的孩子數(shù)是否有共同計數(shù)是有用的。
K-均值演示 (K-Means Demo)
Applying this logic to reef composition, we will explore if there are groups coral reefs using the above substrate categories. This is where unsupervised classification comes into play. Unsupervised algorithms fit data where we don’t know the “correct” answer. And, one of the simplest methods of all is the k-means algorithm.
將這種邏輯應(yīng)用于珊瑚礁組成,我們將使用上述基質(zhì)類別探討是否存在珊瑚礁群。 這是無監(jiān)督分類起作用的地方。 無監(jiān)督算法適合我們不知道“正確”答案的數(shù)據(jù)。 而且,最簡單的方法之一是k-means算法。
Without getting too technical, k-means attempts to split data into k clusters. The algorithm does this by minimizing the distance from the center of the cluster (the cluster mean) to all points in that cluster. And because of this simple fitting criteria, it’s really easy to interpret. So let’s see an example…
不用太技術(shù),k-means嘗試將數(shù)據(jù)拆分為k個群集。 該算法通過最小化從群集中心(群集均值)到該群集中所有點(diǎn)的距離來實現(xiàn)此目的。 而且由于這種簡單的擬合標(biāo)準(zhǔn),它真的很容易解釋。 因此,讓我們看一個例子……
Reef Check.Reef Check 。In Figure 2 we have created two clusters (k=2 in this case) using two substrate categories: hard coral and nutrient indicator algae. As you can see, there appears to be a clear divide between these two categories. But, let’s not get into interpretation quite yet.
在圖2中,我們使用兩個基質(zhì)類別(硬珊瑚和營養(yǎng)指示藻)創(chuàng)建了兩個群集(在這種情況下, k = 2 )。 如您所見,這兩個類別之間似乎存在明顯的鴻溝。 但是,讓我們暫時不做解釋。
Instead, let’s consider the case where we add another variable. Here, the k-means algorithm would categorize each point using three dimensions instead of two. But as you increase the number of dimensions, you lose the ability to visualize; it’s pretty hard to think in five or eight dimensions. However, we can still see where the cluster centers are numerically located in hyperspace.
相反,讓我們考慮添加另一個變量的情況。 在這里,k-means算法將使用三個維度而不是兩個維度對每個點(diǎn)進(jìn)行分類。 但是隨著尺寸的增加,您將失去可視化的能力。 很難從五個或八個維度來思考。 但是,我們?nèi)匀豢梢钥吹骄垲愔行脑跀?shù)字上位于超空間中的位置。
Now that we have a basic understanding of what k-means does, let’s move on to the interesting graphs.
現(xiàn)在,我們對k均值的功能有了基本的了解,讓我們繼續(xù)研究有趣的圖。
前4種基板類型(k = 3) (Top 4 Substrate Types (k=3))
In Figure 3 (below) we have fit three clusters (k=3) using the four most most prevalent substrate types. Each bar represents a substrate category. The height of each bar represents the the difference between the cluster mean and the total mean for that given substrate. Blue bars correspond to a cluster mean greater than the entire category’s mean and conversely, red bars correspond to a cluster mean less than the entire category’s mean.
在下面的圖3中,我們使用四種最普遍的底物類型擬合了三個簇( k = 3 )。 每個條形代表基材類別。 每個條形的高度代表該給定底物的簇均值與總均值之差。 藍(lán)色條形對應(yīng)的聚類平均值大于整個類別的平均值,紅色條形對應(yīng)的聚類平均值小于整個類別的平均值。
Reef Check.Reef Check 。When classifying Caribbean reefs into three clusters there appear to be sensible groupings: sand-dominated, rock-dominated, and algae-dominated. Interestingly, hard coral showed relatively little change even though it was the second most abundant substrate category. Conversely, nutrient indicator algae, which is often found on degraded reefs, had extremely high signal relative to its abundance.
將加勒比海珊瑚礁分為三類時,似乎有一些合理的分類:以沙子為主,以巖石為主和以藻類為主。 有趣的是,即使硬質(zhì)珊瑚是第二豐富的底物類別,其變化也相對較小。 相反,經(jīng)常在退化的珊瑚礁上發(fā)現(xiàn)的營養(yǎng)指示劑藻類相對于其豐富度具有極高的信號。
We can also observe that sand-dominated reefs allowed for the highest quantity of hard coral at roughly 10 percentage points more than the total data average. Rock-dominated reefs were net positive but had little impact on hard corals. And finally, as most people would expect, the evil nutrient indicator algae appears to have a fairly strong negative impact on all other substrate types.
我們還可以觀察到,以砂巖為主的礁石允許的硬珊瑚數(shù)量最多,比整個數(shù)據(jù)平均值高出大約10個百分點(diǎn)。 巖石為主的礁石為凈陽性,但對硬珊瑚影響不大。 最后,正如大多數(shù)人所期望的那樣,邪惡的營養(yǎng)指示劑藻類似乎對所有其他底物類型具有相當(dāng)強(qiáng)烈的負(fù)面影響。
Ok, we’re starting to get somewhere. Now let’s increase the number of substrate types by including all categories that had a median value greater than zero: only silt and recently killed coral were not included.
好的,我們開始有所建樹。 現(xiàn)在,通過包含中值大于零的所有類別來增加底物類型的數(shù)量:不包括淤泥和最近被殺死的珊瑚。
非零中值基板類型(k = 3) (Non-Zero-Median Substrate Types (k=3))
Reef Check.Reef Check 。In Figure 4 it appears the categories we found above hold steady. Sand/rubble dominated reefs seem to support the most life with above-average values in hard coral, soft coral, and sponge. Rocky reefs also exhibit life-supporting ability, although less than its sandy counterpart. And finally, nutrient indicator algae reefs show below average percent cover in all other substrate values observed.
在圖4中,我們上面找到的類別似乎保持穩(wěn)定。 在硬珊瑚,軟珊瑚和海綿中,以沙/卵石為主的礁石似乎能維持大多數(shù)生命,其價值均高于平均值。 礁石還具有生命維持能力,盡管比沙質(zhì)礁石要弱一些。 最后,營養(yǎng)指示劑藻類礁石在所有其他底物值中均顯示低于平均覆蓋率。
Now you might be wondering what the deal is with NI Algae. Well, nutrient indicator algae are often found on degraded reefs because they thrive in waters with elevated nutrient levels, such as nitrogen and phosphorus; Reef Check added this category to monitor the infamous algal blooms. Conversely, these high levels of nutrients can be harmful to corals. Thus, we would expect to see an inverse relationship between nutrient indicator algae and the other living substrate types, namely sponges, soft corals, and hard corals.
現(xiàn)在您可能想知道與NI Algae達(dá)成的交易是什么。 好吧,營養(yǎng)指示劑藻類經(jīng)常在退化的珊瑚礁上發(fā)現(xiàn),因為它們在營養(yǎng)水平較高的水中繁殖,例如氮和磷。 Reef Check添加了此類別,以監(jiān)視臭名昭著的藻華。 相反,這些高含量的養(yǎng)分可能對珊瑚有害。 因此,我們希望看到營養(yǎng)指示劑藻類與其他活的基質(zhì)類型(即海綿,軟珊瑚和硬珊瑚)之間存在反比關(guān)系。
This stuff is pretty cool.
這個東西很酷。
使用非零基材值進(jìn)行擬合(k = 4) (Fitting Using the Non-Zero Substrate Values (k=4))
In our final chart, we will try increasing the number of clusters to four because who’s to say there are only three types of Caribbean reefs? Well, technically there are statistical methods to show reasonable values that k can take. In this case the elbow method was implemented and three to five clusters were deemed sensible.
在我們的最終圖表中,我們將嘗試將集群數(shù)增加到四個,因為誰能說只有三種類型的加勒比海珊瑚礁? 嗯,從技術(shù)上講,有統(tǒng)計方法可以顯示k可以取的合理值。 在這種情況下,采用肘部方法,認(rèn)為三到五個簇是明智的。
Reef Check.Reef Check 。As shown shown in Figure 5 to the left, as expected, a fourth category has emerged. Boasting extremely high values of hard and soft corals, this coral-dominated reef appears to be the “healthiest” reefs of the four.
如預(yù)期的那樣,如左圖5所示,出現(xiàn)了第四類。 這種以珊瑚為主的珊瑚礁擁有極高的硬珊瑚和軟珊瑚價值,似乎是這四種珊瑚中“最健康的”。
Now why did increasing the number of clusters suddenly create this magical healthy reef category? Well, with only three clusters, the high levels of hard and soft corals were lumped into the sand-dominated and rock-dominated classifications. By allowing for a fourth category, the data could be subset more cleanly.
現(xiàn)在,為什么增加簇的數(shù)量突然創(chuàng)建了這個神奇的健康珊瑚礁類別? 好吧,只有三個集群,高水平的硬珊瑚和軟珊瑚被歸類為以沙子為主和以巖石為主的分類。 通過考慮第四類,可以更清晰地對數(shù)據(jù)進(jìn)行子集化。
In a similar vein, why can’t we conclude that there are five types of reefs? To answer your outstanding question, k-means with k=5 was plotted, however the categories created were not intuitive. Moreover, because four central substrate categories compose 91% of the median total, limiting to four clusters is intuitive.
同樣,為什么我們不能得出結(jié)論說有五種類型的珊瑚礁呢? 為了回答您的懸而未決的問題,繪制了k = 5的 k均值,但是創(chuàng)建的類別不直觀。 此外,由于四個中央底物類別構(gòu)成中位數(shù)總數(shù)的91%,因此直觀地限制為四個簇即可。
Ok final question, how can we tell if three or four clusters is better? Another outstanding question, but unfortunately there isn’t a clear answer.
好吧,最后一個問題,我們?nèi)绾未_定三個或四個集群更好? 另一個懸而未決的問題,但不幸的是沒有一個明確的答案。
From an ecological perspective, there is no reason why rock and sand-dominated reefs can’t support corals and sponges, which argues for k=3. It’s also simpler. However, by creating four clusters we can develop clear-cut classifications that appear to correspond to health, which argues for k=4. Those categories are:
從生態(tài)的角度來看,沒有任何理由說明以巖石和沙子為主的珊瑚礁不能支撐珊瑚和海綿,這證明了k = 3 。 它也更簡單。 但是,通過創(chuàng)建四個群集,我們可以開發(fā)出與健康相對應(yīng)的清晰分類,這證明k = 4 。 這些類別是:
As with many applied statistics problems, humans have to make judgement calls based on subject-matter knowledge. Here, there are good arguments for both k=3 and k=4.
與許多應(yīng)用統(tǒng)計問題一樣,人類必須根據(jù)主題知識做出判斷。 在這里,對于k = 3和k = 4都有很好的論據(jù)。
結(jié)論 (Conclusion)
I’m glad you now understand why bar charts are superior to pretty pictures. Even though you have no idea what a Caribbean reef looks like, you have a better understanding of what makes up a Caribbean reef (which is pretty cool).
我很高興您現(xiàn)在了解為什么條形圖優(yōu)于漂亮的圖片。 即使您不知道加勒比礁是什么樣子,您也可以更好地了解加勒比礁的構(gòu)成(這很酷)。
What else can we conclude?
我們還能得出什么結(jié)論?
Got any other ideas?
還有其他想法嗎?
資料來源 (Sources)
- Algae can function as indicators of water pollution. (n.d.). Retrieved August 21, 2020, from http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/ - 藻類可以作為水污染的指標(biāo)。 (nd)。 檢索于2020年8月21日, 網(wǎng)址為http://www.walpa.org/waterline/june-2012/algae-can-function-as-indicators-of-water-pollution/ 
- Barott, K. L., Rodriguez-Mueller, B., Youle, M., Marhaver, K. L., Vermeij, M. J., Smith, J. E., & Rohwer, F. L. (2011). Microbial to reef scale interactions between the reef-building coral Montastraea annularis and benthic algae. Proceedings of the Royal Society B: Biological Sciences, 279(1733), 1655–1664. doi:10.1098/rspb.2011.2155 - KL的Barott,B。的Rodriguez-Mueller,M。的Youle,Marhaver的KL,Vermeij,MJ,Smith,JE和Rohwer的佛羅里達(dá)(2011)。 造礁珊瑚Montastraea ringis和底棲藻類之間的微生物到礁垢的相互作用。 皇家學(xué)會學(xué)報B:生物科學(xué), 279 (1733),1655–1664。 doi:10.1098 / rspb.2011.2155 
- Duffin, P., & 13, J. (2020, January 13). Average number of own children per family U.S. Retrieved August 20, 2020, from https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/ - Duffin,P.,&13,J.(2020年1月13日)。 美國每個家庭的平均獨(dú)生子女?dāng)?shù)于2020年8月20日從https://www.statista.com/statistics/718084/average-number-of-own-children-per-family/檢索 
The data were collected by Reef Check, a coral conservation non-profit that trains volunteer divers to collect marine data. There were 1576 unique entries for the Caribbean ranging from 1997–05–24 to 2019–08–24. Date of the dive was not taken into account, however in future iterations it would be interesting to see how these cluster centers change over time. The only transformation to the traditional k-means algorithm was including weights that correspond to the median percent cover of each substrate category.
數(shù)據(jù)是由珊瑚礁非營利組織Reef Check收集的,該組織培訓(xùn)志愿潛水員收集海洋數(shù)據(jù)。 1997–05–24至2019–08–24期間,加勒比海地區(qū)共有1576個獨(dú)特條目。 沒有考慮潛水日期,但是在將來的迭代中,觀察這些聚類中心如何隨時間變化會很有趣。 對傳統(tǒng)k均值算法的唯一轉(zhuǎn)換是包括權(quán)重,該權(quán)重對應(yīng)于每種基材類別的中位覆蓋率百分比。
Here is the code.
這是代碼 。
Note: These are my findings. If you would like to contact me, leave a message here. All criticisms are welcome.
注意:這些是我的發(fā)現(xiàn)。 如果您想與我聯(lián)系,請在此處留言。 歡迎所有批評。
翻譯自: https://medium.com/data-diving/classification-of-caribbean-coral-reefs-using-k-means-51a66997a989
k均值算法 二分k均值算法
總結(jié)
以上是生活随笔為你收集整理的k均值算法 二分k均值算法_使用K均值对加勒比珊瑚礁进行分类的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 梦到衣服丢了啥意思
- 下一篇: 女人梦到蚯蚓什么预兆
