tableau使用_使用Tableau升级Kaplan-Meier曲线
tableau使用
In a previous article, I showed how we can create the Kaplan-Meier curves using Python. As much as I love Python and writing code, there might be some alternative approaches with their unique set of benefits. Enter Tableau!
在上一篇文章中 ,我展示了如何使用Python創建Kaplan-Meier曲線。 盡管我非常喜歡Python和編寫代碼,但可能會有一些其他方法具有其獨特的優勢。 進入Tableau!
Source資源Tableau is a business intelligence tool used for creating elegant and interactive visualizations on top of data coming from a vast number of sources (you would be surprised how many distinct ones are there!). To make the definition even shorter, Tableau is used for building dashboards.
Tableau是一種商務智能工具,用于在來自大量來源的數據之上創建優雅的交互式可視化效果(您會驚訝地發現那里有許多不同的數據!)。 為了使定義更短,Tableau用于構建儀表板。
So why would a data scientist be interested in using Tableau instead of Python? When creating a Notebook/report with the results of a survival analysis exercise in Python, the reader will always be limited to:
那么,為什么數據科學家會對使用Tableau而不是Python感興趣? 當使用Python的生存分析練習的結果創建Notebook /報告時,讀者將始終限于:
- what the creator of the visualization had in mind, 可視化創建者的想法是什么,
- what data was available at the moment of creating the report. 創建報告時可以使用哪些數據。
In other words, there is little freedom for the reader to explore some alternative angles. What is more, if someone in the company will accidentally find the report a few years later, the only way to make the analysis up-to-date would be to find the data scientist and make them rerun the Notebook and generate another report. Definitely not the best situation.
換句話說,讀者幾乎沒有自由來探索某些替代角度。 更重要的是,如果公司中有人會在幾年后無意間找到報告,那么使分析保持最新狀態的唯一方法是找到數據科學家,然后讓他們重新運行筆記本并生成另一份報告。 絕對不是最好的情況。
This is where a solution based on Tableau (or other business intelligence tools such as PowerBI, Looker, etc.) shines. As the visualizations are built directly on top of a data source, the visualization will be updated together with the data. Less work for the data scientist!
這是基于Tableau(或其他商業智能工具,如PowerBI,Looker等)的解決方案的發源地。 由于可視化直接建立在數據源之上,因此可視化將與數據一起更新。 減少數據科學家的工作!
Another extra benefit is the possibility to include some filters, so the readers can play around and try to explore different subsets of the data. From experience, this is a feature often used by product owners, who want to dive deep into the details and at the same time do not want to constantly come to the data person with another request for a new filter or feature. Another win :)
另一個額外的好處是可以包含一些過濾器,以便讀者可以玩轉并嘗試探索數據的不同子集。 根據經驗,這是產品所有者經常使用的功能,他們想深入了解細節,同時又不想不斷向數據人員提出新過濾器或功能的另一要求。 另一個勝利:)
Lastly, by using such tools, the analysts democratize the access to the data and analyses, as basically anyone in the company can access the dashboard and try to answer their own questions or verify their hypotheses.
最后,通過使用此類工具,分析師可以使對數據和分析的訪問民主化,因為基本上公司中的任何人都可以訪問儀表板并嘗試回答自己的問題或驗證其假設。
After this introduction, let’s jump right into re-creating the very same Kaplan-Meier curves we created in the previous article. Once again, we use the Telco Churn dataset, which requires close to no extra preparation before the analysis. Please refer to that article if you need a refresher on the Kaplan-Meier estimator, as we will not cover theory this time. Also, we assume some basic knowledge of Tableau.
在介紹完之后,讓我們直接重新創建與上一篇文章中創建的相同的Kaplan-Meier曲線。 再一次,我們使用Telco Churn數據集,該數據集在分析之前幾乎不需要任何額外準備。 如果您需要對Kaplan-Meier估計器進行復習,請參閱該文章,因為我們這次將不討論理論。 此外,我們假設您具有Tableau的一些基本知識。
Note: Tableau is a commercial software and requires a license. You can get access to a 14-day trial by following the instructions here.
注意 :Tableau是商業軟件,需要許可證。 您可以按照此處的說明訪問14天試用版。
mohamed Hassan from mohamed Hassan在PixabayPixabay上發布方法1:簡易模式 (Approach #1: Easy mode)
The first approach is dubbed easy, as it will favor speed and simplicity, while at the same time introducing some shortcomings. First, we load the data from a text file (available here).
第一種方法被稱為簡單方法,因為它將有利于速度和簡便性,同時又帶來了一些缺點。 首先,我們從文本文件(可在此處下載 )中加載數據。
To carry out the survival analysis in Tableau, we will need the following variables:
為了在Tableau中進行生存分析,我們將需要以下變量:
- time-to-event — expressed as time periods (for example, days or months) elapsed since joining the sample until the event of interest or censoring. 事件發生時間-表示從加入樣本到感興趣或檢查事件為止的時間段(例如,天或數月)。
- event-of-interest — expressed as a binary variable, where 1 indicates that the event happened, 0 otherwise. 感興趣的事件—用二進制變量表示,其中1表示事件已發生,否則為0。
- additional categorical variables — used for filtering and/or grouping. 其他類別變量-用于過濾和/或分組。
The tenure variable does not require any preparation, as it already expresses the number of months since signing up for the services of the Telco company. But the Churn variable is expressed as a yes/no string, so we need to encode it to binary using a calculated field:
tenure變量不需要任何準備,因為它已經表示自注冊電信公司的服務以來的月數。 但是Churn變量表示為是/否字符串,因此我們需要使用計算字段將其編碼為二進制:
To create this field, right-click on the Churn variable in the variable selector on the left (Data tab), select Create -> Calculated Field.
要創建此字段,請在左側(數據選項卡)的變量選擇器中右鍵單擊Churn變量,然后選擇創建->計算字段。
As the next step, we create a new calculated field, d_i, which represents the number of events that occur over time:
下一步,我們創建一個新的計算字段d_i ,該字段代表隨時間發生的事件數:
The names we used for the variables correspond to the elements you can find in the formula for the Kaplan-Meier estimator.
我們用于變量的名稱與您可以在Kaplan-Meier估計器的公式中找到的元素相對應。
The next variable we create will be the denominator used for calculating the hazard function at a given time. It represents the total number of observations since the last time period:
我們創建的下一個變量將是在給定時間用于計算危險函數的分母。 它表示自上一個時間段以來的觀察總數:
The Number of Records variable is a helper variable used for, as you might have guessed, counting the observations. For that purpose, newer versions of Tableau create a variable based on the name of the data source. However, you can easily create this variable manually by creating a calculated field and placing 1 in the field’s definition. Lastly, we define the Kaplan-Meier curve as:
“ Number of Records變量是一個輔助變量,您可能已經猜到了該變量用于對觀察值進行計數。 為此,Tableau的較新版本根據數據源的名稱創建一個變量。 但是,您可以通過創建一個計算字段并在該字段的定義中放置1來輕松手動創建此變量。 最后,我們將Kaplan-Meier曲線定義為:
Here, the probability of survival is defined as 1 - hazard function.
在此,將生存概率定義為1 - hazard function 。
All the building blocks are ready. Now, we place the tenure on the x-axis, the Kaplan-Meier Curve on the y-axis, format the curve as a percentage, add the tile and place the PaymentMethod variable as a color. This way, we create the following visualization:
所有構建塊均已準備就緒。 現在,我們將使用tenure放置在x軸上,將Kaplan-Meier Curve放置在y軸上,將曲線設置為百分比格式,添加平鋪,并將PaymentMethod變量放置為顏色。 這樣,我們創建以下可視化文件:
Which is very similar to what we obtained last time using lifelines:
這與我們上次使用lifelines獲得的結果非常相似:
Some quick observations:
一些快速觀察:
- the survival curves obtained in Tableau are more or less straight, without the characteristic step structure, 在Tableau中獲得的生存曲線或多或少是筆直的,沒有典型的階梯結構,
- there are no confidence intervals, as their calculation is not that simple in Tableau. 沒有置信區間,因為在Tableau中它們的計算不是那么簡單。
Using Tableau, we can easily add some additional filters to the visualization, such as the cohort date, age, or any of the available categorical variables.
使用Tableau,我們可以輕松地向可視化添加一些其他過濾器,例如隊列日期,年齡或任何可用的分類變量。
方法2:正常模式 (Approach #2: Normal Mode)
In this approach, we will focus on recreating the characteristic step-like shape of the Kaplan-Meier curves. This approach is dubbed the normal mode, as it requires a bit more preparation.
在這種方法中,我們將專注于重新創建Kaplan-Meier曲線的特征階梯狀形狀。 這種方法被稱為普通模式,因為它需要更多的準備。
For the additional data preprocessing, we need to complete two steps. First, add a column called link to the CSV file with the Telco Customer Churn data. The column should be populated with a ‘link’ string. As a matter of fact, this string can be arbitrary, just as the column name. What matters is consistency, but all will become clear in a second. The second step is to create a new CSV file (we called it blending.csv), which contains the following:
對于其他數據預處理,我們需要完成兩個步驟。 首先,在帶有Telco客戶流失數據的CSV文件中添加一個名為link的列。 該列應使用'link'字符串填充。 實際上,該字符串可以是任意的,就像列名一樣。 重要的是一致性,但是所有這些都將在一秒鐘之內變得清晰。 第二步是創建一個新的CSV文件(我們將其稱為blending.csv ),其中包含以下內容:
link, setlink, 1
link, 2
Yep, that’s pretty much it. For your convenience, I stored both files on my GitHub.
是的,僅此而已。 為了方便起見,我將這兩個文件都存儲在GitHub上 。
Armed with the two files, we load them to Tableau and left join the tables using the link variable. You can see that in the following image.
有了這兩個文件,我們將它們加載到Tableau,并使用link變量左連接表。 您可以在下圖中看到它。
As this is the “normal mode”, we will combine a few steps at the same time and create a calculated field called Kaplan-Meier Dots:
由于這是“正常模式”,因此我們將同時結合幾個步驟,并創建一個稱為Kaplan-Meier Dots的計算字段:
You can easily recognize the contents of this field from the “easy mode”, this time, we have put everything into one field. After doing so, comes the new part. We define the Kaplan-Meier Curve as:
您可以從“簡單模式”輕松識別該字段的內容,這一次,我們將所有內容都放在一個字段中。 這樣做之后,出現了新的部分。 我們將Kaplan-Meier Curve定義為:
This convoluted formula will enable us to obtain the step-like shape of the curves. Lastly, we need one more helper variable:
這個復雜的公式將使我們能夠獲得曲線的階梯狀形狀。 最后,我們需要一個輔助變量:
When doing so, please click on the Default Table Calculation and specify to compute the results along tenure.
這樣做時,請單擊“ 默認表計算”,并指定沿tenure計算結果。
Finally, we have all the building blocks to create the curves. We approach the setup similarly to the “easy mode”, with the difference of placing the Index as the Path and set as the Detail. To recreate the curves from Python, we once again use the PaymentMethod as the Color.
最后,我們具有創建曲線的所有構造塊。 我們采用類似于“簡易模式”的方法進行設置,不同之處在于將“ Index ”放置為“路徑”并set為“細節”。 要從Python重新創建曲線,我們再次使用PaymentMethod作為顏色。
In the picture above, we accurately recreated the curves we previously obtained using the lifelines library in Python. This definitely required a bit more work but can pay off in the end.
在上圖中,我們準確地重新創建了先前使用Python的lifelines庫獲得的曲線。 這肯定需要更多的工作,但最終可以得到回報。
We can additionally use the Kaplan-Meier Dots to visualize the events as they happen along the curve. In this case, I believe this would simply clutter the visualization. It would be more suitable for a smaller dataset.
我們還可以使用Kaplan-Meier Dots來可視化沿曲線發生的事件。 在這種情況下,我相信這只會使可視化變得混亂。 它更適合于較小的數據集。
We can further improve the dashboard by adding some filters/splits and then share it with our colleagues via the company’s reporting portal (in this case, an instance of Tableau Server).
我們可以通過添加一些過濾器/拆分來進一步改進儀表板,然后通過公司的報告門戶(在本例中為Tableau Server實例)與同事共享。
結論 (Conclusion)
In this article, I explained the potential benefits of using business intelligence tools such as Tableau for survival analysis and showed how to create dashboards with the Kaplan-Meier curves.
在本文中,我解釋了使用Tableau等商業智能工具進行生存分析的潛在好處,并展示了如何使用Kaplan-Meier曲線創建儀表板。
As is often the case, nothing comes for free and there are also some disadvantages to this approach:
通常,沒有什么是免費的,這種方法也有一些缺點:
- Calculating the confidence intervals is definitely harder and needs quite some effort. 計算置信區間肯定比較困難,并且需要付出很多努力。
- In Tableau, there is no simple way to carry out the log-rank test to compare different survival curves (unless we use R from Tableau, but this might be an idea for a future article). 在Tableau中,沒有簡單的方法來執行對數秩檢驗以比較不同的生存曲線(除非我們使用Tableau中的R,但這可能是以后的文章的想法)。
- If new features are added to the data, for example, new customer segmentation or another category for each observation, this will still require some work from an analyst to add to an already existing dashboard. However, most of the time this does not happen often or requires little extra work. 如果將新功能(例如,新客戶細分或每個觀察的另一個類別)添加到數據中,則仍需要分析師進行一些工作才能添加到現有儀表板中。 但是,在大多數情況下,這種情況并不經常發生或需要很少的額外工作。
I hope you enjoyed this alternative approach to visualizing the Kaplan-Meier curves. As always, any constructive feedback is welcome. You can reach out to me on Twitter or in the comments.
我希望您喜歡這種替代方法來可視化Kaplan-Meier曲線。 一如既往,歡迎任何建設性的反饋。 您可以在Twitter或評論中與我聯系。
If you liked this article, you might also like the other ones in the series:
如果您喜歡這篇文章,您可能還喜歡該系列中的其他文章:
翻譯自: https://towardsdatascience.com/level-up-your-kaplan-meier-curves-with-tableau-bc4a10ec6a15
tableau使用
總結
以上是生活随笔為你收集整理的tableau使用_使用Tableau升级Kaplan-Meier曲线的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 如何在Pandas中使用Excel文件
- 下一篇: 梦到很多大蟒蛇是什么意思