當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据科学的5种基本的面向业务的批判性思维技能

發(fā)布時間：2023/12/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了数据科学的5种基本的面向业务的批判性思维技能小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

批判性思維

As Alexander Pope said, to err is human. By that metric, who is more human than us data scientists? We devise wrong hypotheses constantly and then spend time working on them just to find out how wrong we were.

正如亞歷山大·波普(Alexander Pope)所說，犯錯是人類。按照這個指標，誰比我們的數(shù)據(jù)科學(xué)家更人性化？我們不斷設(shè)計錯誤的假設(shè)，然后花時間研究它們，以找出我們的錯誤所在。

When looking at mistakes from an experiment, a data scientist needs to be critical, always on the lookout for something that others may have missed. But sometimes, in our day-to-day routine, we can easily get lost in little details. When this happens, we often fail to look at the overall picture, ultimately failing to deliver what the business wants.

在查看實驗中的錯誤時，數(shù)據(jù)科學(xué)家必須至關(guān)重要，始終在尋找其他人可能錯過的東西。但是有時候，在我們的日常工作中，我們很容易在細節(jié)上迷失方向。發(fā)生這種情況時，我們常常無法看清整體情況，最終無法交付業(yè)務(wù)所需的東西。

Our business partners have hired us to generate value. We won’t be able to generate that value unless we develop business-oriented critical thinking, including having a more holistic perspective of the business at hand. So here is some practical advice for your day-to-day work as a data scientist. These recommendations will help you to be more diligent and more impactful at the same time.

我們的商業(yè)伙伴已聘請我們創(chuàng)造價值。除非我們發(fā)展面向業(yè)務(wù)的批判性思維，包括對手頭的業(yè)務(wù)有更全面的了解，否則我們將無法產(chǎn)生該價值。因此，這是您作為數(shù)據(jù)科學(xué)家的日常工作的一些實用建議。這些建議將幫助您同時更加勤奮和富有影響力。

1.當心清潔數(shù)據(jù)綜合癥 (1. Beware of the Clean Data Syndrome)

Tell me how many times this has happened to you: You get a data set and start working on it straight away. You create neat visualizations and start building models. Maybe you even present automatically generated descriptive analytics to your business counterparts!

告訴我這件事發(fā)生了多少次:您獲得了一個數(shù)據(jù)集，并立即開始處理它。您可以創(chuàng)建簡潔的可視化效果并開始構(gòu)建模型。甚至您甚至可以向業(yè)務(wù)對手展示自動生成的描述性分析！

But do you ever ask, “Does this data actually make sense?” Incorrectly assuming that the data is clean could lead you toward very wrong hypotheses. Not only that, but you’re also missing an important analytical opportunity with this assumption.

但是您是否曾經(jīng)問過:“這些數(shù)據(jù)真的有意義嗎？” 錯誤地假設(shè)數(shù)據(jù)是干凈的可能會導(dǎo)致您得出非常錯誤的假設(shè)。不僅如此，這種假設(shè)還會使您失去重要的分析機會。

You can actually discern a lot of important patterns by looking at discrepancies in the data. For example, if you notice that a particular column has more than 50 percent of values missing, you might think about dropping the column. But what if the missing column is because the data collection instrument has some error? By calling attention to this, you could have helped the business to improve its processes.

實際上，您可以通過查看數(shù)據(jù)中的差異來識別許多重要的模式。例如，如果您發(fā)現(xiàn)某個特定的列缺少超過50％的值，則可以考慮刪除該列。但是，如果缺少列是因為數(shù)據(jù)收集工具有一些錯誤怎么辦？通過引起對此的注意，您可以幫助企業(yè)改進其流程。

Or what if you’re given a distribution of customers that shows a ratio of 90 percent men versus 10 percent women, but the business is a cosmetics company that predominantly markets its products to women? You could assume you have clean data and show the results as is, or you can use common sense and ask the business partner if the labels are switched.

或者，如果給您分配的客戶分布顯示出90％的男性與10％的女性比率，但該企業(yè)是一家化妝品公司，主要將產(chǎn)品銷售給女性？您可以假設(shè)您有干凈的數(shù)據(jù)并按原樣顯示結(jié)果，或者可以使用常識并詢問業(yè)務(wù)伙伴是否更換了標簽。

Such errors are widespread. Catching them not only helps the future data collection processes but also prevents the company from making wrong decisions by preventing various other teams from using bad data.

這種錯誤很普遍。捕獲它們不僅有助于將來的數(shù)據(jù)收集流程，而且還可以防止其他團隊使用不良數(shù)據(jù)來防止公司做出錯誤的決定。

2.注意業(yè)務(wù) (2. Be Aware of the business)

Source: Fab.com Beginnings資料來源 :Fab.com起點

You probably know fab.com. If you don’t, it’s a website that sells selected health and fitness items. But the site’s origins weren’t in e-commerce. Fab.com started as Fabulis.com, a social networking site for gay men. One of the site’s most popular features was called the “Gay Deal of the Day.”

您可能知道fab.com。如果您不這樣做，那是一個出售選定健康和健身物品的網(wǎng)站。但是該網(wǎng)站的起源不是電子商務(wù)。 Fab.com 最初是Fabulis.com(男同性戀者的社交網(wǎng)站)。該網(wǎng)站最受歡迎的功能之一被稱為“每日同性戀交易”。

One day, the deal was for hamburgers. Half of the deal’s buyers were women, despite the fact that they weren’t the site’s target users. This fact caused the data team to realize that they had an untapped market for selling goods to women. So Fabulis.com changed its business model to serve this newfound market.

有一天，這筆交易是給漢堡包的。盡管這不是該網(wǎng)站的目標用戶，但交易的買家中有一半是女性。這一事實使數(shù)據(jù)團隊意識到，他們有一個尚未開發(fā)的向女性出售商品的市場。因此Fabulis.com更改了其業(yè)務(wù)模式以服務(wù)于這個新發(fā)現(xiàn)的市場。

Be on the lookout for something out of the ordinary. Be ready to ask questions. If you see something in the data, you may have hit gold. Data can help a business to optimize revenue, but sometimes it has the power to change the direction of the company as well.

尋求與眾不同的東西。準備問問題。如果您看到數(shù)據(jù)中的某些內(nèi)容，則可能是黃金。數(shù)據(jù)可以幫助企業(yè)優(yōu)化收入，但有時它也可以改變公司的發(fā)展方向。

Source: Flickr Origins as “Game Neverending”資料來源 :Flickr起源為“游戲永無止境”

Another famous example of this is Flickr, which started out as a multiplayer game. Only when the founders noticed that people were using it as a photo upload service did the company pivot to the photo-sharing app we know it as today.

另一個著名的例子是Flickr，它最初是一種多人游戲。只有當創(chuàng)始人注意到人們將其用作照片上傳服務(wù)時，公司才轉(zhuǎn)向我們今天所知的照片共享應(yīng)用程序。

Try to see patterns that others would miss. Do you see a discrepancy in some buying patterns or maybe something you can’t seem to explain? That might be an opportunity in disguise when you look through a wider lens.

嘗試查看其他人會錯過的模式。您是否發(fā)現(xiàn)某些購買模式存在差異，或者您似乎無法解釋？當您從更大的角度看時，這可能是變相的機會。

3.關(guān)注正確的指標 (3. Focus on the right metrics)

What do we want to optimize for? Most businesses fail to answer this simple question.

我們要優(yōu)化什么？大多數(shù)企業(yè)無法回答這個簡單的問題。

Every business problem is a little different and should, therefore, be optimized differently. For example, a website owner might ask you to optimize for daily active users. Daily active users is a metric defined as the number of people who open a product on a given day. But is that the right metric? Probably not! In reality, it’s just a vanity metric, meaning one that makes you look good but doesn’t serve any purpose when it comes to actionability. This metric will always increase if you are spending marketing dollars across various channels to bring more and more customers to your site.

每個業(yè)務(wù)問題都稍有不同，因此應(yīng)該以不同的方式進行優(yōu)化。例如，網(wǎng)站所有者可能會要求您針對每日活躍用戶進行優(yōu)化。每日活躍用戶是一個指標，定義為在特定日期打開產(chǎn)品的人數(shù)。但這是正確的指標嗎？可能不是！實際上，這只是一種虛榮感指標，這意味著它可以使您看起來不錯，但對于可操作性沒有任何作用。如果您在各種渠道上花費營銷費用來吸引越來越多的客戶訪問您的網(wǎng)站，則該指標將始終保持增長。

Instead, I would recommend optimizing the percentage of users that are active to get a better idea of how my product is performing. A big marketing campaign might bring a lot of users to my site, but if only a few of them convert to active, the marketing campaign was a failure and my site stickiness factor is very low. You can measure the stickiness by the second metric and not the first one. If the percentage of active users is increasing, that must mean that they like my website.

相反，我建議優(yōu)化活躍用戶的百分比，以更好地了解我的產(chǎn)品的性能。大型的營銷活動可能會吸引很多用戶訪問我的網(wǎng)站，但是如果只有少數(shù)用戶轉(zhuǎn)換為活動用戶，則營銷活動將失敗并且我的網(wǎng)站黏性系數(shù)非常低。您可以通過第二個指標而不是第一個指標來衡量粘性。如果活躍用戶的百分比在增加，那必須表示他們喜歡我的網(wǎng)站。

Another example of looking at the wrong metric happens when we create classification models. We often try to increase accuracy for such models. But do we really want accuracy as a metric of our model performance?

創(chuàng)建分類模型時，會出現(xiàn)另一個錯誤指標的例子。我們經(jīng)常嘗試提高此類模型的準確性。但是，我們是否真的希望準確性作為衡量模型性能的指標？

PixabayPixabay

Imagine that we’re predicting the number of asteroids that will hit the Earth. If we want to optimize for accuracy, we can just say zero all the time, and we will be 99.99 percent accurate. That 0.01 percent error could be hugely impactful, though. What if that 0.01 percent is a planet-killing-sized asteroid? A model can be reasonably accurate but not at all valuable. A better metric would be the F score, which would be zero in this case, because the recall of such a model is zero as it never predicts an asteroid hitting the Earth.

想象一下，我們正在預(yù)測將撞擊地球的小行星的數(shù)量。如果我們要優(yōu)化準確性，我們可以一直說零，那么我們將達到99.99％的準確性。不過，該0.01％的錯誤可能會產(chǎn)生巨大影響。如果那0.01％是殺行星大小的小行星怎么辦？模型可以相當準確，但根本沒有價值。更好的度量標準是F分數(shù)，在這種情況下為零，因為這種模型的召回率是零，因為它從未預(yù)測過小行星撞擊地球。

When it comes to data science, designing a project and the metrics we want to use for evaluation is much more important than modeling itself. The metrics themselves need to specify the business goal and aiming for a wrong goal effectively destroys the whole purpose of modeling. For example, F1 or PRAUC is a better metric in terms of asteroid prediction as they take into consideration both the precision and recall of the model. If we optimize for accuracy, our whole modeling effort could just be in vain.

在數(shù)據(jù)科學(xué)方面，設(shè)計項目和我們要用于評估的指標比建模本身更為重要。度量標準本身需要指定業(yè)務(wù)目標，而針對錯誤的目標有效地破壞了建模的整個目的。例如，就小行星預(yù)測而言，F1或PRAUC是更好的指標，因為它們同時考慮了模型的精度和召回率。如果我們針對準確性進行優(yōu)化，那么整個建模工作將徒勞無功。

4.統(tǒng)計有時會說謊 (4. Statistics Lie sometimes)

Be skeptical of any statistics that get quoted to you. Statistics have been used to lie in advertisements, in workplaces, and in a lot of other arenas in the past. People will do anything to get sales or promotions.

懷疑引用給您的任何統(tǒng)計信息。過去，統(tǒng)計信息已被用于廣告，工作場所以及許多其他領(lǐng)域。人們會做任何事情來獲得銷售或促銷。

Source資源

For example, do you remember Colgate’s claim that 80 percent of dentists recommended their brand? This statistic seems pretty good at first. If so many dentists use Colgate, I should too, right? It turns out that during the survey, the dentists could choose multiple brands rather than just one. So other brands could be just as popular as Colgate.

例如，您還記得高露潔聲稱80％的牙醫(yī)推薦其品牌的說法嗎？起初，這個統(tǒng)計數(shù)據(jù)看起來不錯。如果有那么多牙醫(yī)使用高露潔，我也應(yīng)該吧？事實證明，在調(diào)查期間，牙醫(yī)可以選擇多個品牌，而不僅僅是一個。因此，其他品牌可能與高露潔一樣受歡迎。

Source資源

Marketing departments are just myth creation machines. We often see such examples in our daily lives. Take, for example, this 1992 ad from Chevrolet. Just looking at just the graph and not at the axis labels, it looks like Nissan/Datsun must be dreadful truck manufacturers. In fact, the graph indicates that more than 95 percent of the Nissan and Datsun trucks sold in the previous 10 years were still running. And the small difference might just be due to sample sizes and the types of trucks sold by each of the companies. As a general rule, never trust a chart that doesn’t label the Y-axis.

營銷部門只是神話創(chuàng)造的機器。我們在日常生活中經(jīng)常看到這樣的例子。以1992年雪佛蘭(Chevrolet)的廣告為例。只看圖表而不看軸標簽，看起來日產(chǎn)/ Datsun一定是可怕的卡車制造商。實際上，該圖表明在過去10年中售出的日產(chǎn)和Datsun卡車中超過95％仍在運行。差異很小可能只是由于樣本量和每個公司出售的卡車的類型。作為一般規(guī)則，否E版本的信任，不標注Y軸的圖表。

As a part of the ongoing pandemic, we’re seeing even more such examples with a lot of studies promoting cures for COVID-19. This past June in India, a man claimed to have made medicine for coronavirus that cured 100 percent of patients in seven days. This news predictably caused a big stir, but only after he was asked about the sample size did we understand what was actually happening here. With a sample size of 100, the claim was utterly ridiculous on its face. Worse, the way the sample was selected was hugely flawed. His organization selected asymptomatic and mildly symptomatic users with a mean age between 35 and 45 with no pre-existing conditions, I was dumbfounded — this was not even a random sample. So not only was the study useless, it was actually unethical.

作為持續(xù)進行的大流行的一部分，我們通過許多促進COVID-19治愈的研究看到了更多這樣的例子。今年六月在印度，一名男子聲稱自己制作了冠狀病毒藥物，在7天內(nèi)治愈了100％的患者。可以預(yù)見的是，這一消息引起了極大的轟動，但只有在詢問了他有關(guān)樣本量的信息后，我們才了解這里實際發(fā)生的情況。樣本數(shù)量為100，該聲明的內(nèi)容完全荒謬。更糟糕的是，樣本的選擇方式存在巨大缺陷。他的組織選擇了無癥狀和輕度癥狀的使用者，他們的平均年齡在35至45歲之間，并且沒有既往疾病，我對此感到震驚-這甚至不是隨機樣本。因此，這項研究不僅無用，而且實際上是不道德的。

When you see charts and statistics, remember to evaluate them carefully. Make sure the statistics were sampled correctly and are being used in an ethical, honest way.

當您看到圖表和統(tǒng)計數(shù)據(jù)時，請記住要仔細評估它們。確保統(tǒng)計信息已正確采樣并以道德，誠實的方式使用。

5.不要屈服于謬論 (5. Don’t Give in to Fallacies)

Photo by Jonathan Petersson on Unsplash 喬納森·彼得森 ( Jonathan Petersson)在Unsplash上拍攝的照片

During the summer of 1913 in a casino in Monaco, gamblers watched in amazement as the roulette wheel landed on black an astonishing 26 times in a row. And since the probability of red versus black is precisely half, they were confident that red was “due.” It was a field day for the casino and a perfect example of gambler’s fallacy, a.k.a. the Monte Carlo fallacy.

在1913年夏天，在摩納哥的一家賭場中，賭徒驚奇地看著輪盤賭輪連續(xù)地連續(xù)26次落在黑色上。而且由于紅色與黑色的概率恰好是一半，所以他們確信紅色是“應(yīng)有的”。這是賭場的野外活動日，也是賭徒謬論 (又稱蒙特卡洛謬論)的完美例證。

This happens in everyday life outside of casinos too. People tend to avoid long strings of the same answer. Sometimes they do so while sacrificing accuracy of judgment for the sake of getting a pattern of decisions that look fairer or more probable. For example, an admissions office may reject the next application they see if they have approved three applications in a row, even if the application should have been accepted on merit.

這也發(fā)生在賭場以外的日常生活中。人們傾向于避免使用長串相同的答案。有時他們這樣做是在犧牲判斷準確性的同時，為了獲得看起來更公平或更可能的決策模式。例如，招生辦公室可以連續(xù)拒絕三個申請，即使他們本應(yīng)被接受，也可以拒絕下一個申請。

The world works on probabilities. We are seven billion people, each doing an event every second of our lives. Because of that sheer volume, rare events are bound to happen. But we shouldn’t put our money on them.

世界靠概率工作。我們有70億人口，每個人每秒鐘都在做一件事情。由于數(shù)量龐大，必將發(fā)生罕見的事件。但是我們不應(yīng)該把錢花在他們身上。

Think also of the spurious correlations we end up seeing regularly. This particular graph shows that organic food sales cause autism. Or is it the opposite? Just because two variables move together in tandem doesn’t necessarily mean that one causes the other. Correlation does not imply causation and as data scientists, it is our job to be on a lookout for such fallacies, biases, and spurious correlations. We can’t allow oversimplified conclusions to cloud our work.

還請考慮一下我們最終經(jīng)?？吹降奶摷訇P(guān)聯(lián)。此特殊圖表顯示，有機食品的銷售會導(dǎo)致自閉癥。還是相反？僅僅因為兩個變量串聯(lián)在一起并不一定意味著一個導(dǎo)致另一個。關(guān)聯(lián)并不意味著因果關(guān)系，作為數(shù)據(jù)科學(xué)家，尋找此類謬論，偏差和虛假關(guān)聯(lián)是我們的工作。我們不能允許過于簡單的結(jié)論使我們的工作蒙上陰影。

Data scientists have a big role to play in any organization. A good data scientist must be both technical as well as business-driven to perform the job’s requirements well. Thus, we need to make a conscious effort to understand the business’ needs while also polishing our technical skills.

數(shù)據(jù)科學(xué)家在任何組織中都可以發(fā)揮重要作用。優(yōu)秀的數(shù)據(jù)科學(xué)家必須具備技術(shù)和業(yè)務(wù)驅(qū)動才能很好地滿足工作要求。因此，我們需要有意識地努力去了解業(yè)務(wù)需求，同時還要完善我們的技術(shù)技能。

繼續(xù)學(xué)習(xí) (Continue Learning)

If you want to learn more about how to apply Data Science in a business context, I would like to call out the AI for Everyone course by Andrew Ng which focusses on spotting opportunities to apply AI to problems in your own organization, working with an AI team and build an AI strategy in your company.

如果您想了解有關(guān)如何在業(yè)務(wù)環(huán)境中應(yīng)用數(shù)據(jù)科學(xué)的更多信息，我想講一下Andrew Ng的“ 每個人的AI”課程 ，該課程著重于發(fā)現(xiàn)與AI合作將AI應(yīng)用于您自己組織中的問題的機會。團隊并在您的公司中制定AI戰(zhàn)略。

Thanks for the read. I am going to be writing more beginner-friendly posts in the future too. Follow me up at Medium or Subscribe to my blog to be informed about them. As always, I welcome feedback and constructive criticism and can be reached on Twitter @mlwhiz.

感謝您的閱讀。我將來也會寫更多對初學(xué)者友好的文章。在Medium上關(guān)注我，或訂閱我的博客以了解有關(guān)它們的信息。與往常一樣，我歡迎您提供反饋和建設(shè)性的批評，可以在Twitter @mlwhiz上與我們聯(lián)系。

This post was first published here.

這篇文章首先在這里發(fā)表。

翻譯自: https://towardsdatascience.com/5-essential-business-oriented-critical-thinking-skills-for-data-science-ac25fa69aafc