大数据平台蓝图_数据科学面试蓝图
大數(shù)據(jù)平臺藍(lán)圖
1.組織是關(guān)鍵 (1. Organisation is Key)
I’ve interviewed at Google (and DeepMind), Uber, Facebook, Amazon for roles that lie under the “Data Scientist” umbrella and this is the typical interview construction theme I’ve observed:
我曾在Google(和DeepMind),Uber,Facebook和Amazon接受過采訪,采訪對象是“數(shù)據(jù)科學(xué)家”保護(hù)傘下的角色,這是我觀察到的典型采訪構(gòu)建主題:
Now nobody is expecting some super graduate level competency in all of these topics, but you need to know enough to convince your interviewer that you’re capable of delivering if they offered you the job. How much you need to know depends on the job spec, but in this increasingly competitive market, no knowledge is lost.
現(xiàn)在,沒有人期望在所有這些主題上都有一定的超級研究生水平的能力,但是您需要足夠的知識來說服面試官,如果他們?yōu)槟峁┝斯ぷ?#xff0c;您就可以勝任。 您需要了解多少知識取決于工作規(guī)范,但是在這個競爭日益激烈的市場中,知識不會丟失。
I recommend using Notion to organise your job prep. It’s extremely versatile, and enables you to utilise the Spaced Repetition and Active Recall principles to nail down learning and deploying key topics that come up time and time again in a Data Scientist interview. Ali Abdaal has a great tutorial on note taking with Notion to maximise your learning potential during the interview process.
我建議使用Notion來組織您的工作準(zhǔn)備。 它的用途非常廣泛,可以讓您利用“ 間隔重復(fù)”和“ 主動回憶”原理來確定學(xué)習(xí)和部署關(guān)鍵主題,這些主題一次又一次地出現(xiàn)在數(shù)據(jù)科學(xué)家訪談中。 Ali Abdaal擁有關(guān)于Notion的精彩筆記教程 ,可在面試過程中最大程度地發(fā)揮學(xué)習(xí)潛力。
I used to run through my Notion notes over and over, but in particular, right before my interview. This ensured that key topics and definitions were loaded into my working memory and I didn’t waste precious time “ummmmmm”ing when hit with some question.
過去,我經(jīng)常反復(fù)瀏覽概念筆記,尤其是在面試之前。 這樣可以確保將關(guān)鍵主題和定義加載到我的工作記憶中,當(dāng)我遇到一些問題時,我不會浪費寶貴的時間。
2.軟件工程 (2. Software Engineering)
Not all Data Scientist roles will grill you on the time complexity of an algorithm, but all of these roles will expect you to write code. Data Science isn’t one job, but a collection of jobs that attracts talent from a variety of industries, including the software engineering world. As such you’re competing with guys that know the ins and outs of writing efficient code and I would recommend spending at least 1–2 hours a day in the lead-up to your interview practicing the following concepts:
并非所有的Data Scientist角色都會使您擔(dān)心算法的時間復(fù)雜性,但是所有這些角色都希望您編寫代碼。 數(shù)據(jù)科學(xué)不是一項工作,而是一系列工作,吸引了包括軟件工程界在內(nèi)的各種行業(yè)的人才。 因此,您正在與了解編寫高效代碼的來龍去脈的人競爭,我建議您每天至少花費1-2個小時來進(jìn)行面試,并實踐以下概念:
DO NOT LEARN THE ALGORITHMS OFF BY HEART. This approach is useless, because the interviewer can question you on any variation of the algorithm and you will be lost. Instead learn the strategy behind how each algorithm works. Learn what computational and spatial complexity are, and learn why they are so fundamental to building efficient code.
不要通過心學(xué)習(xí)算法。 這種方法沒有用,因為訪問員可以對算法的任何變體詢問您,您會迷路。 相反,學(xué)習(xí)后面的每個算法是如何工作的戰(zhàn)略 。 了解什么是計算和空間復(fù)雜性,并了解為什么它們對于構(gòu)建高效代碼如此重要。
LeetCode was my best friend during interview preparation and is well worth the $35 per month in my opinion. Your interviewers only have so many algorithm questions to sample from, and this website covers a host of algorithm concepts including companies that are likely or are known to have asked these questions in the past. There’s also a great community who discuss each problem in detail, and helped me during the myriad of “stuck” moments I encountered. LeetCode has a “l(fā)ite” version with a smaller question bank if the $35 price tag is too steep, as do HackerRank and geeksforgeeks which are other great resources.
LeetCode是我在面試準(zhǔn)備期間最好的朋友,在我看來,每月35美元的價值非常值得。 您的面試官只有這么多算法問題可供選擇,并且此網(wǎng)站涵蓋了許多算法概念,包括過去可能或已知曾問過這些問題的公司。 還有一個很棒的社區(qū),詳細(xì)討論每個問題,并在遇到的許多“卡住”時刻為我提供了幫助。 LeetCode有一個“精簡版”版本,如果35美元的價格太高,則問題庫較小, HackerRank和geeksforgeeks也是其他很棒的資源。
What you should do is attempt each question, even if it’s a brute force approach that takes ages to run. Then look at the model solution, and try to figure out what the optimal strategy is. Then read up what the optimal strategy is and try to understand why this is the optimal strategy. Ask yourself questions like “why is Quicksort O(n2) average time complexity?”, why do two pointers and one for loop make more sense than three for loops?
您應(yīng)該做的是嘗試每個問題,即使這是一個蠻橫的方法,也要花很多時間才能解決。 然后查看模型解決方案,并嘗試找出最佳策略。 然后閱讀最佳策略是什么,并嘗試了解為什么這是最佳策略。 問自己一些問題,例如“為什么Quicksort O(n2)平均時間復(fù)雜度?”,為什么兩個指針和一個for循環(huán)比三個for循環(huán)更有意義?
3.應(yīng)用統(tǒng)計 (3. Applied Statistics)
Data science has an implicit dependence on applied statistics, and how implicit that will be depends on the role you’ve applied for. Where do we use applied statistics? It pops up just about anywhere where we need to organise, interpret and derive insights from data.
數(shù)據(jù)科學(xué)對應(yīng)用的統(tǒng)計信息有隱式依賴,隱含的依賴程度取決于您申請的角色。 我們在哪里使用應(yīng)用統(tǒng)計數(shù)據(jù)? 它幾乎出現(xiàn)在我們需要組織,解釋和從數(shù)據(jù)中獲取見解的任何地方。
I studied the following topics intensely during my interviews, and you bet your bottom dollar that I was grilled about each topic:
我在面試中認(rèn)真研究了以下主題,您敢打賭,我為每個主題都感到沮喪:
If you think this is a lot of material you are not alone, I was massively overwhelmed with the volume of knowledge expected in these kinds of interviews and the plethora of information on the internet that could help me. Two invaluable resources come to mind when I was revising for interviews.
如果您認(rèn)為其中有很多材料并不孤單,那么這些采訪中所期望的知識量以及互聯(lián)網(wǎng)上可以幫助我的大量信息會讓我不知所措。 當(dāng)我進(jìn)行面試修訂時,會想到兩個寶貴的資源。
Introduction to Probability and Statistics, an open course on everything listed above including questions and an exam to help you test your knowledge.
概率統(tǒng)計概論 ,這是一門有關(guān)上述所有內(nèi)容的公開課程,包括問題和幫助您測試知識的考試。
Machine Learning: A Bayesian and Optimization Perspective by Sergios Theodoridis. This is more a machine learning text than a specific primer on applied statistics, but the linear algebra approaches outlined here really help drive home the key statistical concepts on regression.
機(jī)器學(xué)習(xí):貝葉斯和優(yōu)化視角 ,作者:Sergios Theodoridis。 這更多是機(jī)器學(xué)習(xí)的文章,而不是應(yīng)用統(tǒng)計的特定基礎(chǔ)知識,但是這里概述的線性代數(shù)方法確實有助于推動回歸的關(guān)鍵統(tǒng)計概念的理解。
The way you’re going to remember this stuff isn’t through memorisation, you need to solve as many problems as you can get your hands on. Glassdoor is a great repo for the sorts of applied stats questions typically asked in interviews. The most challenging interview I had by far was with G-Research, but I really enjoyed studying for the exam, and their sample exam papers were fantastic resources when it came to testing how far I was getting in my applied statistics revision.
記住這些東西的方式不是通過記憶,而是需要解決盡可能多的問題。 Glassdoor是針對訪談中通常會問到的各種應(yīng)用統(tǒng)計問題的絕佳倉庫。 到目前為止,我遇到的最具挑戰(zhàn)性的采訪是在G-Research上,但我真的很喜歡為考試學(xué)習(xí),當(dāng)涉及到測試我的應(yīng)用統(tǒng)計學(xué)修訂版的學(xué)習(xí)程度時,他們的樣本試卷是非常有用的資源。
4.機(jī)器學(xué)習(xí) (4. Machine Learning)
Now we come to the beast, the buzzword of our millennial era, and a topic so broad that it can be easy to get so lost in revision that you want to give up.
現(xiàn)在我們來談?wù)劔F,我們千禧年時代的流行語,以及一個如此廣泛的話題,以至于很容易迷失在修訂之中,以至于您想放棄。
The applied statistics part of this study guide will give you a very very strong foundation to get started with machine learning (which is basically just applied applied statistics written in fancy linear algebra), but there are certain key concepts that came up over and over again during my interviews. Here is a (by no means exhaustive) set of concepts organised by topic:
本學(xué)習(xí)指南的應(yīng)用統(tǒng)計部分將為您提供非常強(qiáng)大的基礎(chǔ),以幫助您開始機(jī)器學(xué)習(xí)(這基本上只是用花哨的線性代數(shù)編寫的應(yīng)用統(tǒng)計),但是有些關(guān)鍵概念會反復(fù)出現(xiàn)。在我的采訪中 這是按主題組織的(絕不是窮舉)概念集:
指標(biāo)-分類 (Metrics — Classification)
Confusion Matrices, Accuracy, Precision, Recall, Sensitivity
混淆矩陣,準(zhǔn)確性,精度,召回率,靈敏度
F1 Score
F1分?jǐn)?shù)
TPR, TNR, FPR, FNR
TPR,TNR,FPR,FNR
Type I and Type II errors
I型和II型錯誤
AUC-ROC Curves
AUC-ROC曲線
指標(biāo)-回歸 (Metrics — Regression)
Total sum of squares, explained sum of squares, residual sum of squares
平方總和,解釋平方和,殘差平方和
Coefficient of determination and its adjusted form
確定系數(shù)及其調(diào)整形式
AIC and BIC
AIC和BIC
Advantages and disadvantages of RMSE, MSE, MAE, MAPE
RMSE,MSE,MAE,MAPE的優(yōu)缺點
偏差/偏差權(quán)衡,過度/欠缺 (Bias-Variance Tradeoff, Over/Under-Fitting)
K Nearest Neighbours algorithm and the choice of k in bias-variance trade-off
K最近鄰算法和偏差方差折衷中的k選擇
Random Forests
隨機(jī)森林
The asymptotic property
漸近性質(zhì)
Curse of dimensionality
維度詛咒
選型 (Model Selection)
K-Fold Cross Validation
K折交叉驗證
L1 and L2 Regularisation
L1和L2正則化
Bayesian Optimization
貝葉斯優(yōu)化
采樣 (Sampling)
Dealing with class imbalance when training classification models
訓(xùn)練分類模型時應(yīng)對班級失衡
SMOTE for generating pseudo observations for an underrepresented class
SMOTE用于為代表性不足的類生成偽觀察
Class imbalance in the independent variables
自變量中的類不平衡
Sampling methods
采樣方式
Sources of sampling bias
抽樣偏差的來源
Measuring Sampling Error
測量采樣誤差
假設(shè)檢驗 (Hypothesis Testing)
This really comes under under applied statistics, but I cannot stress enough the importance of learning about statistical power. It’s enormously important in A/B testing.
這確實屬于應(yīng)用統(tǒng)計的范疇,但是我不能足夠強(qiáng)調(diào)學(xué)習(xí)統(tǒng)計能力的重要性。 在A / B測試中,這非常重要。
回歸模型 (Regression Models)
Ordinary Linear Regression, its assumptions, estimator derivation and limitations are covered in significant detail in the sources cited in the applied statistics section. Other regression models you should be familiar with are:
在“應(yīng)用統(tǒng)計”部分引用的來源中,非常詳細(xì)地介紹了普通線性回歸,其假設(shè),估計量的推導(dǎo)和限制。 您應(yīng)該熟悉的其他回歸模型是:
Deep Neural Networks for Regression
深度神經(jīng)網(wǎng)絡(luò)回歸
Random Forest Regression
森林隨機(jī)回歸
XGBoost Regression
XGBoost回歸
Time Series Regression (ARIMA/SARIMA)
時間序列回歸(ARIMA / SARIMA)
Bayesian Linear Regression
貝葉斯線性回歸
Gaussian Process Regression
高斯過程回歸
聚類算法 (Clustering Algorithms)
K-Means
K均值
Hierarchical Clustering
層次聚類
Dirichlet Process Mixture Models
Dirichlet過程混合模型
分類模型 (Classification Models)
Logistic Regression (Most important one, revise well)
Logistic回歸( 最重要的一個,請認(rèn)真修改 )
Multiple Regression
多重回歸
XGBoost Classification
XGBoost分類
Support Vector Machines
支持向量機(jī)
It’s a lot, but much of the content will be trivial if your applied statistics foundation is strong enough. I would recommend knowing the ins and outs of at least three different classification/regression/clustering methods, because the interviewer could always (and has previously) asked “what other methods could we have used, what are some advantages/disadvantages”? This is a small subset of the machine learning knowledge in the world, but if you know these important examples, the interviews will flow a lot more smoothly.
數(shù)量很多,但是如果您的應(yīng)用統(tǒng)計基礎(chǔ)足夠強(qiáng)大,那么很多內(nèi)容將變得微不足道。 我建議您至少了解三種不同的分類/回歸/聚類方法的來龍去脈,因為面試官可以總是(并且以前曾問過)“我們還可以使用什么其他方法,有什么優(yōu)點/缺點”? 這是世界上機(jī)器學(xué)習(xí)知識的一小部分,但如果你知道這些重要的例子,面試會更加順暢流很多 。
5.數(shù)據(jù)處理和可視化 (5. Data Manipulation and Visualisation)
“What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms”?
“在應(yīng)用機(jī)器學(xué)習(xí)算法之前,數(shù)據(jù)整理和數(shù)據(jù)清理有哪些步驟?”
We are given a new dataset, the first thing you’ll need to prove is that you can perform an exploratory data analysis (EDA). Before you learn anything realise that there is one path to success in data wrangling: Pandas. The Pandas IDE, when used correctly, is the most powerful tool in a data scientists toolbox. The best way to learn how to use Pandas for data manipulation is to download many, many datasets and learn how to do the following set of tasks as confidently as you making your morning cup of coffee.
我們得到了一個新的數(shù)據(jù)集,首先需要證明的是,您可以執(zhí)行探索性數(shù)據(jù)分析(EDA)。 在您學(xué)任何東西之前,請先了解一下,成功進(jìn)行數(shù)據(jù)整理的方法是:熊貓。 正確使用Pandas IDE是數(shù)據(jù)科學(xué)家工具箱中最強(qiáng)大的工具。 學(xué)習(xí)如何使用Pandas進(jìn)行數(shù)據(jù)操作的最好方法是下載許多數(shù)據(jù)集,并學(xué)習(xí)如何像制作早晨咖啡一樣自信地完成以下任務(wù)。
One of my interviews involved downloading a dataset, cleaning it, visualising it, performing feature selection, building and evaluating a model all in one hour. It was a crazy hard task, and I felt overwhelmed at times, but I made sure I had practiced building model pipelines for weeks before actually attempting the interview, so I knew I could find my way if I got lost.
我的采訪之一涉及在一小時內(nèi)下載數(shù)據(jù)集,對其進(jìn)行清理,對其進(jìn)行可視化,進(jìn)行特征選擇,構(gòu)建和評估模型。 這是一項瘋狂的艱巨任務(wù),有時我會感到不知所措,但是我確保在實際嘗試面試之前已經(jīng)練習(xí)了數(shù)周的模型流水線建設(shè),所以我知道如果迷路了,我會找到方法的。
Advice: The only way to get good at all this is to practice, and the Kaggle community has an incredible wealth of knowledge on mastering EDAs and model pipeline building. I would check out some of the top ranking notebooks on some of the projects out there. Download some example datasets and build your own notebooks, get familiar with the Pandas syntax.
建議:擅長于這一切的唯一方法就是練習(xí),而Kaggle社區(qū)在掌握EDA和模型管道構(gòu)建方面擁有不可思議的豐富知識。 我會在一些項目中查看一些頂級筆記本。 下載一些示例數(shù)據(jù)集并構(gòu)建自己的筆記本,熟悉Pandas語法。
資料組織 (Data Organisation)
There are three sure things in life: death, taxes and getting asked to merge datasets, and perform groupby and apply tasks on said merged datasets. Pandas is INCREDIBLY versatile at this, so please practice practice practice.
生活中存在三件事:死亡,稅收和被要求合并數(shù)據(jù)集 ,并執(zhí)行g(shù)roupby并將任務(wù)應(yīng)用于所述合并的數(shù)據(jù)集。 熊貓在此方面具有多種用途,因此請練習(xí)練習(xí)。
資料剖析 (Data Profiling)
This involves getting a feel for the “meta” characteristics of the dataset, such as the shape and description of numerical, categorical and date-time features in the data. You should always be seeking to address a set of questions like “how many observations do I have”, “what does the distribution of each feature look like”, “what do the features mean”. This kind of profiling early on can help you reject non-relevant features from the outset, such as categorical features with thousands of levels (names, unique identifiers) and mean less work for you and your machine later on (work smart, not hard, or something woke like that).
這涉及感受數(shù)據(jù)集的“元”特征,例如數(shù)據(jù)中數(shù)字,分類和日期時間特征的形狀和描述。 您應(yīng)該一直在尋求解決一系列問題,例如“我有多少個觀測值”,“每個特征的分布是什么樣的”,“特征是什么意思”。 盡早進(jìn)行此類剖析可以幫助您從一開始就拒絕不相關(guān)的功能,例如具有數(shù)千個級別的分類功能(名稱,唯一標(biāo)識符),并為您和以后的機(jī)器減少工作量(聰明,不費力,或類似的東西喚醒)。
數(shù)據(jù)可視化 (Data Visualisation)
Here you are asking yourself “what does the distribution of my features even look like?”. A word of advice, if you didn’t learn about boxplots in the applied statistics part of the study guide, then here is where I stress you learn about them, because you need to learn how to identify outliers visually and we can discuss how to deal with them later on. Histograms and kernel density estimation plots are extremely useful tools when looking at properties of the distributions of each feature.
在這里,您在問自己“我的功能分布看起來是什么樣?”。 一條建議,如果您在學(xué)習(xí)指南的“應(yīng)用統(tǒng)計信息”部分中未了解箱線圖 ,那么我在這里強(qiáng)調(diào)您要了解它們,因為您需要學(xué)習(xí)如何直觀地識別異常值,我們可以討論如何待會兒再處理。 當(dāng)查看每個特征的分布特性時, 直方圖和核密度估計圖是非常有用的工具。
We can then ask “what does the relationship between my features look like”, in which case Python has a package called seaborn containing very nifty tools like pairplot and a visually satisfying heatmap for correlation plots.
然后我們可以問“我的功能之間的關(guān)系是什么樣的”,在這種情況下,Python有一個名為seaborn的程序包, 其中包含非常漂亮的工具,例如pairplot和一個視覺令人滿意的相關(guān)圖熱圖 。
處理空值,語法錯誤和重復(fù)的行/列 (Handling Null Values, Syntax Errors and Duplicate Rows/Columns)
Missing values are a sure thing in any dataset, and arise due to a multitude of different factors, each contributing to bias in their own unique way. There is a whole field of study on how best to deal with missing values (and I once had an interview where I was expected to know individual methods for missing value imputation in much detail). Check out this primer on ways of handling null values.
在任何數(shù)據(jù)集中,缺失值都是確定的事情,并且由于多種不同因素而產(chǎn)生,每種因素都以自己獨特的方式造成偏差。 有一個關(guān)于如何最好地處理缺失值的完整研究領(lǐng)域(我曾經(jīng)接受過一次面試,希望我能更詳細(xì)地了解缺失值估算的各個方法)。 查閱本入門手冊 ,了解處理空值的方法。
Syntax errors typically arise when our dataset contains information that has been manually input, such as through a form. This could lead us to erroneously conclude that a categorical feature has many more levels than are actually present, because “Hot”, ‘hOt”, “hot/n” are all considered unique levels. Check out this primer on handling dirty text data.
當(dāng)我們的數(shù)據(jù)集包含手動輸入的信息(例如通過表單)時,通常會出現(xiàn)語法錯誤。 這可能導(dǎo)致我們錯誤地得出結(jié)論:分類特征的級別比實際存在的級別多得多,因為“熱”,“ hOt”,“熱/ n”都被認(rèn)為是唯一的級別。 查看有關(guān)處理臟文本數(shù)據(jù)的入門知識 。
Finally, duplicate columns are of no use to anyone, and having duplicate rows could lead to overrepresentation bias, so it’s worth dealing with them early on.
最后,重復(fù)的列對任何人都沒有用,并且重復(fù)的行可能會導(dǎo)致代表過多的偏見,因此值得盡早處理它們。
標(biāo)準(zhǔn)化或標(biāo)準(zhǔn)化 (Standardisation or Normalisation)
Depending on the dataset you’re working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model.
根據(jù)您正在使用的數(shù)據(jù)集和您決定使用的機(jī)器學(xué)習(xí)方法,對數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化或標(biāo)準(zhǔn)化可能會很有用,這樣不同比例的不同變量不會對模型的性能產(chǎn)生負(fù)面影響。
There’s a lot here to go through, but honestly it wasn’t as much the “memorise everything” mentality that helped me insofar as it was the confidence building that learning as much as I could instilled in me. I must have failed so many interviews before the formula “clicked” and I realised that all of these things aren’t esoteric concepts that only the elite can master, they’re just tools that you use to build incredible models and derive insights from data.
這里有很多事情要經(jīng)過,但說實話,“記住一切”的心態(tài)并沒有幫助我,只要是建立足夠的信心就可以使我學(xué)到很多。 在公式“被點擊”之前,我一定沒有經(jīng)過太多的采訪,我意識到所有這些都不是只有精英才能掌握的深奧概念,它們只是您用來構(gòu)建令人難以置信的模型并從數(shù)據(jù)中獲得見解的工具。
Best of luck on your job quest guys, if you need any help at all please let me know and I will answer emails/questions when I can.
最好的求職者是您,如果您需要任何幫助,請告訴我,我會在可能的時候回復(fù)電子郵件/問題。
翻譯自: https://towardsdatascience.com/the-data-science-interview-blueprint-75d69c92516c
大數(shù)據(jù)平臺藍(lán)圖
總結(jié)
以上是生活随笔為你收集整理的大数据平台蓝图_数据科学面试蓝图的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 台式洗碗机怎么安装(台式电脑推荐)
- 下一篇: 我的世界如何导入翅膀(如何在网易我的世界