Capital one TPS整理
Credit Card Fraud Detection 7 times from 2015 to 2017
What machine learning model would you use to classify fraudulent transactions on credit cards?
feature selection
how to use classification method, which one is good to use?Later there will also be a problem which method is the least useful.?
bias variance trade off -?What does regularization do?
target missing
false positive/false negative -?Are false positives or false negatives more important??What is the effect of FP and FN?
What is VIF (in regression output)?
potential issues
exploratory analysis and data cleaning
How would you handle missing or garbage data?
How would you use existing features to add new features?
Logistic regression, random forests
Difference between random forest and gradient boosted tree.
Anomaly detection/novelty detection techniques might be also helpful because of the huge data imbalance that normally exists in such scenarios.
Asked a lot of possible problems with the model and how should you deal with that when time?is limited.
Couple things to keep in mind regarding fraud:
1) you're dealing with an imbalanced data set (your fraud cases may be 3-5% of all your data). So, consider either oversampling, or giving higher weight to your fraud cases.
2) you data may not have all the true fraud cases - in other words, there maybe actual fraud cases not captured in your data. So, some form of anomaly detection may be needed.
?
預(yù)測用戶是否會注銷信用卡 -3 times in 2018
如果給你一堆dataset,比如信用卡一年的交易記錄、客戶個人信息,銀行想預(yù)測客戶會不會在一個月之內(nèi)關(guān)戶,如果會的話,銀行打算發(fā)一點cashback rewards給這些人挽留一下。讓你建模預(yù)關(guān)戶。??以下是面試官的問題:
1.? ? ? ? 你會選哪些feature?(感覺是隨便說,只要有關(guān)系。追問如果是一堆transaction的日期之類的,應(yīng)該怎樣rebuild feature)
2.? ? ? ? 怎么做data cleaning:?
? ? a.? ? ? ?? ???怎樣detect outlier?. From 1point 3acres bbs
? ? b.? ? ? ?? ???怎樣fill in missing data?(我說可以填constant比如mean,然后他追問填mean在什么情況下不合適、怎樣更好)
? ? c.? ? ? ?? ???如果target value也missing了怎么辦
3.? ? ? ? 你選什么model?(我說decision tree,然后他讓我說有沒有其他model,優(yōu)缺點分別是什么,target是什么。target應(yīng)該是一個binary的值whether the customer will close the account in one month,如果regression得到了0~1之間的值就代表how likely)
4.? ? ? ? 怎么看model 的performance,用什么package. From 1point 3acres bbs
5.? ? ? ? 如果data size很大有1TB,怎樣sample,用什么package. From 1point 3acres bbs
6.? ? ? ? 如果model不準(zhǔn)確,會給銀行造成什么損失?
7.? ? ? ? 如果用model predict得到了一堆target的值,應(yīng)該怎樣根據(jù)target發(fā)rewards (我說畫個distribution,給最可能關(guān)戶的百分之幾客戶發(fā)rewards。追問除了這種方式還有什么方式,我也不確定是考modeling還是business sense)
8.? ? ? ? 最后一個是地里看到的一模一樣的open question,兩人都有5000limit,但是一個用100%一個只用2%,這兩人有沒有可能都在一月之內(nèi)關(guān)戶。面試官應(yīng)該看你第一反應(yīng)是考慮model的問題還是考慮其他方面。
從feature engineering 到 最后 model tuning and validation 的所有步驟。
如何建model,用了哪些parameter,結(jié)果如何 還有為什么要選這個model
credit card churn model
? ?? ?1. Feature engineering,比如從start date算出tenure 等等
? ?? ?2. Missing value
? ?? ?3. 用什么模型,為什么
? ?? ?4. 現(xiàn)在數(shù)據(jù)量加大,怎么辦?spark。如果你要選,用RSpark還是PySpark?為什么
? ?? ?5. 現(xiàn)在模型output出來,一個credit limit 使用率0%的用戶和使用率95%的用戶都很危險,都很可能馬上就關(guān)掉信用卡,你會怎么處理?我回答churn model是起點,一般marketing department會根據(jù)churn model的結(jié)果設(shè)計retention program。對于這兩類危險用戶,需要設(shè)計不同的incentive plan。
? ?? ?? ?? ? 1)使用率0%的用戶,基本上很難挽回。
? ?? ?? ?? ? 2)使用率95%的用戶大概率可以挽回,降低利率,增加cashback等等。。。
? ?? ?? ?? ? 3)可以根據(jù)測試結(jié)果再搞個uplift model,看哪些high churn users可以挽回的,著重施加treatment。
- tell me some useful packages you use in R/python? ?1 Answer
- how do you detect multicollinearity? ?1 Answer
- how do you join two data sets???
?
Other questions:
- our sever run cost is xxx, 其他固定成本是xxx,能容納xxx TB流量。 我們大概有xxx個客戶,每個客戶交付給我們server使用費為xxx/month。我們給每個用戶分配xxxGB,但是平均每個用戶只會用掉期中的xx%,所以我們可以把剩下的空間再去接納更多的客戶。問:每年盈利是多少?現(xiàn)有另外一種server b, cost is xxx,capacity is xxx。。。請權(quán)衡比較我們要不要把已有server換成server b-baidu?
- 題目是有一個運動產(chǎn)品的零售商,來找你優(yōu)化他們的在線廣告競拍系統(tǒng),提高response rate。假設(shè)你有的數(shù)據(jù)是3, 000, 000用戶的訪問數(shù)據(jù),每行數(shù)據(jù)有150多個column,已知overall的response rate是1/1000。被問的問題有:
1. 選什么作為target?
Response or not
2. 選什么metrics?
AUC-ROC
3. 怎么處理NA??
It depends. If NA is meaningful, leave it there. If NA is missing due to data extracation, do some simple if-else condition/mean(median)/regression to fill
4. 怎么做feature engineering??
Encode categorical varaible, use 'groupby' and 'mean/medium/std' to generate some features
4. 數(shù)據(jù)量特別大怎么辦?
mapreduce,但是我沒用過,就拿本地并行優(yōu)化舉了個例子,怎么分配數(shù)據(jù)給各個線程,然后怎么把數(shù)據(jù)收回來合并。
5. 模型用什么?
GBDT,lightGBM/XGB
6. 怎么評估模型表現(xiàn)?
k-fold CV
7. Overfitting/underfitting怎么辦?
分別討論了一下。想辦法獲取更多的數(shù)據(jù),調(diào)整hyper-parameter。
8. 如果模型預(yù)測出了問題,會有什么影響?
分情況討論了一下整體上會有什么變化,對單個用戶有什么影響。
?
- Given a dataset, how would you model it to extract a particular information. How would you architect the pipeline.
?
?
?false positive/false negative, regularization, and potential issues
轉(zhuǎn)載于:https://www.cnblogs.com/ffeng0312/p/10275071.html
總結(jié)
以上是生活随笔為你收集整理的Capital one TPS整理的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: QTableview 获取鼠标坐标的it
- 下一篇: 1016.XXE漏洞攻防学习