打开应用蜂窝移动数据就关闭_基于移动应用行为数据的客户流失预测
打開應用蜂窩移動數據就關閉
In the previous article, we created a logistic regression model to predict user enrollment using app behavior data. Hopefully, you had good learning there. This post aims to improve your model building skills with new techniques and tricks based on a larger mobile app behavior data. It is split into 7 parts.
在上一篇文章中 ,我們創建了一個邏輯回歸模型,以使用應用程序行為數據來預測用戶注冊。 希望您在那里學得很好。 這篇文章旨在通過基于更大的移動應用行為數據的新技術和技巧來提高您的模型構建技能 。 它分為7部分。
1. Business challenge
1.業務挑戰
2. Data processing
2.數據處理
3. Model building
3.模型制作
4. Model validation
4.模型驗證
5. Feature analysis
5.特征分析
6. Feature selection
6.功能選擇
7. Conclusion
7.結論
Now let’s begin the journey.
現在開始旅程。
Business challenge
業務挑戰
We are tasked by a Fintech firm to analyze mobile app behavior data to identify potential churn customers. The goal is to predict which users are likely to churn, so the firm can focus on re-engaging these users with better products.
金融科技公司的任務是分析移動應用行為數據,以識別潛在的客戶流失。 目標是預測哪些用戶可能流失,以便公司可以專注于通過更好的產品重新吸引這些用戶。
2. Data processing
2. 數據處理
EDA should be performed before data processing. Detailed steps are introduced in this article. The video below shows the final data after EDA.
EDA應該在數據處理之前執行。 本文介紹了詳細步驟。 以下視頻顯示了EDA之后的最終數據。
2.1 One-hot encoding
2.1一鍵編碼
One-hot encoding is a technique to convert categorical variables into numerical variables. It is needed as the model we are to build cannot read categorical data. One-hot encoding simply creates additional features based on the number of unique categories. Here, specifically,
一鍵式編碼是一種將分類變量轉換為數值變量的技術。 之所以需要它,是因為我們要建立的模型無法讀取分類數據。 一鍵式編碼僅根據唯一類別的數量創建其他功能。 具體來說,
dataset = pd.get_dummies(dataset)Above automatically convert all categorical variables to numerical variables. But one drawback of one-hot encoding is the dummy variable trap. It is a scenario in which variables are highly correlated to each other. To avoid the trap, one of the dummy variables has to be dropped. Specifically,
以上自動將所有類別變量轉換為數值變量。 但是,一鍵編碼的一個缺點是偽變量陷阱 。 在這種情況下,變量之間高度相關。 為了避免陷阱,必須刪除其中一個虛擬變量。 特別,
dataset = dataset.drop(columns = [‘housing_na’, ‘zodiac_sign_na’, ‘payment_type_na’])2.2 Data split
2.2數據分割
This is to split the data into train and test sets. Specifically,
這是將數據分為訓練集和測試集。 特別,
X_train, X_test, y_train, y_test = train_test_split(dataset.drop(columns = ‘churn’), dataset[‘churn’], test_size = 0.2,random_state = 0)2.3 Data balancing
2.3數據平衡
類別不平衡是分類中的常見問題,其中每個類別中觀察值的比例不成比例。 但是,如果我們在不平衡的數據集上訓練模型會怎樣? 該模型將明智地決定最好的事情是始終預測1類,因為1類占用了90%的數據,因此該模型可以達到90%的準確性。 (Imbalanced classes are a common problem in a classification where the disproportionate ratio of observations in each class occurs. But what would happen if we train a model on an imbalanced dataset? The model would cleverly decide the best thing is to always predict class 1, because class 1 takes 90% of the data, so the model can achieve 90% accuracy.)
There are many ways to combat imbalanced classes, such as changing performance metrics, collecting more data, over-sampling or down-sampling data, etc. Here we use the down-sampling method.
這里有許多解決不平衡類的方法,例如更改性能指標,收集更多數據,過采樣或下采樣數據等。在這里,我們使用下采樣方法。
First, let’s investigate the imbalance level of the dependent variable in y_train.
首先,讓我們研究y_train中因變量的不平衡水平。
Fig.1 Imbalance distribution of the dependent variable: churn or not圖1因變量的不平衡分布:攪動與否As shown in Fig.1, the dependent variable is slightly imbalanced. To down-sample the data, we take the index of each class, and randomly choose the index of the majority class at a number of minority class in y_train. Then concatenate the index of both classes and down-sample x_train and y_train.
如圖1所示,因變量略有失衡。 為了對數據進行降采樣,我們獲取每個類別的索引,并在y_train中的少數少數類別中隨機選擇多數類別的索引。 然后連接兩個類的索引以及下采樣x_train和y_train 。
pos_index = y_train[y_train.values == 1].indexneg_index = y_train[y_train.values == 0].index
if len(pos_index) > len(neg_index):
higher = pos_index
lower = neg_index
else:
higher = neg_index
lower = pos_index
random.seed(0)
higher = np.random.choice(higher, size=len(lower))
lower = np.asarray(lower)
new_indexes = np.concatenate((lower, higher))
X_train = X_train.loc[new_indexes,]
y_train = y_train[new_indexes]
2.4. Feature scaling
2.4。 功能縮放
Fundamentally, feature scaling is to normalize the range of the variables. This is to avoid any variable having a dominant impact on the model. For a neural network, feature scaling helps gradient descent converge faster than without it.
從根本上說,特征縮放是為了規范變量的范圍。 這是為了避免任何變量對模型產生主要影響。 對于神經網絡,特征縮放有助于梯度下降收斂,而沒有梯度下降則更快。
Here we use standardization to normalize the variables. Specifically,
這里我們使用的標準化規范化的變量。 特別,
from sklearn.preprocessing import StandardScalersc_X = StandardScaler()
X_train2 = pd.DataFrame(sc_X.fit_transform(X_train))
X_test2 = pd.DataFrame(sc_X.transform(X_test))
3. Model building
3. 建立模型
在這里,我們建立了流失預測的邏輯回歸分類器。 本質上,邏輯回歸使用獨立變量的線性組合來預測一類概率的對數。 如果您想了解有關邏輯回歸的更多詳細信息,請訪問此Wikipedia頁面。 (Here we build a logistic regression classifier for churn prediction. Essentially, logistic regression predicts the logarithm of the probability of a class using a linear combination of independent variables. If you like to dive into more details on logistic regression, visit this Wikipedia page.)
Specifically,
特別,
from sklearn.linear_model import LogisticRegressionclassifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
Now, let’s test and evaluate the model. Specifically,
現在,讓我們測試和評估模型。 特別,
y_pred = classifier.predict(X_test)from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
cm = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
f1_score(y_test, y_pred)
Finally, we got an accuracy of 0.61 and F1 of 0. 61. Not too bad performance.
最終,我們得到0.61的精度和0的F1。61 。 性能還不錯。
4. Model validation
4. 模型驗證
With the model trained and tested, one question is how good the model is to generalize to an unknown dataset. We use cross-validation to measure the size of the performance difference between known datasets and unknown datasets. Specifically,
在對模型進行訓練和測試之后,一個問題是該模型將泛化到未知數據集的質量如何。 我們使用交叉驗證來衡量已知數據集和未知數據集之間的性能差異的大小。 特別,
from sklearn.model_selection import cross_val_scoreaccuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
With the above, we found 10-fold cross-validation produces an average accuracy of 0.64.5 with a standard deviation of 0.023. This indicates the model can generalize well on an unknown dataset.
綜上所述,我們發現10倍交叉驗證產生的平均準確度為0.64.5 ,標準偏差為0.023 。 這表明該模型可以很好地概括未知數據集。
5. Feature analysis
5. 特征分析
With 41 features, we built a logistic regression model. But how to know which feature is more important in predicting the dependent variable? Specifically,
具有41個功能,我們建立了邏輯回歸模型。 但是如何知道哪個特征在預測因變量中更重要? 特別,
pd.concat([pd.DataFrame(X_train.columns, columns = [“features”]), pd.DataFrame(np.transpose(classifier.coef_), columns = [“coef”])],axis = 1)As shown in Figure 2, we found two features that are very important: purchase_partners and purchase. This indicates a user’s purchase history plays a great role when deciding churn or not. Meanwhile, this occurs to you that not all variables are relevant for prediction.
如圖2所示,我們發現了兩個非常重要的功能: purchase_partners和purchase 。 這表明在決定是否流失時,用戶的購買歷史起著重要作用。 同時,您會想到并非所有變量都與預測相關。
Fig.2 Variable importance on prediction圖2預測的重要性6. Feature selection
6. 功能選擇
Feature selection is a technique to select a subset of the most relevant features for modeling training.
特征選擇是一種選擇最相關特征的子集進行建模訓練的技術。
In this application, x_train contains 41 features, but as seen in Figure 2, not all features play important roles. Using feature selection helps to reduce the number of unimportant features and achieve similar performance with less training data. A more detailed explanation of feature selection can be found here.
在此應用程序中, x_train包含41個功能,但是如圖2所示,并非所有功能都起著重要的作用。 使用特征選擇有助于減少不重要特征的數量,并以較少的訓練數據獲得相似的性能 。 有關功能選擇的詳細說明,請參見此處 。
Here, we use the Recursive Feature Elimination (RFE). It works by fitting the given algorithm, ranking the feature by importance, discarding the least important features, and refitting until a specified number of features is achieved. Specifically,
在這里,我們使用遞歸特征消除 (RFE)。 它通過擬合給定算法,按重要性對特征進行排序,丟棄最不重要的特征并重新擬合直到達到指定數量的特征來工作。 特別,
from sklearn.feature_selection import RFEfrom sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
rfe = RFE(classifier, 20)
rfe = rfe.fit(X_train, y_train)
Note above, we set to select 20 features. Figure 3 shows all selected features.
請注意,我們設置為選擇20個功能 。 圖3顯示了所有選定的功能。
Fig.3 Features recommended by RFE圖3 RFE推薦的功能Great. with the RFE selected features, let’s retrain and test the model.
大。 使用RFE選定的功能,讓我們重新訓練和測試模型。
classifier.fit(X_train[X_train.columns[rfe.support_]], y_train)y_pred = classifier.predict(X_test[X_train.columns[rfe.support_]])
In the end, we got an accuracy of 0.61 and F1 of 0.61. The same performance as the model trained on 41 features!
最后,我們得到了0.61的0.61的精度和F1。 與使用41個功能訓練的模型具有相同的性能!
If we apply cross-validation again, we got an average accuracy of 0.647 with a standard deviation of 0.014. Again, very much the same as the previous model.
如果再次應用交叉驗證,則平均精度為0.647 ,標準偏差為0.014 。 同樣,與以前的模型非常相似。
7. Conclusion
7. 結論
Initially, we trained a logistic regression model with 41 features, achieving an accuracy of 0.645. But using feature selection to reduce features, we created a light version of the model, with an accuracy of 0.647. We found that half of the features are of no relevance in deciding the customer’s churn. Great. Well done!
最初,我們訓練了具有41個特征的邏輯回歸模型,實現了0.645的準確性。 但是使用特征選擇來減少特征,我們創建了模型的精簡版,精度為0.647。 我們發現,一半功能與決定客戶流失無關。 大。 做得好!
Huge congratulations for making it to the end. If you need the source code, feel free to visit my Github page.
巨大的祝賀,使它走到了盡頭。 如果您需要源代碼,請隨時訪問我的Github頁面。
翻譯自: https://towardsdatascience.com/prediction-on-customer-churn-with-mobile-app-behavior-data-bbce8de2802f
打開應用蜂窩移動數據就關閉
總結
以上是生活随笔為你收集整理的打开应用蜂窝移动数据就关闭_基于移动应用行为数据的客户流失预测的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: FreeBSD 在 2022 年结束时未
- 下一篇: 抖音超市东西是正品吗