人口预测和阻尼-增长模型_使用分类模型预测利率-第3部分
人口預測和阻尼-增長模型
This is the final article of the series “ Predicting Interest Rate with Classification Models”. Here are the links if you didn't read the First or the Second articles of the series where I explain the challenge I had when started at M2X Investments. As I mentioned before, I will try my best to make this article understandable per se. I will skip the explanation of assumptions regarding the data for “article-length” reasons. Nevertheless, you can check them in previous posts of the series. Let’s do it!
這是“使用分類模型預測利率”系列的最后一篇文章。 如果您沒有閱讀該系列的第一篇或第二篇文章,這些鏈接將為您解釋我在M2X Investments創業時遇到的挑戰。 如前所述,我將盡力使本文本身易于理解。 由于“文章長度”的原因,我將跳過有關數據的假設的解釋。 不過,您可以在本系列的先前文章中進行檢查。 我們開始做吧!
快速回顧 (Fast Recap)
In previous articles, I applied a couple of classification models to the problem of predicting up movements of the Fed Fund Effective Rate. In short, it is a binary classification problem where 1 represents up movement and 0, neutral or negative movement. The models applied were Logistic Regression, Naive Bayes, and Random Forest. Random Forest was the one that yielded the best results so far, without hyperparameter optimization, with an F1-score of 0.76.
在先前的文章中,我將幾個分類模型應用于預測聯邦基金有效利率上升趨勢的問題。 簡而言之,這是一個二進制分類問題,其中1代表上移而0代表中立或負向運動。 應用的模型是Logistic回歸,樸素貝葉斯和隨機森林。 到目前為止,Random Forest是在沒有超參數優化的情況下獲得最佳結果的F1分數為0.76。
If you are curious to know more about the data, please refer to Part 1 or Part 2 of the series. I omitted the explanation about them in this article for practical purposes only.
如果您想了解更多有關數據的信息,請參閱 本系列的 第1 部分 或 第2部分 。 我僅出于實際目的省略了對它們的解釋。
Catboost和支持向量機簡介 (A brief introduction to Catboost and Support Vector Machines)
Catboost (Catboost)
Catboost is an open-source library for gradient boosting on decision trees. Ok, so what is Gradient Boosting?
Catboost是用于在決策樹上進行梯度增強的開源庫。 好的,什么是梯度提升 ?
Gradient boosting is a machine learning algorithm that can be used for classification and regression problems. It usually gives great results when tackling heterogeneous data and small data sets problems. But what in essence is this algorithm? Let's start by defining Boosting.
梯度提升是一種機器學習算法,可用于分類和回歸問題。 在處理異構數據和小數據集問題時,通常會產生很好的結果。 但是這個算法本質上是什么? 讓我們從定義Boosting開始。
Boosting is an ensemble technique that tries to transform weak learners in strong learners by sequentially training them with the objective of making them better than their predecessors. The sequentially part means that each learner (or usually a tree) is made by taking the previous tree error into account (in the case of the AdaBoost algorithm the trees are called stumps).
提拔是一種合奏技術,它通過依次培訓他們以使其比其前任更好為目標,嘗試將弱者轉變為強者。 順序部分意味著每個學習者(或通常是一棵樹)都是通過考慮先前的樹錯誤而制成的(在AdaBoost算法的情況下,樹稱為樹樁)。
As an example, imagine that we train a tree and give each observation equal weights. Next, we evaluate the tree and get its errors. Then, for the next tree, we increase the weight of the observations that were incorrectly classified by the first one and lower the weights of the ones correctly classified. This is basically we saying that the next tree should give more importance to that mistakenly classified observation and classify it correctly. Thus, it goes until we stop and get the final votes of our trees.
例如,假設我們訓練一棵樹,并給每個觀察值相等的權重。 接下來,我們評估樹并得到其錯誤。 然后,對于下一棵樹,我們增加第一個分類錯誤的觀測值的權重,并降低正確分類的觀測值的權重。 基本上,這是我們在說,下一棵樹應更加重視錯誤分類的觀察并將其正確分類。 因此,直到我們停下來并獲得樹木的最終票數為止。
Let's go back to the Gradient Boosting now. With the concept of Boosting in mind, we can think of Gradient Boosting as an algorithm that takes the same process described above. The difference is that now we will define a loss function to be optimized (minimized). This means that, after calculating the loss, the next tree that we create, will have to reduce the loss (follow the gradient by reducing the residual loss).
現在讓我們回到“ 梯度增強” 。 考慮到Boosting的概念,我們可以將Gradient Boosting視為采用上述相同過程的算法。 不同之處在于,現在我們將定義要優化(最小化)的損失函數。 這意味著,在計算了損失之后,我們創建的下一棵樹將必須減少損失(通過減少殘留損失來遵循梯度)。
What about Catboost?
那Catboost呢?
Catboost is a gradient boost decision tree library. In their page, they say that it performs well with default parameters, and also that it has categorical features support, built-in model analysis tools, and presents high speed in training on CPU and GPU.
Catboost是梯度提升決策樹庫。 在他們的頁面中 ,他們說它在使用默認參數的情況下表現良好,并且具有分類功能支持,內置的模型分析工具,并且在CPU和GPU方面提供了很高的培訓速度。
支持向量機— SVC (Support Vector Machines— SVC)
SVMs are supervised learning algorithms used for classification, regression, and outlier detection. We will use Support Vector Classifiers (SVC) to find a hyperplane in n-dimensional space that accurately classifies the data. This hyperplane will have the maximum distance between the data points of the different classes — this distance is called maximum margin. Let’s take a two-dimensional space as an example. The hyperplane will be a line dividing the space into two parts with maximum distance between the classification labels.
SVM是用于分類,回歸和離群值檢測的監督學習算法。 我們將使用支持向量分類器(SVC) 在n維空間中找到可對數據進行準確分類的超平面 。 該超平面將在不同類別的數據點之間具有最大距離-此距離稱為最大余量 。 讓我們以二維空間為例。 超平面將是一條將空間分為兩部分的線,分類標簽之間的距離最大。
Image by LAMFO圖片由LAMFOThe data points closer to the line separating the space are called Support Vectors and will dictate the hyperplane margin. So we start with our data in a low dimension and if we cant classify it in that dimension, we move to a higher dimension to find a Support Vector Classifier that will best divide our data into two groups. And so on…To transform the plane that the data relies on and find our Support Vector Classifier, we use a function called Kernel.
靠近分隔線的數據點稱為支持向量 ,將決定超平面的余量。 因此,我們從低維度的數據開始,如果無法在該維度上對其進行分類,那么我們將移至更高的維度以找到一種支持向量分類器 ,該分類器可以將我們的數據最好地分為兩組。 依此類推……為了轉換數據所依賴的平面并找到我們的支持向量分類器,我們使用了一個稱為Kernel的函數。
Ther Kernel Function can have different shapes. As an example, it can be a polynomial kernel or a radial kernel. It is important to notice that, for the sake of the computational cost, kernel functions calculate data relationships as if they were in a higher dimension. However, in reality, they are not transformed into that dimension. This trick is called The Kernel Trick.
內核功能可以具有不同的形狀。 例如,它可以是多項式核或徑向核。 重要的是要注意到,出于計算成本的考慮,內核函數計算數據關系就好像它們在更高維度中一樣。 但是,實際上,它們并沒有轉變為那個維度。 此技巧稱為 “內核技巧” 。
代碼 (The code)
Catboost (Catboost)
Starting with Catboost. In previous articles, we talked about the data and the assumptions we made to binarize it and deal with NaNs. So will skip the explanation of this part and focus on the model's results and their application.
從Catboost開始。 在先前的文章中,我們討論了數據以及為將其二值化并處理NaN而做出的假設。 因此,將跳過這一部分的說明,而將重點放在模型的結果及其應用上。
import numpy as npimport pandas as pd
import quandl as qdl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
from imblearn.over_sampling import ADASYN
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from catboost import CatBoostClassifier
from sklearn import metrics# get data from Quandl
data = pd.DataFrame()
meta_data = ['RICIA','RICIM','RICIE']
for code in meta_data:
df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']
for code in meta_data:
df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")
df.columns = [code]
data = pd.concat([data, df], axis=1)# dealing with possible empty values (not much attention to this part, but it is very important)
data.fillna(data.mean(), inplace=True)
print(data.head())
print("\nData shape:\n",data.shape)#histograms
data.hist()
plt.show()# scaling values to maked them vary between 0 and 1
scaler = MinMaxScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)# pulling dependent variable from Quandl (par yield curve)
par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")
par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values
par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])
par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize it
par_data=par_data.pct_change()
par_data.fillna(0, inplace=True)
par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])
print("Number of 0 and 1s:\n",par_data.value_counts())# plot number of 0 and 1s
sns.countplot(x='FED/RIFSPFF_N_D', data=par_data, palette='Blues')
plt.title('0s and 1s')
plt.savefig('0s and 1s')# Over-sampling with ADASYN method
sampler = ADASYN(random_state=13)
X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())
columns = data_scaled.columns
data_scaled = pd.DataFrame(data=X_os,columns=columns )
par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))
print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))
After adjusting the proportions of 0s and 1s in our label set, we split the data in training and test sets and create the model.
調整標簽集中0和1s的比例后,我們將數據分為訓練集和測試集,然后創建模型。
# split data into test and train setX_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# Catboost model
clf=CatBoostClassifier(iterations=None, learning_rate=None, depth=Non, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function=None, border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None,od_wait=None, od_type=None,nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, verbose=None, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, classes_count=None, class_weights=None, one_hot_max_size=None, random_strength=None, name=None, ignored_features=None, train_dir=None, custom_loss=None, custom_metric=None, eval_metric=None, bagging_temperature=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, fold_len_multiplier=None, used_ram_limit=None, gpu_ram_part=None, allow_writing_files=None, final_ctr_computation_mode=None, approx_on_full_history=None, boosting_type=None, simple_ctr=None, combinations_ctr=None, per_feature_ctr=None, task_type=None, device_config=None, devices=None, bootstrap_type=None, subsample=None, sampling_unit=None, dev_score_calc_obj_block_size=None, max_depth=None, n_estimators=None, num_boost_round=None, num_trees=None, colsample_bylevel=None, random_state=None, reg_lambda=None, objective=None, eta=None, max_bin=None, scale_pos_weight=None, gpu_cat_features_storage=None, data_partition=None, metadata=None, early_stopping_rounds=None, cat_features=None, grow_policy=None, min_data_in_leaf=None, min_child_samples=None, max_leaves=None, num_leaves=None, score_function=None, leaf_estimation_backtracking=None, ctr_history_unit=None, monotone_constraints=None, feature_weights=None, penalties_coefficient=None, first_feature_use_penalties=None, model_shrink_rate=None, model_shrink_mode=None, langevin=None, diffusion_temperature=None, boost_from_average=None, text_features=None, tokenizers=None, dictionaries=None, feature_calcers=None, text_processing=None)
As you can see, I made sure to put all parameters that the model accepts. Yes…a lot! But don’t worry, it is not the time to optimize them; it is time to get a glimpse of the model’s performance. So we are not going to change any of them (all of them will be None).
如您所見,我確保放置了模型接受的所有參數。 是的很多! 但是不用擔心,現在不是優化它們的時候了。 現在該瞥一眼模型的性能了。 因此,我們不會更改其中的任何一個(所有這些都將為None)。
clf.fit(X_train, y)y_pred = clf.predict(X_test)
print('\nAccuracy of Catboost Classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))# confusion matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')Classification report | Image by Author分類報告| 圖片作者 Catboost ROC curve | Image by AuthorCatboost ROC曲線| 圖片作者
The results show an F1-score of 0.72, close to the Random Forest model results. It seems that this model will enter our “Potential Good Models” list for further investigation and hyperparameter optimization! Let’s see what the SVC model tells us!
結果顯示F1分數為0.72,接近隨機森林模型的結果。 似乎該模型將進入我們的“潛在良好模型”列表,以進行進一步的研究和超參數優化! 讓我們看看SVC模型告訴我們什么!
支持向量分類器 (Support Vector Classifier)
The first part of the code until the oversampling is pretty much the same posted above. So we will dive into the model code.
直到過度采樣為止的代碼的第一部分與上面發布的內容幾乎相同。 因此,我們將深入研究模型代碼。
# split data into test and train setX_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write y
y = y_train['FED/RIFSPFF_N_D']# Support Vector Classifier model
clf=SVC(C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=True, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=13)# fit model
clf.fit(X_train, y)# predict
y_pred = clf.predict(X_test)
print('\nAccuracy of SVC classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))SVC accuracy | Image by AuthorSVC精度| 圖片作者
The accuracy of the SVC is 0.65. Let’s see what the classification report shows us.
SVC的精度為0.65。 讓我們看看分類報告向我們展示了什么。
# confusion matrixconfusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print('\nConfusion matrix:\n',confusion_matrix)
print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrix
disp = metrics.plot_confusion_matrix(clf, X_test, y_test,cmap=plt.cm.Blues)
disp.ax_.set_title('Confusion Matrix')
plt.savefig('Confusion Matrix')# roc curve
logit_roc_auc = metrics.roc_auc_score(y_test, clf.predict(X_test))
fpr, tpr, thresholds = metrics.roc_curve(y_test, clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='SVC (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve - Support Vector Classifier')
plt.legend(loc="lower right")
plt.savefig('SVC_ROC')SVC Confusion Matrix | Image by AuthorSVC混淆矩陣| 圖片作者 Classification report | Image by Author分類報告| 圖片作者 SVCs ROC curve | Image by AuthorSVC的ROC曲線| 圖片作者
Ok, so it turns out that our F1-score with the SVC model is 0.65. As we saw earlier, the CatBoost model performed better as well as the Random Forest. So we will stick with those two in our “Potential Good Models” list for the hyperparameters optimization step. With the results of the five models at hand, we ended up with two promising models. The next step would be to optimize the two models to see which performs best, but that will be another series on how to optimize and compare models.
好的,事實證明,帶有SVC模型的F1得分是0.65。 正如我們前面所看到的,CatBoost模型的性能優于隨機森林。 因此,我們將在“潛在良好模型”列表中堅持使用這兩個參數進行超參數優化。 有了這五個模型的結果,我們最終得到了兩個有希望的模型。 下一步將是優化兩個模型,以查看哪種模型效果最佳,但這將是有關如何優化和比較模型的另一系列文章。
This article was written in conjunction with Guilherme Bezerra Pujades Magalh?es.
本文與 Guilherme Bezerra PujadesMagalh?es 一起撰寫 。
參考和重要鏈接 (References and great links)
[1] Laboratory of Machine Learning in Finance and Organizations. LAMFO.
[1] 金融與組織機器學習實驗室。 拉姆福。
[2] J. Starmer, StatQuest with Josh Starmer on Support Vector Machines, YouTube.
[2] J. Starmer, StatQuest和Josh Starmer在支持向量機上 ,YouTube。
[3] Catboost Documentation.
[3] Catboost文檔 。
[4] M2X Investments.
[4] M2X投資。
翻譯自: https://towardsdatascience.com/predicting-interest-rate-with-classification-models-part-3-3eef38dd7b32
人口預測和阻尼-增長模型
總結
以上是生活随笔為你收集整理的人口预测和阻尼-增长模型_使用分类模型预测利率-第3部分的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 手游方舟怎么下载
- 下一篇: 苹果a1530是苹果几代