ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码
ML之FE:特征工程中常用的五大數(shù)據(jù)集劃分方法(特殊類型數(shù)據(jù)分割,如時間序列數(shù)據(jù)分割法)講解及其代碼
?
?
目錄
特殊類型數(shù)據(jù)分割
5.1、時間序列數(shù)據(jù)分割TimeSeriesSplit
?
?
特殊類型數(shù)據(jù)分割
5.1、時間序列數(shù)據(jù)分割TimeSeriesSplit
| class TimeSeriesSplit?Found at: sklearn.model_selection._split ? class TimeSeriesSplit(_BaseKFold): ????"""Time Series cross-validator?.. versionadded:: 0.18 ???? ????Provides train/test indices to split time series data samples?that are observed at fixed time intervals, in train/test sets.?In each split, test indices must be higher than before, and thus shuffling?in cross validator is inappropriate.?This cross-validation object is a variation of :class:`KFold`.??In the kth split, it returns first k folds as train set and the?(k+1)th fold as test set. ????Note that unlike standard cross-validation methods, successive?training sets are supersets of those that come before them. ????Read more in the :ref:`User Guide <cross_validation>`. ???? ????Parameters ????---------- ????n_splits : int, default=5. Number of splits. Must be at least 2.?.. versionchanged:: 0.22?. ``n_splits`` default value changed from 3 to 5. ????max_train_size : int, default=None. Maximum size for a single training set. | ? ? ? ? ? ? 提供訓(xùn)練/測試索引,以分割時間序列數(shù)據(jù)樣本,在訓(xùn)練/測試集中,在固定的時間間隔觀察。在每次分割中,測試索引必須比以前更高,因此在交叉驗證器中變換是不合適的。這個交叉驗證對象是KFold 的變體。在第k次分割中,它返回第k次折疊作為序列集,返回第(k+1)次折疊作為測試集。 注意,與標(biāo)準(zhǔn)的交叉驗證方法不同,連續(xù)訓(xùn)練集是之前那些訓(xùn)練集的超集。 更多信息請參見:ref: ' User Guide <cross_validation> '。</cross_validation> ? 參數(shù) ---------- n_splits?:int,默認(rèn)=5。數(shù)量的分裂。必須至少是2. ..versionchanged:: 0.22。' ' n_split ' ' '默認(rèn)值從3更改為5。 max_train_size?: int,默認(rèn)None。單個訓(xùn)練集的最大容量。 |
| ????Examples ????-------- ????>>> import numpy as np ????>>> from sklearn.model_selection import TimeSeriesSplit ????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) ????>>> y = np.array([1, 2, 3, 4, 5, 6]) ????>>> tscv = TimeSeriesSplit() ????>>> print(tscv) ????TimeSeriesSplit(max_train_size=None, n_splits=5) ????>>> for train_index, test_index in tscv.split(X): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????... ????X_train, X_test = X[train_index], X[test_index] ????... ????y_train, y_test = y[train_index], y[test_index] ????TRAIN: [0] TEST: [1] ????TRAIN: [0 1] TEST: [2] ????TRAIN: [0 1 2] TEST: [3] ????TRAIN: [0 1 2 3] TEST: [4] ????TRAIN: [0 1 2 3 4] TEST: [5] ???? ????Notes ????----- ????The training set has size ``i * n_samples // (n_splits + 1)?+ n_samples % (n_splits + 1)`` in the ``i``th split,?with a test set of size ``n_samples//(n_splits + 1)``,?where ``n_samples`` is the number of samples. | ? |
| ????""" ????@_deprecate_positional_args ????def __init__(self, n_splits=5, *, max_train_size=None): ????????super().__init__(n_splits, shuffle=False, random_state=None) ????????self.max_train_size = max_train_size ???? ????def split(self, X, y=None, groups=None): ????????"""Generate indices to split data into training and test set. ? ????????Parameters ????????---------- ????????X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples?and n_features is the number of features. ? ????????y : array-like of shape (n_samples,). Always ignored, exists for compatibility. ? ????????groups : array-like of shape (n_samples,). Always ignored, exists for compatibility. ? ????????Yields ????????------ ????????train : ndarray. The training set indices for that split. ? ????????test : ndarray. The testing set indices for that split. ????????""" ????????X, y, groups = indexable(X, y, groups) ????????n_samples = _num_samples(X) ????????n_splits = self.n_splits ????????n_folds = n_splits + 1 ????????if n_folds > n_samples: ????????????raise ValueError( ????????????????("Cannot have number of folds ={0} greater?than the number of samples: {1}.").?format(n_folds, n_samples)) ????????indices = np.arange(n_samples) ????????test_size = n_samples // n_folds ????????test_starts = range(test_size + n_samples % n_folds, n_samples, ?????????test_size) ????????for test_start in test_starts: ????????????if self.max_train_size and self.max_train_size < test_start: ????????????????yield indices[test_start - self.max_train_size:test_start], indices ?????????????????[test_start:test_start + test_size] ????????????else: ????????????????yield indices[:test_start], indices[test_start:test_start + test_size] | ? |
| ????Examples ????-------- ????>>> import numpy as np ????>>> from sklearn.model_selection import TimeSeriesSplit ????>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) ????>>> y = np.array([1, 2, 3, 4, 5, 6]) ????>>> tscv = TimeSeriesSplit() ????>>> print(tscv) ????TimeSeriesSplit(max_train_size=None, n_splits=5) ????>>> for train_index, test_index in tscv.split(X): ????... ????print("TRAIN:", train_index, "TEST:", test_index) ????... ????X_train, X_test = X[train_index], X[test_index] ????... ????y_train, y_test = y[train_index], y[test_index] ????TRAIN: [0] TEST: [1] ????TRAIN: [0 1] TEST: [2] ????TRAIN: [0 1 2] TEST: [3] ????TRAIN: [0 1 2 3] TEST: [4] ????TRAIN: [0 1 2 3 4] TEST: [5] ???? ????Notes ????----- ????The training set has size ``i * n_samples // (n_splits + 1)?+ n_samples % (n_splits + 1)`` in the ``i``th split,?with a test set of size ``n_samples//(n_splits + 1)``,?where ``n_samples`` is the number of samples. | ? |
| ????""" ????@_deprecate_positional_args ????def __init__(self, n_splits=5, *, max_train_size=None): ????????super().__init__(n_splits, shuffle=False, random_state=None) ????????self.max_train_size = max_train_size ???? ????def split(self, X, y=None, groups=None): ????????"""Generate indices to split data into training and test set. ? ????????Parameters ????????---------- ????????X : array-like of shape (n_samples, n_features). Training data, where n_samples is the number of samples?and n_features is the number of features. ? ????????y : array-like of shape (n_samples,). Always ignored, exists for compatibility. ? ????????groups : array-like of shape (n_samples,). Always ignored, exists for compatibility. ? ????????Yields ????????------ ????????train : ndarray. The training set indices for that split. ? ????????test : ndarray. The testing set indices for that split. ????????""" ????????X, y, groups = indexable(X, y, groups) ????????n_samples = _num_samples(X) ????????n_splits = self.n_splits ????????n_folds = n_splits + 1 ????????if n_folds > n_samples: ????????????raise ValueError( ????????????????("Cannot have number of folds ={0} greater?than the number of samples: {1}.").?format(n_folds, n_samples)) ????????indices = np.arange(n_samples) ????????test_size = n_samples // n_folds ????????test_starts = range(test_size + n_samples % n_folds, n_samples, ?????????test_size) ????????for test_start in test_starts: ????????????if self.max_train_size and self.max_train_size < test_start: ????????????????yield indices[test_start - self.max_train_size:test_start], indices ?????????????????[test_start:test_start + test_size] ????????????else: ????????????????yield indices[:test_start], indices[test_start:test_start + test_size] | ? |
總結(jié)
以上是生活随笔為你收集整理的ML之FE:特征工程中常用的五大数据集划分方法(特殊类型数据分割,如时间序列数据分割法)讲解及其代码的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Python之pandas:pandas
- 下一篇: Anaconda:成功解决Anacond