调研AutoGluon数据处理与Tabular-NN
文章目錄
- 剝開果殼,直擊AG技術核心
 - 超參搜索與模型訓練
 - TabularNN
 - TabularNN 的 model-specific 特征處理
 - 對于每種特征構造一個ColumnTransformer
 - TabularNN的網絡結構
 
可以參考一下這篇博客:AutoGluon Tabular 表數據全流程自動機器學習 AutoML ,不過這位博主雖然有所總結,但是沒有深入代碼層面。
剝開果殼,直擊AG技術核心
hyperparams = {'NN': {'num_epochs': 10, 'activation': 'relu', 'dropout_prob': ag.Real(0.0,0.5)}, 'GBM': {'num_boost_round': 1000, 'learning_rate': ag.Real(0.01,0.1,log=True)} }進入autogluon.task.tabular_prediction.tabular_prediction.TabularPrediction#fit,映入眼簾的是
learner = Learner(path_context=output_directory, label=label, problem_type=problem_type, objective_func=eval_metric, stopping_metric=stopping_metric,id_columns=id_columns, feature_generator=feature_generator, trainer_type=trainer_type,label_count_threshold=label_count_threshold, random_seed=random_seed) learner.fit(X=train_data, X_test=tuning_data, scheduler_options=scheduler_options,hyperparameter_tune=hyperparameter_tune, feature_prune=feature_prune,holdout_frac=holdout_frac, num_bagging_folds=num_bagging_folds, num_bagging_sets=num_bagging_sets, stack_ensemble_levels=stack_ensemble_levels,hyperparameters=hyperparameters, ag_args_fit=ag_args_fit, excluded_model_types=excluded_model_types, time_limit=time_limits_orig, save_data=cache_data, save_bagged_folds=save_bagged_folds, verbosity=verbosity)Learner:autogluon.utils.tabular.ml.learner.default_learner.DefaultLearner#__init__
Learner encompasses full problem, loading initial data, feature generation, model training, model prediction
進入autogluon.utils.tabular.ml.learner.default_learner.DefaultLearner#fit
X, y, X_test, y_test, holdout_frac, num_bagging_folds = self.general_data_processing(X, X_test, holdout_frac, num_bagging_folds)進入當前代碼文件的general_data_processing函數
首先看到的就是這個代碼,先mark一下。不會爆內存?
X = copy.deepcopy(X)label的缺失值
missinglabel_inds = [index for index, x in X[self.label].isna().iteritems() if x]處理缺失值的方法是drop
X = X.drop(missinglabel_inds, axis=0)mark一下當前代碼文件的get_problem_type函數。 problem type有MULTICLASS_LIMIT,BINARY,REGRESSION 三種類型
其實是4種。還有一個聞所未聞的softclass。詳情見autogluon/utils/tabular/ml/constants.py:5
處理完標簽,就開始處理特征了。
如果定義了X_test,就疊起來一起做特征工程。這個操作屬于基操,不過要小心數據泄露。
X_super = pd.concat([X, X_test], ignore_index=True) 。。。處理。。。 X = X_super.head(len(X)).set_index(X.index) X_test = X_super.tail(len(X_test)).set_index(X_test.index)好,開始看數據處理
X = self.feature_generator.fit_transform(X, banned_features=self.submission_columns, drop_duplicates=False)self.feature_generator來自autogluon.utils.tabular.features.auto_ml_feature_generator.AutoMLFeatureGenerator
 進入之。
mark代碼文件下get_feature_types函數,用于解析date特征與text特征,值得借鑒。
在minimize_categorical_memory_usage函數中,是用這種神奇的方法做OrdinalEncoding的(傳入前已經將object處理成了category)
for column in cat_columns:new_categories = list(range(len(X_features[column].cat.categories.values)))X_features[column].cat.rename_categories(new_categories, inplace=True)出棧,回到autogluon/utils/tabular/ml/learner/default_learner.py:66
self.trainer_type <class 'autogluon.utils.tabular.ml.trainer.auto_trainer.AutoTrainer'>進入autogluon.utils.tabular.ml.trainer.auto_trainer.AutoTrainer#train函數
處理超參數(看不出個所以然來)
self.hyperparameters = self._process_hyperparameters(hyperparameters=hyperparameters, ag_args_fit=ag_args_fit, excluded_model_types=excluded_model_types)獲取模型
models = self.get_models(hyperparameters=self.hyperparameters, hyperparameter_tune=hyperparameter_tune, level=0)獲取模型的get_models函數調用了autogluon.utils.tabular.ml.trainer.model_presets.presets.get_preset_models
level_key = default
懷疑model是kwargs
model {'num_epochs': 10, 'activation': 'relu', 'dropout_prob': Real: lower=0.0, upper=0.5, 'AG_args': {'model_type': 'NN'}}果然,autogluon/utils/tabular/ml/trainer/model_presets/presets.py:129
model_names_set.add(name) model_params = copy.deepcopy(model) model_params.pop(AG_ARGS)model_init就是模型實例了
model_init = model_type(path=path, name=name, problem_type=problem_type, objective_func=objective_func, stopping_metric=stopping_metric, num_classes=num_classes, hyperparameters=model_params)進入autogluon.utils.tabular.ml.models.abstract.abstract_model.AbstractModel#__init__
mark一下TabularNN的所在地為autogluon.utils.tabular.ml.models.tabular_nn.tabular_nn_model.TabularNeuralNetModel
進入autogluon.utils.tabular.ml.trainer.abstract_trainer.AbstractTrainer#stack_new_level
 有是個數據處理?每太看懂
進入當前代碼文件的train_multi
 套娃進入train_multi_initial
 套娃進入train_multi_fold
 套娃進入train_single_full
超參搜索與模型訓練
hpo_models, hpo_model_performances, hpo_results = model.hyperparameter_tune(X_train=X_train, X_test=X_test, Y_train=y_train, Y_test=y_test, scheduler_options=(self.scheduler_func, self.scheduler_options), verbosity=self.verbosity)model(autogluon.utils.tabular.ml.models.lgb.lgb_model.LGBModel)自帶一個hyperparameter_tune方法,
self.scheduler_func <class 'autogluon.scheduler.fifo.FIFOScheduler'> self.scheduler_options {'resource': {'num_cpus': 12, 'num_gpus': 0}, 'searcher': 'random', 'search_options': {}, 'checkpoint': None, 'resume': False, 'num_trials': 5, 'time_out': 27.0, 'reward_attr': 'validation_performance', 'time_attr': 'epoch', 'visualizer': 'none', 'dist_ip_addrs': []}我們先來看LGBM(同時也是優先級最高的模型。paper一直說自己的tabularNN多么牛x,但實際上也沒設為最高優先級,口嫌體正直)
進入autogluon.utils.tabular.ml.models.lgb.lgb_model.LGBModel#hyperparameter_tune
這段代碼在check min_data_in_leaf 這個超參
if isinstance(params_copy['min_data_in_leaf'], Int):upper_minleaf = params_copy['min_data_in_leaf'].upperif upper_minleaf > X_train.shape[0]: # TODO: this min_data_in_leaf adjustment based on sample size may not be necessaryupper_minleaf = max(1, int(X_train.shape[0] / 5.0))lower_minleaf = params_copy['min_data_in_leaf'].lowerif lower_minleaf > upper_minleaf:lower_minleaf = max(1, int(upper_minleaf / 3.0))params_copy['min_data_in_leaf'] = Int(lower=lower_minleaf, upper=upper_minleaf)超參搜索(HPO)與訓練就有意思了:
先在hyperparameter_tune函數的最后調用一個scheduler.run()
 按F7進入之。
 run函數的最后是個循環,不斷調用self.schedule_next()。
 按F7進入之。
 config在這里是隨機推薦的:(AG實現了其他的推薦器,如skopt等)
schedule_next函數的最后是這樣一段代碼:
task = self._create_new_task(config, resources=resources) self.add_job(task, **extra_kwargs) task Task (task_id: 0,fn: <function lgb_trial at 0x7f4de2db31e0>,args: {args: {'util_args': {'dataset_train_filename': 'dataset_train.bin', 'dataset_val_filename': 'dataset_val.b.., config: {'feature_fraction': 1.0, 'learning_rate': 0.0316227766, 'min_data_in_leaf': 20, 'num_leaves': 31}, },resource: DistributedResource(Node = Remote REMOTE_ID: 0, <Remote: 'inproc://192.168.1.106/2563/1' processes=1 threads=12, memory=16.68 GB>nCPUs = 12, CPU_IDs = {[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})) type(task) <class 'autogluon.core.task.Task'>最后估計要調autogluon.utils.tabular.ml.models.lgb.hyperparameters.lgb_trial.lgb_trial
進入autogluon/utils/tabular/ml/models/lgb/hyperparameters/lgb_trial.py:19,打斷點F9
args.keys() dict_keys(['util_args', 'num_boost_round', 'num_threads', 'objective', 'verbose', 'boosting_type', 'two_round', 'learning_rate', 'feature_fraction', 'min_data_in_leaf', 'num_leaves', 'seed_value', 'task_id'])簡單掃了一眼,num_boost_round, learning_rate, feature_fraction, min_data_in_leaf, num_leaves 都是LGBM常見的超參。參數中混雜了一些其他的參數,如task_id。
進入autogluon.utils.tabular.ml.models.abstract.model_trial.prepare_inputs
一波操作騷的可以
type(args["util_args"]) <class 'autogluon.utils.edict.EasyDict'> args["util_args"].model <autogluon.utils.tabular.ml.models.lgb.lgb_model.LGBModel object at 0x7f4db0840860>最后調用了一個autogluon.utils.tabular.ml.models.abstract.model_trial.fit_and_save_model函數。
看的頭暈眼花,還是不管LGBM和莫名其妙的流程了,直接看tabularNN吧
TabularNN
在autogluon/scheduler/fifo.py:235打一斷點,待LGBM的5次trial都執行完之后, Run To Cursor 到 autogluon/scheduler/fifo.py:300,打印
task.fn <function tabular_nn_trial at 0x7f616248f730>按兩次shift鍵查詢tabular_nn_trial,進入autogluon.utils.tabular.ml.models.tabular_nn.tabular_nn_trial.tabular_nn_trial,在函數內打一斷點。(Run To Cursor并不會執行到該函數,AG和HpBandSter差不多,worker和master在兩個進程/線程)
重新運行代碼,跑到tabular_nn_trial函數
train_dataset = TabularNNDataset.load(util_args.train_path) train_dataset.feature_groups {'vector': ['age', 'fnlwgt', 'education-num', 'hours-per-week', 'capital-gain', 'capital-loss', 'sex'], 'embed': ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'native-country'], 'language': []}TabularNN 的 model-specific 特征處理
首先關注TabularNN的數據處理(我對data skew的處理很感興趣)
在autogluon.utils.tabular.ml.models.tabular_nn.tabular_nn_model.TabularNeuralNetModel#generate_datasets的autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py:452打一斷點,按F7進入
發現TabularNN并沒有實現自己的preprocess函數,而用的是父類的函數。
進入當前代碼文件的process_train_data。這個函數才是預處理的關鍵
首先就是獲取特征類型了,進入_get_types_of_features,一共有5種特征類型:
types_of_features = {'continuous': [], 'skewed': [], 'onehot': [], 'embed': [], 'language': []} # continuous = numeric features to rescale # skewed = features to which we will apply power (ie. log / box-cox) transform before normalization # onehot = features to one-hot encode (unknown categories for these features encountered at test-time are encoded as all zeros). We one-hot encode any features encountered that only have two unique values. for feature in self.features:feature_data = df[feature] # pd.Seriesnum_unique_vals = len(feature_data.unique())if num_unique_vals == 2: # will be onehot encoded regardless of proc.embed_min_categories valuetypes_of_features['onehot'].append(feature)elif feature in continuous_featnames:if np.abs(feature_data.skew()) > skew_threshold:types_of_features['skewed'].append(feature)else:types_of_features['continuous'].append(feature)elif feature in categorical_featnames:if num_unique_vals >= embed_min_categories: # sufficiently many categories to warrant learned embedding dedicated to this featuretypes_of_features['embed'].append(feature)else:types_of_features['onehot'].append(feature)elif feature in language_featnames:types_of_features['language'].append(feature) return types_of_featuresskew_threshold =0.99 , embed_min_categories = 4
對于每種特征構造一個ColumnTransformer
在識別完特征之后,會開始構造ColumnTransformer
 直接放代碼了:
impute_strategy = median , max_category_levels = 100
def _create_preprocessor(self, impute_strategy, max_category_levels):""" Defines data encoders used to preprocess different data types and creates instance variable which is sklearn ColumnTransformer object """if self.processor is not None:Warning("Attempting to process training data for TabularNeuralNetModel, but previously already did this.")continuous_features = self.types_of_features['continuous']skewed_features = self.types_of_features['skewed']onehot_features = self.types_of_features['onehot']embed_features = self.types_of_features['embed']language_features = self.types_of_features['language']transformers = [] # order of various column transformers in this list is important!if len(continuous_features) > 0:continuous_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy=impute_strategy)),('scaler', StandardScaler())])transformers.append( ('continuous', continuous_transformer, continuous_features) )if len(skewed_features) > 0:power_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy=impute_strategy)),('quantile', QuantileTransformer(output_distribution='normal')) ]) # Or output_distribution = 'uniform'# TODO: remove old code: ('power', PowerTransformer(method=self.params['proc.power_transform_method'])) ])transformers.append( ('skewed', power_transformer, skewed_features) )if len(onehot_features) > 0:onehot_transformer = Pipeline(steps=[# TODO: Consider avoiding converting to string for improved memory efficiency('to_str', FunctionTransformer(self.convert_df_dtype_to_str)),('imputer', SimpleImputer(strategy='constant', fill_value=self.unique_category_str)),('onehot', OneHotMergeRaresHandleUnknownEncoder(max_levels=max_category_levels, sparse=False))]) # test-time unknown values will be encoded as all zeros vectortransformers.append( ('onehot', onehot_transformer, onehot_features) )if len(embed_features) > 0: # Ordinal transformer applied to convert to-be-embedded categorical features to integer levelsordinal_transformer = Pipeline(steps=[('to_str', FunctionTransformer(self.convert_df_dtype_to_str)),('imputer', SimpleImputer(strategy='constant', fill_value=self.unique_category_str)),('ordinal', OrdinalMergeRaresHandleUnknownEncoder(max_levels=max_category_levels))]) # returns 0-n when max_category_levels = n-1. category n is reserved for unknown test-time categories.transformers.append( ('ordinal', ordinal_transformer, embed_features) )if len(language_features) > 0:raise NotImplementedError("language_features cannot be used at the moment")return ColumnTransformer(transformers=transformers) # numeric features are processed in the same order as in numeric_features vector, so feature-names remain the same.用了QuantileTransformer 而沒用PowerTransformer,但是變量名申明的卻是power_transformer,發生了什么
一個一個看:
自研Encoder參數:max_levels=max_category_levels(100),研究一下這兩個自研 Encoder
感覺寫的挺爛的。max_levels=max_category_levels(100)的思路和auto-sklearn 2.0的Category Coalescence 、Minority Coalescer神似。只不過ASKL采取的是ratio或者說fraction的思想(Minimum percentaage samples∈[0.0001, 0.5]),而AG采取的是指定一個數的思想,并且是寫死的(max_category_levels = 100)
self.feature_arraycol_map = self._get_feature_arraycol_map(max_category_levels=max_category_levels)OrderedDict of feature-name -> list of column-indices in df corresponding to this feature
{'age': [0], 'fnlwgt': [1], 'education-num': [2], 'hours-per-week': [3], 'capital-gain': [4], 'capital-loss': [5], 'sex': [6, 7], 'workclass': [8], 'education': [9], 'marital-status': [10], 'occupation': [11], 'relationship': [12], 'race': [13], 'native-country': [14]}單獨搞了個函數算特征處理的一對多關系,脫褲子放屁。
TabularNN的網絡結構
TabularNN數據處理的代碼就是這些了,看訓練的代碼吧
autogluon.utils.tabular.ml.models.abstract.model_trial.fit_and_save_model
 autogluon.utils.tabular.ml.models.abstract.abstract_model.AbstractModel#fit
 autogluon.utils.tabular.ml.models.tabular_nn.tabular_nn_model.TabularNeuralNetModel#_fit
進入get_net
self.model = EmbedNet(train_dataset=train_dataset, params=params, num_net_outputs=self.num_net_outputs, ctx=self.ctx) params {'num_epochs': 10, 'epochs_wo_improve': 20, 'seed_value': None, 'proc.embed_min_categories': 4, 'proc.impute_strategy': 'median', 'proc.max_category_levels': 100, 'proc.skew_threshold': 0.99, 'network_type': 'widedeep', 'layers': None, 'numeric_embed_dim': None, 'activation': 'relu', 'max_layer_width': 2056, 'embedding_size_factor': 1.0, 'embed_exponent': 0.56, 'max_embedding_dim': 100, 'y_range': None, 'y_range_extend': 0.05, 'use_batchnorm': True, 'dropout_prob': 0.25, 'batch_size': 512, 'loss_function': None, 'optimizer': 'adam', 'learning_rate': 0.0003, 'weight_decay': 1e-06, 'clip_gradient': 100.0, 'momentum': 0.9, 'lr_scheduler': None, 'base_lr': 3e-05, 'target_lr': 1.0, 'lr_decay': 0.1, 'warmup_epochs': 10, 'use_ngram_features': False}進入EmbedNet的構造函數
train_dataset.getNumCategoriesEmbeddings()的意義在于統計每個cat feature的基數
getEmbedSizes計算Embed后各個cat feature的維度
mark一下,調研MLBox的EntityCoding
def getEmbedSizes(train_dataset, params, num_categs_per_feature): """ Returns list of embedding sizes for each categorical variable.Selects this adaptively based on training_datset.Note: Assumes there is at least one embed feature."""max_embedding_dim = params['max_embedding_dim']embed_exponent = params['embed_exponent']size_factor = params['embedding_size_factor']embed_dims = [int(size_factor*max(2, min(max_embedding_dim, 1.6 * num_categs_per_feature[i]**embed_exponent)))for i in range(len(num_categs_per_feature))]return embed_dimsautogluon.utils.tabular.ml.models.tabular_nn.tabular_nn_model.TabularNeuralNetModel#set_net_defaults
vector_dim = train_dataset.dataset._data[train_dataset.vectordata_index].shape[1] # total dimensionality of vector features prop_vector_features = train_dataset.num_vector_features() / float(train_dataset.num_features) # Fraction of features that are numeric min_numeric_embed_dim = 32 max_numeric_embed_dim = params['max_layer_width'] params['numeric_embed_dim'] = int(min(max_numeric_embed_dim, max(min_numeric_embed_dim, params['layers'][0]*prop_vector_features*np.log10(vector_dim+10) ))) params['layers'] [256, 128]autogluon/utils/tabular/ml/models/tabular_nn/tabular_nn_model.py:328
self.model EmbedNet((numeric_block): NumericBlock((body): Dense(8 -> 160, Activation(relu)))(embed_blocks): HybridSequential((0): EmbedBlock((body): Embedding(7 -> 4, float32))(1): EmbedBlock((body): Embedding(14 -> 7, float32))(2): EmbedBlock((body): Embedding(6 -> 4, float32))(3): EmbedBlock((body): Embedding(14 -> 7, float32))(4): EmbedBlock((body): Embedding(7 -> 4, float32))(5): EmbedBlock((body): Embedding(6 -> 4, float32))(6): EmbedBlock((body): Embedding(6 -> 4, float32)))(output_block): WideAndDeepBlock((deep): FeedforwardBlock((body): HybridSequential((0): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=194)(1): Dropout(p = 0.25, axes=())(2): Dense(194 -> 256, Activation(relu))(3): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=256)(4): Dropout(p = 0.25, axes=())(5): Dense(256 -> 128, Activation(relu))(6): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=128)(7): Dropout(p = 0.25, axes=())(8): Dense(128 -> 2, linear)))(wide): Dense(194 -> 2, linear)) )總結
以上是生活随笔為你收集整理的调研AutoGluon数据处理与Tabular-NN的全部內容,希望文章能夠幫你解決所遇到的問題。
                            
                        - 上一篇: 客户端程序自动更新(升级)的方式
 - 下一篇: 打印英文年历C语言函数,C语言打印年历