特征选择算法之ReliefF算法python实现
                                                            生活随笔
收集整理的這篇文章主要介紹了
                                特征选择算法之ReliefF算法python实现
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.                        
                                
                            
                            
                            版權聲明:本文為博主原創文章,轉載時請附上原文鏈接:https://blog.csdn.net/qq_40871363/article/details/86594578
 
                        
                        
                        ReliefF算法python實現
- 一、ReliefF算法簡介
 - (一)原理
 - (二)偽算法
 
- 二、ReliefF算法python實現
 - (一)代碼
 - (二)文件:西瓜數據集
 
一、ReliefF算法簡介
ReliefF算法是Relief算法的拓展,其適用于處理多分類問題。
(一)原理
假設數據集D中的樣本分屬于N個類別,對于樣本xi,若它屬于第n類,則ReliefF算法首先在同類即第n類的樣本中尋找xi的k個最近鄰樣本作為猜中近鄰;然后在第n類之外的每個類中均找到xi的k個最近鄰樣本作為猜錯近鄰,其相關統計量對應于屬性j的分量則為:
(二)偽算法
待補充
二、ReliefF算法python實現
(一)代碼
代碼片
// An highlighted block # _*_ coding:utf8 _*_""" # 說明:特征選擇方法一:過濾式特征選擇(ReliefF算法) # 思想:先用特征選擇過程對初始特征進行"過濾",然后再用過濾后的特征訓練模型 # 時間:2019-1-16 # 問題: """import pandas as pd import numpy as np import numpy.linalg as la import random# 異常類 class ReliefError:passclass Relief:def __init__(self, data_df, sample_rate, t, k):"""#:param data_df: 數據框(字段為特征,行為樣本):param sample_rate: 抽樣比例:param t: 統計量分量閾值:param k: k近鄰的個數"""self.__data = data_dfself.__feature = data_df.columnsself.__sample_num = int(round(len(data_df) * sample_rate))self.__t = tself.__k = k# 數據處理(將離散型數據處理成連續型數據,比如字符到數值)def get_data(self):new_data = pd.DataFrame()for one in self.__feature[:-1]:col = self.__data[one]if (str(list(col)[0]).split(".")[0]).isdigit() or str(list(col)[0]).isdigit() or (str(list(col)[0]).split('-')[-1]).split(".")[-1].isdigit():new_data[one] = self.__data[one]# print '%s 是數值型' % oneelse:# print '%s 是離散型' % onekeys = list(set(list(col)))values = list(xrange(len(keys)))new = dict(zip(keys, values))new_data[one] = self.__data[one].map(new)new_data[self.__feature[-1]] = self.__data[self.__feature[-1]]return new_data# 返回一個樣本的k個猜中近鄰和其他類的k個猜錯近鄰def get_neighbors(self, row):df = self.get_data()row_type = row[df.columns[-1]]right_df = df[df[df.columns[-1]] == row_type].drop(columns=[df.columns[-1]])aim = row.drop(df.columns[-1])f = lambda x: eulidSim(np.mat(x), np.mat(aim))right_sim = right_df.apply(f, axis=1)right_sim_two = right_sim.drop(right_sim.idxmin())right = dict()right[row_type] = list(right_sim_two.sort_values().index[0:self.__k])# print list(right_sim_two.sort_values().index[0:self.__k])types = list(set(df[df.columns[-1]]) - set([row_type]))wrong = dict()for one in types:wrong_df = df[df[df.columns[-1]] == one].drop(columns=[df.columns[-1]])wrong_sim = wrong_df.apply(f, axis=1)wrong[one] = list(wrong_sim.sort_values().index[0:self.__k])print right, wrongreturn right, wrong# 計算特征權重def get_weight(self, feature, index, NearHit, NearMiss):# data = self.__data.drop(self.__feature[-1], axis=1)data = self.__datarow = data.iloc[index]right = 0for one in NearHit.values()[0]:nearhit = data.iloc[one]if (str(row[feature]).split(".")[0]).isdigit() or str(row[feature]).isdigit() or (str(row[feature]).split('-')[-1]).split(".")[-1].isdigit():max_feature = data[feature].max()min_feature = data[feature].min()right_one = pow(round(abs(row[feature] - nearhit[feature]) / (max_feature - min_feature), 2), 2)else:right_one = 0 if row[feature] == nearhit[feature] else 1right += right_oneright_w = round(right / self.__k, 2)wrong_w = 0# 樣本row所在的種類占樣本集的比例p_row = round(float(list(data[data.columns[-1]]).count(row[data.columns[-1]])) / len(data), 2)for one in NearMiss.keys():# 種類one在樣本集中所占的比例p_one = round(float(list(data[data.columns[-1]]).count(one)) / len(data), 2)wrong_one = 0for i in NearMiss[one]:nearmiss = data.iloc[i]if (str(row[feature]).split(".")[0]).isdigit() or str(row[feature]).isdigit() or (str(row[feature]).split('-')[-1]).split(".")[-1].isdigit():max_feature = data[feature].max()min_feature = data[feature].min()wrong_one_one = pow(round(abs(row[feature] - nearmiss[feature]) / (max_feature - min_feature), 2), 2)else:wrong_one_one = 0 if row[feature] == nearmiss[feature] else 1wrong_one += wrong_one_onewrong = round(p_one / (1 - p_row) * wrong_one / self.__k, 2)wrong_w += wrongw = wrong_w - right_wreturn w# 過濾式特征選擇def reliefF(self):sample = self.get_data()# print samplem, n = np.shape(self.__data) # m為行數,n為列數score = []sample_index = random.sample(range(0, m), self.__sample_num)print '采樣樣本索引為 %s ' % sample_indexnum = 1for i in sample_index: # 采樣次數one_score = dict()row = sample.iloc[i]NearHit, NearMiss = self.get_neighbors(row)print '第 %s 次采樣,樣本index為 %s,其NearHit k近鄰行索引為 %s ,NearMiss k近鄰行索引為 %s' % (num, i, NearHit, NearMiss)for f in self.__feature[0:-1]:w = self.get_weight(f, i, NearHit, NearMiss)one_score[f] = wprint '特征 %s 的權重為 %s.' % (f, w)score.append(one_score)num += 1f_w = pd.DataFrame(score)print '采樣各樣本特征權重如下:'print f_wprint '平均特征權重如下:'print f_w.mean()return f_w.mean()# 返回最終選取的特征def get_final(self):f_w = pd.DataFrame(self.reliefF(), columns=['weight'])final_feature_t = f_w[f_w['weight'] > self.__t]print final_feature_t# final_feature_k = f_w.sort_values('weight').head(self.__k)# print final_feature_kreturn final_feature_t# 幾種距離求解 def eulidSim(vecA, vecB):return la.norm(vecA - vecB)def cosSim(vecA, vecB):""":param vecA: 行向量:param vecB: 行向量:return: 返回余弦相似度(范圍在0-1之間)"""num = float(vecA * vecB.T)denom = la.norm(vecA) * la.norm(vecB)cosSim = 0.5 + 0.5 * (num / denom)return cosSimdef pearsSim(vecA, vecB):if len(vecA) < 3:return 1.0else:return 0.5 + 0.5 * np.corrcoef(vecA, vecB, rowvar=0)[0][1]if __name__ == '__main__':data = pd.read_csv('西瓜數據集31.csv')[['色澤', '根蒂', '敲擊', '紋理', '臍部', '觸感', '密度', '含糖率', '類別']]print dataf = Relief(data, 1, 0.2, 2)# df = f.get_data()# print type(df.iloc[0])# f.get_neighbors(df.iloc[0])# f.get_weight('色澤', 6, 7, 8)f.reliefF()# f.get_final()(二)文件:西瓜數據集
參考的是周志華的書《機器學習》里面的數據集,這個應該很普遍的,網上一搜就有的哈。附截圖一張~
總結
以上是生活随笔為你收集整理的特征选择算法之ReliefF算法python实现的全部內容,希望文章能夠幫你解決所遇到的問題。
                            
                        - 上一篇: CorelDRAWX4的VBA插件开发(
 - 下一篇: c++ 数论