python人工智能——机器学习——分类算法-k近邻算法——kaggle案例: Facebook V: Predicting Check Ins
題目及翻譯
Facebook and Kaggle are launching a machine learning engineering competition for 2016.
Facebook和Kaggle正在推出2016年的機器學習工程競賽。
Trail blaze your way to the top of the leaderboard to earn an opportunity at interviewing for one of the 10+ open roles as a software engineer, working on world class machine learning problems.
開拓者通過自己的方式進入排行榜的頂端,為10名作為軟件工程師的開放角色中的一位獲得面試機會,從而解決世界級的機器學習問題。
The goal of this competition is to predict which place a person would like to check in to.
本次比賽的目的是預測一個人想要登記的地方。
For the purposes of this competition, Facebook created an artificial world consisting of more than 100,000 places located in a 10 km by 10 km square.
為了本次比賽的目的,Facebook創建了一個人工世界,其中包括10多公里10平方公里的100,000多個地方。
For a given set of coordinates, your task is to return a ranked list of the most likely places.
對于給定的坐標集,您的任務是返回最可能位置的排名列表。
Data was fabricated to resemble location signals coming from mobile devices, giving you a flavor of what it takes to work with real data complicated by inaccurate and noisy values.
數據被制作成類似于來自移動設備的位置信號,讓您了解如何處理由不準確和嘈雜的值導致的實際數據。
Inconsistent and erroneous location data can disrupt experience for services like Facebook Check In.
不一致和錯誤的位置數據可能會破壞Facebook Check In等服務的體驗。
We highly encourage competitors to be active on Kaggle Scripts.
我們強烈鼓勵競爭對手積極參與Kaggle Scripts。
Your work there will be thoughtfully included in the decision making process.
您在那里的工作將被認真地包含在決策過程中。
Please note: You must compete as an individual in recruiting competitions.
請注意:您必須在招募比賽中作為個人參加比賽。
You may only use the data provided to make your predictions.
您只能使用提供的數據進行預測。
數據
In this competition, you are going to predict which business a user is checking into based on their location, accuracy, and timestamp.
在本次競賽中,您將根據用戶的位置,準確性和時間戳預測用戶正在檢查的業務。
The train and test dataset are split based on time, and the public/private leaderboard in the test data are split randomly.
訓練和測試數據集根據時間進行劃分,測試數據中的公共/私人排行榜隨機拆分。
There is no concept of a person in this dataset.
此數據集中沒有人的概念。
All the row_id’s are events, not people.
所有row_id都是事件,而不是人。
Note: Some of the columns, such as time and accuracy, are intentionally left vague in their definitions.
注意:某些列(例如時間和準確性)在其定義中有意留下含糊不清的內容。
Please consider them as part of the challenge.
請將它們視為挑戰的一部分。
File descriptions
文件說明
train.csv, test.csv
row_id: id of the check-in event
row_id:簽入事件的id
x y: coordinates
xy:坐標
accuracy: location accuracy
準確度:定位精度
time: timestamp
時間:時間戳
place_id: id of the business, this is the target you are predicting
place_id:業務的ID,這是您預測的目標
sample_submission.csv - a sample submission file in the correct format with random predictions
sample_submission.csv - 具有隨機預測的正確格式的樣本提交文件
數據集下載
分析
特征值:x,y坐標,定位準確性,時間戳。
目標值:入住位置的id。
處理:
讀取數據
data = pd.read_csv("./facebook-v-predicting-check-ins/train.csv")數據的處理
1、縮小數據集范圍 DataFrame.query()
#1.縮小數據,查詢數據篩選data=data.query("x>1.0&x<1.25&y>2.5&y<2.75")2、處理日期數據 pd.to_datetime、pd.DatetimeIndex
#處理時間的數據time_value=pd.to_datetime(data['time'],unit='s')print(time_value)3、增加分割的日期數據
4、刪除沒用的日期數據
#把日期格式轉換為字典格式time_value=pd.DatetimeIndex(time_value)#構造一些特征data['day']=time_value.daydata['hour']=time_value.hourdata['weekday']=time_value.weekday#把時間戳特征刪除data=data.drop(['time'],axis=1)print(data)
處理完之后,數據規模減少。
5、將簽到位置少于n個用戶的刪除
place_count =data.groupby(‘place_id’).aggregate(np.count_nonzero)
tf = place_count[place_count.row_id > 3].reset_index()
data = data[data[‘place_id’].isin(tf.place_id)]
# 把簽到數量少于n個目標位置刪除place_count = data.groupby('place_id').count()tf = place_count[place_count.row_id > 3].reset_index()data = data[data['place_id'].isin(tf.place_id)]6.標準化
#特征工程(標準化)std=StandardScaler()#對測試集和訓練集的特征值進行標準化x_train=std.fit_transform(x_train)x_test=std.transform(x_test)預測
# 進行算法流程 # 超參數knn = KNeighborsClassifier(n_neighbors=5)#fit() predict() score()knn.fit(x_train,y_train)#得出預測結果y_predict=knn.predict(x_test)print("預測的目標簽到位置為:",y_predict)#得出準確率print("預測的準確率:",knn.score(x_test,y_test))
準確率才剛40%,有點低,再優化一下:
行吧,,孬好及格了。
完整代碼
from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler import pandas as pddef knncls():"""K-近鄰預測用戶簽到位置:return:None"""# 讀取數據data = pd.read_csv("./facebook-v-predicting-check-ins/train.csv")# print(data.head(10))#處理數據#1.縮小數據,查詢數據篩選data=data.query("x>1.0&x<1.25&y>2.5&y<2.75")#處理時間的數據time_value=pd.to_datetime(data['time'],unit='s')# print(time_value)#把日期格式轉換為字典格式time_value=pd.DatetimeIndex(time_value)#構造一些特征data['day']=time_value.daydata['hour']=time_value.hourdata['weekday']=time_value.weekday#把時間戳特征刪除data=data.drop(['time'],axis=1)# print(data)# 把簽到數量少于n個目標位置刪除place_count = data.groupby('place_id').count()tf = place_count[place_count.row_id > 3].reset_index()data = data[data['place_id'].isin(tf.place_id)]# 取出數據當中的特征值和目標值y = data['place_id']x = data.drop(['place_id'], axis=1)x = data.drop(['row_id'], axis=1)# 進行數據的分割訓練集合測試集x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)#特征工程(標準化)std=StandardScaler()#對測試集和訓練集的特征值進行標準化x_train=std.fit_transform(x_train)x_test=std.transform(x_test)# 進行算法流程 # 超參數knn = KNeighborsClassifier(n_neighbors=5)#fit() predict() score()knn.fit(x_train,y_train)#得出預測結果y_predict=knn.predict(x_test)print("預測的目標簽到位置為:",y_predict)#得出準確率print("預測的準確率:",knn.score(x_test,y_test))return Noneif __name__ == "__main__":knncls()流程分析
1、數據集的處理
2、分割數據集
3、對數據集進行標準化
4、estimator流程進行分類預測
——————————————————————————————————————————
2019-7-17更新
好多人都要數據集,現在直接放在這了,直接拿吧。
鏈接:https://pan.baidu.com/s/1ZT39BIG8LjJ3F6GYfcbfPw
提取碼:hoxm
復制這段內容后打開百度網盤手機App,操作更方便哦
總結
以上是生活随笔為你收集整理的python人工智能——机器学习——分类算法-k近邻算法——kaggle案例: Facebook V: Predicting Check Ins的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python人工智能——机器学习——分类
- 下一篇: AI 质检学习报告——学习篇——AI质检