當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习案例：scikit-learn实现ebay数据分析

發(fā)布時(shí)間：2025/3/8 编程问答 20 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习案例：scikit-learn实现ebay数据分析小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

ebay在線拍賣數(shù)據(jù)分析

ebay在線拍賣數(shù)據(jù)

數(shù)據(jù)集下載地址為?Ebay Data Set（https://cims.nyu.edu/~munoz/data/）

raw.tar.gz中包括TrainingSet.csv，TestSet.csv，TrainingSubset.csv和TestSubset.csv這四個(gè)數(shù)據(jù)文件，下表列出了這四個(gè)文件的內(nèi)容簡(jiǎn)介

數(shù)據(jù)名數(shù)據(jù)描述

TrainingSet	2013年4月的所有拍賣
TestSet	2013年5月第一個(gè)周的所有拍賣
TrainingSubset	2013年4月成功交易的所有拍賣
TestSubset	2013年5月第一周成功交易的所有拍賣

數(shù)據(jù)中的特征名及其對(duì)應(yīng)描述：

特征名特征描述

Prices	最終交易金額
StartingBid	拍賣的最低交易金額
BidCount	此項(xiàng)拍賣獲得的投標(biāo)數(shù)
Title	交易標(biāo)題
QuantitySold	成功銷售的數(shù)量（0或1表示）
SellerRating	賣家在ebay上的評(píng)級(jí)
StartDate	拍賣開始的日期
EndDate	拍賣結(jié)束的日期
PositiveFeedbackPercent	賣家收到的正反饋百分比（占所有反饋）
HasPicture	是否有實(shí)物圖（0或1）
MemberSince	賣家創(chuàng)建其在ebay上的賬戶日期
HasStore	賣家是否有ebay店鋪（0或1）
SellerCountry	賣家所在的國(guó)家
BuyitNowPrice	立即購買該商品的價(jià)格
HighBidderFeedbackRating	出價(jià)最高的投標(biāo)者的ebay評(píng)級(jí)
ReturnsAccepted	是否接受退貨（0或1表示）
HasFreeShipping	是否包郵（0或1表示）
IsHOF	賣家中是否是名人堂中的玩家（0或1表示）
IsAuthenticated	是否受到工會(huì)的認(rèn)證（0或1表示）
HasInscription	拍賣項(xiàng)目是否有登記過（0或1表示）
AvgPrice	庫存中關(guān)于這款商品的平均價(jià)格
MedianPrice	庫存中這款商品價(jià)格的中位數(shù)
AuctionCount	庫存中拍賣的總數(shù)
SellerSaleToAveragePriceRatio	這項(xiàng)拍賣商品的價(jià)格占平均價(jià)格的比例
StateDayOfWeek	拍賣開始時(shí)是周幾
EndDayOfWeek	拍賣結(jié)束時(shí)是周幾
AuctionDuration	拍賣持續(xù)的天數(shù)
StartingBidPercent	該商品投標(biāo)底線占平均交易價(jià)格的比例
SellerClosePercent	一個(gè)賣家成功交易的拍賣數(shù)占所有在線拍賣數(shù)的比例
ItemAuctionSellPercent	成功交易的拍賣數(shù)占所有在線拍賣數(shù)的比例

數(shù)據(jù)導(dǎo)入及可視化

實(shí)驗(yàn)用的環(huán)境是Jupyter Python3.6

首先導(dǎo)入相關(guān)的包：

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

讀入數(shù)據(jù)：

test_set = pd.read_csv("Data/TestSet.csv") train_set = pd.read_csv("Data/TrainingSet.csv") test_subset = pd.read_csv("Data/TestSubset.csv") train_subset = pd.read_csv("Data/TrainingSubset.csv")

輸出查看train_set的數(shù)據(jù)：

train_set.info() # Output train_set data

也可以使用head()查看前5條數(shù)據(jù)

train_set.head()

第一列屬性EbayID為每條拍賣紀(jì)錄的ID號(hào)，與預(yù)測(cè)拍賣是否成功沒有聯(lián)系，因此在模型訓(xùn)練時(shí)應(yīng)該將該特征去除。QuantitySold屬性為1代表拍賣成功，為0代表拍賣失敗，其中SellerName拍賣賣方的名字與預(yù)測(cè)拍賣是否成功也沒有關(guān)系，因此在訓(xùn)練時(shí)也應(yīng)將該特征去除

train_data = train_set.drop(['EbayID','QuantitySold','SellerName'],axis = 1) train_target = train_set['QuantitySold'] # Gets the total number of features n_trainSamples, n_features = train_data.shape

這里再解釋一下，為什么要?jiǎng)h除QuantitySold這個(gè)特征。因?yàn)槲覀円獙颖緮?shù)據(jù)分成兩部分，一是純的特征數(shù)據(jù)，二是對(duì)應(yīng)的標(biāo)簽，上面的train_data就是特征數(shù)據(jù)，train_target就是特征標(biāo)簽（是否成功拍賣）

可視化數(shù)據(jù)，取出一部分?jǐn)?shù)據(jù)，兩兩組成對(duì)看數(shù)據(jù)在這個(gè)2維平面上的分布情況

# isSold: Auction success is 1, auction failure is 0 df = pd.DataFrame(np.column_stack((train_data, train_target)), columns = list(range(n_features)) + ['isSold']) sns.pairplot(df[:50], vars = [2,3,4,10,13], hue = 'isSold', size = 1.5)

numpy中矩陣列合并有兩個(gè)函數(shù)，一是hstack()，另一個(gè)是這里用到的column_stack，這兩者的區(qū)別在于：如果合并的矩陣中有某一個(gè)矩陣是稀疏矩陣（有很多0），則最好用column_stack

從第3,9,12,16維特征的散列圖及柱狀圖可看出，這幾個(gè)維度并沒有很好的區(qū)分度，橫縱坐標(biāo)的值分別代表不同維度之間的負(fù)相關(guān)性，為了查看數(shù)據(jù)特征之間的相關(guān)性，及不同特征與類別isSold之間的關(guān)系，我們可以利用seaborn中的熱度圖來顯示其倆倆組隊(duì)之間的相關(guān)性

train = train_set.drop(['EbayID','SellerName'],axis = 1) plt.figure(figsize = (10,10))# The correlation matrix of the data is calculated corr = train.corr()# produce keep out the heat map triangle part of the mask, because the heat the graph is symmetric matrix # so you just output the lower triangular part mask = np.zeros_like(corr, dtype = np.bool) mask[np.triu_indices_from(mask)] = True# Produces the corresponding color change in the heat map cmap = sns.diverging_palette(220, 10, as_cmap = True)# Call the heat in seanborn to create a heat map sns.heatmap(corr, cmap = cmap, mask = mask, vmax = .3,square = True, xticklabels = 5, yticklabels = 2,linewidths = .5, cbar_kws = {'shrink':.5})# Rotate yticks into the horizontal direction for easy viewing plt.yticks(rotation = 0)plt.show()

顏色越偏紅，相關(guān)性越大，越偏藍(lán)相關(guān)性越小且負(fù)相關(guān)，白色即兩個(gè)特征之間沒有多大的關(guān)聯(lián)，通過第一列可看出，不同維的屬性與類別isSold之間的關(guān)系，其中第3,9,12,16維特征與拍賣是否會(huì)成功有很強(qiáng)的正相關(guān)性，其中3,9,12,16分別對(duì)應(yīng)屬性SellerClosePercent，HitCount，SellerSaleAvgPriceRatio和BestOffer，表示當(dāng)這些屬性的值越大時(shí)越有可能拍賣成功，其中第6維特征StartingBid與成功拍賣isSold之間呈現(xiàn)較大的負(fù)相關(guān)性，可看出當(dāng)拍賣投標(biāo)的底價(jià)越高，則這項(xiàng)拍賣的成功性就越低

通過這副熱度圖的第二列我們還可以看出不同特征與價(jià)格Price之間的相關(guān)性

利用數(shù)據(jù)預(yù)測(cè)拍賣是否會(huì)成功

由于數(shù)據(jù)量比較大，且特征維度也不是特別少，因此一開始做baseline時(shí)，就不利用SVM支持向量機(jī)這些較簡(jiǎn)單的模型，因?yàn)楫?dāng)數(shù)據(jù)量比較大，且維度較高時(shí)，有些簡(jiǎn)單的機(jī)器學(xué)習(xí)算法并不高效，且可能訓(xùn)練到最后都不收斂

根據(jù)scikit-learn提供的機(jī)器學(xué)習(xí)算法使用圖譜

圖譜推薦先使用SGDClassifier，其全稱為Stochastic Gradient Descent 隨機(jī)梯度下降，通過梯度下降法在訓(xùn)練過程中沒有用到所有的訓(xùn)練樣本，而是隨機(jī)從訓(xùn)練樣本中選取一部分進(jìn)行訓(xùn)練，但是SGD對(duì)特征值的大小比較敏感，而通過上面的數(shù)據(jù)預(yù)覽，可以知道在我們的數(shù)據(jù)集里有數(shù)值較大的數(shù)據(jù)，如Category。因此我們需要先使用sklearn.preprocessing提供的StandardScaler對(duì)數(shù)據(jù)進(jìn)行預(yù)處理，使其每個(gè)屬性的波動(dòng)幅度不要太大，有助于訓(xùn)練時(shí)函數(shù)收斂

下面是使用sklearn中的SGDClassifier實(shí)現(xiàn)拍賣是否成功的模型訓(xùn)練代碼

from sklearn.linear_model import SGDClassifier from sklearn.preprocessing import StandardScaler# The results of mini_batch learning for SGDClassifier in the training process were drawn def plot_learning(clf,title):plt.figure()# Record the prediction of the last training result in this trainingvalidationScore = []# Record the forecast situation after adding this training resulttrainScore = []# Minimum training frequencymini_batch = 1000for i in range(int(np.ceil(n_trainSamples / mini_batch))):x_batch = train_data[i * mini_batch : min((i + 1) * mini_batch, n_trainSamples)]y_batch = train_target[i * mini_batch: min((i + 1) * mini_batch, n_trainSamples)]if i > 0:validationScore.append(clf.score(x_batch, y_batch))clf.partial_fit(x_batch, y_batch, classes = range(5))if i > 0:trainScore.append(clf.score(x_batch, y_batch))plt.plot(trainScore, label = "train_score")plt.plot(validationScore, label = "validation_score")plt.xlabel("Mini_batch")plt.ylabel("Score")plt.grid()plt.title(title)plt.savefig('test.jpg')# Normalized data scaler = StandardScaler() train_data = scaler.fit_transform(train_data.drop(['EndDay'], axis = 1))# Create SGDClassifier clf = SGDClassifier(penalty = 'l2', alpha = 0.001) plot_learning(clf, 'SGDClassifier')

訓(xùn)練結(jié)果如下圖，由于SGDClassifier是在所有的訓(xùn)練樣本中抽取一部分作為本次訓(xùn)練集，因此這里不適用Cross Validation(交叉驗(yàn)證)

可以看到SGDClassifier的訓(xùn)練效果還不錯(cuò)，準(zhǔn)確率幾乎達(dá)到92%。我們可以繼續(xù)使用scikit-learn中封裝的一些降維方法，這里我們使用三種方法進(jìn)行降維——Random，Projection，PCA和T-SNE embedding

from sklearn import manifold, decomposition, random_projection from matplotlib import offsetbox from time import timeimages = [] images.append([[0., 0., 5., 13., 9., 1., 0., 0.],[0., 0., 13., 15., 10., 15., 5., 0.],[0., 3., 15., 2., 0., 11., 8., 0.],[0., 4., 12., 0., 0., 8., 8., 0.],[0., 5., 8., 0., 0., 9., 8., 0.],[0., 4., 11., 0., 1., 12., 7., 0.],[0., 2., 14., 5., 10., 12., 0., 0.],[0., 0., 6., 13., 10., 0., 0., 0.] ]) images.append([[0., 0., 0., 12., 13., 5., 0., 0.],[0., 0., 0., 11., 16., 9., 0., 0.],[0., 0., 3., 15., 16., 6., 0., 0.],[0., 7., 15., 16., 16., 2., 0., 0.],[0., 0., 1., 16., 16., 3., 0., 0.],[0., 0., 1., 16., 16., 6., 0., 0.],[0., 0., 1., 16., 16., 6., 0., 0.],[0., 0., 0., 11., 16., 10., 0., 0.] ]) # 1000 pieces of data were selected for visual display show_instances = 1000# define the drawing function def plot_embedding(X, title = None):x_min, x_max = np.min(X, 0), np.max(X, 0)X = (X - x_min) / (x_max - x_min)plt.figure()ax = plt.subplot(111)for i in range(X.shape[0]):plt.text(X[i,0], X[i,1], str(train_target[i]),color = plt.cm.Set1(train_target[i] / 2.),fontdict = {'weight':'bold','size':9})if hasattr(offsetbox, 'AnnotationBbox'):shown_images = np.array([[1., 1.]])for i in range(show_instances):dist = np.sum((X[i] - shown_images) ** 2, 1)if np.min(dist) < 4e-3:# don't show points that are too closecontinueshown_images = np.r_[shown_images, [X[i]]]auctionbox = offsetbox.AnnotationBbox(offsetbox.OffsetImage(images[train_target[i]], cmap = plt.cm.gray_r), X[i])ax.add_artist(auctionbox)plt.xticks([]), plt.yticks([])if title is not None:plt.title(title)# Random Projuection start_time = time() rp = random_projection.SparseRandomProjection(n_components = 2,random_state = 50) rp.fit(train_data[:show_instances]) train_projected = rp.transform(train_data[:show_instances]) plot_embedding(train_projected, "Random Projecion of the auction (time: %.3fs)" % (time() - start_time))# PCA start_time = time() train_pca = decomposition.TruncatedSVD(n_components = 2).fit_transform(train_data[:show_instances]) plot_embedding(train_projected, "Pricincipal Components Projection of the auction (time: %.3fs)" % (time() - start_time))# t-sns start_time = time() tsne= manifold.TSNE(n_components = 2, init = 'pca', random_state = 0) train_tsne = tsne.fit_transform(train_data[:show_instances]) plot_embedding(train_projected, "T-SNE embedding of the auction (time: %.3fs)" % (time() - start_time))

隨機(jī)投影效果如下圖

PCA降維效果

T-SNE降維效果

從上面三幅圖中，我們可以看出數(shù)字0和1的重疊情況，判斷出數(shù)據(jù)的可區(qū)分度并不是特別大，因此我們訓(xùn)練效果也并沒有特別好

分類訓(xùn)練結(jié)束后，查看分類器在測(cè)試集上的效果

from sklearn.metrics import precision_score, recall_score, f1_scoretrain_data = scaler.fit_transform(train_data)train_pred = clf.predict(train_data)print("SGDClassifier training performance on testing dataset:") print("\tPrecision：%1.3f" % precision_score(train_target, train_pred)) print("\tRecall：%1.3f" % recall_score(train_target, train_pred)) print("\tF1：%1.3f \n" % f1_score(train_target, train_pred))

測(cè)試效果：

SGDClassifier training performance on testing dataset:Precision：0.875Recall：0.730F1：0.796

預(yù)測(cè)拍賣最終成交價(jià)格

由于價(jià)格Price是一個(gè)Numerical的值，而拍賣是否成功是一個(gè)Category的值，因此兩者做法是不一樣的，預(yù)測(cè)價(jià)格是一個(gè)回歸任務(wù)，而判斷拍賣是否成功是一個(gè)分類任務(wù)

同樣根據(jù)機(jī)器學(xué)習(xí)算法使用圖譜，這里我們采取SGDRegressor，代碼如下：

from sklearn.linear_model import SGDRegressor import random from sklearn.preprocessing import MinMaxScaler# prepare data test_subset = pd.read_csv('Data/TestSubset.csv') train_subset = pd.read_csv('Data/TrainingSubset.csv')# Training Data train = train_subset.drop(['EbayID','Price','SellerName','EndDay'],axis=1) train_target = train_subset['Price']scaler = MinMaxScaler() train = scaler.fit_transform(train) n_trainSamples, n_features = train.shape# ploting example from scikit-learn def plot_learning(clf,title):plt.figure()validationScore = []trainScore = []mini_batch = 500# define the shuffle indexidx = list(range(n_trainSamples))random.shuffle(idx)for i in range(int(np.ceil(n_trainSamples / mini_batch))):x_batch = train[idx[i * mini_batch: min((i + 1) * mini_batch, n_trainSamples)]]y_batch = train_target[idx[i * mini_batch: min((i + 1) * mini_batch, n_trainSamples)]]if i > 0:validationScore.append(clf.score(x_batch, y_batch))clf.partial_fit(x_batch, y_batch)if i > 0:trainScore.append(clf.score(x_batch, y_batch))plt.plot(trainScore, label="train score")plt.plot(validationScore, label="validation socre")plt.xlabel("Mini_batch")plt.ylabel("Score")plt.legend(loc='best')plt.title(title)sgd_regresor = SGDRegressor(penalty='l2',alpha=0.001) plot_learning(sgd_regresor,"SGDRegressor")# 準(zhǔn)備測(cè)試集查看測(cè)試情況 test = test_subset.drop(['EbayID','Price','SellerName','EndDay'],axis=1) test = scaler.fit_transform(test) test_target = test_subset['Price']print("SGD regressor prediction result on testing data: %.3f" % sgd_regresor.score(test,test_target))plt.show()

在測(cè)試集上的測(cè)試結(jié)果：SGD regressor prediction result on testing data: 0.936，由于SGDRegressor回歸效果不錯(cuò)，因此就不太需要進(jìn)一步選擇其他的模型進(jìn)行嘗試了

總結(jié)

本篇文章大概講解了如何使用scikit-learn進(jìn)行數(shù)據(jù)分析，其實(shí)在數(shù)據(jù)分析過程中，運(yùn)用到機(jī)器學(xué)習(xí)的算法進(jìn)行模型訓(xùn)練并不是最重要的，大量的時(shí)間花費(fèi)在數(shù)據(jù)的預(yù)處理上，我不止一次聽到很多機(jī)器學(xué)習(xí)大牛說過一句話數(shù)據(jù)分析，最重要的不是算法，是數(shù)據(jù)。關(guān)于更多scikit-learn的機(jī)器學(xué)習(xí)算法，可以查看官方文檔，上面有很多例子，可以幫助大家快速入門

備注：公眾號(hào)菜單包含了整理了一本AI小抄，非常適合在通勤路上用學(xué)習(xí)。

往期精彩回顧適合初學(xué)者入門人工智能的路線及資料下載機(jī)器學(xué)習(xí)在線手冊(cè)深度學(xué)習(xí)在線手冊(cè)AI基礎(chǔ)下載（第一部分）備注：加入本站微信群或者qq群，請(qǐng)回復(fù)“加群”獲取一折本站知識(shí)星球優(yōu)惠券，請(qǐng)回復(fù)“知識(shí)星球”

喜歡文章，點(diǎn)個(gè)在看

總結(jié)

以上是生活随笔為你收集整理的机器学习案例：scikit-learn实现ebay数据分析的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：一文通俗了解对抗生成网络(GAN)核心思
下一篇： wuhan2020开源项目协作流程发布

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

机器学习案例：scikit-learn实现ebay数据分析

ebay在線拍賣數(shù)據(jù)分析

ebay在線拍賣數(shù)據(jù)

數(shù)據(jù)導(dǎo)入及可視化

利用數(shù)據(jù)預(yù)測(cè)拍賣是否會(huì)成功

預(yù)測(cè)拍賣最終成交價(jià)格

總結(jié)

總結(jié)