當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

吴恩达《机器学习》学习笔记七——逻辑回归（二分类）代码

發(fā)布時(shí)間：2024/7/23 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了吴恩达《机器学习》学习笔记七——逻辑回归（二分类）代码小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

吳恩達(dá)《機(jī)器學(xué)習(xí)》學(xué)習(xí)筆記七——邏輯回歸（二分類）代碼

一、無正則項(xiàng)的邏輯回歸
- 1.問題描述
- 2.導(dǎo)入模塊
- 3.準(zhǔn)備數(shù)據(jù)
- 4.假設(shè)函數(shù)
- 5.代價(jià)函數(shù)
- 6.梯度下降
- 7.擬合參數(shù)
- 8.用訓(xùn)練集預(yù)測和驗(yàn)證
- 9.尋找決策邊界
二、正則化邏輯回歸
- 1.準(zhǔn)備數(shù)據(jù)
- 2.特征映射
- 3.正則化代價(jià)函數(shù)
- 4.正則化梯度
- 5.擬合參數(shù)
- 6.預(yù)測
- 7.畫出決策邊界

課程鏈接：https://www.bilibili.com/video/BV164411b7dx?from=search&seid=5329376196520099118

這次的筆記緊接著上兩次對邏輯回歸模型和正則化筆記，將一個(gè)分類問題用邏輯回歸和正則化的方法解決。機(jī)器學(xué)習(xí)在我看來，理論和代碼需要兩手抓，即使理論搞懂，代碼也將是又一個(gè)門檻，所以多多嘗試。

這次筆記用到的數(shù)據(jù)集：https://pan.baidu.com/s/1h5Ygse5q2wkTeXA9Pwq2RA
提取碼：5rd4

一、無正則項(xiàng)的邏輯回歸

1.問題描述

建立一個(gè)邏輯回歸模型來預(yù)測一個(gè)學(xué)生是否被大學(xué)錄取。根據(jù)兩次考試的結(jié)果來決定每個(gè)申請人的錄取機(jī)會(huì)。有以前的申請人的歷史數(shù)據(jù)，可以用它作為邏輯回歸的訓(xùn)練集

python實(shí)現(xiàn)邏輯回歸目標(biāo)：建立分類器（求解出三個(gè)參數(shù) θ0 θ1 θ2）即得出分界線備注:θ1對應(yīng)’Exam 1’成績,θ2對應(yīng)’Exam 2’ 設(shè)定閾值，根據(jù)閾值判斷錄取結(jié)果備注:閾值指的是最終得到的概率值.將概率值轉(zhuǎn)化成一個(gè)類別.一般是＞0.5是被錄取了,＜0.5未被錄取.

2.導(dǎo)入模塊

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns plt.style.use('fivethirtyeight') #樣式美化 import matplotlib.pyplot as plt from sklearn.metrics import classification_report#這個(gè)包是評價(jià)報(bào)告

1.Seaborn是基于matplotlib的圖形可視化python包。它提供了一種高度交互式界面，便于用戶能夠做出各種有吸引力的統(tǒng)計(jì)圖表。

Seaborn是在matplotlib的基礎(chǔ)上進(jìn)行了更高級的API封裝，從而使得作圖更加容易，在大多數(shù)情況下使用seaborn能做出很具有吸引力的圖，而使用matplotlib就能制作具有更多特色的圖。應(yīng)該把Seaborn視為matplotlib的補(bǔ)充，而不是替代物。同時(shí)它能高度兼容numpy與pandas數(shù)據(jù)結(jié)構(gòu)以及scipy與statsmodels等統(tǒng)計(jì)模式。

2.plt.style.use()函數(shù)；可以對圖片的整體風(fēng)格進(jìn)行設(shè)置。可以通過plt.style.availabel知道一共有多少種主題。具體參考plt.style.use()函數(shù)介紹。

3.sklearn中的classification_report函數(shù)用于顯示主要分類指標(biāo)的文本報(bào)告．在報(bào)告中顯示每個(gè)類的精確度，召回率，F1值等信息。具體參考classification_report函數(shù)介紹

3.準(zhǔn)備數(shù)據(jù)

data = pd.read_csv('work/ex2data1.txt', names=['exam1', 'exam2', 'admitted']) data.head()#看前五行

data.describe()

數(shù)據(jù)讀入后，通過可視化查看一下數(shù)據(jù)分布：

sns.set(context="notebook", style="darkgrid", palette=sns.color_palette("RdBu", 2)) #設(shè)置樣式參數(shù),默認(rèn)主題 darkgrid（灰色背景+白網(wǎng)格）,調(diào)色板 2色sns.lmplot('exam1', 'exam2', hue='admitted', data=data, size=6, fit_reg=False, #fit_reg'參數(shù)，控制是否顯示擬合的直線scatter_kws={"s": 50}) #hue參數(shù)是將name所指定的不同類型的數(shù)據(jù)疊加在一張圖中顯示 plt.show()#看下數(shù)據(jù)的樣子

定義了下面三個(gè)函數(shù)，分別用于從數(shù)據(jù)中提取特征X，提取標(biāo)簽y，以及對特征進(jìn)行標(biāo)準(zhǔn)化處理。

def get_X(df):#讀取特征 # """ # use concat to add intersect feature to avoid side effect # not efficient for big dataset though # """ones = pd.DataFrame({'ones': np.ones(len(df))})#ones是m行1列的dataframedata = pd.concat([ones, df], axis=1) # 合并數(shù)據(jù)，根據(jù)列合并 axis = 1的時(shí)候，concat就是行對齊，然后將不同列名稱的兩張表合并加列return data.iloc[:, :-1].as_matrix() # 這個(gè)操作返回 ndarray,不是矩陣def get_y(df):#讀取標(biāo)簽 # '''assume the last column is the target'''return np.array(df.iloc[:, -1])#df.iloc[:, -1]是指df的最后一列def normalize_feature(df): # """Applies function along input axis(default 0) of DataFrame."""return df.apply(lambda column: (column - column.mean()) / column.std())#特征縮放在邏輯回歸同樣適用

提取特征和標(biāo)簽：

X = get_X(data) print(X.shape)y = get_y(data) print(y.shape)

4.假設(shè)函數(shù)

邏輯回歸模型的假設(shè)函數(shù)：

def sigmoid(z):# your code here (appro ~ 1 lines)return 1 / (1 + np.exp(-z))

繪制一下sigmoid函數(shù)的圖像：

fig, ax = plt.subplots(figsize=(8, 6)) ax.plot(np.arange(-10, 10, step=0.01),sigmoid(np.arange(-10, 10, step=0.01))) ax.set_ylim((-0.1,1.1)) #lim 軸線顯示長度 ax.set_xlabel('z', fontsize=18) ax.set_ylabel('g(z)', fontsize=18) ax.set_title('sigmoid function', fontsize=18) plt.show()

5.代價(jià)函數(shù)

初始化參數(shù)：

theta = theta=np.zeros(3) # X(m*n) so theta is n*1 theta

定義代價(jià)函數(shù)：

def cost(theta, X, y):''' cost fn is -l(theta) for you to minimize'''costf = np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))return costf # Hint:X @ theta與X.dot(theta)等價(jià)

計(jì)算一下初始的代價(jià)函數(shù)值：

cost(theta, X, y)

6.梯度下降

這是批量梯度下降（batch gradient descent）
轉(zhuǎn)化為向量化計(jì)算：

依次定義梯度：

def gradient(theta, X, y):# your code here (appro ~ 2 lines)return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)

計(jì)算梯度初始值：

gradient(theta, X, y)

7.擬合參數(shù)

這里不再自定義更新參數(shù)的函數(shù)，而是使用scipy.optimize.minimize 去自動(dòng)尋找參數(shù)。

import scipy.optimize as opt res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient) print(res)

其中fun是指優(yōu)化后的代價(jià)函數(shù)值，x是指優(yōu)化后的三個(gè)參數(shù)值。以上，算是已經(jīng)訓(xùn)練完成。

8.用訓(xùn)練集預(yù)測和驗(yàn)證

因?yàn)檫@里沒有提供驗(yàn)證集，所以使用訓(xùn)練集進(jìn)行預(yù)測和驗(yàn)證。就是用訓(xùn)練好的模型對訓(xùn)練集進(jìn)行預(yù)測，將結(jié)果與真實(shí)結(jié)果進(jìn)行比較評估。

def predict(x, theta):prob = sigmoid(x @ theta)return (prob >= 0.5).astype(int) #實(shí)現(xiàn)變量類型轉(zhuǎn)換 final_theta = res.x y_pred = predict(X, final_theta)print(classification_report(y, y_pred))

9.尋找決策邊界

決策邊界就是下面這樣一條線：

print(res.x) # this is final theta

coef = -(res.x / res.x[2]) # find the equation print(coef)x = np.arange(130, step=0.1) y = coef[0] + coef[1]*x

在看一下數(shù)據(jù)描述，確定一下x和y的范圍：

data.describe() # find the range of x and y

sns.set(context="notebook", style="ticks", font_scale=1.5) 默認(rèn)使用notebook上下文主題 context可以設(shè)置輸出圖片的大小尺寸(scale)sns.lmplot('exam1', 'exam2', hue='admitted', data=data, size=6, fit_reg=False, scatter_kws={"s": 25})plt.plot(x, y, 'grey') plt.xlim(0, 130) plt.ylim(0, 130) plt.title('Decision Boundary') plt.show()

二、正則化邏輯回歸

1.準(zhǔn)備數(shù)據(jù)

這邊使用一個(gè)新的數(shù)據(jù)集：

df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted']) df.head()

sns.set(context="notebook", style="ticks", font_scale=1.5)sns.lmplot('test1', 'test2', hue='accepted', data=df, size=6, fit_reg=False, scatter_kws={"s": 50})plt.title('Regularized Logistic Regression') plt.show()

從這個(gè)數(shù)據(jù)分布來看，不可能使用一條直線做到很好的劃分?jǐn)?shù)據(jù)集兩個(gè)類別。所以我們需要做一個(gè)特征映射，就是在已有的兩個(gè)特征的基礎(chǔ)上添加一些高次冪的特征組合，使得決策邊界可以變成一條能較好劃分的曲線。

2.特征映射

在這里我把它映射成這樣的一組特征：

一共有28個(gè)項(xiàng)，那么我們可以將這些組合特征看成一個(gè)個(gè)獨(dú)立的特征，即看成x1、x2。。。x28，然后通過邏輯回歸的方法來求解。

def feature_mapping(x, y, power, as_ndarray=False): # """return mapped features as ndarray or dataframe"""data = {"f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)for i in np.arange(power + 1)for p in np.arange(i + 1)}if as_ndarray:return pd.DataFrame(data).as_matrix()else:return pd.DataFrame(data) x1 = np.array(df.test1) x2 = np.array(df.test2) data = feature_mapping(x1, x2, power=6) print(data.shape) data.head()

下面是特征映射之后的數(shù)據(jù)集，特征變成了28維：

data.describe()

3.正則化代價(jià)函數(shù)

相比之前的表達(dá)式，多了正則化的懲罰項(xiàng)。

theta = np.zeros(data.shape[1]) X = feature_mapping(x1, x2, power=6, as_ndarray=True) print(X.shape)y = get_y(df) print(y.shape)

def regularized_cost(theta, X, y, l=1):theta_j1_to_n = theta[1:]regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()return cost(theta, X, y) + regularized_term

計(jì)算一下初始代價(jià)函數(shù)值：

regularized_cost(theta, X, y, l=1)

因?yàn)槲覀冊O(shè)置theta為0，所以這個(gè)正則化代價(jià)函數(shù)與代價(jià)函數(shù)的值應(yīng)該相同

4.正則化梯度

def regularized_gradient(theta, X, y, l=1):theta_j1_to_n = theta[1:] #不加theta0regularized_theta = (l / len(X)) * theta_j1_to_nregularized_term = np.concatenate([np.array([0]), regularized_theta])return gradient(theta, X, y) + regularized_term

計(jì)算一下梯度的初始值：

regularized_gradient(theta, X, y)

5.擬合參數(shù)

import scipy.optimize as opt print('init cost = {}'.format(regularized_cost(theta, X, y)))res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient) res

6.預(yù)測

final_theta = res.x y_pred = predict(X, final_theta)print(classification_report(y, y_pred))

7.畫出決策邊界

我們需要找到所有滿足 X×θ=0 的x，這里不求解多項(xiàng)式表達(dá)式，而是創(chuàng)造一個(gè)足夠密集的網(wǎng)格，對網(wǎng)格里的每一個(gè)點(diǎn)進(jìn)行 X×θ的計(jì)算，若結(jié)果小于一個(gè)很小的值，如10 ^ -3，則可以當(dāng)做是邊界上的一點(diǎn)，遍歷該網(wǎng)格上的每一點(diǎn)，即可得到近似邊界。

def draw_boundary(power, l): # """ # power: polynomial power for mapped feature # l: lambda constant # """density = 1000threshhold = 2 * 10**-3final_theta = feature_mapped_logistic_regression(power, l)x, y = find_decision_boundary(density, power, final_theta, threshhold)df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])sns.lmplot('test1', 'test2', hue='accepted', data=df, size=6, fit_reg=False, scatter_kws={"s": 100})plt.scatter(x, y, c='R', s=10)plt.title('Decision boundary')plt.show() def feature_mapped_logistic_regression(power, l): # """for drawing purpose only.. not a well generealize logistic regression # power: int # raise x1, x2 to polynomial power # l: int # lambda constant for regularization term # """df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])x1 = np.array(df.test1)x2 = np.array(df.test2)y = get_y(df)X = feature_mapping(x1, x2, power, as_ndarray=True)theta = np.zeros(X.shape[1])res = opt.minimize(fun=regularized_cost,x0=theta,args=(X, y, l),method='TNC',jac=regularized_gradient)final_theta = res.xreturn final_theta def find_decision_boundary(density, power, theta, threshhold):t1 = np.linspace(-1, 1.5, density) #1000個(gè)樣本t2 = np.linspace(-1, 1.5, density)cordinates = [(x, y) for x in t1 for y in t2]x_cord, y_cord = zip(*cordinates)mapped_cord = feature_mapping(x_cord, y_cord, power) # this is a dataframeinner_product = mapped_cord.as_matrix() @ thetadecision = mapped_cord[np.abs(inner_product) < threshhold]return decision.f10, decision.f01 #尋找決策邊界函數(shù)

下面我們看一下正則化系數(shù)不同，導(dǎo)致的決策邊界有什么不同？

draw_boundary(power=6, l=1) #set lambda = 1

draw_boundary(power=6, l=0) # set lambda < 0.1

draw_boundary(power=6, l=100) # set lambda > 10

上面三個(gè)例子分別展示了較好擬合、過擬合和欠擬合的三種情況。

總結(jié)

以上是生活随笔為你收集整理的吴恩达《机器学习》学习笔记七——逻辑回归（二分类）代码的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：深度学习-KNN，K近邻算法简介
下一篇： leetcode-search-in-r