DNN:逻辑回归与 SoftMax 回归方法
第四章:SoftMax回歸
UFLDL Tutorial 翻譯系列:http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
簡介:見 AI : 一種現代方法。Chapter21. Reinforce Learning p.703
??????????? Softmax函數為多個變量的Logitic函數的泛化.
??????????? 為什么使用SoftMax方法:因為反向傳播和更新方法簡單,更直接且直觀。
1.先做練習
?
Exercise:Softmax Regression
Softmax regressionIn this problem set, you will use softmax regression to classify MNIST images. The goal of this exercise is to build a softmax classifier that you will be able to reuse in the future exercises and also on other classification problems that you might encounter. 在此問題集合里,你將使用SoftMax進行分類MNIST圖像數據集。目標是建立一個SoftMax分類器,你將遇到其他特征練習時遇到的分類問題。 In the file softmax_exercise.zip, we have provided some starter code. You should write your code in the places indicated by "YOUR CODE HERE" in the files. In the starter code, you will need to modifysoftmaxCost.m andsoftmaxPredict.m for this exercise. We have also provided softmaxExercise.m that will help walk you through the steps in this exercise. 代碼下載鏈接給出了一些初始預處理代碼。 Dependencies 依賴項The following additional files are required for this exercise:
You will also need:
If you have not completed the exercises listed above, we strongly suggest you complete them first. Step 0: Initialize constants and parametersWe've provided the code for this step in softmaxExercise.m.?? 步驟一:初始化常量和參數 Two constants, inputSizeandnumClasses, corresponding to the size of each input vector and the number of class labels have been defined in the starter code. This will allow you to reuse your code on a different data set in a later exercise. We also initializelambda, the weight decay parameter here. Step 1: Load data?The starter code loads the MNIST images and labels into inputData andlabels respectively. The images are pre-processed to scale the pixel values to the range[0,1], and the label 0 is remapped to 10 for convenience of implementation, so that the labels take values in. You will not need to change any code in this step for this exercise, but note that your code should be general enough to operate on data of arbitrary size belonging to any number of classes. Step 2: Implement softmaxCost - 執行軟回歸 代價函數In softmaxCost.m, implement code to compute the softmax cost functionJ(θ). Remember to include the weight decay term in the cost as well. Your code should also compute the appropriate gradients, as well as the predictions for the input data (which will be used in the cross-validation step later). It is important to vectorize your code so that it runs quickly. We also provide several implementation tips below: Note: In the provided starter code, theta is a matrix where each thejth row is Implementation Tip: Computing the ground truth matrix - In your code, you may need to compute the ground truth matrixM, such thatM(r, c) is 1 ify(c) =r and 0 otherwise. This can be done quickly, without a loop, using the MATLAB functionssparse andfull. Specifically, the commandM = sparse(r, c, v) creates a sparse matrix such thatM(r(i), c(i)) = v(i) for all i. That is, the vectorsr andc give the position of the elements whose values we wish to set, andv the corresponding values of the elements. Runningfull on a sparse matrix gives a "full" representation of the matrix for use (meaning that Matlab will no longer try to represent it as a sparse matrix in memory). The code for usingsparse andfull to compute the ground truth matrix has already been included in softmaxCost.m.
When the products are large, the exponential function will become very large and possibly overflow. When this happens, you will not be able to compute your hypothesis. However, there is an easy solution - observe that we can multiply the top and bottom of the hypothesis by some constant without changing the output: Hence, to prevent overflow, simply subtract some large constant value from each of the terms before computing the exponential. In practice, for each example, you can use the maximum of the terms as the constant. Assuming you have a matrixM containing these terms such that M(r, c) is, then you can use the following code to accomplish this: % M is the matrix as described in the text M = bsxfun(@minus, M, max(M, [], 1));max(M) yields a row vector with each element giving the maximum value in that column.bsxfun (short for binary singleton expansion function) applies minus along each row ofM, hence subtracting the maximum of each column from every element in the column. Implementation Tip: Computing the predictions - you may also findbsxfun useful in computing your predictions - if you have a matrixM containing the terms, such thatM(r, c) contains the term, you can use the following code to compute the hypothesis (by dividing all elements in each column by their column sum): % M is the matrix as described in the text M = bsxfun(@rdivide, M, sum(M))The operation of bsxfun in this case is analogous to the earlier example. Step 3: Gradient checkingOnce you have written the softmax cost function, you should check your gradients numerically.In general, whenever implementing any learning algorithm, you should always check your gradients numerically before proceeding to train the model. The norm of the difference between the numerical gradient and your analytical gradient should be small, on the order of10 ? 9. 一般來說,不管何時執行任何算法,你都應該 在訓練模型之前 檢測數值梯度? 數值距離和你的分析.計算距離 應該小于10 ? 9. Implementation Tip: Faster gradient checking - when debugging, you can speed up gradient checking by reducing the number of parameters your model uses. In this case, we have included code for reducing the size of the input data, using the first 8 pixels of the images instead of the full 28x28 images. This code can be used by setting the variableDEBUG to true, as described in step 1 of the code. 快速檢測:你可以歸約你的參數個數。利用前8個像素而不是整個圖像。 Step 4: Learning parameters--模型訓練Now that you've verified that your gradients are correct, you can train your softmax model using the functionsoftmaxTraininsoftmaxTrain.m.softmaxTrain which uses the L-BFGS algorithm, in the functionminFunc. Training the model on the entire MNIST training set of 60000 28x28 images should be rather quick, and take less than 5 minutes for 100 iterations. 軟回歸使用L-BFGS算法,此方法在函數minFunc.里面。使用6萬個圖像訓練,只需五分鐘完成100次迭代。 Factoring softmaxTrain out as a function means that you will be able to easily reuse it to train softmax models on other data sets in the future by invoking the function with different parameters.??? 參數softmaxTrain 作為一個函數可使你重用軟回歸模型在其他數據集 僅僅是通過設置不同的參數即可以。 Use the following parameter when training your softmax classifier: lambda = 1e-4Step 5: Testing-測試準確率Now that you've trained your model, you will test it against the MNIST test set, comprising 10000 28x28 images. However, to do so, you will first need to complete the functionsoftmaxPredict insoftmaxPredict.m, a function which generates predictions for input data under a trained softmax model. Once that is done, you will be able to compute the accuracy (the proportion of correctly classified images) of your model using the code provided. Our implementation achieved an accuracy of92.6%. If your model's accuracy is significantly less (less than 91%), check your code, ensure that you are using the trained weights, and that you are training your model on the full 60000 training images. Conversely, if your accuracy is too high (99-100%), ensure that you have not accidentally trained your model on the test set as well.?? 利用預測函數softmaxPredict.m檢測準確率。 |
2.解析模型---Softmax Regression
Contents[hide]
|
Introduction
???? In these notes, we describe theSoftmax regression model. This model generalizes logistic regression toclassification problems where the class labely can take on more than two possible values.This will be useful for such problems as MNIST digit classification, where the goal is to distinguish between 10 differentnumerical digits. Softmax regression is a supervised learning algorithm, but we will later beusing it in conjuction with our deep learning/unsupervised feature learning methods.
???? 軟回歸是邏輯斯特遞歸模型延伸的多類分類器。
???? Recall that in logistic regression, we had a training setofm labeled examples, where the input features are. (In this set of notes, we will use the notational convention of letting the feature vectorsx ben + 1 dimensional, withx0 = 1 corresponding to the intercept term.) With logistic regression, we were in the binary classification setting, so the labels were. Our hypothesis took the form:
??? ( 對于二分類的要求,目標函數為,我們使用Sigmod函數,模型參數θ去訓練最小化代價函數)
and the model parameters θ were trained to minimizethe cost function
???? In the softmax regression setting, we are interested in multi-classclassification (as opposed to only binary classification), and so the labely can take onk different values,rather than only? two. Thus, in our training set,we now have that. (Note thatour convention will be to index the classes starting from 1, rather than from 0.) For example,in the MNIST digit recognition task, we would havek = 10 different classes.
? ( ? 多類分類的目標函數為。)
???? Given a test input x, we want our hypothesis to estimatethe probability thatp(y =j |x) for each value of.I.e., we want to estimate the probability of the class label taking on each of thek different possible values. Thus, our hypothesis will output ak dimensional vector (whose elements sum to 1) giving us ourk estimated probabilities. Concretely, our hypothesishθ(x) takes the form:?
(由此,我們假設 k 維向量(元素和為1)表示我們的k個特征的估值概率。形式如下:)
????????????????????
Here are the parameters of our model.? Notice that the termnormalizes the distribution, so that it sums to one.
For convenience, we will also write θ to denote all the parameters of our model. When you implement softmax regression, it is usually convenient to representθ as ak-by-(n + 1) matrix obtained by stacking up in rows, so that
???? ? ?? (? ? 即時我們的模型參數,列向量的表示應該是比較方便的。)
?
Cost Function-估價函數
We now describe the cost function that we'll use for softmax regression. In the equation below, istheindicator function, so that 1{a true statement} = 1, and1{a false statement} = 0.For example,1{2 + 2 = 4} evaluates to 1; whereas1{1 + 1 = 5} evaluates to 0. Our cost function will be:
Notice that this generalizes the logistic regression cost function, which could also have been written:
The softmax cost function is similar, except that we now sum over thek different possible valuesof the class label. Note also that in softmax regression, we have that.
There is no known closed-form way to solve for the minimum ofJ(θ), and thus as usual we'll resort to an iterativeoptimization algorithm such as gradient descent or L-BFGS. Taking derivatives, one can show that the gradient is:
Recall the meaning of the "" notation. In particular,is itself a vector, so that itsl-th element isthe partial derivative ofJ(θ) with respect to thel-th element ofθj.?? ( 本身是一個向量,對其求偏導數)
Armed with this formula for the derivative, one can then plug it into an algorithm such as gradient descent, and have itminimizeJ(θ). For example, with the standard implementation of gradient descent, on each iterationwe would perform the update (for each).
When implementing softmax regression, we will typically use a modified version of the cost function described above;specifically, one that incorporates weight decay. We describe the motivation and details below.
Properties of softmax regression parameterization
Softmax regression has an unusual property that it has a "redundant" set of parameters. To explain what this means, suppose we take each of our parameter vectorsθj, and subtract some fixed vectorψfrom it, so that everyθj is now replaced withθj ? ψ (for every). Our hypothesis now estimates the class label probabilities as
(軟回歸有“過多的”參數集合,即過參數化了,意味著對于我們想要擬合數據集的假設,超過了需要擬合假設的一般量的參數個數)
In other words, subtracting ψ from every θjdoes not affect our hypothesis' predictions at all! This shows that softmaxregression's parameters are "redundant." More formally, we say that oursoftmax model isoverparameterized, meaning that for any hypothesis we might fit to the data, there are multiple parameter settings that give rise to exactly the same hypothesis functionhθ mapping from inputsxto the predictions.
Further, if the cost function J(θ) is minimized by somesetting of the parameters,then it is also minimized by for any value ofψ. Thus, theminimizer of J(θ) is not unique. (Interestingly,J(θ) is still convex, and thus gradient descent willnot run into a local optima problems. But the Hessian is singular/non-invertible,which causes a straightforward implementation of Newton's method to run intonumerical problems.)
()
Notice also that by setting ψ = θ1, one can alwaysreplaceθ1 with (the vector of all0's), without affecting the hypothesis. Thus, one could "eliminate" the vectorof parametersθ1 (or any otherθj, forany single value ofj), without harming the representational powerof our hypothesis. Indeed, rather than optimizing over thek(n + 1)parameters (where), one could instead set and optimize only with respect to the(k ? 1)(n + 1)remaining parameters, and this would work fine.
In practice, however, it is often cleaner and simpler to implement the version which keepsall the parameters, withoutarbitrarily setting one of them to zero. But we willmake one change to the cost function: Adding weight decay. This will take care ofthe numerical problems associated with softmax regression's overparameterized representation.
Weight Decay
We will modify the cost function by adding a weight decay termwhich penalizes large values of the parameters. Our cost function is now
With this weight decay term (for anyλ > 0), the cost functionJ(θ) is now strictly convex, and is guaranteed to have aunique solution. The Hessian is now invertible, and becauseJ(θ) is convex, algorithms such as gradient descent, L-BFGS, etc. are guaranteedto converge to the global minimum.
To apply an optimization algorithm, we also need the derivative of thisnew definition ofJ(θ). One can show that the derivative is:
By minimizing J(θ) with respect to θ, we will have a working implementation of softmax regression.
?
Relationship to Logistic Regression-和邏輯遞歸的關系
In the special case where k = 2, one can show that softmax regression reduces to logistic regression.This shows that softmax regression is a generalization of logistic regression. Concretely, whenk = 2,the softmax regression hypothesis outputs
Taking advantage of the fact that this hypothesisis overparameterized and settingψ = θ1,we can subtractθ1 from each of the two parameters, giving us
Thus, replacing θ2 ? θ1 with a single parameter vectorθ', we findthat softmax regression predicts the probability of one of the classes as,and that of the other class as,same as logistic regression.
Softmax Regression vs. k Binary Classifiers--和k個二分類器的比較
??????? Suppose you are working on a music classification application, and there arek types of music that you are trying to recognize. Should you use asoftmax classifier, or should you buildk separate binary classifiers usinglogistic regression?
???????This will depend on whether the four classes aremutually exclusive. For example,if your four classes are classical, country, rock, and jazz, then assuming eachof your training examples is labeled with exactly one of these four class labels,you should build a softmax classifier withk = 4.(If there're also some examples that are none of the above four classes,then you can setk = 5 in softmax regression, and also have a fifth, "none of the above," class.)
???????If however your categories are has_vocals, dance, soundtrack, pop, then theclasses are not mutually exclusive; for example, there can be a piece of popmusic that comes from a soundtrack and in addition has vocals. In this case, itwould be more appropriate to build 4 binary logistic regression classifiers. This way, for each new musical piece, your algorithm can separately decide whetherit falls into each of the four categories.
???????Now, consider a computer vision example, where you're trying to classify images intothree different classes. (i) Suppose that your classes are indoor_scene , outdoor_urban_scene,? and outdoor_wilderness_scene. Would you use sofmax regressionor three logistic regression classifiers?? (ii) Now suppose your classes areindoor_scene,? black_and_white_image,? and image_has_people. Would you use softmaxregression or multiple logistic regression classifiers?
???????In the first case, the classes are mutually exclusive, so a softmax regressionclassifier would be appropriate. In the second case, it would be more appropriate to buildthree separate logistic regression classifiers.
對于多類分類問題,是建立一個 軟回歸分類器還是建立多個二分類器呢?
這依賴于你的多類是否互斥,軟回歸 分類器對集合的分類是劃分而不是覆蓋。
...................
考慮一個計算機視覺的問題,如果你對圖像分類 ,類別是 室內、室外市區、室外草地場景,你將使用軟回歸還是三個二分類器?若類別是室外場景、黑白圖像、有人的圖像,又將使用哪種方式?
答案是 第一種分類使用軟回歸,第二種分類使用多個二分類器,因為第一個類別集合是劃分,而第二個類別集合有覆蓋。
?
3.另外一個翻譯
?原文鏈接:http://blog.csdn.net/celerychen2009/article/details/9014797
. 對開源softmax回歸的一點解釋
? ?對深度學習的開源代碼中有一段softmax的代碼,下載鏈接如下:
? ?https://github.com/yusugomori/DeepLearning
? ?這個開源的代碼是實現了深度網絡的常見算法,包括c,c++,java,python等不同語言的版本。
softmax回歸中有這樣一段代碼:
void LogisticRegression_softmax(LogisticRegression *this, double *x) { int i; double max = 0.0; double sum = 0.0; for(i=0; i<this->n_out; i++) if(max < x[i]) max = x[i]; for(i=0; i<this->n_out; i++) { x[i] = <span style="color:#000099;">exp</span>(x[i] - max); sum += x[i]; } for(i=0; i<this->n_out; i++) x[i] /= sum; }Tips:乍一看這段代碼,發現它和文獻中對softmax模型中參數優化的迭代公式中是不一樣!其實,如果沒有那個求最大值的過程,直接取指數運算就一樣的。而加一個求最大值的好處在于避免數據的絕對值過小,數據絕對值太小可能會導致計算一直停留在零而無法進行。就像對數似然函數,似然函數取對數防止概率過小一樣。
從新翻譯沒有多大意義!
總結
以上是生活随笔為你收集整理的DNN:逻辑回归与 SoftMax 回归方法的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ubuntu 16.04 安装QT问题
- 下一篇: linux hosts.equiv设