Machine LearningLogistic Regression
In this section we discuss logistic regression, which is a discriminative model for binary classification.
Example
Consider a binary classification problem where the two classes are equally probable, the class-0 conditional density is a standard multivariable normal distribution in two dimensions, and the class-1 conditional density is a multivariate normal distribution with mean and covariance . Find the class boundary for the Bayes classifier.
Solution. The Bayes classifier is , where
By symmetry, the classifier will predict class 1 for every point above the line and class 0 for every point below the line. We can obtain the same result by solving the equation . We get
which simplifies to , as desired.
Example
Find the regression function for the example above. Plot a heatmap of this function.
Solution. Let's use the multivariate normal type MvNormal
from the Distributions
package.
using Plots, Distributions, Optim mycgrad = cgrad([:MidnightBlue,:SeaGreen,:Gold,:Tomato]) gr(aspect_ratio=1,fillcolor=mycgrad) # Plots.jl defaults A = MvNormal([0,0],[1.0 0; 0 1]) B = MvNormal([1,1],[1.0 0; 0 1]) xgrid = -5:1/2^5:5 ygrid = -5:1/2^5:5 r(x,y) = 0.5pdf(B,[x,y])/(0.5pdf(A,[x,y])+0.5pdf(B,[x,y])) heatmap(xgrid,ygrid,r)
We can see from the heatmap that restricting to any line of slope 1 yields a function which asymptotes to 0 in the southwest direction and to 1 in the northeast direction, increasing smoothly in between. Such a function is called a sigmoid function.
Given the regression function , we can recover the Bayes classifier by predicting class 1 whenever and class 0 whenever . However, the value of the regression function also conveys the degree of confidence associated with the prediction. If and , then observations at and are both predicted as class 1, but the latter with much more confidence.
The graph in the example above suggests modeling parametrically as a composition of a linear map and a sigmoid function. Specifically, we posit the model , where , , and .
To select the parameters and , we penalize lack of confident correctness for each training sample. We give a sample of class 1 the penalty (which is
Exercise
Experiment with the sliders below and get the loss value below 2.45.
loss = ${loss}
using Optim Z = [-1.2, -0.8, -0.7, 0.4, -2.4, 1.13] O = [2.2, 1.3, 0.8, 2.5, 2.62] f(α, β, x) = 1/(1+exp(-α-β*x)) function loss(Z, O, θ) α, β = θ sum(log(1/(1-f(α, β, x))) for x in Z) + sum(log(1/f(α, β, x)) for x in O) end optimize(θ->loss(Z,O,θ), [0.0, 1.0])
Example
Sample 1000 points by choosing one of the two multivariate Gaussian distributions uniformly at random and then sampling from the selected distribution. Find the function of the form which minimizes
Solution. We begin by sampling the points as suggested.
observations = [rand(Bool) ? (rand(A),0) : (rand(B),1) for i in 1:1000] cs = [c for ((x,y),c) in observations] scatter([(x,y) for ((x,y),c) in observations], group=cs, markersize=2)
Next, we define the loss function and minimize it:
σ(u) = 1/(1 + exp(-u)) r(β,x) = σ(β'*[1;x]) C(β,xᵢ,yᵢ) = yᵢ*log(1/r(β,xᵢ))+(1-yᵢ)*log(1/(1-r(β,xᵢ))) L(β) = sum(C(β,xᵢ,yᵢ) for (xᵢ,yᵢ) in observations) β̂ = optimize(L,ones(3),BFGS()).minimizer heatmap(xgrid,ygrid,(x,y)->r(β̂,[x,y]))
We can see that the resulting heatmap looks quite similar to the actual regression function.
Example
In the example above, is it true that for some and ?
Solution. We calculate
which is equal to if and . So the assumption was correct in this example.
Exercise
Consider a binary classification problem for which the regression function satisfies for some and . Show that the decision boundary is linear.
Solution. We solve to find the decision boundary. This equation is equivalent to , the solution set of which is linear (by definition, since the equation is linear).
This exercise shows that directly applying logistic regression always yields linear decision boundaries. However, we can use logistic regression to find nonlinear decision boundaries by appending components to the feature vectors which are derived from the original features. For example, if we apply the map to each feature vector, then the linear boundary we discover in will correspond to a quadric curve in the original feature space .