# [3] ML-Logistic regression

It has been a long time after I wrote ML-Linear regression maybe because I am busy dealing with stuff concerning my graduation(enjoy parties). Anyway, I will catch up with my plan and update my blog as soon as I can.

# Definition

In classification problems, linear regression is not a good way to predict output Y. One of reasons is that hypothesis(θ) may be greater than 1 or less than 0 while the logistic regression could limit the output of h(θ) between 0 and 1.

We call it Logistic function/Sigmoid function:

The hypothesis output is similar to **probability**. Concretely, the h~θ~(x) =Probability(y=1|x;θ). So we want to predict if h~θ~(x) will greater or less than 0.5 to know output y is 0 or 1.

As we can see the h~θ~(x)/g(z) in the picture, g(z) will equals 0.5 when Z equals 0. So the problem will become that predict if Z larger or smaller than 0.

The **Decision Boundary** can visualize this goal very well. Here are two examples: **Linear decision boundary** and **non-linear decision boundary**:

Moreover, you can use different decision boundaries to predict output g(z) and it depends on what kind of function you use in Z.

# Cost Function

Next, as we have already know training set (x1,y1;x2,y2…x~n~,y~n~), we should choose a θi to decide our decision boundary which could fit the data set best. The logistic is as same as Linear regression—–build a cost function.

or we could write this:

To better understand this cost function, we should know why we use **log** function. If we plot -log[h(θ)] (when y=1) and -log[1-h(θ)] (when y=0), we could know the h(θ) range from 0 to 1 while the cost output range from 0 to infinite.

We penalize the learning algorithm by a very, very large cost. And that’s captured by having this cost go to infinity if y equals 1 and h(θ) approaches 0.

Similar to linear regression, we can use gradient descent to calculate minimum of Cost function **J(θ)** and the main difference is just the definition of h(θ).

Besides, many other algorithms could also calculate min J(θ). Some properties are mentioned below:

Algorithm | Property |
---|---|

gradient descent | Simple |

conjugate gradient /BFGS /L-BFGS | Pros: No need to pick α; Faster than gradient descent Cons: complex |

To better summarize Linear regression and Logistic regression, I made a table for comparing some key functions of them.

*I noticed some equation display errors on the website, I will fix it as soon as possible*

[…] Details of Logistic Regression […]