Ever wondered how logistic regression really works?

You’re not alone. It’s confusing.

This post drills down into the mechanics of logistic regression. We’ll go through the mathematical aspects and also construct our own logistic regression function in R. Exciting!

## What is logistic regression?

In linear regression the dependent variable takes continuous values, like house prices for an area or the income of a individual.

Logistic regression differs in that it is used for modelling categorical data. Typically you model data that falls into two categories, but using logistic regression with more than two categories is certainly possible.

We will assume for this post that our response variable falls into exactly two categories. Here are some examples where you might want to use logistic regression:

- If a plant lives/ dies
- If someone was accepted/rejected for a home loan
- If a student graduates/drops out of university
- Owning/not owning a TV/laptop/segway/aircraft

We represent one of the categories by and the other by . For example a plant that lives could be represented by , and a dead plant by .

## Mathematical representation

Let us have data points, or training samples. Each one of these training samples is comprised of measurements of independent variables which are denoted by . The realised values of these variables are given by .

Each training sample also has a value of the dependent variable associated with it, and we let be the realised value of . We have that where either or .

The th training sample will consist of the data points and the observed response value . Because our data is all realised values (we know the actual values of the variables) we use lowercase notation to refer to it.

We refer to individual column vectors like

and row vectors like

We will find it useful to define the matrix , which is known as the design matrix. Let

where is a vector of 1’s that allows us to estimate the intercept term.

In addition, let , the probability of the response being 1 given the data . You can then set the vector of probabilities as a column vector where is the probability of .

The logistic regression model can be modelled as

where is the vector of coefficients. Solving for gives

### Likelihood function

Looking at our data, for each vector of features we have an observed class . We’ll be able to fit a likelihood to the data.

This is how we’ll find the coefficients – by maximising the likelihood function with respect to .

I’ll only provide the key findings below, skipping over some of the mathematical steps. Feel free to try the derivations yourself and see if you can get to the same answer.

We have that with probability , and with probability . From this we can formulate the likelihood function

and hence the log likelihood

which can be rearranged to get

by substituting in the definition of .

We can take the derivative of the log likelihood and see for what value of it equals zero: in other words, looking for a place where the slope of the vector is zero. This is the standard way of maximising the log likelihood and ends up being

which is not just one equation, but a series of them!

In fact, matrix notation is the easiest way that we can represent this. We can rewrite it as

which is a lot easier to work with in R.

To solve this we’re going to use an iterative algorithm called the Newton-Raphson method. In order to be able to implement the method, it turns out we’re going to require the second derivative of the log likelihood. Let’s just grab that now:

Again we can rewrite this in matrix form but this time we’re going to define a new matrix to help us out. Let be a diagonal matrix where the th diagonal element of is equal to . Another way of saying this is where .

Rewriting gives

## The Newton-Raphson Method

Lacking an explicit solution for like we had for linear regression, we are forced to turn to alternate non-exact methods. One of these methods is the Newton-Raphson method.

The Newton-Raphson method is an iterative method used to calculate maximum likelihood estimators. Lucky us that is a maximum likelihood estimator!

I won’t go into the method in depth, but a good explanation of Newton-Raphson can be found here. The important thing to remember is that it’s an iterative method where in each iteration moves closer to its best value (ideally). The term “best value” here means the with the highest likelihood, so on every iteration we should find that our likelihood increases with each new .

The good news for us is that we’ve already got everything we need to perform the Newton-Raphson method. We’re going to have to clear up the notation a little, however.

Define as the vector that we get after performing a Newton update, and as from the previous iteration.

A Newton update is then:

Substituting in the values that we got earlier gives

In each Newton update (or iteration), our vector of probabilities will update as we recalculate it using an updated . will also change with each iteration as a consequence of its direct reliance on .

### Applying the Newton-Raphson method

The algorithm can be split into two sections: setup for the iterative loop, and the iterative loop itself.

For the setup:

- Set to some initial value – a guess for the coefficients of the logistic regression. You can achieve quicker convergence if you already have a rough idea of what your coefficients are likely to be, since you can set the initial guess to a value close to its true value. This’ll lead to less iterations but the difference shouldn’t be too significant.
- Set a threshold value . The threshold value is used to check if the logistic regression algorithm is converging or not. The idea is that the amount that changes by should get smaller and smaller each iteration and eventually drop below our threshold value. If it never drops below the threshold value, then we can say that the algorithm isn’t converging and terminate it.
- Set an iterations counter. We’ll use this counter to track the number of iterations, and if we find that isn’t converging after some predetermined number of iterations (i.e. ) then we’ll end the algorithm.

The iterative part then goes like this:

- Calculate using
- Calculate using the updated . Remember that is a diagonal matrix where .
- Update by using , where is the value from the previous iteration.
- See how much has changed by in the current iteration. If it’s less than the threshold value then end the loop and return .
- See if the number of iterations we’ve done is higher than some predetermined number (like 100). If so then end the algorithm and display an error message along the lines of “the function failed to converge”.

## In R

We’ll first generate some sample data. We’ll have 30 sample points and three independent variables, which we’ll imaginatively name and . Our dependent variable is denoted by .

1 2 3 4 5 6 7 8 9 10 |
set.seed(2016) #simulate data #independent variables x1 = rnorm(30,3,2) + 0.1*c(1:30) x2 = rbinom(30, 1,0.3) x3 = rpois(n = 30, lambda = 4) x3[16:30] = x3[16:30] - rpois(n = 15, lambda = 2) #dependent variable y = c(rbinom(5, 1,0.1),rbinom(10, 1,0.25),rbinom(10, 1,0.75),rbinom(5, 1,0.9)) |

We construct next our design matrix which contains our data for and as well as a vector as a column of 1’s. Remember that we need in order to calculate the intercept .

1 2 |
x0 = rep(1,30) #bias X = cbind(x0,x1,x2,x3) |

Here’s the algorithm. The code pretty much follows the steps described above.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
manual_logistic_regression = function(X,y,threshold = 1e-10, max_iter = 100) #A function to find logistic regression coeffiecients #Takes three inputs: { #A function to return p, given X and beta #We'll need this function in the iterative section calc_p = function(X,beta) { beta = as.vector(beta) return(exp(X%*%beta) / (1+ exp(X%*%beta))) } #### setup bit #### #initial guess for beta beta = rep(0,ncol(X)) #initial value bigger than threshold so that we can enter our while loop diff = 10000 #counter to ensure we're not stuck in an infinite loop iter_count = 0 #### iterative bit #### while(diff > threshold ) #tests for convergence { #calculate probabilities using current estimate of beta p = as.vector(calc_p(X,beta)) #calculate matrix of weights W W = diag(p*(1-p)) #calculate the change in beta beta_change = solve(t(X)%*%W%*%X) %*% t(X)%*%(y - p) #update beta beta = beta + beta_change #calculate how much we changed beta by in this iteration #if this is less than threshold, we'll break the while loop diff = sum(beta_change^2) #see if we've hit the maximum number of iterations iter_count = iter_count + 1 if(iter_count > max_iter) { stop("This isn't converging, mate.") } } #make it pretty coef = c("(Intercept)" = beta[1], x1 = beta[2], x2 = beta[3], x3 = beta[4]) return(coef) } |

One thing to note is that I defined calc_p() (the function to calculate ) inside the manual_logistic_regression() function. This means that calc_p() is a function only accessible from within manual_logistic_regression() and other functions won’t be able to access calc_p() . Keep this in mind if you wanted to extend this code or perform some testing or something.

We can go ahead and see how well our version compares against R’s implementation of logistic regression. We usually run logistic regression by calling the glm() function whilst setting the family argument to “binomial”.

Let’s compare the two:

1 2 3 4 5 6 7 8 9 10 11 12 |
#using R package M1 = glm(y~x1+x2+x3, family = "binomial") M1$coefficients # (Intercept) x1 x2 x3 # -1.3512086 0.3191309 0.2033449 -0.0832102 #our version manual_logistic_regression(X,y) # (Intercept) x1 x2 x3 # -1.3512086 0.3191309 0.2033449 -0.0832102 #they're the same! Amazing! |

And that’s all there is to it.

The objective of the article is to bring out how logistic regression can be made without using inbuilt functions and not to give an introduction on Logistic regression.